Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL

Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL

Michael Schrenk

Language: English

Pages: 392

ISBN: 1593273975

Format: PDF / Kindle (mobi) / ePub


There's a wealth of data online, but sorting and gathering it by hand can be tedious and time consuming. Rather than click through page after endless page, why not let bots do the work for you?

Webbots, Spiders, and Screen Scrapers will show you how to create simple programs with PHP/CURL to mine, parse, and archive online data to help you make informed decisions. Michael Schrenk, a highly regarded webbot developer, teaches you how to develop fault-tolerant designs, how best to launch and schedule the work of your bots, and how to create Internet agents that:

  • Send email or SMS notifications to alert you to new information quickly
  • Search different data sources and combine the results on one page, making the data easier to interpret and analyze
  • Automate purchases, auction bids, and other online activities to save time

Sample projects for automating tasks like price monitoring and news aggregation will show you how to put the concepts you learn into practice.

This second edition of Webbots, Spiders, and Screen Scrapers includes tricks for dealing with sites that are resistant to crawling and scraping, writing stealthy webbots that mimic human search behavior, and using regular expressions to harvest specific data. As you discover the possibilities of web scraping, you'll see how webbots can save you precious time and give you much greater control over the data available on the Web.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

..................................................................................................... 295 Further Exploration ............................................................................................... 296 29 D E S IG N IN G W E B B O T - F R I E N D L Y W E B S IT E S 297 Optimizing Web Pages for Search Engine Spiders ................................................... 297 Well-Defined Links

February 16, 2012 11:59 AM While this isn’t particularly useful, this function becomes more interesting when the pattern matches more than one possible result set. For example, if the pattern had described an email address, we could have extracted all the email addresses from a web page. Or, if you were developing a spider, the pattern could have described hyperlinks and extracted all the links in a web page. We’ll cover this in detail as we progress. preg_split(pattern, subject) Finally,

................................................................................. 37 Parsing Poorly Written HTML ................................................................................... 38 Standard Parse Routines .......................................................................................... 38 Using LIB_parse ..................................................................................................... 39 Splitting a String at a Delimiter: split_string()

them. Both scripts are available at this book’s website. The PHP sections of this script appear in bold.


authentication or encryption. Use of these features is outside the scope of this book, but they’re available for you to explore on the official PHP website available at http://www.php.net. 142 Chapter 13 webbots2e.book Page 143 Thursday, February 16, 2012 11:59 AM Further Exploration Since FTP is often the only application-level protocol that computer systems share, it is a convenient communication bridge between new and old computer systems. Moreover, in addition to using FTP as a common

Download sample

Download