Tuesday 7 May 2013

WP Web Scraper

What is web scraping? Why do I need it?

Web scraping (or Web harvesting, Web data extraction) is a computer software technique of extracting information from websites. Web scraping focuses more on the transformation of unstructured Web content, typically in HTML format, into structured data that can be formatted and displayed or stored and analyzed. Web scraping is also related to Web automation, which simulates human Web browsing using computer software. Exemplary uses of Web scraping include online price comparison, weather data monitoring, market data tracking, Web content mashup and Web data integration.
Sounds interesting, but how do I actually use it?

WP Web Scraper can be used through a shortcode (for posts, pages or sidebar) or template tag (for direct integration in your theme) for scraping and displaying web content. Here's the actual usage detail:

For use directly in posts, pages or sidebar (text widget): [wpws url="" selector=""]

Example usage as a shortcode: [wpws url="http://google.com" selector="title" user_agent="My Bot" on_error="error_show"] (Display the title tag of google's home page, using My Bot as a user agent)

For use within themes: <?php echo wpws_get_content($url, $selector, $xpath, $wpwsopt)?>

Example usage in theme: <?php echo wpws_get_content('http://google.com','title','','user_agent=My Bot&on_error=error_show&')?> (Display the title tag of google's home page, using My Bot as a user agent)

For usage of other advanced parameters refer the Usage Manual

Further details about selector syntax in Selectors
Wow! I can actually create a complete meshup using this!

Yes you can. However, you should consider the copyright of the content owner. Its best to at least attribute the content owner by a linkback or better take a written permission. Apart from rights, scraping in general is a very resource intensive task. It will exhaust the bandwidth of your host as well as the host of of the content owner. Best is not to overdo it. Ideally find single pages with enough content to create your your meshup.
Okie. Then whats the best way to optimize its usage?

Here are some tips to help you optimize the usage:

    Keep the timeout as low as possible (least is 1 second). Higher timeout might impact your page processing time if you are dealing with content on slow servers.
    If you plan use multiple scrapers in a single page, make sure you set the cache timeout to a larger period. Possibly as long as a day (i.e. 1440 minutes) or even more. This will cache content on your server and reduce scraping.
    Use fast loading pages as your content source. Also prefer pages low in size to optimize performance.
    Keep a close watch on your scraper. If the website changes its page layout, your selector may fail to fetch the right content.
    If you are scraping a lot, keep a watch on your cache size too. Clear cache occasionaly.

What libraries are used? What are the minimum requirements apart from WordPress

For scraping, the plugin primarily uses WP_HTTP classes. For caching it uses the Transients API. For parsing htm using CSS style selectors the plugin uses phpQuery - a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library and for xpath parsing it uses JS_Extractor.

Source: http://wordpress.org/extend/plugins/wp-web-scrapper/faq/

No comments:

Post a Comment