Thursday, April 14, 2011

Simple Ruby Screen Scraper in just 5 lines without using XPath

Simple Ruby Screen Scraper in just 5 lines, true, its very -very basic scrapper that I will  explain in this article. You can call it Screen Scraping, Web Screen Scraping, Data Scraping, Website Scraping, Web Page Scraping, Web Crawler .... ....... my main aim is to explain how to extract a particular link or label from a web page.

For scraping, I am using Ruby version - 1.8.7 which should be installed on your system. Along with this, Mechanize (1.0.0) and Hpricot  (0.8.4) gem. I will be scraping a link "Quality Assurance" given on the left panel under "My Blogs" section on my website "http://www.kumarritesh.com/". So lets start.

1. Open command prompt (I am working on Windows 7). Then open Interactive Ruby Shell (irb) prompt.

C:\Users\ritesh>irb
irb(main):001:0>

2. Open site "http://www.kumarritesh.com/" and inspect the div in which link "Quality Assurance" is present. Web page Code of that block looks like the code given below:

<div id="module_77" class="tabcontent tabopen" tabindex="-1" role="tabpanel" aria-hidden="false" aria-expanded="true" aria-labelledby="link_77">
  <ul class="menu">
    <li id="item-466"><a href="/index.php/quality-assurance">Quality Assurance</a></li>
    <li id="item-467"><a href="/index.php/search-engine-optimization">Search Engine Optimization</a></li>
    <li id="item-468"><a href="/index.php/ruby-on-rails">Ruby on Rails</a></li>
    <li id="item-470"><a href="/index.php/blogs-links">Blog Links</a></li>
  </ul>
</div>

 3. Write the codes as follows:
require 'rubygems'
require 'mechanize'
require 'hpricot'

agent = Mechanize.new
page = agent.get('http://www.kumarritesh.com')
@response = page.content
doc = Hpricot(@response)
doc.search("//div[@class='tabcontent tabopen']").search("//li[@id ='item-466']").search("a").innerHTML

Last line will give you the output as "Quality Assurance". Very much confused with the code, let me take you through the code step by step.

First three lines load hpricot, rubygems and mechanize
require 'rubygems'
require 'mechanize'
require 'hpricot'
Instantiate a new mechanize object:
agent = Mechanize.new  
Now we'll use the agent we've created to fetch website page "http://www.kumarritesh.com/" and store the page content in an object, then passing that object to Hpricot. Hpricot loads the contents into a document object.
page = agent.get('http://www.kumarritesh.com')
@response = page.content
doc = Hpricot(@response)

 Now, last line will give you the expected output by parsing the doc
doc.search("//div[@class='tabcontent tabopen']").search("//li[@id ='item-466']").search("a").innerHTML

 What it does is it searches for the div class 'tabcontent tabopen' and then inside that div class, searches for li with id 'item-466' and then finally searches for anchor link.
'innerHTML' will remove all HTML elements and thus you get the result as "Quality Assurance"

In the coming articles, I will be taking you deep into the ocean of scraping where you will be diving like a big whale here and there finding some useful & interesting concepts, so stay tuned

1 comment: