share
Stack OverflowOptions for HTML scraping?
[+411] [40] Mark Harrison
[2008-08-05 21:09:11]
[ html web-scraping html-parsing html-content-extraction ]
[ https://stackoverflow.com/questions/2861/options-for-html-scraping ] [DELETED]

I'm thinking of trying Beautiful Soup [1], a Python package for HTML scraping. Are there any other HTML scraping packages I should be looking at? Python is not a requirement, I'm actually interested in hearing about other languages as well.

The story so far:

(related) Best Methods to parse HTML - Gordon
Tag Soup link is dead. - Tapper7
HtmlUnit is a complete Java browser implementation that you cannot dissect into parts (you cannot download just a html page and scrape it, it will download all referred files, execute scripts, etc.). As such I don't think it belongs here. - Mark Jeronimus
Stock Java can walk HTML with XPath expressions, although not without issues. The parser part (DocumentBuilder) chokes on incorrect HTML, and 100% correct HTML is actually quite rare on the web. Therefore I like to replace the parser with JTidy. As for for XPath, Java's own XPathExpression can be used (which exists since Java 1.5) - Mark Jeronimus
[+66] [2008-08-05 21:13:32] Joey deVilla [ACCEPTED]

The Ruby world's equivalent to Beautiful Soup is why_the_lucky_stiff's Hpricot [1].

[1] https://github.com/hpricot/hpricot

(13) These days Ruby folks have switched to Nokogiri for scraping. - Mark Thomas
1
[+46] [2008-08-07 18:38:30] Jon Galloway

In the .NET world, I recommend the HTML Agility Pack. Not near as simple as some of the above options (like HTMLSQL), but it's very flexible. It lets you maniuplate poorly formed HTML as if it were well formed XML, so you can use XPATH or just itereate over nodes.

http://www.codeplex.com/htmlagilitypack


(2) combine linq with it and it seems more like HTMLSQL, no? - Bless Yahu
(3) Combine SharpQuery with it, and it becomes just like jQuery! code.google.com/p/sharp-query - mpen
(1) HTML Agility Pack fails to correctly structure the DOM for an number of HTML documents I've tried. - Ash Berlin-Taylor
2
[+39] [2008-08-07 18:18:59] Cristian

BeautifulSoup is a great way to go for HTML scraping. My previous job had me doing a lot of scraping and I wish I knew about BeautifulSoup when I started. It's like the DOM with a lot more useful options and is a lot more pythonic. If you want to try Ruby they ported BeautifulSoup calling it RubyfulSoup but it hasn't been updated in a while.

Other useful tools are HTMLParser or sgmllib.SGMLParser which are part of the standard Python library. These work by calling methods every time you enter/exit a tag and encounter html text. They're like Expat if you're familiar with that. These libraries are especially useful if you are going to parse very large files and creating a DOM tree would be long and expensive.

Regular expressions aren't very necessary. BeautifulSoup handles regular expressions so if you need their power you can utilize it there. I say go with BeautifulSoup unless you need speed and a smaller memory footprint. If you find a better HTML parser on Python, let me know.


3
[+22] [2008-08-07 18:31:17] deadprogrammer

I found HTMLSQL [1] to be a ridiculously simple way to screenscrape. It takes literally minutes to get results with it.

The queries are super-intuitive - like:

SELECT title from img WHERE $class == 'userpic'

There are now some other alternatives that take the same approach.

[1] http://www.jonasjohn.de/lab/htmlsql.htm

(7) FYI, this is a PHP library - Tristan Havelick
4
[+20] [2008-09-17 12:44:55] akaihola

The Python lxml [1] library acts as a Pythonic binding for the libxml2 and libxslt libraries. I like particularly its XPath support and pretty-printing of the in-memory XML structure. It also supports parsing broken HTML. And I don't think you can find other Python libraries/bindings that parse XML faster than lxml.

[1] http://codespeak.net/lxml/

5
[+18] [2008-08-05 23:37:44] andrewrk

For Perl, there's WWW::Mechanize.


6
[+18] [2009-12-28 16:59:14] filippo

Python has several options for HTML scraping in addition to Beatiful Soup. Here are some others:

  • mechanize [1]: similar to perl WWW:Mechanize. Gives you a browser like object to ineract with web pages
  • lxml [2]: Python binding to libwww. Supports various options to traverse and select elements (e.g. XPath [3] and CSS selection)
  • scrapemark [4]: high level library using templates to extract informations from HTML.
  • pyquery [5]: allows you to make jQuery like queries on XML documents.
  • scrapy [6]: an high level scraping and web crawling framework. It can be used to write spiders, for data mining and for monitoring and automated testing
[1] http://wwwsearch.sourceforge.net/mechanize/
[2] http://codespeak.net/lxml/
[3] http://en.wikipedia.org/wiki/XPath
[4] http://arshaw.com/scrapemark/
[5] https://github.com/gawel/pyquery
[6] http://scrapy.org/

(1) The Python Standard Library has a built-in HTML Parser... why not just use that? docs.python.org/2.7/library/htmlparser.html - ArtOfWarfare
7
[+14] [2009-07-31 19:39:57] user67627

'Simple HTML DOM Parser' is a good option for PHP, if your familiar with jQuery or JavaScript selectors then you will find yourself at home.

Find it here [1]

There is also a blog post about it here. [2]

[1] http://simplehtmldom.sourceforge.net/
[2] http://blog.dougalmatthews.com/2008/08/html-dom-and-easy-screen-scraping-in-php/

(1) I second this one. Dont need to install any mod_python, etc into the web server just to make it work - Brock Woolf
8
[+14] [2012-02-10 19:42:50] cookie_monster

Why has no one mentioned JSOUP yet for Java? http://jsoup.org/


9
[+11] [2008-09-18 20:13:40] akaihola

The templatemaker [1] utility from Adrian Holovaty (of Django [2] fame) uses a very interesting approach: You feed it variations of the same page and it "learns" where the "holes" for variable data are. It's not HTML specific, so it would be good for scraping any other plaintext content as well. I've used it also for PDFs and HTML converted to plaintext (with pdftotext and lynx, respectively).

[1] http://code.google.com/p/templatemaker/
[2] http://www.djangoproject.com/

how did you get templatemaker working for large HTML pages? I found it crashes when I give it anything non-trivial. - hoju
I suppose I've had no large HTML pages. No filed Issues seem to exist for that problem at code.google.com/p/templatemaker/issues/list so it's probably appropriate to send a test case there. It doesn't look like Adrian is maintaining the library though. I wonder what he uses nowadays at EveryBlock since they surely do a lot of scraping. - akaihola
10
[+10] [2009-08-16 20:56:23] ra9r

I know and love Screen-Scraper [1].

Screen-Scraper is a tool for extracting data from websites. Screen-Scraper automates:

* Clicking links on websites
* Entering data into forms and submitting
* Iterating through search result pages
* Downloading files (PDF, MS Word, images, etc.)

Common uses:

* Download all products, records from a website
* Build a shopping comparison site
* Perform market research
* Integrate or migrate data

Technical:

* Graphical interface--easy automation
* Cross platform (Linux, Mac, Windows, etc.)
* Integrates with most programming languages (Java, PHP, .NET, ASP, Ruby, etc.)
* Runs on workstations or servers

Three editions of screen-scraper:

* Enterprise: The most feature-rich edition of screen-scraper. All capabilities are enabled.
* Professional: Designed to be capable of handling most common scraping projects.
* Basic: Works great for simple projects, but not nearly as many features as its two older brothers.
[1] http://www.screen-scraper.com

Unfortunately not even the Basic version is FOSS. It only seems to be free as in beer. - Andreas Kuckartz
11
[+9] [2008-08-05 21:11:29] GateKiller

I would first find out if the site(s) in question provide an API server or RSS Feeds for access the data you require.


12
[+8] [2008-08-22 10:20:38] Frank Krueger

Scraping Stack Overflow is especially easy with Shoes [1] and Hpricot [2].

require 'hpricot'

Shoes.app :title => "Ask Stack Overflow", :width => 370 do
  SO_URL = "http://stackoverflow.com"
  stack do
    stack do
      caption "What is your question?"
      flow do
        @lookup = edit_line "stackoverflow", :width => "-115px"
        button "Ask", :width => "90px" do
          download SO_URL + "/search?s=" + @lookup.text do |s|
            doc = Hpricot(s.response.body)
            @rez.clear()
            (doc/:a).each do |l|
              href = l["href"]
              if href.to_s =~ /\/questions\/[0-9]+/ then
                @rez.append do
                  para(link(l.inner_text) { visit(SO_URL + href) })
                end
              end
            end
            @rez.show()
          end
        end
      end
    end
    stack :margin => 25 do
      background white, :radius => 20
      @rez = stack do
      end
    end
    @rez.hide()
  end
end
[1] http://code.whytheluckystiff.net/shoes/
[2] http://code.whytheluckystiff.net/hpricot/

13
[+8] [2008-08-26 22:46:37] dpavlin

Another option for Perl would be Web::Scraper [1] which is based on Ruby's Scrapi [2]. In a nutshell, with nice and concise syntax, you can get a robust scraper directly into data structures.

[1] http://search.cpan.org/~miyagawa/Web-Scraper/lib/Web/Scraper.pm
[2] http://blog.labnotes.org/2006/07/11/scraping-with-style-scrapi-toolkit-for-ruby/

14
[+7] [2008-08-31 12:09:33] Henry

I've had some success with HtmlUnit [1], in Java. It's a simple framework for writing unit tests on web UI's, but equally useful for HTML scraping.

[1] http://htmlunit.sourceforge.net

you can also use it to evaluate javascript execution if you ever have the need :) - David
15
[+7] [2011-03-02 12:15:28] mvark

Yahoo! Query Language or YQL can be used alongwith jQuery, AJAX, JSONP to screen scrape web pages [1]

[1] http://projects.ischool.washington.edu/tabrooks/343INFOAutumn09/JSONP/jsonpJqueryYQL.htm

16
[+6] [2009-02-13 12:58:01] GeekyMonkey

Another tool for .NET is MhtBuilder [1]

[1] http://www.codeproject.com/KB/files/MhtBuilder.aspx

17
[+6] [2011-05-11 18:28:29] jbst

There is this solution too: netty HttpClient [1]

[1] http://docs.jboss.org/netty/3.2/xref/org/jboss/netty/example/http/snoop/HttpClient.html

18
[+5] [2008-08-06 05:57:24] Mike Minutillo

I use Hpricot on Ruby. As an example this is a snippet of code that I use to retrieve all book titles from the six pages of my HireThings account (as they don't seem to provide a single page with this information):

pagerange = 1..6
proxy = Net::HTTP::Proxy(proxy, port, user, pwd)
proxy.start('www.hirethings.co.nz') do |http|
  pagerange.each do |page|
    resp, data = http.get "/perth_dotnet?page=#{page}" 
    if resp.class == Net::HTTPOK
      (Hpricot(data)/"h3 a").each { |a| puts a.innerText }
    end
  end
end 

It's pretty much complete. All that comes before this are library imports and the settings for my proxy.


19
[+5] [2008-08-22 13:58:44] Acuminate

I've used Beautiful Soup a lot with Python. It is much better than regular expression checking, because it works like using the DOM [1], even if the HTML is poorly formatted. You can quickly find HTML tags and text with simpler syntax than regular expressions. Once you find an element, you can iterate over it and its children, which is more useful for understanding the contents in code than it is with regular expressions. I wish Beautiful Soup existed years ago when I had to do a lot of screenscraping -- it would have saved me a lot of time and headache since HTML structure was so poor before people started validating it.

[1] http://en.wikipedia.org/wiki/Document_Object_Model

20
[+5] [2008-08-27 09:43:07] JonnyGold

Although it was designed for .NET [1] web-testing, I've been using the WatiN [2] framework for this purpose. Since it is DOM-based, it is pretty easy to capture HTML, text, or images. Recentely, I used it to dump a list of links from a MediaWiki [3] All Pages namespace query into an Excel spreadsheet. The following VB.NET [4] code fragement is pretty crude, but it works.


Sub GetLinks(ByVal PagesIE As IE, ByVal MyWorkSheet As Excel.Worksheet)

    Dim PagesLink As Link
    For Each PagesLink In PagesIE.TableBodies(2).Links
        With MyWorkSheet
            .Cells(XLRowCounterInt, 1) = PagesLink.Text
            .Cells(XLRowCounterInt, 2) = PagesLink.Url
        End With
        XLRowCounterInt = XLRowCounterInt + 1
    Next
End Sub
[1] http://en.wikipedia.org/wiki/.NET_Framework
[2] http://en.wikipedia.org/wiki/Watir#Similar_tools
[3] http://en.wikipedia.org/wiki/MediaWiki
[4] http://en.wikipedia.org/wiki/Visual_Basic_.NET

21
[+3] [2008-08-17 14:13:14] kaybenleroll

I have used LWP [1] and HTML::TreeBuilder [2] with Perl and have found them very useful.

LWP (short for libwww-perl) lets you connect to websites and scrape the HTML, you can get the module here [3] and the O'Reilly book seems to be online here [4].

TreeBuilder allows you to construct a tree from the HTML, and documentation and source are available in HTML::TreeBuilder - Parser that builds a HTML syntax tree [5].

There might be too much heavy-lifting still to do with something like this approach though. I have not looked at the Mechanize module [6] suggested by another answer, so I may well do that.

[1] http://en.wikipedia.org/wiki/Library_for_WWW_in_Perl
[2] http://search.cpan.org/~cjm/HTML-Tree-5.02/lib/HTML/TreeBuilder.pm
[3] http://search.cpan.org/dist/libwww-perl/
[4] http://lwp.interglacial.com/
[5] http://search.cpan.org/~petek/HTML-Tree-3.23/lib/HTML/TreeBuilder.pm
[6] http://search.cpan.org/dist/WWW-Mechanize/

22
[+3] [2008-08-24 10:32:37] Peter Hilton

In Java, you can use TagSoup [1].

[1] https://github.com/websdotcom/tagsoup

23
[+3] [2008-09-17 12:56:25] crojac

You would be a fool not to use Perl.. Here come the flames..

Bone up on the following modules and ginsu any scrape around.

use LWP
use HTML::TableExtract
use HTML::TreeBuilder
use HTML::Form
use Data::Dumper

24
[+3] [2008-10-09 20:53:21] hsivonen

Implementations of the HTML5 parsing algorithm [1]: html5lib [2] (Python, Ruby), Validator.nu HTML Parser [3] (Java, JavaScript; C++ in development), Hubbub [4] (C), Twintsam [5] (C#; upcoming).

[1] http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html
[2] http://code.google.com/p/html5lib/
[3] http://about.validator.nu/htmlparser/
[4] http://www.netsurf-browser.org/projects/hubbub/
[5] http://code.google.com/p/twintsam/

25
[+3] [2012-10-29 15:59:36] pedro

Well, if you want it done from the client side using only a browser you have jcrawl.com [1]. After having designed your scrapping service from the web application (http://www.jcrawl.com/app.html), you only need to add the generated script to an HTML page to start using/presenting your data.

All the scrapping logic happens on the the browser via JavaScript. I hope you find it useful. Click this link for a live example that extracts the latest news from Yahoo tennis [2].

[1] http://www.jcrawl.com
[2] http://www.jcrawl.com/app.html#service=yahoo+tennis+news&resultsview=1&data=%7B%22serviceId%22%3A%22service%20705%22%2C%22service%22%3A%22yahoo%20tennis%20news%22%2C%22host%22%3A%22http%3A%2F%2Fsports.yahoo.com%2Ftennis%2F%22%2C%22renderHtml%22%3Atrue%2C%22height%22%3A394%2C%22dataType%22%3A%22list%22%2C%22dataModel%22%3A%5B%7B%22column%22%3A%5B%7B%22indexPath%22%3A%5B0%2C0%2C1%2C1%2C0%2C0%2C1%2C0%2C1%2C0%2C0%2C0%5D%2C%22attribute%22%3A%22innerHtml%22%2C%22field%22%3A%22title%22%7D%2C%7B%22indexPath%22%3A%5B0%2C0%2C1%2C1%2C0%2C0%2C1%2C0%2C1%2C0%2C0%2C0%2C0%5D%2C%22attribute%22%3A%22href%22%2C%22field%22%3A%22link%22%7D%5D%7D%2C%7B%22column%22%3A%5B%7B%22indexPath%22%3A%5B0%2C0%2C1%2C1%2C0%2C0%2C1%2C0%2C1%2C0%2C1%2C0%5D%2C%22attribute%22%3A%22innerHtml%22%2C%22field%22%3A%22Field%201%22%7D%2C%7B%22indexPath%22%3A%5B0%2C0%2C1%2C1%2C0%2C0%2C1%2C0%2C1%2C0%2C1%2C0%2C0%5D%2C%22attribute%22%3A%22href%22%2C%22field%22%3A%22Field%202%22%7D%5D%7D%5D%7D

26
[+2] [2008-08-05 22:58:06] Grant

You probably have as much already, but I think this is what you are trying to do:

from __future__ import with_statement
import re, os

profile = ""

os.system('wget --no-cookies --header "Cookie: soba=(SeCreTCODe)" http://stackoverflow.com/users/30/myProfile.html')
with open("myProfile.html") as f:
    for line in f:
        profile = profile + line
f.close()
p = re.compile('summarycount">(\d+)</div>') #Rep is found here
print p
m = p.search(profile)
print m
print m.group(1)
os.system("espeak \"Rep is at " + m.group(1) + " points\""
os.remove("myProfile.html")

27
[+2] [2008-08-27 18:49:53] Shawn Miller

I've had mixed results in .NET using SgmlReader which was originally started by Chris Lovett [1] and appears to have been updated by MindTouch [2].

[1] http://robgarrett.com/cs/blogs/software/archive/2005/08/09/1499.aspx
[2] http://wiki.developer.mindtouch.com/Community/SgmlReader

28
[+2] [2008-11-19 19:11:58] kkubasik

I've also had great success using Aptana's Jaxer + jQuery to parse pages. It's not as fast or 'script-like' in nature, but jQuery selectors + real JavaScript/DOM is a lifesaver on more complicated (or malformed) pages.


29
[+2] [2010-07-22 04:31:27] Neil McGuigan

I like Google Spreadsheets' ImportXML(URL, XPath) function.

It will repeat cells down the column if your XPath expression returns more than one value.

You can have up to 50 importxml() functions on one spreadsheet.

RapidMiner's Web Plugin is also pretty easy to use. It can do posts, accepts cookies, and can set the user-agent [1].

[1] http://en.wikipedia.org/wiki/User_agent

30
[+1] [2008-08-05 21:29:51] pix0r

Regular expressions work pretty well for HTML scraping as well ;-) Though after looking at Beautiful Soup, I can see why this would be a valuable tool.


(3) Regular expressions? The center cannot hold it is too late - Andrew Grimm
31
[+1] [2008-08-25 12:02:23] robintw

Scrubyt [1] uses Ruby and Hpricot to do nice and easy web scraping. I wrote a scraper for my university's library service using this in about 30 minutes.

[1] https://github.com/scrubber/scrubyt

32
[+1] [2010-05-17 15:58:39] seagulf

For more complex scraping applications, I would recommend the IRobotSoft web scraper. It is a dedicated free software for screen scraping. It has a strong query language for HTML pages, and it provides a very simple web recording interface that will free you from many programming effort.


33
[+1] [2010-11-22 17:04:29] tim

The recent talk by Dav Glass Welcome to the Jungle! (YUIConf 2011 Opening Keynote) [1] shows how you can use YUI [2] 3 on Node.js [3] to do clientside-like programming (with DOM selectors instead of string processing) on the server. It is very impressive.

[1] http://developer.yahoo.com/yui/theater/video.php?v=glass-node
[2] http://en.wikipedia.org/wiki/Yahoo!_UI_Library
[3] http://en.wikipedia.org/wiki/Nodejs

34
[+1] [2010-12-01 05:28:20] Justin Thomson

I've been using Feedity - http://feedity.com for some of the scraping work (and conversion into RSS feeds) at my library. It works well for most webpages.


35
[+1] [2011-04-12 00:20:14] hoju

I do a lot of advanced web scraping so wanted to have total control over my stack and understand the limitations. This webscraping library [1] is the result.

[1] http://code.google.com/p/webscraping/

It is implemented in Python. And the project is still living. - Andreas Kuckartz
36
[+1] [2012-07-04 10:43:20] BeniBela

I made a very nice library Internet Tools [1] for web scraping.

The idea is to match a template against the web page, which will extract all data from the page and also validate if the page structure is unchanged.

So you can just take the HTML of the web page you want to process, remove all dynamical or irrelevant content and annotate the interesting parts.

E.g. the HTML for a new question on the stackoverflow.com index page is:

<div id="question-summary-11326954" class="question-summary narrow">

    <!-- skipped, this is getting too long -->

    <div class="summary">

        <h3><a title="Some times my tree list have vertical scroll ,then I scrolled very fast and the tree list shivered .Have any solution for this.
" class="question-hyperlink" href="/questions/11326954/about-scroll-bar-issue-in-tree">About Scroll bar issue in Tree</a></h3>

    <!-- skipped -->

    </div>
</div>

So you just remove this certain id, title and summary, to create a template that will read all new questions in title, summary, link-arrays:

 <t:loop>
   <div class="question-summary narrow">
     <div class="summary">
       <h3>
          <a class="question-hyperlink">
            {title:=text(), summary:=@title, link:=@href}
          </a>
       </h3>
     </div>
   </div>
 </t:loop>

And of course it also supports the basic techniques, CSS 3 selectors, XPath 2 and XQuery 1 expressions.

The only problem is that I was so stupid to make it a Free Pascal [2] library. But there is also language independent web demo [3].

[1] http://www.benibela.de/sources_en.html#internettools
[2] http://en.wikipedia.org/wiki/Free_Pascal
[3] http://videlibri.sourceforge.net/cgi-bin/xidelcgi

37
[0] [2011-04-04 22:44:35] Neil McGuigan

For those that would prefer a graphical workflow tool, RapidMiner (FOSS) has a nice web crawling and scraping facility.

Here's a series of videos:

http://vancouverdata.blogspot.com/2011/04/rapidminer-web-crawling-rapid-miner-web.html


38
[0] [2013-05-10 18:28:51] Hemerson Varela

When it comes to extracting data from an HTML document on the server-side, Node.js [1] is a fantastic option. I have used it successfully with two modules called request [2] and cheerio [3].

You can see an example how it works here [4].

[1] http://en.wikipedia.org/wiki/Node.js
[2] https://github.com/mikeal/request
[3] https://github.com/MatthewMueller/cheerio
[4] http://procbits.com/2012/04/11/quick-and-dirty-screen-scraping-with-node-js-using-request-and-cheerio

39
[-1] [2010-12-01 05:41:47] mpen

SharpQuery [1]

It's basically jQuery for C#. It depends on HTML Agility Pack [2] for parsing the HTML.

[1] http://BeautifulSoup
[2] http://htmlagilitypack.codeplex.com/

40