Question

Searching SO and Google, I've found that there are a few Java HTML parsers which are consistently recommended by various parties. Unfortunately it's hard to find any information on the strengths and weaknesses of the various libraries. I'm hoping that some people have spent some comparing these libraries, and can share what they've learned.

Here's what I've seen:

JTidy ^[1]
NekoHTML ^[2]
jsoup ^[3]
TagSoup ^[4]

And if there's a major parser that I've missed, I'd love to hear about its pros and cons as well.

Thanks!

[1] http://jtidy.sourceforge.net/
[2] http://nekohtml.sourceforge.net/
[3] http://jsoup.org/
[4] http://home.ccil.org/~cowan/XML/tagsoup/

Answer 1

General

Almost all known HTML parsers implements the W3C DOM API ^[1] (part of the JAXP API, Java API for XML processing) and gives you a org.w3c.dom.Document ^[2] back which is ready for direct use by JAXP API. The major differences are usually to be found in the features of the parser in question. Most parsers are to a certain degree forgiving and lenient with non-wellformed HTML ("tagsoup"), like JTidy ^[3], NekoHTML ^[4], TagSoup ^[5] and HtmlCleaner ^[6]. You usually use this kind of HTML parsers to "tidy" the HTML source (e.g. replacing the HTML-valid <br> by a XML-valid <br />), so that you can traverse it "the usual way" using the W3C DOM and JAXP API.

The only ones which jumps out are HtmlUnit ^[7] and Jsoup ^[8].

HtmlUnit

HtmlUnit ^[9] provides a completely own API which gives you the possibility to act like a webbrowser programmatically. I.e. enter form values, click elements, invoke JavaScript, etcetera. It's much more than alone a HTML parser. It's a real "GUI-less webbrowser" and HTML unit testing tool.

Jsoup

Jsoup ^[10] also provides a completely own API. It gives you the possibility to select elements using jQuery ^[11]-like CSS selectors ^[12] and provides a slick API to traverse the HTML DOM tree to get the elements of interest.

Particularly the traversing of the HTML DOM tree is the major strength of Jsoup. Ones who have worked with org.w3c.dom.Document know what a hell of pain it is to traverse the DOM using the verbose NodeList ^[13] and Node ^[14] APIs. True, XPath ^[15] makes the life easier, but still, it's another learning curve and it can end up to be still verbose.

Here's an example which uses a "plain" W3C DOM parser like JTidy in combination with XPath to extract the first paragraph of your question and the names of all answerers (I am using XPath since without it, the code needed to gather the information of interest would otherwise grow up 10 times as big, without writing utility/helper methods).

String url = "http://stackoverflow.com/questions/3152138";
Document document = new Tidy().parseDOM(new URL(url).openStream(), null);
XPath xpath = XPathFactory.newInstance().newXPath();
  
Node question = (Node) xpath.compile("//*[@id='question']//*[contains(@class,'post-text')]//p[1]").evaluate(document, XPathConstants.NODE);
System.out.println("Question: " + question.getFirstChild().getNodeValue());

NodeList answerers = (NodeList) xpath.compile("//*[@id='answers']//*[contains(@class,'user-details')]//a[1]").evaluate(document, XPathConstants.NODESET);
for (int i = 0; i < answerers.getLength(); i++) {
    System.out.println("Answerer: " + answerers.item(i).getFirstChild().getNodeValue());
}

And here's an example how to do exactly the same with Jsoup:

String url = "http://stackoverflow.com/questions/3152138";
Document document = Jsoup.connect(url).get();

Element question = document.select("#question .post-text p").first();
System.out.println("Question: " + question.text());

Elements answerers = document.select("#answers .user-details a");
for (Element answerer : answerers) {
    System.out.println("Answerer: " + answerer.text());
}

Do you see the difference? It's not only less code, but Jsoup is also relatively easy to grasp if you already have moderate experience with CSS selectors (by e.g. developing websites and/or using jQuery).

Summary

The pros and cons of each should be clear enough now. If you just want to use the standard JAXP API to traverse it, then go for the first mentioned group of parsers. There are pretty a lot ^[16] of them. Which one to choose depends on the features it provides (how is HTML cleaning made easy for you? are there some listeners/interceptors and tag-specific cleaners?) and the robustness of the library (how often is it updated/maintained/fixed?). If you like to unit test the HTML, then HtmlUnit is the way to go. If you like to extract specific data from the HTML (which is more than often the real world requirement), then Jsoup is the way to go.

[1] http://docs.oracle.com/javase/6/docs/api/org/w3c/dom/package-summary.html
[2] http://docs.oracle.com/javase/6/docs/api/org/w3c/dom/Document.html
[3] http://jtidy.sourceforge.net/
[4] http://nekohtml.sourceforge.net
[5] http://home.ccil.org/%7Ecowan/XML/tagsoup/
[6] http://htmlcleaner.sourceforge.net/
[7] http://htmlunit.sourceforge.net/
[8] http://jsoup.org/
[9] http://htmlunit.sourceforge.net/
[10] http://jsoup.org/
[11] http://jquery.com
[12] http://www.w3.org/TR/css3-selectors/
[13] http://docs.oracle.com/javase/6/docs/api/org/w3c/dom/NodeList.html
[14] http://docs.oracle.com/javase/6/docs/api/org/w3c/dom/Node.html
[15] http://docs.oracle.com/javase/6/docs/api/javax/xml/xpath/XPath.html
[16] http://java-source.net/open-source/html-parsers

Answer 2

This article ^[1] compares certain aspects of the following parsers:

NekoHTML
JTidy
TagSoup
HtmlCleaner

It is by no means a complete summary, and it is from 2008. But you may find it helpful.

[1] http://www.benmccann.com/blog/java-html-parsing-library-comparison/

Answer 3

Add The validator.nu HTML Parser ^[1], an implementation of the HTML5 parsing algorithm in Java, to your list.

On the plus side, it's specifically designed to match HTML5, and at the heart of the HTML5 validator, so highly likely to match future browser's parsing behaviour to a very high degree of accuracy.

On the minus side, no browsers' legacy parsing works exactly like this, and as HTML5 is still in draft, subject to change.

In practice, such problems only affect obscure corner cases, and is for all practical purposes, an excellent parser.

[1] http://about.validator.nu/htmlparser/

Answer 4

I found Jericho ^[1] HTML Parser to be very well written, kept up to date (which many of the parsers are not), no dependencies, and easy to use.

[1] http://jericho.htmlparser.net/docs/index.html

Answer 5

I'll just add to @MJB answer after working with most of the HTML parsing libraries in Java, there is a huge pro/con that is omitted: parsers that preserve the formatting and incorrectness of the HTML on input and output.

That is most parsers when you change the document will blow away the whitespace, comments, and incorrectness of the DOM particularly if they are an XML like library.

Jericho ^[1] is the only parser I know that allows you to manipulate nasty HTML while preserving whitespace formatting and the incorrectness of the HTML (if there is any).

[1] http://jericho.htmlparser.net/docs/index.html

Answer 6

Two other options are HTMLCleaner ^[1] and HTMLParser ^[2].

I have tried most of the parsers here for a crawler / data extraction framework I have been developing. I use HTMLCleaner for the bulk of the data extraction work. This is because it supports a reasonably modern dialect of HTML, XHTML, HTML 5, with namespaces, and it supports DOM, so it is possible to use it with Java's built in XPath implementation ^[3].

It's a lot easier to do this with HTMLCleaner than some of the other parsers: JSoup for example supports a DOM like interface, rather than DOM, so some assembly required. Jericho has a SAX-line interface so again it is requires some work although Sujit Pal has a good description of how to do this ^[4] but in the end HTMLCleaner just worked better.

I also use HTMLParser and Jericho for a table extraction task, which replaced some code written using Perl's libhtml-tableextract-perl ^[5]. I use HTMLParser to filter the HTML for the table, then use Jericho to parse it. I agree with MJB's and Adam's comments that Jericho is good in some cases because it preserves the underlying HTML. It has a kind of non-standard SAX interface, so for XPath processing HTMLCleaner is better.

Parsing HTML in Java is a surprisingly hard problem as all the parsers seem to struggle on certain types of malformed HTML content.

[1] http://htmlcleaner.sourceforge.net/
[2] http://htmlparser.sourceforge.net/
[3] https://stackoverflow.com/questions/9022140/using-xpath-contains-against-html-in-java
[4] http://sujitpal.blogspot.tw/2009/04/xpath-over-html-using-jericho-and-jaxen.html
[5] http://search.cpan.org/~msisk/HTML-TableExtract-2.11/lib/HTML/TableExtract.pm