JTidy or Jsoup for Java

Question

JTidy or Jsoup for Java

I recently developed web scrapers in python using BeautifulSoup. Now I want to know which libraries are most preferred in Java. I did some searches, mostly I see JTidy and JSoup. What is the difference between the two?

+5

java web-crawler web scraping screen-scraping

torayeff 15 sept. '12 at 16:23

source share

1 answer

João Silva · Accepted Answer · 2012-09-15T16:32:44+0000

JTidymore often used to organize HTML, that is, to correct incorrect or erroneous HTML, such as closed tags, for example, from <div><span>text</div>to <div><span>text</span></div.

JSoupon the other hand, provides a full-blown API for parsing HTML and for extracting parts of it. This allows you to use jQuery, such as selectors , to find elements, or DOMmethods equivalent to those you use with JavaScript, for example getElementById. I would say that JSoup is really the equivalent of BeautifulSoup Java.

For example, to extract the first paragraph of a Wikipedia article using JSoup, you can use the following:

String url = "http://en.wikipedia.org/wiki/Potato";
Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select(".mw-content-ltr p");
String firstParagraph = paragraphs.first().text();

Or, to extract the title from this very question:

Document doc = Jsoup.connect("http://stackoverflow.com/questions/12439078/jtidy-or-jsoup-for-java").get();
String question = doc.select("#question-header a").text(); // JTidy or Jsoup for Java

Pretty good API, huh? :-)

JTidy or Jsoup for Java

More articles: