JTidy or Jsoup for Java

I recently developed web scrapers in python using BeautifulSoup. Now I want to know which libraries are most preferred in Java. I did some searches, mostly I see JTidy and JSoup. What is the difference between the two?

+5
source share
1 answer

JTidymore often used to organize HTML, that is, to correct incorrect or erroneous HTML, such as closed tags, for example, from <div><span>text</div>to <div><span>text</span></div.

JSoupon the other hand, provides a full-blown API for parsing HTML and for extracting parts of it. This allows you to use jQuery, such as selectors , to find elements, or DOMmethods equivalent to those you use with JavaScript, for example getElementById. I would say that JSoup is really the equivalent of BeautifulSoup Java.

For example, to extract the first paragraph of a Wikipedia article using JSoup, you can use the following:

String url = "http://en.wikipedia.org/wiki/Potato";
Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select(".mw-content-ltr p");
String firstParagraph = paragraphs.first().text();

Or, to extract the title from this very question:

Document doc = Jsoup.connect("http://stackoverflow.com/questions/12439078/jtidy-or-jsoup-for-java").get();
String question = doc.select("#question-header a").text(); // JTidy or Jsoup for Java

Pretty good API, huh? :-)

+11
source

All Articles