Parsing a large stream of HTML with Jsoup

Can anyone offer a pointer or advice on how I will try to parse an extremely large HTML stream / file. For example, I have a table with approximately 270,000 rows, I would like to bring it to my application about 20 thousand at a time. The jsoup parse method allows HTML snippets, but I don’t understand what might be the most efficient and clean way to read XXX bytes representing this snippet.

Any help is most appreciated.

+5
source share
1 answer

If this is XHTML, and you do not need to keep all this in memory at once, it is better to bet on using the SAX analyzer and select the data you need using the start and end events of the tag.

Another thought might be the StAX parser.

0
source

All Articles