Removing HTML objects while maintaining line breaks with JSoup

Question

Removing HTML objects while maintaining line breaks with JSoup

I used JSoup to parse texts, and so far it has been great, but ran into a problem.

I can use Node.html()to return the full HTML of the desired node, which saves line breaks as such:

Gl&oacute;andi augu, silfurn&aacute;tt
<br />Bl&oacute;&eth; alv&ouml;ru, starir &aacute;
<br />&Oacute;&eth;ur hundur er &iacute; v&iacute;gam&oacute;&eth;, &iacute; maga... m&eacute;r
<br />
<br />Kolni&eth;ur gref, kvik sem dreg h&eacute;r
<br />Kolni&eth;ur svart, hvergi bjart n&eacute;

But, unfortunately, it has a side effect of storing objects and HTML tags.

However, if I use Node.text(), I can get a more beautiful result, without tags and entities:

Glóandi augu, silfurnátt Blóð alvöru, starir á Óður hundur er í vígamóð, í maga... mér Kolniður gref, kvik sem dreg hér Kolniður svart,

Who has another unfortunate side effect of removing line breaks and single line compression.

Just replacing it <br />with a node before the call Node.text()gives the same result, and it seems that this method compresses the text on one line of the method itself, ignoring new lines.

, , ?

+3

java html parsing jsoup

joshschreuder 18 . '11 5:28

2

fooobar.com/questions/76182/...

    String text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2nl").replaceAll("\n", "br2nl")).text();
    text = text.replaceAll("br2nl ", "\n").replaceAll("br2nl", "\n").trim();

,

+1

petrumo 13 . '12 15:54

qwerty · Accepted Answer · 2011-03-18T05:44:20+0000

( ) API... , node . , <br>.

TextNode.getWholeText() .

Removing HTML objects while maintaining line breaks with JSoup

More articles: