Removing HTML objects while maintaining line breaks with JSoup

I used JSoup to parse texts, and so far it has been great, but ran into a problem.

I can use Node.html()to return the full HTML of the desired node, which saves line breaks as such:

Glóandi augu, silfurnátt
<br />Bl&oacute;&eth; alv&ouml;ru, starir &aacute;
<br />&Oacute;&eth;ur hundur er &iacute; v&iacute;gam&oacute;&eth;, &iacute; maga... m&eacute;r
<br />
<br />Kolni&eth;ur gref, kvik sem dreg h&eacute;r
<br />Kolni&eth;ur svart, hvergi bjart n&eacute;

But, unfortunately, it has a side effect of storing objects and HTML tags.

However, if I use Node.text(), I can get a more beautiful result, without tags and entities:

Glóandi augu, silfurnátt Blóð alvöru, starir á Óður hundur er í vígamóð, í maga... mér Kolniður gref, kvik sem dreg hér Kolniður svart,

Who has another unfortunate side effect of removing line breaks and single line compression.

Just replacing it <br />with a node before the call Node.text()gives the same result, and it seems that this method compresses the text on one line of the method itself, ignoring new lines.

, , ?

+3
2

( ) API...   , node . , <br>.

TextNode.getWholeText() .

+2

fooobar.com/questions/76182/...

    String text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2nl").replaceAll("\n", "br2nl")).text();
    text = text.replaceAll("br2nl ", "\n").replaceAll("br2nl", "\n").trim();

,

+1

All Articles