Prevent lxml from touching data in <script> tags

Question

Prevent lxml from touching data in <script> tags

I am trying to write a python script that modifies the contents of a <script> tag in the files that I am processing. I use lxml.html (unlike BeautifulSoup, etc.) for this because of its speed. The contents of the script tag are surrounded by comment tags (<! - and →):

<script>
<!--
...
-->
</script>

The problem is that when I try something like scriptNode.text = '<!-- ...lxml it changes the angle brackets to its html representations (& lt; and gt;) when I write html back to the file. I tried to escape from them in the line ('\ <...'), but that does not seem to help.

Looking at most modern websites, it seems that these comment tags are not needed. I can delete them, but many scripts also use some html inside them, and if they are also modified in their HTML representation, this is a problem.

I am surprised that lxml completely modifies this data, recently I heard that HTML parsers are designed to avoid changing / interpreting data in <script> tags.

Is there a parameter / command that I can use to prevent this?

thank

+3

python html-parsing lxml

Alexander Tsepkov Jun 16 '11 at 17:22

source share

2 answers

, , , , , tostring() write():

main = open('file.html', 'w')
main.write(lxml.html.tostring(htmlTree))
main.close()

htmlTree.write('file.html', pretty_print=False)

, , , CDATA, , .

0

Alexander Tsepkov 16 . '11 18:15

Ignacio Vazquez-Abrams · Accepted Answer · 2011-06-16T17:41:10+0000

Put them in the CDATA section .

Prevent lxml from touching data in <script> tags

More articles: