Prevent lxml from touching data in <script> tags

I am trying to write a python script that modifies the contents of a <script> tag in the files that I am processing. I use lxml.html (unlike BeautifulSoup, etc.) for this because of its speed. The contents of the script tag are surrounded by comment tags (<! - and →):

<script>
<!--
...
-->
</script>

The problem is that when I try something like scriptNode.text = '<!-- ...lxml it changes the angle brackets to its html representations (& lt; and gt;) when I write html back to the file. I tried to escape from them in the line ('\ <...'), but that does not seem to help.

Looking at most modern websites, it seems that these comment tags are not needed. I can delete them, but many scripts also use some html inside them, and if they are also modified in their HTML representation, this is a problem.

I am surprised that lxml completely modifies this data, recently I heard that HTML parsers are designed to avoid changing / interpreting data in <script> tags.

Is there a parameter / command that I can use to prevent this?

thank

+3
source share
2 answers

Put them in the CDATA section .

+4
source

, , , , , tostring() write():

main = open('file.html', 'w')
main.write(lxml.html.tostring(htmlTree))
main.close()

htmlTree.write('file.html', pretty_print=False)

, , , CDATA, , .

0

All Articles