I am trying to write a python script that modifies the contents of a <script> tag in the files that I am processing. I use lxml.html (unlike BeautifulSoup, etc.) for this because of its speed. The contents of the script tag are surrounded by comment tags (<! - and →):
<script>
</script>
The problem is that when I try something like scriptNode.text = '<!-- ...lxml it changes the angle brackets to its html representations (& lt; and gt;) when I write html back to the file. I tried to escape from them in the line ('\ <...'), but that does not seem to help.
Looking at most modern websites, it seems that these comment tags are not needed. I can delete them, but many scripts also use some html inside them, and if they are also modified in their HTML representation, this is a problem.
I am surprised that lxml completely modifies this data, recently I heard that HTML parsers are designed to avoid changing / interpreting data in <script> tags.
Is there a parameter / command that I can use to prevent this?
thank
source
share