Python lxml xpath returns escape characters in a list with text

Until last week, my experience with Python was very limited to large database files on our network, and suddenly I entered the world trying to extract information from html tables.

After a lot of reading, I decided to use lxml and xpath with Python 2.7 to get the data. I got one field using the following code:

xpath = "//table[@id='resultsTbl1']/tr[position()>1]/td[@id='row_0_partNumber']/child::text()" 

who prepared the following list:

['\r\n\t\tBAR18FILM/BKN', '\r\n\t\t\r\n\t\t\t', '\r\n\t\t\t', '\r\n\t\t\t', '\r\n\t\t\t', '\r\n\t\t\t', '\r\n\t\t\t\r\n\t\t']

I recognized the CR / LF and tab escape characters, I was wondering how to avoid them?

+3
source share
1 answer

These characters are part of the XML document, so they are returned. You cannot escape them, but you can break them. You can call a method .strip()for each returned item:

results = [x.strip() for x in results]

. , .

, script:

#!/usr/bin/python

from lxml import etree

with open('data.xml') as fd:
    doc = etree.parse(fd)

results = doc.xpath(
    "//table[@id='results']/tr[position()>1]/td/child::text()")

print 'Before stripping'
print repr(results)

print 'After stripping'
results = [x.strip() for x in results]
print repr(results)

:

<doc>
  <table id="results">
    <tr>
      <th>ID</th><th>Name</th><th>Description</th>
    </tr>

    <tr>
      <td>
      1
      </td>
      <td>
      Bob
      </td>
      <td>
      A person
      </td>
      </tr>
    <tr>
      <td>
      2
      </td>
      <td>
      Alice
      </td>
      <td>
      Another person
      </td>
    </tr>
  </table>
</doc>

:

Before stripping
['\n\t\t\t1\n\t\t\t', '\n\t\t\tBob\n\t\t\t', '\n\t\t\tA person\n\t\t\t', '\n\t\t\t2\n\t\t\t', '\n\t\t\tAlice\n\t\t\t', '\n\t\t\tAnother person\n\t\t\t']
After stripping
['1', 'Bob', 'A person', '2', 'Alice', 'Another person']
0

All Articles