Using Beautiful Soup for HTML tables that are missing tags </td>

I am struggling with parsing some flaky HTML tables down to lists with Beautiful Soup. There is no </td> tag in the tables in question.

Using the following code (not real tables, which I parse, but functionally similar):

import bs4
test = "<table> <tr><td>1<td>2<td>3</tr> <tr><td>1<td>2<td>3</tr> </table>"
def walk_table2(text):
    "Take an HTML table and spit out a list of lists (of entries in a row)."
    soup = bs4.BeautifulSoup(text)
    return [[x for x in row.findAll('td')] for row in soup.findAll('tr')]

print walk_table2(test)

Gives me:

[[<td>1<td>2<td>3</td></td></td>, <td>2<td>3</td></td>, <td>3</td>], [<td>4<td>5<td>6</td></td></td>, <td>5<td>6</td></td>, <td>6</td>]]

Instead of the expected:

[[<td>1</td>, <td>2</td>, <td>3</td>], [<td>1</td>, <td>2</td>, <td>3</td>]]

It seems that the lxml parser, which uses Beautiful Soup, decides to add the </td> tag before the next instance </tr> and not the next <td> instance.

At this point, I wonder if there is a good option for the parser to put the final td tags in the right place, or if it would be easier to use a regular expression to put them manually before throwing a string in BeautifulSoup ... Any thoughts? Thanks in advance!

+5
3

, HTML Python. , -, Beautiful Soup . html5lib lxml , :

>>> soup = bs4.BeautifulSoup(test, "lxml")
>>> [[x for x in row.findAll('td')] for row in soup.findAll('tr')]
[[<td>1</td>, <td>2</td>, <td>3</td>], [<td>1</td>, <td>2</td>, <td>3</td>]]

>>> soup = bs4.BeautifulSoup(test, "html5lib")
>>> [[x for x in row.findAll('td')] for row in soup.findAll('tr')]
[[<td>1</td>, <td>2</td>, <td>3</td>], [<td>1</td>, <td>2</td>, <td>3</td>]]
+4

BeautifulSoup . , BS 3.1 3.0.8 ( "bad end tag" ), , HTML, . , , . , BS4 , BS 3.1 - , .

+2

, :

( , , stackoverflow html, C'MON, ...)

import re
r1 = re.compile('(?<!\<tr\>)\<td', re.IGNORECASE)
r2 = re.compile('\<\/tr>', re.IGNORECASE)
test = "<table> <tr><td>1<td>2<td>3</tr> <tr><td>1<td>2<td>3</tr> </table>"
test = r1.sub('</td><td', test)
test = r2.sub('</td></tr>', test)
print test

Oh test :

<table> <tr><td>1</td><td>2</td><td>3</td></tr> <tr><td>1</td><td>2</td><td>3</td></tr> </table>
+1

All Articles