Using Beautiful Soup for HTML tables that are missing tags </td>
I am struggling with parsing some flaky HTML tables down to lists with Beautiful Soup. There is no </td> tag in the tables in question.
Using the following code (not real tables, which I parse, but functionally similar):
import bs4
test = "<table> <tr><td>1<td>2<td>3</tr> <tr><td>1<td>2<td>3</tr> </table>"
def walk_table2(text):
"Take an HTML table and spit out a list of lists (of entries in a row)."
soup = bs4.BeautifulSoup(text)
return [[x for x in row.findAll('td')] for row in soup.findAll('tr')]
print walk_table2(test)
Gives me:
[[<td>1<td>2<td>3</td></td></td>, <td>2<td>3</td></td>, <td>3</td>], [<td>4<td>5<td>6</td></td></td>, <td>5<td>6</td></td>, <td>6</td>]]
Instead of the expected:
[[<td>1</td>, <td>2</td>, <td>3</td>], [<td>1</td>, <td>2</td>, <td>3</td>]]
It seems that the lxml parser, which uses Beautiful Soup, decides to add the </td> tag before the next instance </tr> and not the next <td> instance.
At this point, I wonder if there is a good option for the parser to put the final td tags in the right place, or if it would be easier to use a regular expression to put them manually before throwing a string in BeautifulSoup ... Any thoughts? Thanks in advance!
, HTML Python. , -, Beautiful Soup . html5lib lxml , :
>>> soup = bs4.BeautifulSoup(test, "lxml")
>>> [[x for x in row.findAll('td')] for row in soup.findAll('tr')]
[[<td>1</td>, <td>2</td>, <td>3</td>], [<td>1</td>, <td>2</td>, <td>3</td>]]
>>> soup = bs4.BeautifulSoup(test, "html5lib")
>>> [[x for x in row.findAll('td')] for row in soup.findAll('tr')]
[[<td>1</td>, <td>2</td>, <td>3</td>], [<td>1</td>, <td>2</td>, <td>3</td>]]
, :
( , , stackoverflow html, C'MON, ...)
import re
r1 = re.compile('(?<!\<tr\>)\<td', re.IGNORECASE)
r2 = re.compile('\<\/tr>', re.IGNORECASE)
test = "<table> <tr><td>1<td>2<td>3</tr> <tr><td>1<td>2<td>3</tr> </table>"
test = r1.sub('</td><td', test)
test = r2.sub('</td></tr>', test)
print test
Oh test :
<table> <tr><td>1</td><td>2</td><td>3</td></tr> <tr><td>1</td><td>2</td><td>3</td></tr> </table>