How to extract tables from websites in Python

Question

How to extract tables from websites in Python

Here

http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500

There is a table. My goal is to extract the table and save it to a csv file. I wrote the code:

import urllib
import os

web = urllib.urlopen("http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500")

s = web.read()
web.close()

ff = open(r"D:\ex\python_ex\urllib\output.txt", "w")
ff.write(s)
ff.close()

I lost here. Who can help with this? Thank!

+3

python urllib

Bill tp May 11 '12 at 17:33

source share

7 answers

Pandas , html. to_html() html dataframes. to_csv() csv. - , df_list[-1] .

import requests
import pandas as pd

url = 'http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print df
df.to_csv('my data.csv')

, :

pd.read_html(requests.get(<url>).content)[-1].to_csv(<csv file>)

+25

MarredCheese 12 . '17 18:36

BeautifulSoup, . , , -, . , , . pastbin.

http://pastebin.com/RPNbtX8Q

:

from urllib2 import Request, urlopen, URLError
from TableParser import TableParser
url_addr ='http://foo/bar'
req = Request(url_addr)
url = urlopen(req)
tp = TableParser()
tp.feed(url.read())

# NOTE: Here you need to know exactly how many tables are on the page and which one
# you want. Let say it the first table
my_table = tp.get_tables()[0]
filename = 'table_as_csv.csv'
f = open(filename, 'wb')
with f:
    writer = csv.writer(f)
    for row in table:
        writer.writerow(row)

, pastbin, , .

+2

aquil.abdullah 11 '12 18:56

, CSV.

BeautifulSoup . , ( 3.0.8, , BeautifulSoup 4).

(, ), csv.write.

0

Andrew Gorcester 11 '12 17:42

BeautifulSOup . documentation html.

csv - csv module.

.

0

Adam 11 '12 17:42

Look at this answer parsing table with BeautifulSoup and write in a text file . Also use google with the following words "python beautifulsoup"

0

Kovadim May 11 '12 at 17:42

source share

import requests
import pandas as pd

url = 'http://www.ffiec.gov/census/report.aspx? 
year=2011&state=01&report=demographic&msa=11500'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print (df)
df.to_csv('my data.csv')

0

Sham pat May 27 '19 at 3:27

source share

Vikas · Accepted Answer · 2012-05-11T17:41:49+0000

So, you want to parse the htmlfile to get elements from it. You can use BeautifulSoup or lxml for this task.

You already have solutions using BeautifulSoup. I will post the solution with lxml:

from lxml import etree
import urllib

web = urllib.urlopen("http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500")
s = web.read()

html = etree.HTML(s)

## Get all 'tr'
tr_nodes = html.xpath('//table[@id="Report1_dgReportDemographic"]/tr')

## 'th' is inside first 'tr'
header = [i[0].text for i in tr_nodes[0].xpath("th")]

## Get text from rest all 'tr'
td_content = [[td.text for td in tr.xpath('td')] for tr in tr_nodes[1:]]

How to extract tables from websites in Python

More articles: