I have cleared web data thanks to beautifulsoup, but I have problems converting the output to a matrix / array that I can manipulate.
from bs4 import BeautifulSoup
import urllib2
headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('http://statsheet.com/mcb/teams/duke/game_stats', None, headers)
html = urllib2.urlopen(req).read()
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'sortable statsb'})
rows = table.findAll('tr')
for tr in rows:
text = []
cols = tr.findAll('td')
for td in cols:
try:
text = ''.join(td.find(text=True))
except Exception:
text = "000"
print text+",",
print
Note: ''.join(td.find(text=True))- this is to prevent program crashes in empty cells.
which outputs:
W, GSU, 32, 42, 74, 24-47, 51.1, 15-23, 65.2, 11-24, 45.8, 6, 25, 31, 17, 4, 6, 15, 19,
W, UK, 33, 42, 75, 26-57, 45.6, 15-22, 68.2, 8-18, 44.4, 11, 20, 31, 16, 6, 6, 8, 17,
W, FGCU, 52, 36, 88, 30-63, 47.6, 19-23, 82.6, 9-31, 29.0, 16, 21, 37, 19, 9, 4, 18, 14,
W, @MINN, 40, 49, 89, 30-55, 54.5, 21-26, 80.8, 8-10, 80.0, 10, 22, 32, 12, 12, 4, 15, 21,
W, VCU, 29, 38, 67, 20-48, 41.7, 24-27, 88.9, 3-15, 20.0, 4, 30, 34, 14, 4, 8, 8, 18,
W, Lville, 36, 40, 76, 24-55, 43.6, 23-27, 85.2, 5-20, 25.0, 8, 25, 33, 13, 8, 6, 14, 20,
W, OSU, 23, 50, 73, 24-51, 47.1, 20-27, 74.1, 5-12, 41.7, 8, 29, 37, 11, 3, 5, 8, 19,
which is perfect, only now I can’t figure out how to get the data into the matrix so that I can manipulate certain columns, add new columns, etc.
I played with numpy, but every time I try, I get something like this:
[u'W,']
[u'GSU,']
[u'32,']
[u'42,']
[u'74,']
[u'24-47,']
[u'51.1,']
[u'15-23,']
[u'65.2,']
[u'11-24,']
[u'45.8,']
I want my cleared data to be able to add columns, move columns, change text in columns, split data in one column into two columns (hyphenated columns).
python. , / - . , .