Cross tab (contingency table) using Python ORM?

Anyone who does very simple statistical research on data in a relational database should have computed crosstabs, also called contingency tables ( wikipedia page ). they are needed when you need to calculate how many items fall into more than one category at a time. For example: How many customers are women and how is chocolate?

Scipy has ways to do this for matrices using the histogram2d variation. But for meaningful statistical analysis, you need to have a table (with variable names) from which you can specify which variables you want to copy. Moreover, it should work for other types of variables, and not just for numeric ones. In fact, numerical tabulation is more complex since it requires binning. Rnaturally has a function calledtablewhich can be easily ported to Python. However, remember I mentioned in the title that I would like to use ORM, why? Since crosstabs are much smaller than the data used to create it, and you could have a 2x2 table calculated from billions of records in the database. I mean: in serious applications, you cannot afford to transfer all your data to memory and pass through it. Thus, you will have to convert the table to an SQL query so that the entire calculation is performed by the database engine. And ORM will take care of the necessary SQL dialect settings necessary for you to be able to run your code using any database backend. An example of SQL (in the MySQL dialog box) for simple cross tabs can be found here .

So, now that I think I motivated you to the problem, here are the questions: Is this functionality implemented in any Python ORM? How would you implement this using, say, SQLAlchemy or Django ORM?

+3
source share
1 answer

I am very sorry that I have to answer my question, but sometimes we just can’t wait for help. And since I found an answer and a good one, I feel obligated to share with the community. So here it is:

table = self.session.query(Table.var1, Table.var2, func.count(Table)).group_by(Table.var1, Table.var2).all()

(, , ). , . , , 296110 .28 , var1 var2 5 90 .

(2d):

def pprint_table():
    colnames = list(set([i[1] for i in table]))
    rows = defaultdict(lambda:[0]*len(colnames))
    for r in table:
        rows[r[0]][colnames.index(r[1])] = r[2]
    print colnames, 'total'
    for rn, r in rows.items():
        print rn, r, sum(r)
+2

All Articles