Indexing 20M records with python and mongoDB

Question

Indexing 20M records with python and mongoDB

I would like to mention my little project, and if I am on the way. I need work with all Medline articles ( http://www.nlm.nih.gov/bsd/licensee/2011_stats/baseline_doc.html ). For those who are not familiar with the Medline database, I am adding a little information:

Approx. 20,000,000 records (83.4 GB disk space), each of which has many fields and subfields.
You can download this database (with license) in XML format.
These 20M records are distributed in 653 files.
Each file has one MedlineCitationSet, and this is a set of records (MedlineCitation).

I want to process these entries and get information such as title, abstract ... Then I decided to index these files (or entries) using python and mongodb. And I have one option:

I created a parser for the media line, and for each record a JSON record is created for mongoDB and after indexing pubmedID. Then I can create a function like get_abstract ('pubmedID'): string.

My questions:

Is that a good idea? (XML parsing -> JSON -> insert and indexing!)
Is it possible to use GridFS and get block equivalents for records for each file? How?
Do you know in another way?

+3

python mongodb pymongo gridfs

Àlex May 03 '11 at 12:01

source share

1 answer

Gates vp · Accepted Answer · 2011-05-03T19:39:47+0000

Is that a good idea? (XML parsing -> JSON -> insert and indexing!)

? JSON , XML, , , .

GridFS ? ?

GridFS , . MongoDB (16MB == ). , , .

, GridFS. , GridFS .

GridFS . , GridFS - . MongoDB.

PS: , pubmedID - . , _id pubmedID .

ie: collection.insert({"_id": xml_obj.pubmedID, "text" : xml_obj.article_text})

Indexing 20M records with python and mongoDB

More articles: