Indexing 20M records with python and mongoDB

I would like to mention my little project, and if I am on the way. I need work with all Medline articles ( http://www.nlm.nih.gov/bsd/licensee/2011_stats/baseline_doc.html ). For those who are not familiar with the Medline database, I am adding a little information:

  • Approx. 20,000,000 records (83.4 GB disk space), each of which has many fields and subfields.
  • You can download this database (with license) in XML format.
  • These 20M records are distributed in 653 files.
  • Each file has one MedlineCitationSet, and this is a set of records (MedlineCitation).

I want to process these entries and get information such as title, abstract ... Then I decided to index these files (or entries) using python and mongodb. And I have one option:

I created a parser for the media line, and for each record a JSON record is created for mongoDB and after indexing pubmedID. Then I can create a function like get_abstract ('pubmedID'): string.

My questions:

  • Is that a good idea? (XML parsing -> JSON -> insert and indexing!)
  • Is it possible to use GridFS and get block equivalents for records for each file? How?
  • Do you know in another way?
+3
source share
1 answer

Is that a good idea? (XML parsing -> JSON -> insert and indexing!)

? JSON , XML, , , .

GridFS ? ?

GridFS , . MongoDB (16MB == ). , , .

, GridFS. , GridFS .

GridFS . , GridFS - . MongoDB.


PS: , pubmedID - . , _id pubmedID .

ie: collection.insert({"_id": xml_obj.pubmedID, "text" : xml_obj.article_text})

+2
source

All Articles