I have a very large collection (~ 7M elements) in MongoDB, mainly consisting of documents with three fields.
I would like to be able to iterate over all the unique values for one of the fields in an expedient manner.
Currently, I request only this field, and then process the returned results, repeating the cursor for uniqueness. This works, but it's pretty slow, and I suspect there should be a better way.
I know that mongo has a function db.collection.distinct(), but this is limited by the maximum BSON size (16 MB), which exceeds my dataset.
Is there a way to iterate over something similar to db.collection.distinct(), but using a cursor or some other method, so limiting the record size is not a problem?
I think that maybe something like the map / reduce function might be suitable for this kind of thing, but I really don't understand the map reduction paradigm in the first place, so I have no idea what I'm doing, Project, which I am working on, partially learns about working with various database tools, so I'm pretty inexperienced.
I use PyMongo, if appropriate (I don't think so). This should basically depend only on MongoDB.
Example:
For this dataset:
{"basePath" : "foo", "internalPath" : "Neque", "itemhash": "49f4c6804be2523e2a5e74b1ffbf7e05"}
{"basePath" : "foo", "internalPath" : "porro", "itemhash": "ffc8fd5ef8a4515a0b743d5f52b444bf"}
{"basePath" : "bar", "internalPath" : "quisquam", "itemhash": "cf34a8047defea9a51b4a75e9c28f9e7"}
{"basePath" : "baz", "internalPath" : "est", "itemhash": "c07bc6f51234205efcdeedb7153fdb04"}
{"basePath" : "foo", "internalPath" : "qui", "itemhash": "5aa8cfe2f0fe08ee8b796e70662bfb42"}
What I would like to do is iterate over the field only basePath. For the above dataset, this means that I will iterate over the values foo, barand bazonly once.
, , , , , ( ).
, (: , ):
self.log.info("Running path query")
itemCursor = self.dbInt.coll.find({"basePath": pathRE}, fields={'_id': False, 'internalPath': False, 'itemhash': False}, exhaust=True)
self.log.info("Query complete. Processing")
self.log.info("Query returned %d items", itemCursor.count())
self.log.info("Filtering returned items to require uniqueness.")
items = set()
for item in itemCursor:
items.add(item["basePath"])
self.log.info("total unique items = %s", len(items))
self.dbInt.coll.distinct("basePath") OperationFailure: command SON([('distinct', u'deduper_collection'), ('key', 'basePath')]) failed: exception: distinct too big, 16mb cap
, , . , , .
reStr = "^%s" % fqPathBase
pathRE = re.compile(reStr)
self.log.info("Running path query")
pipeline = [
{ "$match" :
{
"basePath" : pathRE
}
},
{"$group":
{
"_id": "$basePath"
}
},
{"$out": "tmp_unique_coll"}
]
itemCursor = self.dbInt.coll.aggregate(pipeline, allowDiskUse=True)
itemCursor = self.dbInt.db.tmp_unique_coll.find(exhaust=True)
self.log.info("Query complete. Processing")
self.log.info("Query returned %d items", itemCursor.count())
self.log.info("Filtering returned items to require uniqueness.")
items = set()
retItems = 0
for item in itemCursor:
retItems += 1
items.add(item["_id"])
self.log.info("Recieved items = %d", retItems)
self.log.info("total unique items = %s", len(items))
2X . , 834273 , 11467 :
(retreive, stuff python set ):
real 0m22.538s
user 0m17.136s
sys 0m0.324s
:
real 0m9.881s
user 0m0.548s
sys 0m0.096s
, 2X , .
Update:
SQL, . SELECT DISTINCT(colName) WHERE xxx.
, MongoDB NoSQL , .