Iterate over single elements in one field in MongoDB

Question

Iterate over single elements in one field in MongoDB

I have a very large collection (~ 7M elements) in MongoDB, mainly consisting of documents with three fields.

I would like to be able to iterate over all the unique values for one of the fields in an expedient manner.

Currently, I request only this field, and then process the returned results, repeating the cursor for uniqueness. This works, but it's pretty slow, and I suspect there should be a better way.

I know that mongo has a function db.collection.distinct(), but this is limited by the maximum BSON size (16 MB), which exceeds my dataset.

Is there a way to iterate over something similar to db.collection.distinct(), but using a cursor or some other method, so limiting the record size is not a problem?

I think that maybe something like the map / reduce function might be suitable for this kind of thing, but I really don't understand the map reduction paradigm in the first place, so I have no idea what I'm doing, Project, which I am working on, partially learns about working with various database tools, so I'm pretty inexperienced.

_{I use PyMongo, if appropriate (I don't think so). This should basically depend only on MongoDB.}

Example:

For this dataset:

{"basePath" : "foo", "internalPath" : "Neque", "itemhash": "49f4c6804be2523e2a5e74b1ffbf7e05"}
{"basePath" : "foo", "internalPath" : "porro", "itemhash": "ffc8fd5ef8a4515a0b743d5f52b444bf"}
{"basePath" : "bar", "internalPath" : "quisquam", "itemhash": "cf34a8047defea9a51b4a75e9c28f9e7"}
{"basePath" : "baz", "internalPath" : "est", "itemhash": "c07bc6f51234205efcdeedb7153fdb04"}
{"basePath" : "foo", "internalPath" : "qui", "itemhash": "5aa8cfe2f0fe08ee8b796e70662bfb42"}

What I would like to do is iterate over the field only basePath. For the above dataset, this means that I will iterate over the values foo, barand bazonly once.

, , , , , ( ).

, (: , ):

    self.log.info("Running path query")
    itemCursor = self.dbInt.coll.find({"basePath": pathRE}, fields={'_id': False, 'internalPath': False, 'itemhash': False}, exhaust=True)
    self.log.info("Query complete. Processing")
    self.log.info("Query returned %d items", itemCursor.count())
    self.log.info("Filtering returned items to require uniqueness.")
    items = set()
    for item in itemCursor:
        # print item
        items.add(item["basePath"])

    self.log.info("total unique items = %s", len(items))

self.dbInt.coll.distinct("basePath") OperationFailure: command SON([('distinct', u'deduper_collection'), ('key', 'basePath')]) failed: exception: distinct too big, 16mb cap

, , . , , .

    reStr = "^%s" % fqPathBase
    pathRE = re.compile(reStr)
    self.log.info("Running path query")

    pipeline = [
        { "$match" :
            {
                "basePath" : pathRE
            }
        },
        # Group the keys
        {"$group":
            {
                "_id": "$basePath"
            }
        },

        # Output to a collection "tmp_unique_coll"
        {"$out": "tmp_unique_coll"}
        ]

    itemCursor = self.dbInt.coll.aggregate(pipeline, allowDiskUse=True)
    itemCursor = self.dbInt.db.tmp_unique_coll.find(exhaust=True)

    self.log.info("Query complete. Processing")
    self.log.info("Query returned %d items", itemCursor.count())
    self.log.info("Filtering returned items to require uniqueness.")
    items = set()
    retItems = 0
    for item in itemCursor:
        retItems += 1
        items.add(item["_id"])


    self.log.info("Recieved items = %d", retItems)
    self.log.info("total unique items = %s", len(items))

2X . , 834273 , 11467 :

(retreive, stuff python set ):

real    0m22.538s
user    0m17.136s
sys     0m0.324s

:

real    0m9.881s
user    0m0.548s
sys     0m0.096s

, 2X , .

Update:

SQL, . SELECT DISTINCT(colName) WHERE xxx.

, MongoDB NoSQL , .

+3

mongodb mongodb-query aggregation-framework pymongo

Fake Name 22 . '14 11:44

4

A MapReduce, Mongo , 16 .

map = function() {
    if(this['basePath']) {
        emit(this['basePath'], 1);
    }
    // if basePath always exists you can just call the emit:
    // emit(this.basePath);
};

reduce = function(key, values) {
    return Array.sum(values);
};

basePath , . . basePath .

, , , out, .

db.yourCollectionName.mapReduce(
                 map,
                 reduce,
                 { out: "distinctMR" }
               )

+1

WiredPrairie 22 . '14 13:36

@ :

field = 'basePath' # Field I want db.collection.aggregate( [{'$project': {field: 1, '_id': 0}}])

$project . , '_id': 0 _id.

? $limit $skip:

field = 'basePath' # Field I want db.collection.aggregate( [{'$project': {field: 1, '_id': 0}}, {'$limit': X}, {'$skip': Y}])

0

reading_ant 19 '17 11:45

, . , "" , . , , , . , mongo , , .

You can define this strategy using the $ gt operator in mongo, but you should take into account values such as empty or empty lines, and possibly discard them using the $ ne or $ nin operator. You can also expand this strategy using multiple keys, using operators such as $ gte for one key and $ gt for another.

This strategy should give you different string field values in alphabetical order or different numerical values sorted in ascending order.

0

nsblenin Apr 28 '19 at 11:22

source share

Neil Lunn · Accepted Answer · 2014-02-22T12:36:54+0000

. , 2.6 MongoDB , , .

FYI, , .distinct() - , , , .

, , 2.6 dev 2.5.3
mapReduce,

, , [ ].

db.collection.aggregate([

    // Group the key and increment the count per match
    {$group: { _id: "$basePath", count: {$sum: 1}  }},

    // Hey you can even sort it without breaking things
    {$sort: { count: 1 }},

    // Output to a collection "output"
    {$out: "output"}

])

, $out, , 16 . , .

2.6 " ", , .

allowDiskUse runCommand, .

, . mapReduce. . 2.5.5 .

Iterate over single elements in one field in MongoDB

More articles: