, , 10 , , , dict heapq, sotapme ( WoLpH) WoLpH:
wordcounter = collections.Counter(article_one)
leastcommon = word counter.nsmallest(10)
, , , 5 , 6 69105 , :
wordcounter = collections.Counter(article_one)
allwords = sorted(wordcounter.items(), key=operator.itemgetter(1))
leastcommon = itertools.takewhile(lambda x: x[1] < 5, allwords)
, heapifying, M list, a heap. - log N, . .
pastebin, , cat reut2* >reut2.sgm Reuters-21578 corpus ( , , , SGML ...):
$ python leastwords.py reut2.sgm
heap: 32.5963380337
sort: 22.9287009239
$ python3 leastwords.py reut2.sgm
heap: 32.47026552911848
sort: 25.855643508024514
$ pypy leastwords.py reut2.sgm
heap: 23.95291996
sort: 16.1843900681
( : takewhile genexp yield , nsmallest , list , decorate-sort-undecorate partial lambda ..), 5% ( ).
Anyway, this is closer than I expected, so I will probably go with one that is simpler and more readable. But I think sorting beats a bunch there, so ...
Once again: if you just need the N least common, for a reasonable N, I bet I can’t even check that a bunch will win.