Search for the least common items in a list

I want to generate an ordered list of the least common words in a large text, with the smallest common word first appearing along with a value indicating how many times it appears in the text.

I cleared the text from some articles in the online magazine, then just assigned and smashed;

article_one = """ large body of text """.split() 
=> ("large","body", "of", "text")

It seems like a regex would be appropriate for the next steps, but being new to programming, I am not very well versed - If the best answer includes a regex, can someone give me a good regular expression tutorial other than pydoc?

+5
source share
5 answers

ready answer from motherhood.

# From the official documentation ->>
>>> # Tally occurrences of words in a list
>>> cnt = Counter()
>>> for word in ['red', 'blue', 'red', 'green', 'blue', 'blue']:
...     cnt[word] += 1
>>> cnt
Counter({'blue': 3, 'red': 2, 'green': 1})
## ^^^^--- from the standard documentation.

>>> # Find the ten most common words in Hamlet
>>> import re
>>> words = re.findall('\w+', open('hamlet.txt').read().lower())
>>> Counter(words).most_common(10)
[('the', 1143), ('and', 966), ('to', 762), ('of', 669), ('i', 631),
 ('you', 554),  ('a', 546), ('my', 514), ('hamlet', 471), ('in', 451)]

>>> def least_common(adict, n=None):
.....:       if n is None:
.....:               return sorted(adict.iteritems(), key=itemgetter(1), reverse=False)
.....:       return heapq.nsmallest(n, adict.iteritems(), key=itemgetter(1))

Obviously adapt to set: D

0

/ defaultdict, , Python 2.7, 2.5 :)

import collections

counter = collections.defaultdict(int)
article_one = """ large body of text """

for word in article_one.split():
    counter[word] += 1

print sorted(counter.iteritems(), key=lambda x: x[::-1])
+4

. Counter

c.most_common()[:-n-1:-1]       # n least common elements

,

from collections import Counter
Counter( mylist ).most_common()[:-2:-1]

:

from collections import Counter
Counter( mylist ).most_common()[:-3:-1]

+2

, , -, . .

#!/usr/bin/env python
import operator
import string

article_one = """A, a b, a b c, a b c d, a b c d efg.""".split()
wordbank = {}

for word in article_one:
    # Strip word of punctuation and capitalization
    word = word.lower().strip(string.punctuation)
    if word not in wordbank:
        # Create a new dict key if necessary
        wordbank[word] = 1
    else:
        # Otherwise, increment the existing key value
        wordbank[word] += 1

# Sort dict by value
sortedwords = sorted(wordbank.iteritems(), key=operator.itemgetter(1))

for word in sortedwords:
    print word[1], word[0]

:

1 efg
2 d
3 c
4 b
5 a

Python >= 2.4 Python 3+, print iteritems items.

+1

, , 10 , , , dict heapq, sotapme ( WoLpH) WoLpH:

wordcounter = collections.Counter(article_one)
leastcommon = word counter.nsmallest(10)

, , , 5 , 6 69105 , :

wordcounter = collections.Counter(article_one)
allwords = sorted(wordcounter.items(), key=operator.itemgetter(1))
leastcommon = itertools.takewhile(lambda x: x[1] < 5, allwords)

, heapifying, M list, a heap. - log N, . .

pastebin, , cat reut2* >reut2.sgm Reuters-21578 corpus ( , , , SGML ...):

$ python leastwords.py reut2.sgm # Apple 2.7.2 64-bit
heap: 32.5963380337
sort: 22.9287009239
$ python3 leastwords.py reut2.sgm # python.org 3.3.0 64-bit
heap: 32.47026552911848
sort: 25.855643508024514
$ pypy leastwords.py reut2.sgm # 1.9.0/2.7.2 64-bit
heap: 23.95291996
sort: 16.1843900681

( : takewhile genexp yield , nsmallest , list , decorate-sort-undecorate partial lambda ..), 5% ( ).

Anyway, this is closer than I expected, so I will probably go with one that is simpler and more readable. But I think sorting beats a bunch there, so ...

Once again: if you just need the N least common, for a reasonable N, I bet I can’t even check that a bunch will win.

0
source

All Articles