Word is calculated in Python using regular expression

Question

Word is calculated in Python using regular expression

What is the correct way to count English words in a document using regular expression?

I tried:

words=re.findall('\w+', open('text.txt').read().lower())
len(words)

but it seems that I am missing a few words (compared with the word count in gedit). Am I doing it right?

Thank you so much!

+3

python regex count word

Zhe li May 16, '11 at 13:12

source share

2 answers

Mrab · Answer 1 · 2011-05-17T14:34:00+0000

Using \ w + incorrectly counts words containing apostrophes or hyphens, for example, "cannot" will be considered 2 words. It will also count numbers (lines of digits); “12.345” and “6.7” will be considered two words (“12” and “345”, “6” and “7”).

Johnsyweb · Answer 2 · 2011-05-16T13:17:23+0000

This seems to work as expected.

>>> import re
>>> words=re.findall('\w+', open('/usr/share/dict/words').read().lower())
>>> len(words)
234936
>>> 
bash-3.2$ wc /usr/share/dict/words
  234936  234936 2486813 /usr/share/dict/words

? ?

, :

words=re.findall(r'\w+', open('/usr/share/dict/words').read())

Word is calculated in Python using regular expression

More articles: