Accent detection in words (Python)

Question

Accent detection in words (Python)

Here's the deal: I wrote a program that finds all the classes of algorithms in a dictionary. However, I have a problem with accented characters. Currently my code is reading them, processing them as if they are invisible, but still printing some kind of replacement code at the end in the form of "\ xc3 \ ???". I would like to drop all words with accents, but I do not know how to find them.

Things I tried:

checking if the type is unicode
using regex to check for words containing '\ xc3'
decoding / encoding (I do not understand unicode completely, but everything I tried does not work).

QUESTION / PROBLEM: I need to figure out how to recognize accents, but my program prints accents on the command line as strange '\ xc3 \ ???' characters that do not match their program, because I could not find any words containing "\ xc3 \ ???" even though it prints on the command line.

Example: sé → s \ xc3 \ xa9, and sé and s are considered anagrams in my program.

Test Dictionary:

stop
tops
pots
hello
world
pit
tip
\xc3\xa9
sé
s
se

Code output:

Found
\xc3\xa9
['pit', 'tip']
['world']
['s\xc3\xa9', 's']
['\\xc3\\xa9']
['stop', 'tops', 'pots']
['se']
['hello']

The program itself:

import re

anadict = {};

for line in open('fakedic.txt'):#/usr/share/dict/words'):
        word = line.strip().lower().replace("'", "")
        line = ''.join(sorted(ch for ch in word if word if ch.isalnum($
        if isinstance(word, unicode):
                print word
                print "UNICODE!"
        pattern = re.compile(r'xc3')
        if pattern.findall(word):
               print 'Found'
               print word
        if anadict.has_key(line):
                if not (word in anadict[line]):
                        anadict[line].append(word)
        else:
                anadict[line] = [word]

for key in anadict:
        if (len(anadict[key]) >= 1):
                print anadict[key]

reference

+3

python command-line regex unicode non-ascii-characters

Worcestershire Feb 18 '14 at 3:50

source share

2 answers

, ... :

, Python ASCII?

, , , ord char 128, , . , unicode, . (, , )

:) ,

+1

ForgetfulFellow 18 . '14 4:03

Worcestershire · Accepted Answer · 2014-03-04T03:39:45+0000

In the end, I used regular expressions (mainly to test everything that wasn't an alphabetic character):

if re.match('^[a-zA-Z_]+$', word):

Which helped me cross out any word that contains \ or any other number or funky symbol. Not an ideal solution, but it worked.

Accent detection in words (Python)

More articles: