Here's the deal: I wrote a program that finds all the classes of algorithms in a dictionary. However, I have a problem with accented characters. Currently my code is reading them, processing them as if they are invisible, but still printing some kind of replacement code at the end in the form of "\ xc3 \ ???". I would like to drop all words with accents, but I do not know how to find them.
Things I tried:
- checking if the type is unicode
- using regex to check for words containing '\ xc3'
- decoding / encoding (I do not understand unicode completely, but everything I tried does not work).
QUESTION / PROBLEM: I need to figure out how to recognize accents, but my program prints accents on the command line as strange '\ xc3 \ ???' characters that do not match their program, because I could not find any words containing "\ xc3 \ ???" even though it prints on the command line.
Example: sé → s \ xc3 \ xa9, and sé and s are considered anagrams in my program.
Test Dictionary:
stop
tops
pots
hello
world
pit
tip
\xc3\xa9
sé
s
se
Code output:
Found
\xc3\xa9
['pit', 'tip']
['world']
['s\xc3\xa9', 's']
['\\xc3\\xa9']
['stop', 'tops', 'pots']
['se']
['hello']
The program itself:
import re
anadict = {};
for line in open('fakedic.txt'):
word = line.strip().lower().replace("'", "")
line = ''.join(sorted(ch for ch in word if word if ch.isalnum($
if isinstance(word, unicode):
print word
print "UNICODE!"
pattern = re.compile(r'xc3')
if pattern.findall(word):
print 'Found'
print word
if anadict.has_key(line):
if not (word in anadict[line]):
anadict[line].append(word)
else:
anadict[line] = [word]
for key in anadict:
if (len(anadict[key]) >= 1):
print anadict[key]
reference
source
share