Python Regex - Removing Special Characters, but Saving Apostractions

I am trying to remove all special characters from some text, here is my regex:

pattern = re.compile('[\W_]+', re.UNICODE)
words = str(pattern.sub(' ', words))

Super simple, but unfortunately it causes problems when using apostrophes (single quotes). For example, if I had the word "no", this code returns "doesn".

Is there a way to adapt this regular expression so that it does not remove apostrophes in such cases?

edit: this is what i do after:

doesn't this mean it -technically- works?

it should be:

doesn't that mean it technically works

+5
source share
4 answers

Like this?

>>> pattern=re.compile("[^\w']")
>>> pattern.sub(' ', "doesn't it rain today?")
"doesn't it rain today "

If underscores should also be filtered out:

>>> re.compile("[^\w']|_").sub(" ","doesn't this _technically_ means it works? naïve I am ...")
"doesn't this  technically  means it works  naïve I am    "
+12
source

, : [a-z]*'?[a-z]+.

.

+1

re.sub(r"[^\w' ]", "", "doesn't this mean it -technically- works?")
0

([^\w']|_)+?

Please note that this will not work for things like:

doesn't this mean it 'technically' works?

This may not be exactly what you need.

0
source

All Articles