Storing only certain characters in a string using Python?

In my program, I have a line like this:

ag ct oso gcota p>

Using python, my goal is to get rid of the space and save only the characters a, t, c and g. I understand how to get rid of a space (I just use line = line.replace ("," ")). But how can I get rid of characters that I don't need when they can be any other letter in the alphabet?

+5
source share
3 answers

A very elegant and fast way is to use regular expressions:

import re

str = 'ag ct oso gcota'
str = re.sub('[^atcg]', '', str)

"""str is now 'agctgcta"""
+12
source

I could do something like:

chars_i_want = set('atcg')
final_string = ''.join(c for c in start_string if c in chars_i_want)

This is probably the easiest way to do this.


Another option is to use str.translateto do the work:

import string
chars_to_remove = string.printable.translate(None,'acgt')
final_string = start_string.translate(None,chars_to_remove)

, . , , timeit.


update: !

import re
import string

def test_re(s,regex=re.compile('[^atgc]')):
    return regex.sub(s,'')

def test_join1(s,chars_keep=set('atgc')):
    return ''.join(c for c in s if c in chars_keep)

def test_join2(s,chars_keep=set('atgc')):
    """ list-comp is faster, but less 'idiomatic' """
    return ''.join([c for c in s if c in chars_keep])

def translate(s,chars_to_remove = string.printable.translate(None,'acgt')):
    return s.translate(None,chars_to_remove)

import timeit

s = 'ag ct oso gcota'
for func in "test_re","test_join1","test_join2","translate":
    print func,timeit.timeit('{0}(s)'.format(func),'from __main__ import s,{0}'.format(func))

( ), regex :

test_re 0.901512145996
test_join1 6.00346088409
test_join2 3.66561293602
translate 1.0741918087
+4

mgilson test_re() ? re.sub() , .

python 3.4; string.translate() , dict. dict, . , ( ).

import re
import string

regex=re.compile('[^atgc]')

chars_to_remove = string.printable.translate({ ord('a'): None, ord('c'): None, ord('g'): None, ord('t'): None })
cmap = {}
for c in chars_to_remove:
    cmap[ord(c)] = None

def test_re(s):
    return regex.sub('',s)

def test_join1(s,chars_keep=set('atgc')):
    return ''.join(c for c in s if c in chars_keep)

def test_join2(s,chars_keep=set('atgc')):
    """ list-comp is faster, but less 'idiomatic' """
    return ''.join([c for c in s if c in chars_keep])

def translate(s):
    return s.translate(cmap)

import timeit

s = 'ag ct oso gcota'
for func in "test_re","test_join1","test_join2","translate":
    print(func,timeit.timeit('{0}(s)'.format(func),'from __main__ import s,{0}'.format(func)))

:

test_re 3.3141989699797705
test_join1 2.4452173250028864
test_join2 2.081048655003542
translate 1.9390292020107154

Too bad string.translate () does not have the ability to control what to do with characters that are not on the map. The current implementation is to save them, but we could also be able to delete them, in those cases where the characters that we want to keep are much smaller than the ones that we want to delete (oh, hello, unicode).

0
source

All Articles