I'm having problems using the parallel version of the map ( ppmap wrapper, implementation of Kirk Strauser).
The function that I am trying to run in parallel performs a simple search for regular expressions on a large number of lines (sequences of proteins) that are analyzed from the file system using BioPython SeqIO. Each function call uses its own file.
If I run the function using a regular map, everything will work as expected. However, when using ppmap, some of the runs simply freeze, there is no CPU usage, and the main program does not even respond to KeyboardInterrupt. Also, when I look at the running processes, the workers are still there (but they no longer use any processor).
eg.
/usr/bin/python -u /usr/local/lib/python2.7/dist-packages/pp-1.6.1-py2.7.egg/ppworker.py 2>/dev/null
In addition, workers do not seem to freeze any particular data record - if I manually kill the process and restart the execution, it will stop at another point. (Therefore, I temporarily resorted to keeping a list of ready-made records and re-running the program several times).
Is there any way to see where the problem is?
Sample code that I run:
def analyse_repeats(data):
"""
Loads whole proteome in memory and then looks for repeats in sequences,
flags both real repeats and sequences not containing particular aminoacid
"""
(organism, organism_id, filename) = data
import re
letters = ['C','M','F','I','L','V','W','Y','A','G','T','S','Q','N','E','D','H','R','K','P']
try:
handle = open(filename)
data = Bio.SeqIO.parse(handle, "fasta")
records = [record for record in data]
store_records = []
for record in records:
sequence = str(record.seq)
uniprot_id = str(record.name)
for letter in letters:
items = set(re.compile("(%s+)" % tuple(([letter] * 1))).findall(sequence))
if items:
for item in items:
store_records.append((organism_id,len(item), uniprot_id, letter))
else:
store_records.append((organism_id,0, uniprot_id, letter))
handle.close()
return (organism,store_records)
except IOError as e:
print e
return (organism, [])
res_generator = ppmap.ppmap(
None,
analyse_repeats,
zip(todo_list, organism_ids, filenames)
)
for res in res_generator:
If I use a simple map instead of ppmap, everything works fine:
res_generator = map(
analyse_repeats,
zip(todo_list, organism_ids, filenames)
)