I need to work with an open source project ( biojava ), but I am not satisfied with some performance, d wanted to spend some time to improve it.
For example, I have a text database encoded this way:
chrX Cufflinks exon 65175856 65175971 . . . gene_id "XLOC_002576"; transcript_id "TCONS_00004217"; exon_number "1"; gene_name "RP6-159A1.2"; oId "CUFF.3698.1"; nearest_ref "ENST00000456392"; class_code "p"; tss_id "TSS3873";
chrX Cufflinks exon 128986006 128986088 . . . gene_id "XLOC_002577"; transcript_id "TCONS_00004218"; exon_number "1"; oId "CUFF.3750.1"; class_code "u"; tss_id "TSS3874";
Not every field is required, each gene_idcan be associated with several transcript_id(1..n), and each transcript_idhas 1 or more exon.
The behavior of the library is to load the entire text file into ArrayList, and for each search, all lists must be iterated. This works well with small lists, but in my case I have 10 ^ 10 queries with a really big list, and it takes a couple of days on a good computer.
Would Neo4j be a good choice? What would be a good way to implement it? For example, is it bad to create a String object only and establish relationships between them? Or is it better to use Hsqldb with a single table?
Please note that I do not need perseverance, but speed and synchronization are required.
EDIT: if you want, you can see the project here .
source
share