Faster way to parse file to array, compare to array in second file, write final file

Question

I currently have an MGF file containing MS2 spectral data (QE_2706_229_sequest_high_conf.mgf). The file template is here, as well as a snippet of example:

BEGIN IONS
TITLE=File3249 Spectrum10594 scans: 11084
PEPMASS=499.59366 927079.3
CHARGE=3+
RTINSECONDS=1710
SCANS=11084
104.053180 3866.360000
110.071530 178805.000000
111.068610 1869.210000
111.074780 10738.600000
112.087240 13117.900000
113.071150 7148.790000
114.102690 4146.490000
115.086840 11835.600000
116.070850 6230.980000
... ...
END IONS

This unannotated spectral file contains thousands of these entries, the total file size is ~150 MB. I then have a series of text files which I need to parse. Each file is similar to the format above, with the first column being read into a NumPy array. Then the unannotated spectra file is parsed for each entry until a matching array is found from the annotated text files input.

(Filename GRPGPVAGHHQMPR)

       m/z              i matches
 104.05318        3866.4
 110.07153      178805.4
 111.06861        1869.2
 111.07478       10738.6
 112.08724       13117.9
 113.07115        7148.8
 114.10269        4146.5
 115.08684       11835.6
 116.07085        6231.0

Once a match is found, an MGF annotated file is written that then contains the full entry information in the unannotated file, but with a line that specifies the filename of the annotated text file that matched that particular entry. The output is below:

BEGIN IONS
SEQ=GRPGPVAGHHQMPR
TITLE=File3249 Spectrum10594 scans: 11084
PEPMASS=499.59366 927079.3
... ...
END IONS

There may be a much more computationally inexpensive way to parse. Given 2,000 annotated files to search through, with the above large unannotated file, parsing currently takes ~ 12 hrs on a 2.6 GHz quad-core Intel Haswell CPU.

import numpy as np
import subprocess as sp
import sys
from pyteomics import mgf, auxiliary

def main():
    pep_files = []

    if (len(sys.argv) > 0):
        spec_in = sys.argv[1]
    else:
        print 'Not enough Command Line Arguments!'

    path = '/DeNovo/QE_2706_229_sequest_high_conf.mgf'
    print spec_in
    pep_files.append(spec_in)

    for ann_spectra in pep_files:
        seq = ann_spectra[:ann_spectra.find('-') - 1]
        print seq
        a = np.genfromtxt(ann_spectra, dtype=float, invalid_raise=False, usemask=False, filling_values=0.0, usecols=(0))              
        b = np.delete(a, 0)

        entries = []

        with mgf.read(path) as reader:
            for spectrum in reader:   
                if np.array_equal(b, spectrum['m/z array']):
                    entries.append(spectrum)
                    file_name = 'DeNovo/good_training_seq/{}.mgf'.format(ann_spectra[:-4])
                    with open(file_name, 'wb') as mgf_out:
                        for entry in entries:
                            mgf_out.write('BEGIN IONS')
                            mgf_out.write('\nSEQ={}'.format(seq))
                            mgf_out.write('\nTITLE={}'.format(entry['params']['title']))
                            mgf_out.write('\nPEPMASS={} {}'.format(entry['params']['pepmass'][0], entry['params']['pepmass'][1]))
                            mgf_out.write('\nCHARGE={}'.format(entry['params']['charge']))
                            mgf_out.write('\nRTINSECONDS={}'.format(str(entry['params']['rtinseconds'])))
                            mgf_out.write('\nSCANS={}'.format(entry['params']['scans']))
                            mgf_out.write('\n')
                            p = np.vstack([entry['m/z array'], entry['intensity array']])
                            output = p.T
                            np.savetxt(mgf_out, output, delimiter=' ', fmt='%f')
                            mgf_out.write('END IONS')


if __name__ == '__main__':
    main()

The Python script is used with command line arguments with the following bash script:

for f in *.txt ; do python2.7 mgf_parser.py "$f"; done

This was used to be able to only parse a given number of files at a time. Suggestions on any more efficient parsing methods?

You shouldn't update your code once answers have been given as it invalidates the work done by other people. — Josay, Jul 15 at 15:54
The original code in a stack exchange question should always be kept, along with any refactoring that has progressed due to any and all answers received as long as the question is current. — user2277435, Jul 15 at 16:07
@user2277435: We actually no longer encourage updated code to be appended to the question. For ease of reading the question and answers, the answers should just map to one (or more) original code blocks. Any updated code not for further review can be posted on a 3rd-part site. Otherwise, it's not mandatory. — Jamal♦, Jul 15 at 16:48
Noted for the future. On stack overflow, for example, multiple answers can reflect back to original posting which has updated sections to provide note to each poster, especially in optimization problems which compare computational expense. But further discussion I'll leave for meta. — user2277435, Jul 15 at 17:06

Josay · Accepted Answer · 2014-07-15 13:59:02Z

Disclaimer : I have very limited skill/knowledge for anything related to np. Also, I haven't run your code to identify bottleneck which should be the very first thing to do before trying to perform optimisations.

Common ways to make things faster are :

not to do something if you don't need to.
not to do something multiple times if once is enough.

In the first category, your import subprocess as sp and from pyteomics import auxiliary are not required : you can get rid of this. This shouldn't change anything from a performance point of view but it's always good to make things easier.

In the second category, it seems like you could open filename once for each ann_spectra. Also, for each entry, you could retrieve entry['params'] only once. Similarly, you could call mgf_out once.

This is the code I have at this stage :

def main():
    pep_files = []

    if (len(sys.argv) > 0):
        spec_in = sys.argv[1]
    else:
        print 'Not enough Command Line Arguments!'

    path = '/DeNovo/QE_2706_229_sequest_high_conf.mgf'
    print spec_in
    pep_files.append(spec_in)

    for ann_spectra in pep_files:
        seq = ann_spectra[:ann_spectra.find('-') - 1]
        print seq
        a = np.genfromtxt(ann_spectra, dtype=float, invalid_raise=False, usemask=False, filling_values=0.0, usecols=(0))              
        b = np.delete(a, 0)

        entries = []
        file_name = 'DeNovo/good_training_seq/{}.mgf'.format(ann_spectra[:-4])
        with open(file_name, 'wb') as mgf_out, mgf.read(path) as reader:
            for spectrum in reader:   
                if np.array_equal(b, spectrum['m/z array']):
                    entries.append(spectrum)
                    for entry in entries:
                        param = entry['params']
                        mgf_out.write('BEGIN IONS\nSEQ={}\nTITLE={}\nPEPMASS={} {}\nCHARGE={}\nRTINSECONDS={}\nSCANS={}\n').format(
                            seq,
                            param['title'],
                            param['pepmass'][0],
                            param['pepmass'][1],
                            param['charge'],
                            str(param['rtinseconds']),
                            param['scans'])
                        p = np.vstack([entry['m/z array'], entry['intensity array']])
                        output = p.T
                        np.savetxt(mgf_out, output, delimiter=' ', fmt='%f')
                        mgf_out.write('END IONS')

Now, I have realised something that I find quite confusing : you keep adding elements to entries and then you loop over it : it will loop once the first time, twice the second time, n times the n_th times. Is this something we really need/want to do ?

Finally, you are doing things in an un-usual way : you have pep_files = [], then pep_files.append(spec_in). This could easily be written pep_files = [spec_in]. Then, you do for ann_spectra in pep_files:, you might as well get rid of pep_files, ann_spectra and the loop and work directly with spec_in (and maybe you should exit when the argument is not provided).

Points taken, appreciated. Some of the temporary lists were from refactoring that weren't necessary anymore. You cannot use format on NoneType objects, gives an attribute error. This improves the readability, but is still not providing a performance increase. The main bottleneck is the parsing of the 10,000+ entries each time from 'path'. I'm thinking of a method to delete each matched entry from the file once found, every iteration would then be parsing an ever decreasing search space. Any other thoughts? — user2277435, Jul 15 at 15:22

asked	4 months ago
viewed	83 times
active	4 months ago

current community

your communities

more stack exchange communities

Faster way to parse file to array, compare to array in second file, write final file

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged python performance bioinformatics or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Faster way to parse file to array, compare to array in second file, write final file

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python performance bioinformatics or ask your own question.

Related

Hot Network Questions