Tagged Questions

The use of IT to analyze biological data.

learn more… | top users | synonyms

4
votes
2answers
30 views

Convert impute2 files to mach format

Here is a program for converting Impute2 files into MaCH format (related to genetics). Source files include one xxx_haps file and one xxx_samples file, for example: ...
3
votes
0answers
50 views

Finding the Cox regression coefficients in a mixed model for microarray data

I have written a code for a project which aims at finding the Cox regression coefficients in a mixed model for microarray data. The study was carried out on the Affymetrix Hgu133a platform. In the ...
2
votes
1answer
163 views

Slow Python text-processing script

This script of mine merges columns 1 and 2 from one input file and sees if these merged combinations exist in the other infile (and vice versa). I know I get stuck in appending. It did not get past ...
4
votes
0answers
50 views

Vectorize Fisher's Exact Test

I have two data frames/ lists of data, humanSplit and ratSplit, and they are of the form ...
0
votes
1answer
78 views

Faster way to parse file to array, compare to array in second file, write final file

I currently have an MGF file containing MS2 spectral data (QE_2706_229_sequest_high_conf.mgf). The file template is here, as well as a snippet of example: ...
3
votes
0answers
199 views

Genetic Algorithm in Python

I'm a new programmer, so any help is advised. Preferably to make it faster, avoid heavy memory usage and so on. EDIT: Updated the code, now including a functional test program. Fixed the PEP-8 ...
6
votes
2answers
138 views

Comparing 2 lists of peptide to spectrum rankings generated by 2 different algorithms

I'm seeking a general review, but I'm particularly interested in style. This program gets 2 lists of peptide to spectrum matches, so every spectrum title is linked to a list of 1 or 10 possible ...
10
votes
3answers
625 views

Counting DNA nucleotides in C

I have written code to solve the following Rosalind problem. This is my first time writing in C and I would like a review of my code, particularly in regard to correctness and performance. ...
3
votes
1answer
48 views

Calculating overlap of segments in chromosome data

I wrote an R code that basically performs 2 operations: For each segment in file A, find all segments in file B that lie in that segment. Find the percentage of overlap for each case in previous ...
1
vote
2answers
128 views

Parsing BLAST output in XML format using Regular Expression

There many other better ways to parse BLAST output in .xml format, but I was curious to try using regex, even if it is not so straightforward and common. Here is the code how to extract translated ...
3
votes
2answers
156 views

Rosalind's 3rd problem in Scheme

I have an imperative programming background and I've decided to study functional programming by applying it to problems found on sites such as Project Euler and Rosalind. My language of choice is ...
4
votes
2answers
85 views

Data screening using Perl

Background information I've been asked to write a little Perl script that allows genomic data to be screened against reference files in order to determine locations of specific mutations. The input ...
6
votes
2answers
81 views

Foreach-loop for and print commands

How can I make the following code shorter or efficient (maybe with other loops or other nice ideas), and keep the current functionality? ...
5
votes
2answers
198 views

Genomic Range Query

Recently I worked on one of the Codility Training - Genomic Range Query (please refer to one of the evaluation report for the detail of this training). The proper approach for this question is using ...
10
votes
2answers
392 views

Calculating the joint probability of n events from a sample sequence of occurrences

I'm writing an algorithm to take in a sample list of sequences of events, calculate 1-step transitional probabilities from the sequences, forward or in reverse, then calculate the joint probability of ...
4
votes
2answers
500 views

How to improve this Needleman-Wunsch implementation in C#?

I split my implementation of this sequence alignment algorithm in three methods. Where NeedlemanWunsch-method makes use of the ScoringFunction and the Traceback methods. Further I decided to go with ...
5
votes
3answers
347 views

Optimization for SQLite result set parsing

I am retrieving information from an SQLite database that gives me back around 20 million rows that I need to process. This information is then transformed into a dict of lists which I need to use. I ...
4
votes
1answer
102 views

Cutting strings into smaller ones based on specific criteria

So, I've got this largish (for me) script, and I want to see if anybody could tell me if there are any ways to improve it, both in terms of speed, amount of code and the quality of the code. I still ...
6
votes
1answer
418 views

Calculate query coverage from BLAST output

I have a BLAST output file and want to calculate query coverage, appending the query lengths as an additional column to the output. Let's say I have 2 7 15 ...
7
votes
4answers
1k views

Genome string clump finding problem

I am trying to solve a bioinformatics problems from a Stepic course. The problem posed: find clumps of the same pattern within a longer genome. Motivation: Identifying 3 occurrences of the same ...
2
votes
3answers
760 views

Longest DNA sequence that appears at least twice (only one DNA string as input)

My question is to find the longest DNA sub-sequence that appears at least twice. The input is only one DNA string, NOT TWO strings as other LCS programs. I have done my 4th program and it seems to be ...
5
votes
1answer
409 views

Performance: equivalent C and C++ programs

I write quite a bit of code in C, but haven't done much C++ since my college CS classes. I have been revisiting C++ recently, and thought I would re-implement a program I had previously written in C, ...
4
votes
2answers
404 views

Efficient parsing of FASTQ

FASTQ is a notoriously bad format. This is because it uses the same @ character for the id line as it does for quality scores. Deciding what is a quality score and ...
8
votes
3answers
521 views

Statistical calculations with sets of genes

The following piece of code executes 20 million times each time the program is called, so I need a way to make this code as optimized as possible. ...
5
votes
4answers
291 views

Generating DNA sequences and looking for correlations

I've written a script to generate DNA sequences and then count the appearance of each step to see if there is any long range correlation. My program runs really slow for a length 100000 sequence 100 ...
11
votes
1answer
1k views

Simple DNA sequence finder w/ mismatch tolerance

The goal with this function is to find one DNA sequence within another sequence, with a specified amount of mismatch tolerance. For example: ...
2
votes
2answers
137 views

Feedback on text parsing and control structures

I threw together this C program today to handle a bioinformatics data processing task. The program seems to work correctly, but I wanted to know if anyone has suggestions regarding how the input data ...
5
votes
2answers
142 views

Finding database matches and storing them in a glycopeptide structure

I am relatively new to C and would like some feedback on a function that I have written, if it adheres to C standards or if there are some other things which I could have done better/differently. ...
7
votes
6answers
337 views

Explicit Function Notation in Perl

I've gone back and forth a few times recently on my Perl coding style when it comes to module subroutines. If you have an object and you want to call the method bar ...