The use of IT to analyze biological data.

learn more… | top users | synonyms

2
votes
0answers
16 views

Implementation of fast optimal global sequence alignment algorithm

The link to the paper, where the algorithm is explained in-depth and its optimality and termination is proven, can be found here. I also found a C++ implementation, which is a bit unreadable and does ...
8
votes
1answer
317 views

Genetic Algorithm in Python

I'm a new programmer, so any help is welcome. Preferably to make it faster, avoid heavy memory usage, and so on. ...
3
votes
1answer
65 views

Python Longest Repeat

I am trying to find the longest repeated string in text with python, both quickly and space efficiently. I created an implementation of a suffix tree in order to make the processing fast, but the ...
4
votes
1answer
125 views

Cutting strings into smaller ones based on specific criteria

I've got this largish (for me) script, and I want to see if anybody could tell me if there are any ways to improve it, both in terms of speed, amount of code and the quality of the code. I still ...
4
votes
2answers
82 views

bash script for constructing RNA pipeline

I have written a bash script that consists of multiple commands and Python scripts. The goal is to make a pipeline for detecting long non coding RNA from a certain input. Ultimately I would like to ...
5
votes
1answer
54 views

Reading an Excel file and comparing the amino acid sequence of each data pair

Since I am fairly new to Python I was wondering whether anyone can help me by making the code more efficient. I know the output stinks; I will be using Pandas to make this a little nicer. ...
1
vote
2answers
73 views

Counting adenine and cytosine bases

I've started a little challenge on a website, and the first one was about counting different DNA letters. I've done it, but I found my method very brutal. I have a little experience, and I know that ...
2
votes
2answers
146 views

Data processing task for bioinformatics

I threw together this C program today to handle a bioinformatics data processing task. The program seems to work correctly, but I wanted to know if anyone has suggestions regarding how the input data ...
5
votes
1answer
99 views

Case study with a biological populations: a list of lists of lists

I have a population (Pop) which has an attribute which is a list of individuals (Ind) where each individual has an attribute ...
4
votes
2answers
144 views

Reflecting emotion classification based on the Lövheim cube

Background I created a simple class to reflect emotion classification based on the Lövheim cube. The code is not scientific at all, and I just did it for fun, but I want all code I write to be as ...
2
votes
1answer
36 views

A Java class for reading MaCH dosage files v2.0

Version 2 of A Java class for reading MaCH dosage files ...
3
votes
1answer
62 views

A Java class for reading MaCH dosage files

A dosage file (used in computational genetics) is formatted like this: ...
8
votes
3answers
529 views

Statistical calculations with sets of genes

The following piece of code executes 20 million times each time the program is called, so I need a way to make this code as optimized as possible. ...
4
votes
2answers
50 views

Convert impute2 files to mach format

Here is a program for converting Impute2 files into MaCH format (related to genetics). Source files include one xxx_haps file and one xxx_samples file, for example: ...
7
votes
4answers
2k views

Genome string clump finding problem

I am trying to solve a bioinformatics problems from a Stepic course. The problem posed: find clumps of the same pattern within a longer genome. Motivation: Identifying 3 occurrences of the same ...
7
votes
6answers
341 views

Explicit Function Notation in Perl

I've gone back and forth a few times recently on my Perl coding style when it comes to module subroutines. If you have an object and you want to call the method bar ...
5
votes
4answers
309 views

Generating DNA sequences and looking for correlations

I've written a script to generate DNA sequences and then count the appearance of each step to see if there is any long range correlation. My program runs really slow for a length 100000 sequence 100 ...
3
votes
0answers
62 views

Finding the Cox regression coefficients in a mixed model for microarray data

I have written a code for a project which aims at finding the Cox regression coefficients in a mixed model for microarray data. The study was carried out on the Affymetrix Hgu133a platform. In the ...
3
votes
1answer
88 views

Calculating overlap of segments in chromosome data

I wrote an R code that basically performs 2 operations: For each segment in file A, find all segments in file B that lie in that segment. Find the percentage of overlap for each case in previous ...
2
votes
1answer
177 views

Slow Python text-processing script

This script of mine merges columns 1 and 2 from one input file and sees if these merged combinations exist in the other infile (and vice versa). I know I get stuck in appending. It did not get past ...
4
votes
0answers
64 views

Vectorize Fisher's Exact Test

I have two data frames/ lists of data, humanSplit and ratSplit, and they are of the form ...
5
votes
2answers
144 views

Finding database matches and storing them in a glycopeptide structure

I am relatively new to C and would like some feedback on a function that I have written, if it adheres to C standards or if there are some other things which I could have done better/differently. ...
0
votes
1answer
116 views

Faster way to parse file to array, compare to array in second file, write final file

I currently have an MGF file containing MS2 spectral data (QE_2706_229_sequest_high_conf.mgf). The file template is here, as well as a snippet of example: ...
6
votes
2answers
157 views

Comparing 2 lists of peptide to spectrum rankings generated by 2 different algorithms

I'm seeking a general review, but I'm particularly interested in style. This program gets 2 lists of peptide to spectrum matches, so every spectrum title is linked to a list of 1 or 10 possible ...
10
votes
3answers
696 views

Counting DNA nucleotides in C

I have written code to solve the following Rosalind problem. This is my first time writing in C and I would like a review of my code, particularly in regard to correctness and performance. ...
1
vote
2answers
201 views

Parsing BLAST output in XML format using Regular Expression

There many other better ways to parse BLAST output in .xml format, but I was curious to try using regex, even if it is not so straightforward and common. Here is the code how to extract translated ...
5
votes
3answers
529 views

Optimization for SQLite result set parsing

I am retrieving information from an SQLite database that gives me back around 20 million rows that I need to process. This information is then transformed into a dict of lists which I need to use. I ...
3
votes
2answers
211 views

Rosalind's 3rd problem in Scheme

I have an imperative programming background and I've decided to study functional programming by applying it to problems found on sites such as Project Euler and Rosalind. My language of choice is ...
4
votes
2answers
93 views

Data screening using Perl

Background information I've been asked to write a little Perl script that allows genomic data to be screened against reference files in order to determine locations of specific mutations. The input ...
6
votes
2answers
86 views

Foreach-loop for and print commands

How can I make the following code shorter or efficient (maybe with other loops or other nice ideas), and keep the current functionality? ...
5
votes
2answers
282 views

Genomic Range Query

Recently I worked on one of the Codility Training - Genomic Range Query (please refer to one of the evaluation report for the detail of this training). The proper approach for this question is using ...
10
votes
2answers
846 views

Calculating the joint probability of n events from a sample sequence of occurrences

I'm writing an algorithm to take in a sample list of sequences of events, calculate 1-step transitional probabilities from the sequences, forward or in reverse, then calculate the joint probability of ...
4
votes
2answers
741 views

How to improve this Needleman-Wunsch implementation in C#?

I split my implementation of this sequence alignment algorithm in three methods. Where NeedlemanWunsch-method makes use of the ScoringFunction and the Traceback methods. Further I decided to go with ...
6
votes
1answer
650 views

Calculate query coverage from BLAST output

I have a BLAST output file and want to calculate query coverage, appending the query lengths as an additional column to the output. Let's say I have 2 7 15 ...
11
votes
1answer
2k views

Simple DNA sequence finder w/ mismatch tolerance

The goal with this function is to find one DNA sequence within another sequence, with a specified amount of mismatch tolerance. For example: ...
2
votes
3answers
986 views

Longest DNA sequence that appears at least twice (only one DNA string as input)

My question is to find the longest DNA sub-sequence that appears at least twice. The input is only one DNA string, NOT TWO strings as other LCS programs. I have done my 4th program and it seems to be ...
5
votes
1answer
488 views

Performance: equivalent C and C++ programs

I write quite a bit of code in C, but haven't done much C++ since my college CS classes. I have been revisiting C++ recently, and thought I would re-implement a program I had previously written in C, ...
4
votes
2answers
602 views

Efficient parsing of FASTQ

FASTQ is a notoriously bad format. This is because it uses the same @ character for the id line as it does for quality scores. Deciding what is a quality score and ...