Bioinformatics is the use of software tools to analyse biological data.
6
votes
3answers
45 views
Categorizing gene sequences read from a CSV file
I am relatively new to programming and would love to get some feedback on the following section of my code.
...
3
votes
1answer
43 views
V Snare T Snare Model
In the beginning, everything is defined to be of value 10, but I have to change them to suit them for different possible values, hence those are changing. I'm a (Im)mature C coder, hence there might ...
1
vote
1answer
26 views
Compare a sequence with the reference frequency of hexamers
I have written this function (and others similar to that one) But I am not sure I am using references on their full power.
My currently concerns is if I make a huge use of memory. The subroutine ...
4
votes
2answers
96 views
DNA base pair match counter
So my code is done it outputs exactly what it needs to I'm just wondering if it is possible to make this code a lot more simple using objects. If so could someone tell me what I would need member-wise ...
3
votes
2answers
52 views
Rosalind string algorithm problems
I've been starting to learn Rust by going through some of the Rosalind String Algorithm problems.
If anyone would like to point out possible improvements, or anything else, that would be great. There ...
7
votes
3answers
102 views
Prefix Sum in Ruby, Genomic Range Query from Codility
I'm currently going through some lessons on Codility. I've just spent a couple of hours with GenomicRangeQuery, which is intended to demonstrate the use of prefix sums.
The task description is here. ...
6
votes
1answer
152 views
High performance parsing for large, well-formatted text files
I am looking to optimize the performance of a big data parsing problem I have using Python. The example data I show are segments of whole genome DNA sequence alignments for six primate species.
Each ...
6
votes
3answers
90 views
4
votes
1answer
56 views
Find allele frequencies at each site for each iteration for each population from FASTA file
The script takes a FASTA format file in input and outputs the frequencies of each amino acid (A, C, ...
0
votes
2answers
109 views
Comparing two columns in two different rows
I want to go through each line of the a .csv file and compare to see if the first field of line 1 is the same as first field of next line and so on. If it finds a match then I would like to ignore ...
2
votes
2answers
59 views
RNA/DNA transcriber
I've been going through some of the exercises over on exercism and this is one of my solutions: a basic RNA/DNA transcriber. I was happy enough at first but now, looking at it again, the solution ...
4
votes
2answers
104 views
Fast comparison of molecular structures and deleting duplicates
I have a program that reads in two xyz-files (molecular structures) and compares them by an intramolecular distance measure (dRMSD, Fig. 22). A friend told me that my program structure is bad, and as ...
5
votes
2answers
56 views
Converting domain-specific regular-expressions to a list of all matching instances
There seem to be several questions floating around Stackexchange regarding how to take a python regular expression list the matching instances. This problem is a bit different because 1) I'm need to ...
6
votes
1answer
132 views
Statistics about gaps in DNA sequences
Noobie to Numba here, I'm trying to get faster code from existing function but the result is not faster. 10 times faster would be heaven, but I know nothing about optimization.
This is code about ...
3
votes
1answer
245 views
Python Longest Repeat
I am trying to find the longest repeated string in text with python, both quickly and space efficiently. I created an implementation of a suffix tree in order to make the processing fast, but the ...
4
votes
2answers
148 views
bash script for constructing RNA pipeline
I have written a bash script that consists of multiple commands and Python scripts. The goal is to make a pipeline for detecting long non coding RNA from a certain input. Ultimately I would like to ...
5
votes
1answer
168 views
Reading an Excel file and comparing the amino acid sequence of each data pair
Since I am fairly new to Python I was wondering whether anyone can help me by making the code more efficient. I know the output stinks; I will be using Pandas to make this a little nicer.
...
1
vote
2answers
84 views
Counting adenine and cytosine bases
I've started a little challenge on a website, and the first one was about counting different DNA letters. I've done it, but I found my method very brutal. I have a little experience, and I know that ...
4
votes
2answers
256 views
Reflecting emotion classification based on the Lövheim cube
Background
I created a simple class to reflect emotion classification based on the Lövheim cube. The code is not scientific at all, and I just did it for fun, but I want all code I write to be as ...
2
votes
1answer
38 views
A Java class for reading MaCH dosage files v2.0
Version 2 of A Java class for reading MaCH dosage files
...
3
votes
1answer
68 views
A Java class for reading MaCH dosage files
A dosage file (used in computational genetics) is formatted like this:
...
4
votes
2answers
107 views
Convert impute2 files to mach format
Here is a program for converting Impute2 files into MaCH format (related to genetics).
Source files include one xxx_haps file and one xxx_samples file, for example:
...
3
votes
0answers
73 views
Finding the Cox regression coefficients in a mixed model for microarray data
I have written a code for a project which aims at finding the Cox regression coefficients in a mixed model for microarray data. The study was carried out on the Affymetrix Hgu133a platform. In the ...
2
votes
1answer
201 views
Slow Python text-processing script
This script of mine merges columns 1 and 2 from one input file and sees if these merged combinations exist in the other infile (and vice versa).
I know I get stuck in appending. It did not get past ...
4
votes
0answers
87 views
Vectorize Fisher's Exact Test
I have two data frames/ lists of data, humanSplit and ratSplit, and they are of the form
...
0
votes
1answer
197 views
Faster way to parse file to array, compare to array in second file, write final file
I currently have an MGF file containing MS2 spectral data (QE_2706_229_sequest_high_conf.mgf). The file template is here, as well as a snippet of example:
...
8
votes
1answer
695 views
Genetic Algorithm in Python
I'm a new programmer, so any help is welcome. Preferably to make it faster, avoid heavy memory usage, and so on.
...
6
votes
2answers
194 views
Comparing 2 lists of peptide to spectrum rankings generated by 2 different algorithms
I'm seeking a general review, but I'm particularly interested in style.
This program gets 2 lists of peptide to spectrum matches, so every spectrum title is linked to a list of 1 or 10 possible ...
10
votes
3answers
843 views
Counting DNA nucleotides in C
I have written code to solve the following Rosalind problem. This is my first time writing in C and I would like a review of my code, particularly in regard to correctness and performance.
...
3
votes
1answer
194 views
Calculating overlap of segments in chromosome data
I wrote an R code that basically performs 2 operations:
For each segment in file A, find all segments in file B that lie in that segment.
Find the percentage of overlap for each case in previous ...
1
vote
2answers
343 views
Parsing BLAST output in XML format using Regular Expression
There many other better ways to parse BLAST output in .xml format, but I was curious to try using regex, even if it is not so straightforward and common. Here is the code how to extract translated ...
3
votes
2answers
318 views
Rosalind's 3rd problem in Scheme
I have an imperative programming background and I've decided to study functional programming by applying it to problems found on sites such as Project Euler and Rosalind. My language of choice is ...
4
votes
2answers
121 views
Data screening using Perl
Background information
I've been asked to write a little Perl script that allows genomic data to be screened against reference files in order to determine locations of specific mutations.
The input ...
6
votes
2answers
90 views
Foreach-loop for and print commands
How can I make the following code shorter or efficient (maybe with other loops or other nice ideas), and keep the current functionality?
...
5
votes
2answers
554 views
Genomic Range Query
Recently I worked on one of the Codility Training - Genomic Range Query (please refer to one of the evaluation report for the detail of this training).
The proper approach for this question is using ...
10
votes
2answers
2k views
Calculating the joint probability of n events from a sample sequence of occurrences
I'm writing an algorithm to take in a sample list of sequences of events, calculate 1-step transitional probabilities from the sequences, forward or in reverse, then calculate the joint probability of ...
5
votes
1answer
102 views
Case study with a biological populations: a list of lists of lists
I have a population (Pop) which has an attribute which is a list of individuals (Ind) where each individual has an attribute ...
4
votes
2answers
1k views
How to improve this Needleman-Wunsch implementation in C#?
I split my implementation of this sequence alignment algorithm in three methods. Where NeedlemanWunsch-method makes use of the ScoringFunction and the Traceback methods. Further I decided to go with ...
5
votes
3answers
723 views
Optimization for SQLite result set parsing
I am retrieving information from an SQLite database that gives me back around 20 million rows that I need to process. This information is then transformed into a dict of lists which I need to use. I ...
4
votes
1answer
145 views
Cutting strings into smaller ones based on specific criteria
I've got this largish (for me) script, and I want to see if anybody could tell me if there are any ways to improve it, both in terms of speed, amount of code and the quality of the code. I still ...
6
votes
1answer
1k views
Calculate query coverage from BLAST output
I have a BLAST output file and want to calculate query coverage, appending the query lengths as an additional column to the output. Let's say I have
2 7 15
...
7
votes
4answers
3k views
Genome string clump finding problem
I am trying to solve a bioinformatics problems from a Stepic course.
The problem posed: find clumps of the same pattern within a longer genome.
Motivation: Identifying 3 occurrences of the same ...
2
votes
3answers
1k views
Longest DNA sequence that appears at least twice (only one DNA string as input)
My question is to find the longest DNA sub-sequence that appears at least twice. The input is only one DNA string, NOT TWO strings as other LCS programs.
I have done my 4th program and it seems to be ...
5
votes
1answer
633 views
SAM mapped reads
I write quite a bit of code in C, but haven't done much C++ since my college CS classes. I have been revisiting C++ recently, and thought I would re-implement a program I had previously written in C, ...
4
votes
2answers
1k views
Efficient parsing of FASTQ
FASTQ is a notoriously bad format. This is because it uses the same @ character for the id line as it does for quality scores. Deciding what is a quality score and ...
8
votes
3answers
541 views
Statistical calculations with sets of genes
The following piece of code executes 20 million times each time the program is called, so I need a way to make this code as optimized as possible.
...
4
votes
3answers
695 views
FASTA file processing using Python to invoke external filters
I am very new to programming and this is my first functional code. It works fine but I'm sure that I could use a lot of optimization. If you see any blunders or would be able to help condense the ...
5
votes
4answers
380 views
Generating DNA sequences and looking for correlations
I've written a script to generate DNA sequences and then count the appearance of each step to see if there is any long range correlation.
My program runs really slow for a length 100000 sequence 100 ...
12
votes
1answer
2k views
Simple DNA sequence finder w/ mismatch tolerance
The goal with this function is to find one DNA sequence within another sequence, with a specified amount of mismatch tolerance. For example:
...
2
votes
2answers
147 views
Data processing task for bioinformatics
I threw together this C program today to handle a bioinformatics data processing task. The program seems to work correctly, but I wanted to know if anyone has suggestions regarding how the input data ...