Recommended sequence clustering algorithm for transcriptome data

Question

I'm working on a project where I'm going to analyze a large amount of transcriptome data. After assembling our RNA-Seq reads into contigs using Trinity, it looks like I'm going to have about 10GB of sequences in fasta format. Since these sequences are from several hundred tissue libraries but from a single species (chicken), I'm expecting there to be a lot of redundancy, so I'd like to cluster these sequences and just use a representative sequence from each cluster as I go forward with my analysis. I see there are quite a few tools that exist to do things like this, and I'm wondering which you all would recommend. I'll be running this on a Linux machine with 64 CPU cores and ~500GB of RAM.

I started looking at USEARCH, but it seems I'm going to run into some memory issues with the free 32-bit version and as much as I clicked around on their site I couldn't figure out how much the 64-bit version costs or how to buy it.

I guess clustalw should be able to do this.. but I am not sure how to get the clusters.. May be in one of the output files.. Just check it... If you want to use multicores then you need a parallelizable tool.. i'll look up — WYSIWYG, Jul 30 '14 at 4:16

Josh Herr · Answer 1 · 2014-07-29 19:45:48Z

up vote 2 down vote

It does sound like you have a lot of data.

I would first try Robert Edgar's other newer tool UPARSE which is faster and can handle more data using the free 32-bit version. I think you'll mainly be limited by machine memory though, right?

Did you try CD-Hit?

answered Jul 29 '14 at 19:45

Josh Herr
211

Yeah, memory is going to be the limiting factor. I did just set up CD-Hit on my machine, so I'll give that a try. – Colin Jul 29 '14 at 20:14

add a comment |

user1357 · Answer 2 · 2014-07-31 03:43:51Z

up vote 1 down vote

Colin the only way to go is Edgar's software write him [email protected] it's a thousand for a copy (in 2012) worth every penny.

enter image description here

answered Jul 31 '14 at 3:43

user1357

add a comment |

asked	10 months ago
viewed	85 times
active	10 months ago

current community

your communities

more stack exchange communities

Recommended sequence clustering algorithm for transcriptome data

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged bioinformatics sequence-analysis or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Recommended sequence clustering algorithm for transcriptome data

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged bioinformatics sequence-analysis or ask your own question.

Related

Hot Network Questions