MPI-INF Logo
Campus Event Calendar

Event Entry

What and Who

Optimising de novo transcriptome assembly

Dilip Ariyur Durai
International Max Planck Research School for Computer Science - IMPRS
PhD Application Talk
AG 1, AG 2, AG 3, AG 4, AG 5, SWS, RG1, MMCI  
Public Audience
English

Date, Time and Location

Monday, 26 October 2015
10:20
90 Minutes
E1 4
024
Saarbrücken

Abstract

 Motivation: De novo transcriptome assembly is a widely used process for transcriptome analysis. Most assemblers use de Bruijn graph as their base data structure. The graph uses kmer(substrings of length k) as the nodes and two nodes are connected if they have k-1 overlap. A fundamental parameter which highly influences the de Bruijn graph and hence the assembly is the value of k. It has been shown that no single k value leads to an optimal result. As a result, researchers use multi kmer based assembly which builds de Bruijn graphs over multiple k values and merges the resulting assemblies. One of the main constraints of this method is the amount of time and memory it requires for large datasets. Limited research has been done to tackle this issue. With this view, we introduce two algorithms: KREATION and RE-READ which significantly reduces the computational time and resources.

KREATION: Most of the current multi kmer based assemblers run the assembly for a kmer set scattered over the entire read length. This results in generation of suboptimal assembly and increase in run time. We propose KREATION, a method that can be incorporated into an assembler to automatically learns at which kmer value to stop the assembly by analysing the transcripts generated by single kmer iteration. It clusters the related assemblies to estimate the necessity of an additional kmer assembly. We found that a linear model based fit approach works well for predicting the kmer value beyond which no assembly is required. This approach was tested on datasets of different sequence coverage and read length. When compared to the assembly generated by using full range of kmer values, KREATION was found to produce lossless results with a significant reduction in runtime.
 RE-READ: Assembling transcriptome reads with high sequence coverage requires large amount of computational memory. These datasets generally consists of redundant data which when removed can reduce the memory requirement. Current algorithms have a risk of losing kmers which form connections between nodes in the de Bruijn graph. This results in suboptimal assembly. We propose an algorithm called RE-READ which analyses the reads and predicts the connections in the de bruijn graph. Reads are removed if they do not contribute to the connectivity or might contribute to an misassemble. The algorithm will reduce the data significantly without significant loss of assembly quality.
CONCLUSION: We put forward two completely automated methods which significantly reduces the runtime and memory requirements. This would make assemblers efficient and easier to use.

Contact

Andrea Ruffing
--email hidden
passcode not visible
logged in users only

Andrea Ruffing, 10/23/2015 19:00 -- Created document.