TwoPaCo for fragmented genomes

The de Bruijn graphs are the key algorithmic technique in genome assembly and metagenomic analysis. The construction of a de Bruijn graph is one of the most resource intensive steps of many algorithms in bioinformatics. A de Bruijn graph can be compacted by collapsing non-branching paths into single edges. Building and storing the ordinary graph takes lots of space, so it is an interesting question if is it possible to construct the compacted graph directly and in more optimal way.

The second author developed TwoPaCo, a simple and scalable low memory algorithm for the direct construction of the compacted de Bruijn graph from a set of complete genomes. The main idea of this algorithm is the use of the bloom filter to narrow the set of graph vertex candidates. In this project we tried to transform TwoPaCo algorithm to work with fragmented genomes. The algorithm can be divided into two steps. The first step is finding vertices in compacted de Bruijn graph and the second one is determination of edges. The first step in both cases is the same but the second one is totally different. We implemented both steps, but for fragmented genomes it works a little worse than we expected.

 

Студент:
   Мария Атаманова
Куратор:
   Илья Минкин
Время выполнения проекта: Feb 2017 — Jun 2017