Repeat resolution using 10xGenomics data
The recently introduced GemCode technology creates a reagent delivery system that partitions long DNA molecules and prepares sequencing libraries such that all reads that came from the same molecule part share a common barcode. Supposedly, barcode can give us additional information about the location of given sequence. Therefore, barcodes can be used to generate phased haplotype information for human genomes, detect structural variants and interrogate heterogeneous cell populations. But until now no algorithms for genome assembly from this data were published. The goal of this project was to investigate potential of this technology for de novo genome assembly.
One of the main problems in genome assembly is the handling of repeats. Usually assembly algorithms concentrate first on the localization and assembly of stretches of unique sequences (contigs). This step is usually done using assembly graph (e.g. de Bruijn graph) where genome is represented by a certain path. Then these contigs are ordered and connected with various scaffolding methods. Using tags generated by GemCode and alignment of reads to assembly graph we can color assembly graph by read barcodes and expect that edge color set changes "smoothly" along genomic path to decide which way to proceed along repetitive edges. We have analysed GemCode data to reveal the potential of proposed method. We have confirmed existence of strong correlation between number of shared barcodes between two fragments of the genome and distance between them. We have implemented simple scaffolding algorithm based on our observarions. This algorithm was tested on a human genome fragment using ideal de Bruijn graph and real GemCode data. Our experiments have shown very high accuracy making false connections only in 0.2% of cases. Also we showed that our scaffolding method is applicable even for contigs as short as 1000bp. Thus this methods has potential for assembly of real mammalian genomes since initial short read assembly of mammalian genomes usually has N50 at least 1000bp.