Antibody repertoire construction from Ion Torrent reads

Antibody repertoire is defined as an entire set of antibodies in an organism. It describes adaptive immune system, and thus serves as an input to most immunoinformatics studies. This makes antibody repertoire construction a fundamental problem of immunoinformatics.

Antibody is mostly described by its variable region which is formed by the process of V(D)J recombination of B-cells. An organism contains a set of V, D, and J segments encoded in its genome. During the process of V(D)J recombination a single V, single D, and single J segments are chosen and concatenated to form a variable region of an antibody. Further, antibodies gain somatic hypermutations to improve themselves. Typically, modern Rep-seq libraries are produced by a sequencing technique, which outputs reads covering the whole variable region of antibodies. Thus, computationally the problem can be formulated as a clustering and error correction problem on error-prone immunosequencing datasets. But standard solutions for clustering and error correction do not work well for the specifics of the problem. Namely, antibody multiplicities are distributed very unevenly, their sequences can contain very long equal parts, and finally amplification errors statistically look very similar to natural diversities of antibodies.

Most common sequencing technology for Rep-seq is Illumina MiSeq and most of the repertoire construction solutions are developed for this technology, including IgReC developed in CAB. Ion Torrent sequencing is also used for Rep-seq, but it introduces very specific sequencing errors which need a separate approach. The goal of the project was to construct antibody repertoire from sample Ion Torrent dataset by getting inside specific to technology types of errors, such as large indels and homopolymer errors, on top of amplification (PCR) errors.

We propose the usage of RLE (run-length encoding) of reads and germline, as a workaround to eliminate homopolymer errors, and a further decoding of cluster consensuses.


   Антон Банкевич
Время выполнения проекта: Feb 2017 — Jun 2017