The optimization of de novo transcriptome assembly strategies

De novo transcriptome assembly could provide essential biological information about organism of interest, especially when full genome assembly and annotation are not feasible due to large genome size, ploidy, or for other reasons.

In this study, we used 11 raw transcriptomes of different insect species from Staphylinidae family. Including approximately 120,000 species, this is the largest family among animals and de novo transcriptome assembly will help to understand the molecular basis of the differences between individual species.

Due to diversity in data quality, additional preparation was performed before assembly. We used Trimmomatic-derived libraries for the assembly. It should be noted that reads that lost their pair during the libraries cleaning stage, but passed by quality were also used in the assembly.

Assembly itself was performed using several tools - TransAbyss, SOAPdenovo-Trans and Trinity. Using the first two programs, we assemble transcriptomes with different k-mer length — 32, 48 and 64 nucleotides. We then verified the presence and completeness of orthologous genes in obtained transcriptomes with BUSCO tool using “insecta” dataset (odb9). The assembly quality of individual contigs was controlled using Transrate program.

Our results showed that the number of full-length sequences of orthologous genes in assembled libraries is decreasing in response to k-mer length increase. We propose that the optimal strategy for working with transcriptomic data will be the use of three assemblers followed by a detailed comparison of obtained assemblies and generating the best assembly by picking the contigs with highest quality scores.

 

Куратор:
   Александр Предеус
Время выполнения проекта: Feb 2017 — Jun 2017