Development and improvement search strategies in the GeneQuery system

GeneQuery is a new computational tool for finding related phenotypes using ad hoc transcriptional signatures. Unlike earlier analogues, the service works with primary, uncurated data from the open Gene Expression Omnibus (GEO) expression data repository. GeneQuery has already proved itself from the biological point of view, showing the ability to find fundamentally new and biologically significant information in a huge volume of expression data.

GeneQuery currently only uses the information obtained by processing of microarray experiments. In this work our goal was to add the results of RNA sequencing  experiments (RNA-seq) of mouse, rat and human to the GeneQuery database. First, we created transcript sequences from genome sequences and annotations for appropriate species, taken from Gencode (mouse, human) or Ensembl (rat). Next, kallisto indices were created for these transcriptoms for quick quantification of RNA-seq experiments. After that we have generated  a master table with GSE, GSM, and SRR IDs for each sample, as well as the download links. The developed pipeline then allowed us to download, inflate, and quantify samples with excellent efficiency. In the end we merged tsv files that belonged to the same GSE dataset and performed automated clustering with WGCNA.

As a result, we wrote a pipeline that downloads a set of samples, quantifies it using ultra-fast methods of quantifying transcriptional experiments, and then removes volumetric data. Possible future direction of project development includes testing novel methods of clustering in the GeneQuery system, choosing the optimal one, and applying it to the collected experiments.

 

Куратор:
   Александр Предеус
Время выполнения проекта: Feb 2017 — Jun 2017