Applying deep-learning techniques to identification of pathogenic amino acid substitutions by means of Recurrent Neural Networks

Over the past decade deep learning techniques have achieved great success in such application as face recognition, text-mining, text-generation etc. And this technique found its application in computational biology as well. One of its branches accommodates the functional analysis of protein molecules. It's well known that protein function can be dramatically changed by substitutions of single amino acids. Based on the data regarding relationships between such substitutions and human diseases, we explore deep-learning approach that utilized Recurrent Neural Network in order to predict possible effects of single amino acid substitutions on protein function. We also investigate different on how one can represent protein structures.

For our purposes we made use of layers with long short-term memory (LSTM) cells combined with convolutional layers. An LSTM is a recurrent network model that dominates at remembering long- and short-time dependencies. We consider it a key advantage that can be used to predict consequences of amino acid substitutions. We used the HUMSAVAR database as our source of reviewed categorized amino acid variations. For representation purposes we tried to encode the amino acids as one-hot encoded vectors, vectors of physico-chemical properties and vector-embeddings. We also tried to build a protein autoencoder similar to the molecular autoencoder described in recent scientific papers. Each amino acid was represented as one hot vector. Amino acid sequence for each protein was transformed into internal latent space. The goal was to recover amino initial sequence from latent space. We achieved accuracy up to 40 %.


   Иван Сосин
   Андрей Афанасьев
Время выполнения проекта: Feb 2017 — Jun 2017