Text-mining based retrieving of omics pipelines

In this work we developed a method for automatic extraction of proteomics pipelines from scientific papers. The objective was to figure out which proteomics tools are incorporated together, with respect to data formats, frequency of use, and efficacy of the data analysis performed.

140 papers were manually annotated to yield proteomics pipeline graph for each paper. Nodes of the graph corresponded to operations and stored EDAM operation identifier, name and version of the used tool. Edges corresponded to data types and stored data format, paper DOI and PMID. Graphs for papers were merged to yield a structure representing possible flows of data in course of proteomics analysis.

We developed a set of tools for automatic annotation using the merged graph. First tool extracts parts of text describing mass-spectrometry data analysis using words clusterization in relevant corpus. Second tool extracts candidate tool names from given text. Third tool searches for possible nodes corresponding to each tool in merged graph, predicts the most likely connections and outputs graph in gml format. Automatically produced annotations can be used to enrich manually created graph with new data.

 

Студент:
   Юлия Кондратенко
Куратор:
Время выполнения проекта: Feb 2017 — Jun 2017