Improving NLP in molecular biology
Natural language processing (NLP) techniques have seen a steady rise in complexity and popularity, becoming increasingly important in text-mining applications in data-intensive areas, such as computational biology. Unfortunately, many novel methods, based on neural networks and unsupervised deep learning, require adaptation to be efficiently applied to molecular-biological texts.
Usually, text processing begins with parsing and tokenisation. Most modern parsers break space-separated multi-word terms (aka n-grams or collocations), such as “amino acid” and “New York”, blurring the context information. This loss can be formally defined as KL-divergence between order-sensitive and insensitive contexts of an n-gram. Naïvely tokenising all possible n-grams of a specified length hugely dilutes the token dictionary, hindering downstream machine-learning. Hence it is important to select only the most valuable/informative n-grams, which we tried to achieve using three different methods: 1. word collocation networks and centrality measures (degree, closeness, betweenness and PageRank), 2. the tf-idf statistic, penalising tokens overrepresented in the entire corpus and 3. context approximation, based on Bayesian inference. We limited our study to bigrams.
Another issue affecting dictionary dilution is the abundance of extremely rare terms that can be represented by a single token to facilitate term-embedding. The majority of these terms are various chemical names we identify using an LSTM recurrent neural network, discriminating chemical notation (IUPAC and SMILES).