New for: D2, D3
One of the existing approaches for chemical name detection is look-up approach that uses a dictionary comprising term variations and synonyms. It was chosen in the current work for the analysis. The challenge for this method is that available chemical databases are incomplete or focus sometimes on the certain types of chemical compounds like metabolites or approved drugs. Therefore several resources should be merged to generate a comprehensive dictionary. When merging the data sources the criteria for identity of the compounds should be defined, i.e. how to deal with the structures that differ only in stereochemistry, charges, isotopes, etc.
One can merge the compounds based on CAS numbers, InChI identifiers, Synonym overlap. The method proposed here is to merge databases analyzing the 2D graph representation of the compounds when merging databases. Direct comparison of the structure is a more flexible approach where structure information is not lost.
For the creation of a dictionary a workflow is developed that allows to merge databases comparing 2D graph representation of the compounds. The user is able to set up the criteria for structure identity according to the research needs.
In the course of the work the criteria for structure identity should be defined that serve best for the Text Mining purposes: which structure issues should be considered or ignored for compound comparison. Performance of the created dictionary should be compared to the existing ones.