http://bioinformatics.oxfordjournals.org/content/28/5/739.full
Abstract
Motivation: Pathway diagrams from PubMed and World Wide Web (WWW) contain valuable highly curated information difficult to reach without tools specifically designed and customized for the biological semantics and high-content density of the images. There is currently no search engine or tool that can analyze pathway images, extract their pathway components (molecules, genes, proteins, organelles, cells, organs, etc.) and indicate their relationships.
Results: Here, we describe a resource of pathway diagrams retrieved from article and web-page images through optical character recognition, in conjunction with data mining and data integration methods. The recognized pathways are integrated into the BiologicalNetworks research environment linking them to a wealth of data available in the BiologicalNetworks' knowledgebase, which integrates data from >100 public data sources and the biomedical literature. Multiple search and analytical tools are available that allow the recognized cellular pathways, molecular networks and cell/tissue/organ diagrams to be studied in the context of integrated knowledge, experimental data and the literature.
Availability: BiologicalNetworks software and the pathway repository are freely available at www.biologicalnetworks.org.
Contact: baitaluk@sdsc.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
The process of image recognition, data extraction and integration consists of several steps (Supplementary Fig. S1). First, objects and relations are extracted from the image, together with their coordinates. This is done using mathematical morphology and binary analysis routines of ImageJ (http://rsbweb.nih.gov/ij/). The image is transferred to binary gray-scale 32-bit RGB mode by applying a threshold adjustment with a Huang Filter. Special mathematical morphology ‘Opening’ after ‘Closing’ operations are applied to reduce the number of domains for recognition by removing areas that are too small. Finally, an ‘analyze particles’ procedure applied to the last image retrieves all possible candidates for nodes as objects represented by points of an enclosing polygon.
Text recognition is done separately with preliminary image processing (Kou et al., 2007; Li et al., 2008). Cleaning non-textual elements using properties that characterize horizontal text objects, such as alignment, height–width ratio, character separation and connectedness, is performed. ‘Cognitive OpenOCR (Cuneiform)’ (http://en.openocr.org) software is used for batch image text recognition and ‘AutoIt v3’ (http://www.autoitscript.com) for automated batch operations. The scanning procedure is executed both horizontally and vertically to extract horizontal and vertical text.
The next step involves the extracted text, objects and relations being sent to the IntegromeDB (Baitaluk et al., 2010) (back-end database) data integration pipeline to check the consistency of the data (Supplementary Fig. S2 and Table S1). We check that all recognized objects are genes/proteins, processes, cell types, diseases or any other object type constituting BioNets ontology (Baitaluk et al., 2010; Kozhenkov et al., 2011) in our database. Recognized relations are compared with the existing (literature or pubic databases) interactions, reactions and relations integrated in the IntegromeDB from >100 of public databases. Only those recognized relations that are supported by at least one type of evidence from our integrated database are included in the final pathway.
We scanned a collection of >150 journals, 50 000 articles and ~25 000 figures available in PubMed Central and WWW through the Google Image service API by querying terms from Pathway Ontology (e.g. ‘Nicotine pathway’), Gene Ontology Biological Process (e.g. ‘molecular synthesis’), ‘$GeneName pathway’ (e.g. ‘NF-B pathway’), etc. After filtering out images not containing objects/relations, top 1012 pathway/network diagrams (richest in number of literature-supported relations) are stored on a remote server and the Lucene open-source search engine (http://lucene.apache.org) is used to index, retrieve and rank the image text descriptions (using the default statistical ranking). In case of publication, the image description is the image legend, whereas in the case of a web page, the specifically designed algorithm retrieves the most appropriate description from the web page text surrounding the image. Image publication date and source journal are stored as separate fields that can also be used to sort the results.