Information extraction and linked open data in chemistry

Journal article

Abstract:: Chemists not only produce a significant amount of data-rich scholarly communication artifacts, but have also adopted a highly formulaic style of writing. The literature of this discipline is an attractive target for automated data extraction. In previous work, we have demonstrated the identification and extraction of chemical entities from scientific papers.[1][2] However, we have not addressed the extraction of the relationships linking the chemical entities to both each other as well as to the document object from which they were extracted. Using chemical synthesis procedures as an exemplar, we present a methodology for the extraction of both chemical entities and the relationships between them using these techniques. Chemical synthesis procedures are collected by data-mining the chemical literature. Natural language processing tools and entity recognisers are then used to analyse the individual elements within these procedures and provide a grammatical structure. Relationships between the individual entities are then established. This structured information is then stored in RDF[3] using domain-specific ontologies. Once information is expressed in a semantic format, it can then be searched and indexed using the RDF querying Language SPARQL[4] as well as generate visualisations such as visual document summaries. The ultimate goal of the work documented here is to make data contained in publications available and re-usable by the scientific community.

Files:: Information extraction and linked open data in chemistry

(Author's original, bin, 611.7KB, Terms of use)

Institution:: "University of Cambridge"
Department:: Unilever Centre for Molecular Science Informatics,Department of Chemistry
Role:: Author

Institution:: "University of Cambridge"
Department:: Unilever Centre for Molecular Science Informatics,Department of Chemistry
Role:: Author

Institution:: "University of Cambridge"
Department:: Unilever Centre for Molecular Science Informatics,Department of Chemistry
Role:: Author

Institution:: "University of Cambridge"
Department:: Unilever Centre for Molecular Science Informatics,Department of Chemistry
Role:: Author

Copyright holder:: LHawizy et al
Notes:: References
[1] S. E. Adams, J. M. Goodman, R. J. Kidd, A. D. McNaught, P. Murray-Rust, F. R. Norton, J. A. Townsend, and C. A. Waudby, “Experimental data checker: Better information for organic chemists,” Organic and Biomolecular Chemistry,
vol. 2, pp. 3067 –3070, 2004.
[2] P. Corbett and P. Murray-Rust, “High-throughput identification of chemistry in life science texts,” 2006, pp. 107–118. [Online]. Available: http://dx.doi.org/10.1007/11875741 11
[3] W. Consortium, “Rdf primer,” http://www.w3.org/TR/rdf-primer/ , last accessed: 07/08/09.
[4] ——, “Sparql query language for rdf,” http://www.w3.org/TR/rdf-sparql-query/ , last accessed: 07/08/09.

Licence:: Terms and Conditions of Use for Oxford University Research Archive

If you are the owner of this record, you can report an update to it here: Report update to this record