Annotation Guidelines:

Heuristics applied while selecting terms for semantic markup from the text of Reis et al. (2008) Impact of Environment and Social Gradient on Leptospira Infection in Urban slums, PLoS Neglected Tropical Diseases 2(4): e228

by Katie Portwin and David Shotton

Image Bioinformatics Research Group, Department of Zoology, University of Oxford,South Parks Road, Oxford OX1 3PS, UK


Introduction


Semantic markup of the text of the cited PLoS Neglected Tropical Diseases (PLoS NTD) article by Reis et al. (2008) was implemented manually by Katie Portwin and David Shotton, Image Bioinformatics Research Group, Department of Zoology, University of Oxford. The semantically enhanced version of that article was published on 3 September 2008 at doi:10.1371/journal.pntd.0000228.x001, and the paper by Shotton et al. (2009) describes the full range of semantic enhancement applied to that Reis et al. (2008) article. A separate document (Shotton and Portwin, 2009; doi:10.1371/journal.pntd.0000228.x009) describes the technical implementation of those semantic enhancements, while this document describes the heuristics we applied when deciding which textual terms were to be assigned to the semantic classes highlighted in the text of the enhanced version of the article.


Self-referencing information for this documentI


Citation: Portwin K and Shotton D (2009). Annotation Guidelines: Heuristics applied while selecting terms for semantic markup from the text of Reis et al. (2008) Impact of Environment and Social Gradient on Leptospira Infection in Urban slums, PLoS Neglected Tropical Diseases 2(4): e228. (doi:10.1371/journal.pntd.0000228.x010).


URL: http://dx.doi.org/10.1371/journal.pntd.0000228.x010.


Corresponding author: David Shotton <david.shotton@zoo.ox.ac.uk>.

Copyright and license statement

© 2009 David Shotton, University of Oxford. This document, the semantic enhancements we made, the enhanced version of the article and the original article are all open-access publications distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and sources are credited.

Background

To enable semantic mark-up of text to be applied cost-effectively in a journal publishing environment, it will be necessary to automate it. Sophisticated text mining and natural language processing tools are currently being developed to recognise textual instances and link them automatically to domain-specific ontologies. However, our own experience in marking up the text of the chosen PLoS NTD article by Reis et al. (2008) clearly showed the requirement for human intervention. For example, we wished to record 'slums' and 'slum environments' as types of habitat in which the disease leptospirosis was likely to occur. However, blindly marking up every occurance of phrases in which the word 'slum' appeared was not appropriate, since a 'slum dweller' is clearly a person, not a habitat. To guide our mark-up, we developed the following set of simple heuristics that may be of assistance to others undertaking similar work.

Heuristics for semantic mark-up

We provided semantic enhancements to the title, abstracts and text of the PLoS NTD article and to the titles of the cited references in its reference list, in the form of optional coloured background highlighting, by marking up textual instances of nine classes of entities: date, disease, habitat, institution, organism (English name), person (a person's proper name), place, protein and taxon (i.e. Linnaean genus or species Latin name), each class being associated with a particular colour.  In the following explanations, members of these classes are called ‘controlled terms’, and commonly occurring phrases that it would not be sensible to highlight are called ‘stop words’.


The heuristics we developed when deciding whether or not to apply highlighting to an occurrence of a particular term are as follows:


1.   Adjectival use of controlled terms, e.g. 'slum dweller', 'Leptosipra anitbodies', 'Leptospira transmission', 'Mumbai slums', where the nouns 'slum', 'Leptospira' and 'Mumbai' are themselves controlled terms:

    

2.    Stop words, e.g. the occurrence of the noun 'disease', or of any other class name:


3.    Ambiguous terms, e.g. 'household' (which is used in the PLoS NTD article to mean either a physical house or a social group of persons):


4.   Long phrases, e.g. 'the sanitation infrastructure where slum inhabitants reside':


Variations in sentence structure between language lead to interesting differences.  In the Conclusion of the English Language Abstract of our selected PLoS NTD article (http://dx.doi.org/10.1371/journal.pntd.0000228.x001#abstract0), the phrase 'slum residents' is not marked up, for the reason given above, since the word 'slum' is used adjectivally.  However, in the Portuguese language abstract (http://dx.doi.org/10.1371/journal.pntd.0000228.s003.x001), this phrase is translated 'residentes de favelas', since Romance languages have no compound noun formations, and so in this case the noun 'favela' (meaning slum or shanty town) is marked up as a habitat.


References


Reis RB, Ribeiro GS, Felzemburgh RDM, Santana FS, Mohr S, Melendez AXTO, Queiroz A, Santos AC, Ravines RR, Tassinari WS, Carvalho MS, Reis MG and Ko AI (2008). Impact of environment and social gradient on Leptospira infection in urban slums. PLoS Neglected Tropical Disease 2(4): e228 (doi:10.1371/journal.pntd.0000228).


Shotton D and Portwin K (2009). Technical implementation of the semantic enhancements applied to Reis et al. (2008) Impact of environment and social gradient on Leptospira infection in urban slums. PLoS Neglected Tropical Diseases 2(4): e228. (doi:10.1371/journal.pntd.0000228.x009).


Shotton D, Portwin K, Klyne G and Miles A (2009). Adventures in semantic publishing: exemplar semantic enhancement of a research article. (submitted for publication). Preprint available at http://purl.org/net/semanticpublication/Shotton_et_al_PLoS_enhancement_report.pdf.