A HANDBOOK OF INFORMATION TO ACCOMPANY THE LANCASTER SPEECH, THOUGHT AND WRITING PRESENTATION CORPUS (STOP)

http://www.comp.lancs.ac.uk/computing/users/eiamjw/handbook/ba.html

Martin Wynne
Department of Linguistics and Modern English Language,
Lancaster University,
Lancaster
LA1 4YT
[email protected]

1. INTRODUCTION AND HISTORY

A corpus of around 250,000 words has been constructed and annotated for categories of speech and thought presentation (also known as speech and thought reporting or representation) using a tagset which has been developed by Mick Short, Elena Semino, Jonathon Culpeper and Martin Wynne at Lancaster University. This tagset is an extension of the model of speech and thought presentation (ST&WP) proposed in Leech and Short (1981) which posits a continuum of categories along an axis representing degrees of narrator's intervention.

Originally a pilot corpus of some 40,000 words of fiction texts was compiled and annotated in 1994. A parallel pilot sample of 40,000 newspaper texts were then added in 1994 and 1995. This work was done with funds provided by the Faculty of Social Sciences at Lancaster University.

Following the award of a major British Academy research project grant, this 80,000-word pilot corpus was expanded in 1996 and 1997 to a nominally 240,000-word corpus. The fiction and newspaper sections were doubled in size, and a new section of biography and autobiography texts was added.

Analysis of the corpus is ongoing.


2. THE COMPOSITION OF THE CORPUS

2.1 Structure

There are approximately 250,000 words of text in the corpus. It is made up of 120 sections of about 2,000 words. The final count is somewhat in excess of 240,000 because the texts were sampled in such a way as to begin and end them at fairly 'natural' breaks, so that a reader of the corpus text can see enough of the relevant context to understand the narrative, and usually it was preferred to find such a break after rather than before the 2,000 word mark, but as close to it as possible (for more on sampling strategies see 2.3 below).

The primary classification of the corpus is into three sections relating narrative genres. These genres are: (i) fiction, (ii) newspaper news reports and (iii) biography and autobiography. There are a minimum 80,000 words in each section.

Within each genre there is a division between 'serious' and 'popular' texts. While such a division is inevitably difficult to some extent, this classification was made on the basis of what would commonly be held to be the case by the average educated reader. This will enable the testing of such preconceptions by the analysis of the actual texts.

In the fiction and biography there is also an binary division (cutting across the popular/serious division) between first and third person narratives. In the biography this creates a division between biography and autobiography.

Here is a tree diagram representing the structure of the corpus at these levels:

2.2 List of texts sampled

2.3 Sampling Strategies

Fiction

The decision as to what counted as 'high' literature was made by nine members of Lancaster University's Stylistics Research Group, who were given a list of authors whose works were available in electronic form in the Oxford Text Archive. Authors which six or more informants judged as 'high' literature were selected. The extracts which we took constituted relatively independent units (e.g. chapters, sections or short stories). The popular fiction extracts consisted of eight 3rd-person narratives taken from the relevant category of the British National Corpus, to which we added two 1st-person narratives, so that we would have a greater range of narrative styles. Six extracts were from romantic novels and four from action novels.

In the fiction section a further subdivision was made within each text type between texts with first and third person narrators. This is paralleled in the biography/autobiography section, where the biography texts are all first person narratives and the autobiography texts are all third person narratives.

News

All the press data was taken from articles published in British national daily newspapers. Only newspapers that were felt to be prototypical members of the broadsheet or tabloid categories were selected. Newspapers were taken from the same or consecutive days in four samples: 4-5 December 1994, 11-12 December 1994, 28-29 April 1996 and 12-13 May 1996. This enabled us to select articles that covered the same story, and thereby facilitated comparisons between different newspaper styles (work which we hope to carry out in later phases of the project). News stories rather than editorials or magazine-style articles were chosen so that the press data would be as similar as possible in type to the narrative fiction data. The main criterion for selecting articles was that they should appear in at least three newspapers.

The newspapers used for the sample were:
Broadsheets:

Tabloids:

(Auto)biography

It was less clear-cut how to make a serious/popular distinction in the biography/autobiography section. It was decided to rely on the perceived seriousness of the subject, so politicians, serious writers and artists are considered 'serious' and TV stars, royalty and sports people are considered 'popular'. At the same time some attention was paid to the writing style of the biography in question so as not to include problematic cases, such as particularly badly written autobiographies of serious politicians, or highbrow biographies of pop stars, for example.

2.4 Text markup

The following SGML tags were used to mark up the text:
div1
text divisions - fiction (types: serious, popular), newspapers (types: broadsheet, tabloid), biography (types: serious, popular).
div2
sample (c.2000 words)
div3
(in newspapers) articles; other subdivisions of samples
header
bibliographical information and the list of speakers in the text
head
a text heading (e.g. newspaper headline or a chapter heading)
pb
page break
p
paragraph break
note
a note indicating additional information about hte ST&WP tagging
edit
as a note to the text editor indicating what stage of processing the text is at
sptag
speech, thought and writing presentation category tag

3. SPEECH & THOUGHT PRESENTATION TAGGING

3.1 Categories

N Narrative
NRS Narrative Report of Speech
NRW Narrative Report of Writing
NRT Narrative Report of Thought
NI Narrative Report of Internal State
NV Narrative Report of Voice
NRSA Narrative Report of Speech Act
NRWA Narrative Report of Writing Act
NRTA Narrative Report of Thought Act
NRSAP Narrative Report of Speech Act with Topic
NRWAP Narrative Report of Writing Act with Topic
NRTAP Narrative Report of Thought Act with Topic
IS Indirect Speech
IW Indirect Writing
IT Indirect Thought
FIS Free Indirect Speech
FIW Free Indirect Writing
FIT Free Indirect Thought
DS Direct Speech
DW Direct Writing
DT Direct Thought
FDS Free Direct Speech
FDW Free Direct Writing
FDT Free Direct Thought

Affixes:
e embedded
q with quote
h hypothetical
i inferred (see section on NIi below)
+ speech summary [not used]

e.g. NRSAPq is "Narrative Report of Speech Act with Topic with an embedded quotation":

Notes:

#
is to be used to flag problems for discussion, i.e. things that we weren't sure how to analyse. May be used in conjunction with a portmanteau tag to indicate choices (see below)
-
(portmanteau tagging) is to be used for genuine ambiguity, where it is preferable to indicate two possible interpretations (e.g. IS-IT, NV-NRSA).
e
embedded ST&WP is indented on the page to make it easier to read
line breaks
all ST&WP tags are printed on a line on their own, to make it easier to extract, sort, count etc.
wordcounts
the unit used is the orthographic word, simply defined as a string of alphanumeric characters surrounded by spaces or punctuation. Hyphenated and contracted words count as one unit and genitive diacritics are ignored. e.g. "she's a man-eater" (3 words)
scare quotes
these are tagged with a <note>

3.2 Tagging Guidelines

When tagging, consideration is taken of all three of these levels of analysis. For more details of how this is done, it is necessary to refer to the guidelines for the tagging of particular categories in the different text types.

Some problems arise because there are fuzzy areas on the boundary between the presentation of linguistic and non-linguistic acts. There are also mentions of written language where the focus is not on the production or reception of the text, but merely on its existence. Such cases were not annotated as writing presentation.

3.2.1 NIi

The NI category was invented to cover cases in fiction where an omniscient narrator is able to report on the internal states of characters, e.g.:
Jed's heart lifted in his ribs. 
(Rupert Thomson, The Five Gates to Hell)

For a moment she didn't know where she was. 
(Graham Greene, Brighton Rock)

In texts which are not fiction with an omniscient narrator, NI (and all categories of thought presentation) are only used where the character in question has access to the thoughts and internal states which are reported. This usually means that only states and thoughts of the reporter are tagged as NI (or thought).

Often passages tagged as N-NI are in fact inferences based on what someone has said, and may even be quite close to the form of the original utterance. For example:

<sptag cat=N-NI who=S next=NRSAP whonext=B s=1 w=16>
The Palace was keen that the Prime Minister should continue until a 
successor had been elected.
[baker]
This is formally presented formally as as the report of an internal state ('is keen that'), but the reader will infer that this is a report of something that was probably said by a spokesman for the Palace. However it is impossible to tell what type of speech report this might be. It could be FIS, if the original utterance was something like 'The Palace is keen that...'; it could be NRS followed by IS if the original utterance was something like 'The Prime Minister should continue...'; or it could be NRSAP if this report is a summary of what was said, possibly on different occasions and by different people with the words here bearing little relation to the actual words said.

Given the impossibility of classifying the type of speech report, such examples are tagged as NIi, since they are formally presentation of internal states, even though pragmatically they can function as speech presentation.

In all cases where there is no omniscient narrator then, what is formally presented as the narration of internal states or thought is tagged as ambiguous between N and the relevant category of thought presentation, e.g. N-NI, N-NRT, N-IT.

3.2.1 NRSAP

The analysis of the press data highlighted the existence of particular variants of existing categories, which appear to be typical of newspaper reporting. An example of this is the use of extremely long and detailed NRSAs, such as those given below:
Mr Major warned yesterday of the dangers of Britain being left behind if a group of European Union
members pushed ahead with a single currency. 
(The Independent on Sunday, "Blair Puts Labour Troops on Alert for Snap Election") 

Labour called last night for a streamlined Scandinavian style monarchy to banish Britain's class-ridden
society. 
(The Daily Mirror, "Cut the Royals Down to Size")
In both cases the reporter spells out the speech act that the original speaker is supposed to have performed (warned, called), and then goes on to provide details of the content of the utterance in the form of lengthy and complex noun phrases. Clearly, such instances are not fully accounted for by the original definition of the NRSA category, which aimed to capture those cases where little more than the speech act is provided.

They are therefore tagged as NRSAP, or 'narrator's representation of speech act with topic', as in the following example:

<sptag cat=NRSAP who=M next=NRSA whonext=B s=0.67 w=18>
However, when he invited Beatrice Hastings to come and model 
for him nude early on in their affair, 
<sptag cat=NRSA who=B next=N s=0.07 w=2>
Modigliani objected 
<sptag cat=N next=NRS whonext=M s=0.26 w=10>
and she failed to keep the appointment. This happened twice.
[June Rose, Modigliani]

3.2.3 NV

Both the fictional and newspaper data contained instances of minimal speech presentation, which could not easily be accounted for by Leech and Short's categories. Consider the emboldened parts in the examples below:
"Don't you love Barrie's plays?" she asked. "I'm so fond of them". She talked on. Rampion made no
comment. 
(Aldous Huxley, Point Counter Point)

We spoke to vice madam Michaela Hamilton from Bullwell, Notts, who arranged girls for a Hudson orgy
at the Sanam curry house in Stoke. 
(The News of the World, "Hudson Fixed Sex Orgies as his Charity Fund Collapsed")
In both cases we are informed that someone engaged in verbal activity, but we are not given any explicit indication even as to what speech acts were performed, let alone what the form and content of the utterances were. In other words, we are faced with a form of speech presentation that is even more minimal, both formally and functionally, than that captured by the NRSA category, where the narrator specifies the illocutionary force of the utterance, and, possibly, its topic. We classified instances like these as Narrator's Report of Voice, and tagged them with the acronym NV.

3.3 The tagging process

All texts were tagged manually by the present author using a version of the emacs text editor under the Unix operating system. All tagged texts were then checked by both Mick Short and Elena Semino, and any problems were then discussed in detail, and any necessary changes were then made. Additionally, others have been involved in tagging and checking areas of the corpus, and numerous further checks have also been applied in order to ensure global consistency, to enforce evolving guidelines and to identify and correct typographical and other errors. Checking and refinement of the tagging is still in progress.

3.4 The tagging formalism

Formally an sptag can be defined as follows:
<sptag cat=[tag](-[tag]) (who=[A-Z]) next=[tag](-[tag]) (whonext=[A-Z]) 
	s=n(+n(+n)) w=n>

where elements in round brackets are optional; tag is an ST&WP tag from the tagset (e.g. N, NRS, FDS, etc); and n is a number. For example:

<sptag cat=DS who=B next=NRS whonext=B s=0.77 w=10>
'A criminal offence under the Defence of the Realm Act,'
<sptag cat=NRS who=B next=FDS whonext=C s=0.23 w=3>
I told her.
Here the tags tell us that there is a sequence of direct speech spoken by speaker B which is 10 words long (comprising 77% of the sentence) followed by a reported clause reporting speech by speaker B which is 3 words long (comprising 23% of the sentence).

3.5 Who's who

Where possible, speakers are expicitly named persons. However, it is sometimes necessary to attribute ST&WP to somewhat vaguer entities, such as groups of people or institutions, and sometimes the speaker is unknown. Occasionally it has been necessary to indicate the medium rather than the speaker, where this is the only information given, for example in The Prince of Wales by Jonathon Dimbleby:
F the avalanche bulletin
P the Sun
in the examples:
		<sptag cat=eNRSAPQ-eNRWAPQ level=2 who=F s=0.16 w=10>
		the avalanche bulletin warned of 'a considerable local
		avalanche danger'
		</sptag level=1>

<sptag cat=NRW who=P next=DW whonext=P s=0.88 w=7>
On 18 March, the <hi r=it>Sun</hi> headline read 
<sptag cat=DW who=P next=NW whonext=X s=0.13+1 w=8>
'ACCUSED. Official: Charles DID cause the killer avalanche'. 

First person narrators are always coded as B, and unknown speakers always as X. The main protagonist of third narratives have also been coded as B. It may be preferable to change this so that all and only first person narrators are B.


4. Conference papers and publications

  • Plenary paper at Poetics & Linguistics Association conference, Granada, Spain, September 1995, delivered by M. Wynne, E. Semino, J. Culpeper and M. Short.
  • Aston Corpus Seminar, Birmingham, April 1996, by Semino.
  • ICAME, Chester, May 199.
  • PALA, Nottingham, July 1997.
  • IALS, Freiburg, Germany, September 1997.
  • ESSE, Debrecen, Hungary, September 1997.
  • Short, M, Semino, E and Culpeper, J, 'Using a corpus for stylistics research: speech and thought presentation' in Thomas, J and Short, M (eds), Using Corpora for Language Research, Longman, 1996.
  • Leech, G, McEnery, A and Wynne, M, 'Further levels of annotation' in Garside, R, Leech, G & McEnery, A (eds), Corpus Annotation, Longman 1997.
  • Semino E, Short, M and Culpeper, J, 'Using a corpus to test a model of speech and thought presentation', Poetics, forthcoming.
  • Short, M, Semino, E and Wynne, M, 'A (free direct) reply to Paul Simpson's discourse', Journal of Literary Semantics, forthcoming.
  • Wynne, M, Short, M and Semino, E, 'A corpus-based investigation of speech, thought and writing presentation', submitted to the ICAME journal.

Mail queries to:

[email protected]