Originally a pilot corpus of some 40,000 words of fiction texts was compiled and annotated in 1994. A parallel pilot sample of 40,000 newspaper texts were then added in 1994 and 1995. This work was done with funds provided by the Faculty of Social Sciences at Lancaster University.
Following the award of a major British Academy research project grant, this 80,000-word pilot corpus was expanded in 1996 and 1997 to a nominally 240,000-word corpus. The fiction and newspaper sections were doubled in size, and a new section of biography and autobiography texts was added.
Analysis of the corpus is ongoing.
There are approximately 250,000 words of text in the corpus. It is made up of 120 sections of about 2,000 words. The final count is somewhat in excess of 240,000 because the texts were sampled in such a way as to begin and end them at fairly 'natural' breaks, so that a reader of the corpus text can see enough of the relevant context to understand the narrative, and usually it was preferred to find such a break after rather than before the 2,000 word mark, but as close to it as possible (for more on sampling strategies see 2.3 below).
The primary classification of the corpus is into three sections relating narrative genres. These genres are: (i) fiction, (ii) newspaper news reports and (iii) biography and autobiography. There are a minimum 80,000 words in each section.
Within each genre there is a division between 'serious' and 'popular' texts. While such a division is inevitably difficult to some extent, this classification was made on the basis of what would commonly be held to be the case by the average educated reader. This will enable the testing of such preconceptions by the analysis of the actual texts.
In the fiction and biography there is also an binary division (cutting across the popular/serious division) between first and third person narratives. In the biography this creates a division between biography and autobiography.
Here is a tree diagram representing the structure of the corpus at these levels:
In the fiction section a further subdivision was made within each text
type between texts with first and third person narrators.
This is paralleled in the biography/autobiography section, where
the biography texts are all first person narratives and the autobiography
texts are all third person narratives.
The newspapers used for the sample were:
Broadsheets:
| N | Narrative |
| NRS | Narrative Report of Speech |
| NRW | Narrative Report of Writing |
| NRT | Narrative Report of Thought |
| NI | Narrative Report of Internal State |
| NV | Narrative Report of Voice |
| NRSA | Narrative Report of Speech Act |
| NRWA | Narrative Report of Writing Act |
| NRTA | Narrative Report of Thought Act |
| NRSAP | Narrative Report of Speech Act with Topic |
| NRWAP | Narrative Report of Writing Act with Topic |
| NRTAP | Narrative Report of Thought Act with Topic |
| IS | Indirect Speech |
| IW | Indirect Writing |
| IT | Indirect Thought |
| FIS | Free Indirect Speech |
| FIW | Free Indirect Writing |
| FIT | Free Indirect Thought |
| DS | Direct Speech |
| DW | Direct Writing |
| DT | Direct Thought |
| FDS | Free Direct Speech |
| FDW | Free Direct Writing |
| FDT | Free Direct Thought |
Affixes:
| e | embedded |
| q | with quote |
| h | hypothetical |
| i | inferred (see section on NIi below) |
| + | speech summary [not used] |
e.g. NRSAPq is "Narrative Report of Speech Act with Topic with an embedded quotation":
Notes:
Some problems arise because there are fuzzy areas on the boundary between the presentation of linguistic and non-linguistic acts. There are also mentions of written language where the focus is not on the production or reception of the text, but merely on its existence. Such cases were not annotated as writing presentation.
Jed's heart lifted in his ribs. (Rupert Thomson, The Five Gates to Hell) For a moment she didn't know where she was. (Graham Greene, Brighton Rock)
In texts which are not fiction with an omniscient narrator, NI (and all categories of thought presentation) are only used where the character in question has access to the thoughts and internal states which are reported. This usually means that only states and thoughts of the reporter are tagged as NI (or thought).
Often passages tagged as N-NI are in fact inferences based on what someone has said, and may even be quite close to the form of the original utterance. For example:
<sptag cat=N-NI who=S next=NRSAP whonext=B s=1 w=16> The Palace was keen that the Prime Minister should continue until a successor had been elected. [baker]This is formally presented formally as as the report of an internal state ('is keen that'), but the reader will infer that this is a report of something that was probably said by a spokesman for the Palace. However it is impossible to tell what type of speech report this might be. It could be FIS, if the original utterance was something like 'The Palace is keen that...'; it could be NRS followed by IS if the original utterance was something like 'The Prime Minister should continue...'; or it could be NRSAP if this report is a summary of what was said, possibly on different occasions and by different people with the words here bearing little relation to the actual words said.
Given the impossibility of classifying the type of speech report, such examples are tagged as NIi, since they are formally presentation of internal states, even though pragmatically they can function as speech presentation.
In all cases where there is no omniscient narrator then, what is formally presented as the narration of internal states or thought is tagged as ambiguous between N and the relevant category of thought presentation, e.g. N-NI, N-NRT, N-IT.
Mr Major warned yesterday of the dangers of Britain being left behind if a group of European Union members pushed ahead with a single currency. (The Independent on Sunday, "Blair Puts Labour Troops on Alert for Snap Election") Labour called last night for a streamlined Scandinavian style monarchy to banish Britain's class-ridden society. (The Daily Mirror, "Cut the Royals Down to Size")In both cases the reporter spells out the speech act that the original speaker is supposed to have performed (warned, called), and then goes on to provide details of the content of the utterance in the form of lengthy and complex noun phrases. Clearly, such instances are not fully accounted for by the original definition of the NRSA category, which aimed to capture those cases where little more than the speech act is provided.
They are therefore tagged as NRSAP, or 'narrator's representation of speech act with topic', as in the following example:
<sptag cat=NRSAP who=M next=NRSA whonext=B s=0.67 w=18> However, when he invited Beatrice Hastings to come and model for him nude early on in their affair, <sptag cat=NRSA who=B next=N s=0.07 w=2> Modigliani objected <sptag cat=N next=NRS whonext=M s=0.26 w=10> and she failed to keep the appointment. This happened twice. [June Rose, Modigliani]
"Don't you love Barrie's plays?" she asked. "I'm so fond of them". She talked on. Rampion made no comment. (Aldous Huxley, Point Counter Point) We spoke to vice madam Michaela Hamilton from Bullwell, Notts, who arranged girls for a Hudson orgy at the Sanam curry house in Stoke. (The News of the World, "Hudson Fixed Sex Orgies as his Charity Fund Collapsed")In both cases we are informed that someone engaged in verbal activity, but we are not given any explicit indication even as to what speech acts were performed, let alone what the form and content of the utterances were. In other words, we are faced with a form of speech presentation that is even more minimal, both formally and functionally, than that captured by the NRSA category, where the narrator specifies the illocutionary force of the utterance, and, possibly, its topic. We classified instances like these as Narrator's Report of Voice, and tagged them with the acronym NV.
<sptag cat=[tag](-[tag]) (who=[A-Z]) next=[tag](-[tag]) (whonext=[A-Z]) s=n(+n(+n)) w=n>
where elements in round brackets are optional; tag is an ST&WP tag from the tagset (e.g. N, NRS, FDS, etc); and n is a number. For example:
<sptag cat=DS who=B next=NRS whonext=B s=0.77 w=10> 'A criminal offence under the Defence of the Realm Act,' <sptag cat=NRS who=B next=FDS whonext=C s=0.23 w=3> I told her.Here the tags tell us that there is a sequence of direct speech spoken by speaker B which is 10 words long (comprising 77% of the sentence) followed by a reported clause reporting speech by speaker B which is 3 words long (comprising 23% of the sentence).
F the avalanche bulletin P the Sunin the examples:
<sptag cat=eNRSAPQ-eNRWAPQ level=2 who=F s=0.16 w=10> the avalanche bulletin warned of 'a considerable local avalanche danger' </sptag level=1> <sptag cat=NRW who=P next=DW whonext=P s=0.88 w=7> On 18 March, the <hi r=it>Sun</hi> headline read <sptag cat=DW who=P next=NW whonext=X s=0.13+1 w=8> 'ACCUSED. Official: Charles DID cause the killer avalanche'.
First person narrators are always coded as B, and unknown speakers always as X. The main protagonist of third narratives have also been coded as B. It may be preferable to change this so that all and only first person narrators are B.
Mail queries to:
[email protected]