Journal article icon

Journal article

Towards a historical treebank of Middle and Early Modern Welsh part I: Workflow and POS tagging

Abstract:
This article introduces the working methods of the Parsed Historical Corpus of the Welsh Language (PARSHCWL). The corpus is designed to provide researchers with a tool for automatic exhaustive extraction of instances of grammatical structures from Middle and Modern Welsh texts in a way comparable to similar tools that already exist for various European languages. The major features of the corpus are outlined, along with the overall architecture of the workflow needed for a team of researchers to produce it. In this paper, the two first stages of the process, namely pre-processing of texts and automated part-of-speech (POS) tagging are discussed in some detail, focusing in particular on major issues involved in defining word boundaries and in defining a robust and useful tagset.
Publication status:
Published
Peer review status:
Peer reviewed

Actions


Access Document


Publisher copy:
10.16922/jcl.22.6

Authors


More by this author
Institution:
University of Oxford
Division:
HUMS
Department:
Linguistics Philology and Phonetics Faculty
Oxford college:
Jesus College
Role:
Author


Publisher:
University of Wales Press
Journal:
Journal of Celtic Linguistics More from this journal
Volume:
22
Issue:
1
Pages:
125-154
Publication date:
2021-01-01
Acceptance date:
2020-06-26
DOI:
EISSN:
2058-5063
ISSN:
0962-1377


Language:
English
Keywords:
Pubs id:
1131395
Local pid:
pubs:1131395
Deposit date:
2020-09-10

Terms of use



Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP