Journal article
Towards a historical treebank of Middle and Early Modern Welsh part I: Workflow and POS tagging
- Abstract:
- This article introduces the working methods of the Parsed Historical Corpus of the Welsh Language (PARSHCWL). The corpus is designed to provide researchers with a tool for automatic exhaustive extraction of instances of grammatical structures from Middle and Modern Welsh texts in a way comparable to similar tools that already exist for various European languages. The major features of the corpus are outlined, along with the overall architecture of the workflow needed for a team of researchers to produce it. In this paper, the two first stages of the process, namely pre-processing of texts and automated part-of-speech (POS) tagging are discussed in some detail, focusing in particular on major issues involved in defining word boundaries and in defining a robust and useful tagset.
- Publication status:
- Published
- Peer review status:
- Peer reviewed
Actions
Access Document
- Files:
-
-
(Preview, Accepted manuscript, 456.6KB, Terms of use)
-
- Publisher copy:
- 10.16922/jcl.22.6
Authors
- Publisher:
- University of Wales Press
- Journal:
- Journal of Celtic Linguistics More from this journal
- Volume:
- 22
- Issue:
- 1
- Pages:
- 125-154
- Publication date:
- 2021-01-01
- Acceptance date:
- 2020-06-26
- DOI:
- EISSN:
-
2058-5063
- ISSN:
-
0962-1377
- Language:
-
English
- Keywords:
- Pubs id:
-
1131395
- Local pid:
-
pubs:1131395
- Deposit date:
-
2020-09-10
Terms of use
- Copyright date:
- 2021
- Notes:
- This is the accepted manuscript version of the article. The final version is available online from University of Wales Press at https://doi.org/10.16922/jcl.22.6
If you are the owner of this record, you can report an update to it here: Report update to this record