Journal article
Geographic adaptation of pretrained language models
- Abstract:
- While pretrained language models (PLMs) have been shown to possess a plethora of linguistic knowledge, the existing body of research has largely neglected extralinguistic knowledge, which is generally difficult to obtain by pretraining on text alone. Here, we contribute to closing this gap by examining geolinguistic knowledge, i.e., knowledge about geographic variation in language. We introduce geoadaptation, an intermediate training step that couples language modeling with geolocation prediction in a multi-task learning setup. We geoadapt four PLMs, covering language groups from three geographic areas, and evaluate them on five different tasks: fine-tuned (i.e., supervised) geolocation prediction, zero-shot (i.e., unsupervised) geolocation prediction, fine-tuned language identification, zero-shot language identification, and zero-shot prediction of dialect features. Geoadaptation is very successful at injecting geolinguistic knowledge into the PLMs: The geoadapted PLMs consistently outperform PLMs adapted using only language modeling (by especially wide margins on zero-shot prediction tasks), and we obtain new state-of-the-art results on two benchmarks for geolocation prediction and language identification. Furthermore, we show that the effectiveness of geoadaptation stems from its ability to geographically retrofit the representation space of the PLMs.
- Publication status:
- Published
- Peer review status:
- Peer reviewed
Actions
Access Document
- Files:
-
-
(Preview, Version of record, pdf, 758.9KB, Terms of use)
-
- Publisher copy:
- 10.1162/tacl_a_00652
Authors
- Publisher:
- Massachusetts Institute of Technology Press
- Journal:
- Transactions of the Association for Computational Linguistics More from this journal
- Volume:
- 12
- Pages:
- 411–431
- Publication date:
- 2024-04-16
- Acceptance date:
- 2024-01-22
- DOI:
- EISSN:
-
2307-387X
- ISSN:
-
2307-387X
- Language:
-
English
- Pubs id:
-
1616095
- Local pid:
-
pubs:1616095
- Deposit date:
-
2024-02-11
Terms of use
- Copyright holder:
- Association for Computational Linguistics
- Copyright date:
- 2024
- Rights statement:
- © 2024 Association for Computational Linguistics. This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
- Licence:
- CC Attribution (CC BY)
If you are the owner of this record, you can report an update to it here: Report update to this record