Language model tokenizers introduce unfairness between languages

Petrov, A; Malfa, EL; Torr, P; Bibi, A

Conference item

Language model tokenizers introduce unfairness between languages

Abstract:: Recent language models have shown impressive multilingual performance, even when not explicitly trained for it. Despite this, there are concerns about the quality of their outputs across different languages. In this paper, we show how disparity in the treatment of different languages arises at the tokenization stage, well before a model is even invoked. The same text translated into different languages can have drastically different tokenization lengths, with differences up to 15 times in some cases. These disparities persist even for tokenizers that are intentionally trained for multilingual support. Character-level and byte-level models also exhibit over 4 times the difference in the encoding length for some language pairs. This induces unfair treatment for some language communities in regard to the cost of accessing commercial language services, the processing time and latency, as well as the amount of content that can be provided as context to the models. Therefore, we make the case that we should train future language models using multilingually fair subword tokenizers.

Publication status:: Accepted

Peer review status:: Peer reviewed

Actions

Email

Email this record

Send the bibliographic details of this record to your email address.

Your Email
Please enter the email address that the record information will be sent to.

-
Your message (optional)
Please add any additional information to be included within the email.
Share
Cite

Cite this record

APA Style

Petrov, A., Malfa, E. L., Torr, P., & Bibi, A. (2024). Language model tokenizers introduce unfairness between languages. 37th Conference on Neural Information Processing Systems (NeurIPS 2023), 31, 36963–36990.

MLA Style

Petrov, A, et al. “Language Model Tokenizers Introduce Unfairness between Languages.” 37th Conference on Neural Information Processing Systems (NeurIPS 2023), vol. 31, 2024, pp. 36963–90.

Chicago Style

Petrov, A, EL Malfa, P Torr, and A Bibi. 2024. “Language Model Tokenizers Introduce Unfairness between Languages.” In 37th Conference on Neural Information Processing Systems (NeurIPS 2023), 31:36963–90. Curran Associates.
Print

Access Document

Files:: Petrov_et_al_2023_Language_model_tokenizers.pdf

(Preview, Accepted manuscript, pdf, 476.0KB, Terms of use)

Publication website:: https://papers.nips.cc/paper_files/paper/2023/hash/74bb24dca8334adce292883b4b651eda-Abstract-Conference.html

Authors

+ Petrov, A More by this author

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Role:: Author

+ Malfa, EL More by this author

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Role:: Author

+ Torr, P More by this author

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Role:: Author

+ Bibi, A More by this author

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Role:: Author

+ Engineering and Physical Sciences Research Council More from this funder

Grant:: EP/W002981/1

Publisher:: Curran Associates
Host title:: Advances in Neural Information Processing Systems 36
Volume:: 31
Pages:: 36963-36990
Publication date:: 2024-07-01
Acceptance date:: 2023-09-21
Event title:: 37th Conference on Neural Information Processing Systems (NeurIPS 2023)
Event location:: New Orleans, LA, USA
Event website:: https://nips.cc/Conferences/2023
Event start date:: 2023-12-10
Event end date:: 2023-12-16
ISBN:: 9781713899921

Language:: English
Keywords:: FFR
Pubs id:: 1805054
Local pid:: pubs:1805054
Deposit date:: 2024-03-15
ARK identifier:: ark:/29072/ora_e15809fafebd4b98aa77966ec4838851

Terms of use

Copyright holder:: Petrov et al.
Notes:: This paper was presented at the Advances in Neural Information Processing Systems 36 (NeurIPS 2023), 10th-16th December 2023, New Orleans, LA, USA. This is the accepted manuscript version of the article. The final version is available online from Curran Associates at: https://papers.nips.cc/paper_files/paper/2023/hash/74bb24dca8334adce292883b4b651eda-Abstract-Conference.html

Licence:: Terms and Conditions of Use for Oxford University Research Archive

Views and Downloads

About views and downloads

If you are the owner of this record, you can report an update to it here: Report update to this record

Conference item

Language model tokenizers introduce unfairness between languages

Actions

Access Document

Authors

Terms of use

Views and Downloads

Altmetrics

Dimensions

Conference item

Language model tokenizers introduce unfairness between languages

Actions

Access Document

Authors

Funding

Bibliographic Details

Item Description

Terms of use

Metrics

Views and Downloads

Altmetrics

Dimensions