Journal article
OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment
- Abstract:
-
Optical Character Recognition (OCR) can open up understudied historical documents to computational analysis, but the accuracy of OCR software varies. This article reports a benchmarking experiment comparing the performance of Tesseract, Amazon Textract, and Google Document AI on images of English and Arabic text. English-language book scans (n = 322) and Arabic-language article scans (n = 100) were replicated 43 times with different types of artificial noise for a corpus of 18,568 documents, generating 51,304 process requests. Document AI delivered the best results, and the server-based processors (Textract and Document AI) performed substantially better than Tesseract, especially on noisy documents. Accuracy for English was considerably higher than for Arabic. Specifying the relative performance of three leading OCR products and the differential effects of commonly found noise types can help scholars identify better OCR solutions for their research needs. The test materials have been preserved in the openly available “Noisy OCR Dataset” (NOD) for reuse in future benchmarking studies.
- Publication status:
- Published
- Peer review status:
- Peer reviewed
Actions
Access Document
- Files:
-
-
(Preview, Version of record, pdf, 7.9MB, Terms of use)
-
- Publisher copy:
- 10.1007/s42001-021-00149-1
Authors
- Publisher:
- Springer
- Journal:
- Journal of Computational Social Science More from this journal
- Volume:
- 5
- Issue:
- 1
- Pages:
- 861-882
- Publication date:
- 2021-11-22
- Acceptance date:
- 2021-10-06
- DOI:
- EISSN:
-
2432-2725
- ISSN:
-
2432-2717
- Language:
-
English
- Keywords:
- Pubs id:
-
1577541
- Local pid:
-
pubs:1577541
- Deposit date:
-
2024-09-07
Terms of use
- Copyright holder:
- Thomas Hegghammer
- Copyright date:
- 2022
- Rights statement:
- Copyright © 2021, The Author(s). This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
- Licence:
- CC Attribution (CC BY)
If you are the owner of this record, you can report an update to it here: Report update to this record