Journal article icon

Journal article

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

Abstract:

Optical Character Recognition (OCR) can open up understudied historical documents to computational analysis, but the accuracy of OCR software varies. This article reports a benchmarking experiment comparing the performance of Tesseract, Amazon Textract, and Google Document AI on images of English and Arabic text. English-language book scans (n = 322) and Arabic-language article scans (n = 100) were replicated 43 times with different types of artificial noise for a corpus of 18,568 documents, generating 51,304 process requests. Document AI delivered the best results, and the server-based processors (Textract and Document AI) performed substantially better than Tesseract, especially on noisy documents. Accuracy for English was considerably higher than for Arabic. Specifying the relative performance of three leading OCR products and the differential effects of commonly found noise types can help scholars identify better OCR solutions for their research needs. The test materials have been preserved in the openly available “Noisy OCR Dataset” (NOD) for reuse in future benchmarking studies.

Publication status:
Published
Peer review status:
Peer reviewed

Actions


Access Document


Files:
Publisher copy:
10.1007/s42001-021-00149-1

Authors


More by this author
Institution:
University of Oxford
Division:
SSD
Department:
Politics & Int Relations
Oxford college:
All Souls College
Role:
Author
ORCID:
0000-0001-6253-1518


Publisher:
Springer
Journal:
Journal of Computational Social Science More from this journal
Volume:
5
Issue:
1
Pages:
861-882
Publication date:
2021-11-22
Acceptance date:
2021-10-06
DOI:
EISSN:
2432-2725
ISSN:
2432-2717


Language:
English
Keywords:
Pubs id:
1577541
Local pid:
pubs:1577541
Deposit date:
2024-09-07

Terms of use



Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP