Google OCR for Sanskrit

16 Jun 2018

      Dear colleagues,

After noticing results in Google Books from Dēvanāgarī editions of Sanskrit
texts, I suspected that Google had some OCR technology for Dēvanāgarī. It
turns out that they have had it for some time, but only relatively recently
has an API been available to use it. Here is a comparison of Google OCR
with Oliver Hellwig's SanskritOCR:

https://sanskrit-coders.github.io/site/posts/ocr-comparison.html

I think the conclusion is that SanskritOCR is slightly better in terms of
quality, but Google OCR is faster and easier to use programmatically. I
have tried out Google OCR and managed to get relatively good results:

http://www.prakrit.info/blog/google-ocr-for-sanskrit/

I think this could be quite promising. If we manage to write a couple of
tools for processing PDFs into txt files (and thence onto TEI if desired),
then we could go from scanned document to machine-readable text in minutes.
Of course things would need to be checked, and Google OCR seems to have
particularly nasty problems with notes and smaller size text.

My setup is a few local Python scripts that interact with the resources in
a Google Cloud Storage "bucket" (where they need to be in order for Google
OCR to work). Maybe others could come up with a more streamlined setup. Or
maybe we could even create an app that allows users to upload PDF documents
and get text data in return. Thoughts?

Andrew

Andrew Ollett

Patrick McAllister

tags

participants (2)