Re: [Indic-texts] Google OCR for Sanskrit

17 Jun 2018

      On Sat, Jun 16 2018, Andrew Ollett wrote:
...
Dear colleagues,
After noticing results in Google Books from Dēvanāgarī editions of Sanskrit
texts, I suspected that Google had some OCR technology for Dēvanāgarī. It
turns out that they have had it for some time, but only relatively recently
has an API been available to use it.
Here is a comparison of Google OCR with Oliver Hellwig's SanskritOCR:
https://sanskrit-coders.github.io/site/posts/ocr-comparison.html
I wonder if Google’s API uses tesseract?

https://en.wikipedia.org/wiki/Tesseract_(software)
https://github.com/tesseract-ocr/tesseract

It supports these languages:
https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#l...
, Sanskrit among them.

If that’s what the API uses, you can install it also locally (and
apparently even train it for specific typefaces or ‘fonts’).  I was not
able to find the page that was used for the comparison you link to, but
it would be interesting to see how a local instance of tesseract
performs.

Best,

--
Patrick McAllister

Email: patrick.mcallister@oeaw.ac.at
Phone: + 43 1 51581 6423

Institute for the Cultural and Intellectual History of Asia (IKGA)
Austrian Academy of Sciences
Hollandstraße 11-13, 2nd floor
1020 Vienna, Austria

http://www.ikga.oeaw.ac.at/