On Sat, Jun 16 2018, Andrew Ollett wrote:
Dear colleagues,
After noticing results in Google Books from Dēvanāgarī editions of Sanskrit texts, I suspected that Google had some OCR technology for Dēvanāgarī. It turns out that they have had it for some time, but only relatively recently has an API been available to use it.
Here is a comparison of Google OCR with Oliver Hellwig's SanskritOCR:
https://sanskrit-coders.github.io/site/posts/ocr-comparison.html
I wonder if Google’s API uses tesseract? https://en.wikipedia.org/wiki/Tesseract_(software) https://github.com/tesseract-ocr/tesseract It supports these languages: https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#l... , Sanskrit among them. If that’s what the API uses, you can install it also locally (and apparently even train it for specific typefaces or ‘fonts’). I was not able to find the page that was used for the comparison you link to, but it would be interesting to see how a local instance of tesseract performs. Best, -- Patrick McAllister Email: patrick.mcallister@oeaw.ac.at Phone: + 43 1 51581 6423 Institute for the Cultural and Intellectual History of Asia (IKGA) Austrian Academy of Sciences Hollandstraße 11-13, 2nd floor 1020 Vienna, Austria http://www.ikga.oeaw.ac.at/