Dear colleagues, After noticing results in Google Books from Dēvanāgarī editions of Sanskrit texts, I suspected that Google had some OCR technology for Dēvanāgarī. It turns out that they have had it for some time, but only relatively recently has an API been available to use it. Here is a comparison of Google OCR with Oliver Hellwig's SanskritOCR: https://sanskrit-coders.github.io/site/posts/ocr-comparison.html I think the conclusion is that SanskritOCR is slightly better in terms of quality, but Google OCR is faster and easier to use programmatically. I have tried out Google OCR and managed to get relatively good results: http://www.prakrit.info/blog/google-ocr-for-sanskrit/ I think this could be quite promising. If we manage to write a couple of tools for processing PDFs into txt files (and thence onto TEI if desired), then we could go from scanned document to machine-readable text in minutes. Of course things would need to be checked, and Google OCR seems to have particularly nasty problems with notes and smaller size text. My setup is a few local Python scripts that interact with the resources in a Google Cloud Storage "bucket" (where they need to be in order for Google OCR to work). Maybe others could come up with a more streamlined setup. Or maybe we could even create an app that allows users to upload PDF documents and get text data in return. Thoughts? Andrew
On Sat, Jun 16 2018, Andrew Ollett wrote:
Dear colleagues,
After noticing results in Google Books from Dēvanāgarī editions of Sanskrit texts, I suspected that Google had some OCR technology for Dēvanāgarī. It turns out that they have had it for some time, but only relatively recently has an API been available to use it.
Here is a comparison of Google OCR with Oliver Hellwig's SanskritOCR:
https://sanskrit-coders.github.io/site/posts/ocr-comparison.html
I wonder if Google’s API uses tesseract? https://en.wikipedia.org/wiki/Tesseract_(software) https://github.com/tesseract-ocr/tesseract It supports these languages: https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#l... , Sanskrit among them. If that’s what the API uses, you can install it also locally (and apparently even train it for specific typefaces or ‘fonts’). I was not able to find the page that was used for the comparison you link to, but it would be interesting to see how a local instance of tesseract performs. Best, -- Patrick McAllister Email: patrick.mcallister@oeaw.ac.at Phone: + 43 1 51581 6423 Institute for the Cultural and Intellectual History of Asia (IKGA) Austrian Academy of Sciences Hollandstraße 11-13, 2nd floor 1020 Vienna, Austria http://www.ikga.oeaw.ac.at/
participants (2)
-
Andrew Ollett
-
Patrick McAllister