Indic-texts

Download

indic-texts@lists.tei-c.org

February 2020

5 participants
1 discussions

Searching transliterated Indic texts in TEI
by Dániel Balogh 15 Feb '20

15 Feb '20

Dear All, in the near future, we in the DHARMA project will be looking into making our epigraphic corpus searchable. Our texts in Sanskrit and other S and SE Asian languages will be marked up in TEI (EpiDoc) and will be Romanised, mostly according to ISO-15919 but with some quirks on top of that, including a few extra characters and case sensitivity. We will not be lemmatising the corpus at short notice, nor is it likely that we'll add <w> tags, though we may do so for part of the corpus later on. We're interested in extracting transliterated text from TEI XML (sometimes including alternative strings in <choice>) and searching it as fruitfully as possible. Ideally, we should have two search methods, a lenient one to gather fuzzy results and tolerate e.g. variations in epigraphic spelling without returning too many false positives (at the moment we only have some rudimentary notes for the specifics of this), and a strict one to return the exact string searched for. I myself do not have the level of technical preparedness even to understand our options, and will be passing any suggestions on to people with the necessary expertise. But to get started, I would welcome some basic suggestions and pointers: any already working open source specialised code we should check out? Any general search solutions that may be adapted to our purposes? Many thanks and apologies for the vague question, Dan

5 10