Re: [Indic-texts] Searching transliterated Indic texts in TEI

13 Feb 2020

      Dear Dan,

I'm also very interested to see if there are suggestions in this direction.
It seems to me that there are two options:

   1. Creating a plain-text corpus from your TEI corpus (by means of XSL
   transformations), and searching this corpus the way that you would search a
   directory of texts on a local computer (e.g., grep). This is not a very
   "high-tech" solution, but I think this is how most of us search the GRETIL
   archive, and it's very straightforward. To get "fuzzy" results you would
   have to probably write custom shell scripts that replace a given search
   term (e.g., "dharma") with one that is modified to be "fuzzy" (e.g.,
   "dha(r)*((ṁm)|(m)+)a" or whatever).
   2. A web application with a search interface. Good luck with this. It
   would probably involve Apache Lucene. The source code for the SARIT
   <http://sarit.indology.info/> web application, designed for a
   transitional version of eXistDB (between 3 and 4), is available on GitHub
   <https://github.com/sarit/sarit-existdb>.

Andrew

On Thu, Feb 13, 2020 at 4:21 AM Dániel Balogh <danbalogh@gmail.com> wrote:
...
Dear All,
in the near future, we in the DHARMA project will be looking into making
our epigraphic corpus searchable. Our texts in Sanskrit and other S and SE
Asian languages will be marked up in TEI (EpiDoc) and will be Romanised,
mostly according to ISO-15919 but with some quirks on top of that,
including a few extra characters and case sensitivity. We will not be
lemmatising the corpus at short notice, nor is it likely that we'll add <w>
tags, though we may do so for part of the corpus later on. We're interested
in extracting transliterated text from TEI XML (sometimes including
alternative strings in <choice>) and searching it as fruitfully as
possible. Ideally, we should have two search methods, a lenient one to
gather fuzzy results and tolerate e.g. variations in epigraphic spelling
without returning too many false positives (at the moment we only have some
rudimentary notes for the specifics of this), and a strict one to return
the exact string searched for.
I myself do not have the level of technical preparedness even to
understand our options, and will be passing any suggestions on to people
with the necessary expertise. But to get started, I would welcome some
basic suggestions and pointers: any already working open source specialised
code we should check out? Any general search solutions that may be adapted
to our purposes?
Many thanks and apologies for the vague question,
Dan
_______________________________________________
Indic-texts mailing list
Indic-texts@lists.tei-c.org
http://lists.lists.tei-c.org/mailman/listinfo/indic-texts

Re: [Indic-texts] Searching transliterated Indic texts in TEI

Andrew Ollett