Peter,
Daniel’s message made clear that the DHARMA text corpus is in a variant of ISO-15919 transliteration and does not concern only Sanskrit but a variety of South and Southeast Asian languages. Hence it remains unclear to me how your response is pertinent to his question. Could you explain?
Best wishes,
Arlo
PS Incidentally, I happen to notice that your Library is imprecise in the source attribution for <https://sanskritlibrary.org/catalogsText/titus/vedic/vaits.html>. The Titus e-text was not edited by Jost Gippert but by myself. See <http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/av/vaits/vaits.htm>. It is also not clear to me why that source is mentioned at all, because the text in your library does not actually seem to be derived from the Titus version, or if it is has suppressed my identification of the Saṁhitā source of the mantrapratīkas.
Le 13 févr. 2020 à 18:15, Peter Scharf <scharf@sanskritlibrary.org> a écrit :
_______________________________________________You should have your source text in the Sanskrit Library Phonetic (SLP) encoding and transcode input and display to that for searching for reliable results. Patrick McCalister analyzed the disparities between searching in IAST or Unicode Devanagari; they give unexpected results and require extra effort to work around. The Sanskrit Library uses the method I recommend for searching its texts.
******************************Peter M. Scharf, PresidentThe Sanskrit Library******************************
On 13 Feb 2020, at 9:19 PM, Andrew Ollett <andrew.ollett@gmail.com> wrote:
Dear Dan,
I'm also very interested to see if there are suggestions in this direction. It seems to me that there are two options:
- Creating a plain-text corpus from your TEI corpus (by means of XSL transformations), and searching this corpus the way that you would search a directory of texts on a local computer (e.g., grep). This is not a very "high-tech" solution, but I think this is how most of us search the GRETIL archive, and it's very straightforward. To get "fuzzy" results you would have to probably write custom shell scripts that replace a given search term (e.g., "dharma") with one that is modified to be "fuzzy" (e.g., "dha(r)*((ṁm)|(m)+)a" or whatever).
- A web application with a search interface. Good luck with this. It would probably involve Apache Lucene. The source code for the SARIT web application, designed for a transitional version of eXistDB (between 3 and 4), is available on GitHub.
Andrew
_______________________________________________On Thu, Feb 13, 2020 at 4:21 AM Dániel Balogh <danbalogh@gmail.com> wrote:
_______________________________________________Dear All,in the near future, we in the DHARMA project will be looking into making our epigraphic corpus searchable. Our texts in Sanskrit and other S and SE Asian languages will be marked up in TEI (EpiDoc) and will be Romanised, mostly according to ISO-15919 but with some quirks on top of that, including a few extra characters and case sensitivity. We will not be lemmatising the corpus at short notice, nor is it likely that we'll add <w> tags, though we may do so for part of the corpus later on. We're interested in extracting transliterated text from TEI XML (sometimes including alternative strings in <choice>) and searching it as fruitfully as possible. Ideally, we should have two search methods, a lenient one to gather fuzzy results and tolerate e.g. variations in epigraphic spelling without returning too many false positives (at the moment we only have some rudimentary notes for the specifics of this), and a strict one to return the exact string searched for.
I myself do not have the level of technical preparedness even to understand our options, and will be passing any suggestions on to people with the necessary expertise. But to get started, I would welcome some basic suggestions and pointers: any already working open source specialised code we should check out? Any general search solutions that may be adapted to our purposes?Many thanks and apologies for the vague question,Dan
Indic-texts mailing list
Indic-texts@lists.tei-c.org
http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
Indic-texts mailing list
Indic-texts@lists.tei-c.org
http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
Indic-texts mailing list
Indic-texts@lists.tei-c.org
http://lists.lists.tei-c.org/mailman/listinfo/indic-texts