Searching transliterated Indic texts in TEI

Dear All, in the near future, we in the DHARMA project will be looking into making our epigraphic corpus searchable. Our texts in Sanskrit and other S and SE Asian languages will be marked up in TEI (EpiDoc) and will be Romanised, mostly according to ISO-15919 but with some quirks on top of that, including a few extra characters and case sensitivity. We will not be lemmatising the corpus at short notice, nor is it likely that we'll add <w> tags, though we may do so for part of the corpus later on. We're interested in extracting transliterated text from TEI XML (sometimes including alternative strings in <choice>) and searching it as fruitfully as possible. Ideally, we should have two search methods, a lenient one to gather fuzzy results and tolerate e.g. variations in epigraphic spelling without returning too many false positives (at the moment we only have some rudimentary notes for the specifics of this), and a strict one to return the exact string searched for. I myself do not have the level of technical preparedness even to understand our options, and will be passing any suggestions on to people with the necessary expertise. But to get started, I would welcome some basic suggestions and pointers: any already working open source specialised code we should check out? Any general search solutions that may be adapted to our purposes? Many thanks and apologies for the vague question, Dan

Dear Dan, I'm also very interested to see if there are suggestions in this direction. It seems to me that there are two options: 1. Creating a plain-text corpus from your TEI corpus (by means of XSL transformations), and searching this corpus the way that you would search a directory of texts on a local computer (e.g., grep). This is not a very "high-tech" solution, but I think this is how most of us search the GRETIL archive, and it's very straightforward. To get "fuzzy" results you would have to probably write custom shell scripts that replace a given search term (e.g., "dharma") with one that is modified to be "fuzzy" (e.g., "dha(r)*((ṁm)|(m)+)a" or whatever). 2. A web application with a search interface. Good luck with this. It would probably involve Apache Lucene. The source code for the SARIT <http://sarit.indology.info/> web application, designed for a transitional version of eXistDB (between 3 and 4), is available on GitHub <https://github.com/sarit/sarit-existdb>. Andrew On Thu, Feb 13, 2020 at 4:21 AM Dániel Balogh <danbalogh@gmail.com> wrote:
Dear All, in the near future, we in the DHARMA project will be looking into making our epigraphic corpus searchable. Our texts in Sanskrit and other S and SE Asian languages will be marked up in TEI (EpiDoc) and will be Romanised, mostly according to ISO-15919 but with some quirks on top of that, including a few extra characters and case sensitivity. We will not be lemmatising the corpus at short notice, nor is it likely that we'll add <w> tags, though we may do so for part of the corpus later on. We're interested in extracting transliterated text from TEI XML (sometimes including alternative strings in <choice>) and searching it as fruitfully as possible. Ideally, we should have two search methods, a lenient one to gather fuzzy results and tolerate e.g. variations in epigraphic spelling without returning too many false positives (at the moment we only have some rudimentary notes for the specifics of this), and a strict one to return the exact string searched for. I myself do not have the level of technical preparedness even to understand our options, and will be passing any suggestions on to people with the necessary expertise. But to get started, I would welcome some basic suggestions and pointers: any already working open source specialised code we should check out? Any general search solutions that may be adapted to our purposes? Many thanks and apologies for the vague question, Dan _______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts

You should have your source text in the Sanskrit Library Phonetic (SLP) encoding and transcode input and display to that for searching for reliable results. Patrick McCalister analyzed the disparities between searching in IAST or Unicode Devanagari; they give unexpected results and require extra effort to work around. The Sanskrit Library uses the method I recommend for searching its texts. ****************************** Peter M. Scharf, President The Sanskrit Library scharf@sanskritlibrary.org https://sanskritlibrary.org ******************************
On 13 Feb 2020, at 9:19 PM, Andrew Ollett <andrew.ollett@gmail.com> wrote:
Dear Dan,
I'm also very interested to see if there are suggestions in this direction. It seems to me that there are two options: Creating a plain-text corpus from your TEI corpus (by means of XSL transformations), and searching this corpus the way that you would search a directory of texts on a local computer (e.g., grep). This is not a very "high-tech" solution, but I think this is how most of us search the GRETIL archive, and it's very straightforward. To get "fuzzy" results you would have to probably write custom shell scripts that replace a given search term (e.g., "dharma") with one that is modified to be "fuzzy" (e.g., "dha(r)*((ṁm)|(m)+)a" or whatever). A web application with a search interface. Good luck with this. It would probably involve Apache Lucene. The source code for the SARIT <http://sarit.indology.info/> web application, designed for a transitional version of eXistDB (between 3 and 4), is available on GitHub <https://github.com/sarit/sarit-existdb>. Andrew
On Thu, Feb 13, 2020 at 4:21 AM Dániel Balogh <danbalogh@gmail.com <mailto:danbalogh@gmail.com>> wrote: Dear All, in the near future, we in the DHARMA project will be looking into making our epigraphic corpus searchable. Our texts in Sanskrit and other S and SE Asian languages will be marked up in TEI (EpiDoc) and will be Romanised, mostly according to ISO-15919 but with some quirks on top of that, including a few extra characters and case sensitivity. We will not be lemmatising the corpus at short notice, nor is it likely that we'll add <w> tags, though we may do so for part of the corpus later on. We're interested in extracting transliterated text from TEI XML (sometimes including alternative strings in <choice>) and searching it as fruitfully as possible. Ideally, we should have two search methods, a lenient one to gather fuzzy results and tolerate e.g. variations in epigraphic spelling without returning too many false positives (at the moment we only have some rudimentary notes for the specifics of this), and a strict one to return the exact string searched for. I myself do not have the level of technical preparedness even to understand our options, and will be passing any suggestions on to people with the necessary expertise. But to get started, I would welcome some basic suggestions and pointers: any already working open source specialised code we should check out? Any general search solutions that may be adapted to our purposes? Many thanks and apologies for the vague question, Dan _______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org <mailto:Indic-texts@lists.tei-c.org> http://lists.lists.tei-c.org/mailman/listinfo/indic-texts <http://lists.lists.tei-c.org/mailman/listinfo/indic-texts> _______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts

Peter, Daniel’s message made clear that the DHARMA text corpus is in a variant of ISO-15919 transliteration and does not concern only Sanskrit but a variety of South and Southeast Asian languages. Hence it remains unclear to me how your response is pertinent to his question. Could you explain? Best wishes, Arlo PS Incidentally, I happen to notice that your Library is imprecise in the source attribution for <https://sanskritlibrary.org/catalogsText/titus/vedic/vaits.html>. The Titus e-text was not edited by Jost Gippert but by myself. See <http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/av/vaits/vaits.htm>. It is also not clear to me why that source is mentioned at all, because the text in your library does not actually seem to be derived from the Titus version, or if it is has suppressed my identification of the Saṁhitā source of the mantrapratīkas. Le 13 févr. 2020 à 18:15, Peter Scharf <scharf@sanskritlibrary.org<mailto:scharf@sanskritlibrary.org>> a écrit : You should have your source text in the Sanskrit Library Phonetic (SLP) encoding and transcode input and display to that for searching for reliable results. Patrick McCalister analyzed the disparities between searching in IAST or Unicode Devanagari; they give unexpected results and require extra effort to work around. The Sanskrit Library uses the method I recommend for searching its texts. ****************************** Peter M. Scharf, President The Sanskrit Library scharf@sanskritlibrary.org<mailto:scharf@sanskritlibrary.org> https://sanskritlibrary.org<https://sanskritlibrary.org/> ****************************** On 13 Feb 2020, at 9:19 PM, Andrew Ollett <andrew.ollett@gmail.com<mailto:andrew.ollett@gmail.com>> wrote: Dear Dan, I'm also very interested to see if there are suggestions in this direction. It seems to me that there are two options: 1. Creating a plain-text corpus from your TEI corpus (by means of XSL transformations), and searching this corpus the way that you would search a directory of texts on a local computer (e.g., grep). This is not a very "high-tech" solution, but I think this is how most of us search the GRETIL archive, and it's very straightforward. To get "fuzzy" results you would have to probably write custom shell scripts that replace a given search term (e.g., "dharma") with one that is modified to be "fuzzy" (e.g., "dha(r)*((ṁm)|(m)+)a" or whatever). 2. A web application with a search interface. Good luck with this. It would probably involve Apache Lucene. The source code for the SARIT<http://sarit.indology.info/> web application, designed for a transitional version of eXistDB (between 3 and 4), is available on GitHub<https://github.com/sarit/sarit-existdb>. Andrew On Thu, Feb 13, 2020 at 4:21 AM Dániel Balogh <danbalogh@gmail.com<mailto:danbalogh@gmail.com>> wrote: Dear All, in the near future, we in the DHARMA project will be looking into making our epigraphic corpus searchable. Our texts in Sanskrit and other S and SE Asian languages will be marked up in TEI (EpiDoc) and will be Romanised, mostly according to ISO-15919 but with some quirks on top of that, including a few extra characters and case sensitivity. We will not be lemmatising the corpus at short notice, nor is it likely that we'll add <w> tags, though we may do so for part of the corpus later on. We're interested in extracting transliterated text from TEI XML (sometimes including alternative strings in <choice>) and searching it as fruitfully as possible. Ideally, we should have two search methods, a lenient one to gather fuzzy results and tolerate e.g. variations in epigraphic spelling without returning too many false positives (at the moment we only have some rudimentary notes for the specifics of this), and a strict one to return the exact string searched for. I myself do not have the level of technical preparedness even to understand our options, and will be passing any suggestions on to people with the necessary expertise. But to get started, I would welcome some basic suggestions and pointers: any already working open source specialised code we should check out? Any general search solutions that may be adapted to our purposes? Many thanks and apologies for the vague question, Dan _______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org<mailto:Indic-texts@lists.tei-c.org> http://lists.lists.tei-c.org/mailman/listinfo/indic-texts _______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org<mailto:Indic-texts@lists.tei-c.org> http://lists.lists.tei-c.org/mailman/listinfo/indic-texts _______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org<mailto:Indic-texts@lists.tei-c.org> http://lists.lists.tei-c.org/mailman/listinfo/indic-texts

I think that Peter meant that, for searching purposes, it is advisable to convert the ISO-15919 texts to something like the SLP-1 encoding, which does not use digraphs to represent single phonemes (thus we shouldn't get "dharmya" when we search for "harmya," or "aitara" when we search for "itara," etc.). This is a good suggestion, but SLP-1 doesn't include representations for sounds found in languages other than Sanskrit (e.g., short e and o, alveolar consonants in Tamil, retroflex approximants in the Dravidian languages, the Indonesian pepet, etc.) which are part of the DHARMA project. On Thu, Feb 13, 2020 at 11:52 AM Arlo Griffiths <arlo.griffiths@efeo.net> wrote:
Peter,
Daniel’s message made clear that the DHARMA text corpus is in a variant of ISO-15919 transliteration and does not concern only Sanskrit but a variety of South and Southeast Asian languages. Hence it remains unclear to me how your response is pertinent to his question. Could you explain?
Best wishes,
Arlo
PS Incidentally, I happen to notice that your Library is imprecise in the source attribution for < https://sanskritlibrary.org/catalogsText/titus/vedic/vaits.html>. The Titus e-text was not edited by Jost Gippert but by myself. See < http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/av/vaits/vaits.htm>. It is also not clear to me why that source is mentioned at all, because the text in your library does not actually seem to be derived from the Titus version, or if it is has suppressed my identification of the Saṁhitā source of the mantrapratīkas.
Le 13 févr. 2020 à 18:15, Peter Scharf <scharf@sanskritlibrary.org> a écrit :
You should have your source text in the Sanskrit Library Phonetic (SLP) encoding and transcode input and display to that for searching for reliable results. Patrick McCalister analyzed the disparities between searching in IAST or Unicode Devanagari; they give unexpected results and require extra effort to work around. The Sanskrit Library uses the method I recommend for searching its texts.
****************************** Peter M. Scharf, President The Sanskrit Library scharf@sanskritlibrary.org https://sanskritlibrary.org ******************************
On 13 Feb 2020, at 9:19 PM, Andrew Ollett <andrew.ollett@gmail.com> wrote:
Dear Dan,
I'm also very interested to see if there are suggestions in this direction. It seems to me that there are two options:
1. Creating a plain-text corpus from your TEI corpus (by means of XSL transformations), and searching this corpus the way that you would search a directory of texts on a local computer (e.g., grep). This is not a very "high-tech" solution, but I think this is how most of us search the GRETIL archive, and it's very straightforward. To get "fuzzy" results you would have to probably write custom shell scripts that replace a given search term (e.g., "dharma") with one that is modified to be "fuzzy" (e.g., "dha(r)*((ṁm)|(m)+)a" or whatever). 2. A web application with a search interface. Good luck with this. It would probably involve Apache Lucene. The source code for the SARIT <http://sarit.indology.info/> web application, designed for a transitional version of eXistDB (between 3 and 4), is available on GitHub <https://github.com/sarit/sarit-existdb>.
Andrew
On Thu, Feb 13, 2020 at 4:21 AM Dániel Balogh <danbalogh@gmail.com> wrote:
Dear All, in the near future, we in the DHARMA project will be looking into making our epigraphic corpus searchable. Our texts in Sanskrit and other S and SE Asian languages will be marked up in TEI (EpiDoc) and will be Romanised, mostly according to ISO-15919 but with some quirks on top of that, including a few extra characters and case sensitivity. We will not be lemmatising the corpus at short notice, nor is it likely that we'll add <w> tags, though we may do so for part of the corpus later on. We're interested in extracting transliterated text from TEI XML (sometimes including alternative strings in <choice>) and searching it as fruitfully as possible. Ideally, we should have two search methods, a lenient one to gather fuzzy results and tolerate e.g. variations in epigraphic spelling without returning too many false positives (at the moment we only have some rudimentary notes for the specifics of this), and a strict one to return the exact string searched for. I myself do not have the level of technical preparedness even to understand our options, and will be passing any suggestions on to people with the necessary expertise. But to get started, I would welcome some basic suggestions and pointers: any already working open source specialised code we should check out? Any general search solutions that may be adapted to our purposes? Many thanks and apologies for the vague question, Dan _______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts

Dear list members, just to add a more general point to what’s been said already: you’ll have to distinguish quite carefully between the two search methods you want:
On Thu, Feb 13, 2020 at 4:21 AM Dániel Balogh <danbalogh@gmail.com> wrote:
Ideally, we should have two search methods, a lenient one to gather fuzzy results and tolerate e.g. variations in epigraphic spelling without returning too many false positives (at the moment we only have some rudimentary notes for the specifics of this), and a strict one to return the exact string searched for.
It’s unlikely that these two wishes can be fulfilled by the same search engine. “Fuzzy results” is a slightly, ahem, fuzzy term. One important case of fuzzy search is full text search. So, if you search for the terms “X” and “Y” you’d want to find “A X B Y C” as well as “Z Y X W”, “X was in Y”, and perhaps also “X but not Y”. Usually, the utility of full text search depends heavily on how well the indexer can analyze a given language (not script!). In an English language search, it’s standard to get matches on “did” and “done” for a search on “do”. But we don’t really have that kind of thing for Sanskrit yet; Oliver Hellwig’s statistical approaches report success rates in the correct analysis of Sanskrit of around 85%. And if you start mixing languages (not just scripts), then rather weird things might happen if you ask your search engine to index them all in the same way. To be on the safe side, your project should use the @xml:lang tags with some foresight. You’ll probably want to note both the language and the script (e.g., “sa-Latn”), especially if any inscriptions write different languages in the same script, because that is what your search engine will need to know: which strings are Tamil, which are Sanskrit, and so on. And it’s unclear if you want to search for the same terms across all languages in your collection. Would that be useful? Spelling variations are a slightly different kind of problem. If you manage to reduce them to regular expressions, you could probably configure your search engine’s indexing functions to smudge the texts for this differences (e.g., don’t differentiate between “ba” and “va”). It’s also quite common to index the same set of texts in several different ways. E.g., one case sensitive, one not, one with punctuation, one without, one with corrections for spelling errors, and so on. You will really need to experiment with the settings, and decide what things should be searchable. In any case, all the full text searches work in broadly the same way: you need to get the text out of the TEI encoding and into a text-only format that is both simple enough so that a search engine can index it properly yet rich enough to satisfy your queries. The trick is really to find the right balance. Then you configure the search engine to either ignore or augment certain things in your texts. Full text search engines usually have only very minimal understanding of structural features: many can deal with simple HTML elements (e.g., rate a match in a heading higher than a match in a normal paragraph), but I’ve found that to be rather useless for what we were doing in SARIT (most texts not having headings you’d want to search, having been added by the editor---again, that’s a decision you make; for other researchers, that might be very interesting). You should remove all kinds of notes and other interferences (gaps, line numbers, etc.) that might detract from the text. Of course, there’s no one right way to do this. You’ll have to experiment quite a bit with getting the right extract of the text that is buried in your TEI markup. In SARIT, we tried to remove everything except lg-s (metric text, mostly), and paragraphs. I suppose you also have quite a bit of “hard” data about your inscriptions: size, location, etc. It would be great to add this in a search interface, and it’s usually easy to add fields to the index storing these kinds of things. The other search you mention is strict search. As Andrew wrote, that’s much easier in terms of technology. You just extract the full text (perhaps thinking a bit again about notes, labels, etc.), and then use grep or something (make sure to take care of linebreaks properly, many greps search only single lines). With best wishes, P.S.: Perhaps of interest for your technical staff: you can try a simple search interface for the SARIT texts here: https://es.rdorte.org/_plugin/calaca/ (Andrew’s search of "dharmya" and "harmya" should work there), and you can look at the technical interface (and especially the index configuration) here: https://es.rdorte.org/_plugin/head/ This was just a proof of concept for SARIT and isn’t being updated anymore, but the searches still work. Technically, this uses Lucene through something called Elastic (https://www.elastic.co/). Elastic is a very nice interface to Lucene, but the license model is weird; it’s now dropped out of several free software distributions, including Debian. You might want to check that. But Lucene is certainly a very solid basis for anyone starting with full text search. On Thu, Feb 13 2020, Andrew Ollett wrote:
I think that Peter meant that, for searching purposes, it is advisable to convert the ISO-15919 texts to something like the SLP-1 encoding, which does not use digraphs to represent single phonemes (thus we shouldn't get "dharmya" when we search for "harmya," or "aitara" when we search for "itara," etc.). This is a good suggestion, but SLP-1 doesn't include representations for sounds found in languages other than Sanskrit (e.g., short e and o, alveolar consonants in Tamil, retroflex approximants in the Dravidian languages, the Indonesian pepet, etc.) which are part of the DHARMA project.
On Thu, Feb 13, 2020 at 11:52 AM Arlo Griffiths <arlo.griffiths@efeo.net> wrote:
Peter,
Daniel’s message made clear that the DHARMA text corpus is in a variant of ISO-15919 transliteration and does not concern only Sanskrit but a variety of South and Southeast Asian languages. Hence it remains unclear to me how your response is pertinent to his question. Could you explain?
Best wishes,
Arlo
PS Incidentally, I happen to notice that your Library is imprecise in the source attribution for < https://sanskritlibrary.org/catalogsText/titus/vedic/vaits.html>. The Titus e-text was not edited by Jost Gippert but by myself. See < http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/av/vaits/vaits.htm>. It is also not clear to me why that source is mentioned at all, because the text in your library does not actually seem to be derived from the Titus version, or if it is has suppressed my identification of the Saṁhitā source of the mantrapratīkas.
Le 13 févr. 2020 à 18:15, Peter Scharf <scharf@sanskritlibrary.org> a écrit :
You should have your source text in the Sanskrit Library Phonetic (SLP) encoding and transcode input and display to that for searching for reliable results. Patrick McCalister analyzed the disparities between searching in IAST or Unicode Devanagari; they give unexpected results and require extra effort to work around. The Sanskrit Library uses the method I recommend for searching its texts.
****************************** Peter M. Scharf, President The Sanskrit Library scharf@sanskritlibrary.org https://sanskritlibrary.org ******************************
On 13 Feb 2020, at 9:19 PM, Andrew Ollett <andrew.ollett@gmail.com> wrote:
Dear Dan,
I'm also very interested to see if there are suggestions in this direction. It seems to me that there are two options:
1. Creating a plain-text corpus from your TEI corpus (by means of XSL transformations), and searching this corpus the way that you would search a directory of texts on a local computer (e.g., grep). This is not a very "high-tech" solution, but I think this is how most of us search the GRETIL archive, and it's very straightforward. To get "fuzzy" results you would have to probably write custom shell scripts that replace a given search term (e.g., "dharma") with one that is modified to be "fuzzy" (e.g., "dha(r)*((ṁm)|(m)+)a" or whatever). 2. A web application with a search interface. Good luck with this. It would probably involve Apache Lucene. The source code for the SARIT <http://sarit.indology.info/> web application, designed for a transitional version of eXistDB (between 3 and 4), is available on GitHub <https://github.com/sarit/sarit-existdb>.
Andrew
On Thu, Feb 13, 2020 at 4:21 AM Dániel Balogh <danbalogh@gmail.com> wrote:
Dear All, in the near future, we in the DHARMA project will be looking into making our epigraphic corpus searchable. Our texts in Sanskrit and other S and SE Asian languages will be marked up in TEI (EpiDoc) and will be Romanised, mostly according to ISO-15919 but with some quirks on top of that, including a few extra characters and case sensitivity. We will not be lemmatising the corpus at short notice, nor is it likely that we'll add <w> tags, though we may do so for part of the corpus later on. We're interested in extracting transliterated text from TEI XML (sometimes including alternative strings in <choice>) and searching it as fruitfully as possible. Ideally, we should have two search methods, a lenient one to gather fuzzy results and tolerate e.g. variations in epigraphic spelling without returning too many false positives (at the moment we only have some rudimentary notes for the specifics of this), and a strict one to return the exact string searched for. I myself do not have the level of technical preparedness even to understand our options, and will be passing any suggestions on to people with the necessary expertise. But to get started, I would welcome some basic suggestions and pointers: any already working open source specialised code we should check out? Any general search solutions that may be adapted to our purposes? Many thanks and apologies for the vague question, Dan _______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
-- Patrick McAllister long-term email: pma@rdorte.org

Dear Andrew, Thanks for your clarirfication of Peter’s message. Now I understand its pertinence. Dear Patrick, Thanks for all your useful information. I have created a github issue <https://github.com/erc-dharma/project-documentation/issues/7> assembling all info we have received on this matter so far. (Yes, we are carefully using the appropriate @xml:lang tags for language and script! And yes, we will want to make it possible to search both per individual language, or without regard to language, and I suppose while we’re at it we may as well try to make it possible to search in languages A and C but not in B.) Best wishes to all, Arlo
Le 13 févr. 2020 à 23:57, Patrick McAllister <pma@rdorte.org> a écrit :
Dear list members,
just to add a more general point to what’s been said already: you’ll have to distinguish quite carefully between the two search methods you want:
On Thu, Feb 13, 2020 at 4:21 AM Dániel Balogh <danbalogh@gmail.com> wrote:
Ideally, we should have two search methods, a lenient one to gather fuzzy results and tolerate e.g. variations in epigraphic spelling without returning too many false positives (at the moment we only have some rudimentary notes for the specifics of this), and a strict one to return the exact string searched for.
It’s unlikely that these two wishes can be fulfilled by the same search engine.
“Fuzzy results” is a slightly, ahem, fuzzy term. One important case of fuzzy search is full text search. So, if you search for the terms “X” and “Y” you’d want to find “A X B Y C” as well as “Z Y X W”, “X was in Y”, and perhaps also “X but not Y”. Usually, the utility of full text search depends heavily on how well the indexer can analyze a given language (not script!). In an English language search, it’s standard to get matches on “did” and “done” for a search on “do”. But we don’t really have that kind of thing for Sanskrit yet; Oliver Hellwig’s statistical approaches report success rates in the correct analysis of Sanskrit of around 85%. And if you start mixing languages (not just scripts), then rather weird things might happen if you ask your search engine to index them all in the same way.
To be on the safe side, your project should use the @xml:lang tags with some foresight. You’ll probably want to note both the language and the script (e.g., “sa-Latn”), especially if any inscriptions write different languages in the same script, because that is what your search engine will need to know: which strings are Tamil, which are Sanskrit, and so on. And it’s unclear if you want to search for the same terms across all languages in your collection. Would that be useful?
Spelling variations are a slightly different kind of problem. If you manage to reduce them to regular expressions, you could probably configure your search engine’s indexing functions to smudge the texts for this differences (e.g., don’t differentiate between “ba” and “va”). It’s also quite common to index the same set of texts in several different ways. E.g., one case sensitive, one not, one with punctuation, one without, one with corrections for spelling errors, and so on. You will really need to experiment with the settings, and decide what things should be searchable.
In any case, all the full text searches work in broadly the same way: you need to get the text out of the TEI encoding and into a text-only format that is both simple enough so that a search engine can index it properly yet rich enough to satisfy your queries. The trick is really to find the right balance. Then you configure the search engine to either ignore or augment certain things in your texts.
Full text search engines usually have only very minimal understanding of structural features: many can deal with simple HTML elements (e.g., rate a match in a heading higher than a match in a normal paragraph), but I’ve found that to be rather useless for what we were doing in SARIT (most texts not having headings you’d want to search, having been added by the editor---again, that’s a decision you make; for other researchers, that might be very interesting). You should remove all kinds of notes and other interferences (gaps, line numbers, etc.) that might detract from the text. Of course, there’s no one right way to do this. You’ll have to experiment quite a bit with getting the right extract of the text that is buried in your TEI markup. In SARIT, we tried to remove everything except lg-s (metric text, mostly), and paragraphs.
I suppose you also have quite a bit of “hard” data about your inscriptions: size, location, etc. It would be great to add this in a search interface, and it’s usually easy to add fields to the index storing these kinds of things.
The other search you mention is strict search. As Andrew wrote, that’s much easier in terms of technology. You just extract the full text (perhaps thinking a bit again about notes, labels, etc.), and then use grep or something (make sure to take care of linebreaks properly, many greps search only single lines).
With best wishes,
P.S.: Perhaps of interest for your technical staff: you can try a simple search interface for the SARIT texts here: https://es.rdorte.org/_plugin/calaca/ (Andrew’s search of "dharmya" and "harmya" should work there), and you can look at the technical interface (and especially the index configuration) here: https://es.rdorte.org/_plugin/head/ This was just a proof of concept for SARIT and isn’t being updated anymore, but the searches still work. Technically, this uses Lucene through something called Elastic (https://www.elastic.co/). Elastic is a very nice interface to Lucene, but the license model is weird; it’s now dropped out of several free software distributions, including Debian. You might want to check that. But Lucene is certainly a very solid basis for anyone starting with full text search.
On Thu, Feb 13 2020, Andrew Ollett wrote:
I think that Peter meant that, for searching purposes, it is advisable to convert the ISO-15919 texts to something like the SLP-1 encoding, which does not use digraphs to represent single phonemes (thus we shouldn't get "dharmya" when we search for "harmya," or "aitara" when we search for "itara," etc.). This is a good suggestion, but SLP-1 doesn't include representations for sounds found in languages other than Sanskrit (e.g., short e and o, alveolar consonants in Tamil, retroflex approximants in the Dravidian languages, the Indonesian pepet, etc.) which are part of the DHARMA project.
On Thu, Feb 13, 2020 at 11:52 AM Arlo Griffiths <arlo.griffiths@efeo.net> wrote:
Peter,
Daniel’s message made clear that the DHARMA text corpus is in a variant of ISO-15919 transliteration and does not concern only Sanskrit but a variety of South and Southeast Asian languages. Hence it remains unclear to me how your response is pertinent to his question. Could you explain?
Best wishes,
Arlo
PS Incidentally, I happen to notice that your Library is imprecise in the source attribution for < https://sanskritlibrary.org/catalogsText/titus/vedic/vaits.html>. The Titus e-text was not edited by Jost Gippert but by myself. See < http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/av/vaits/vaits.htm>. It is also not clear to me why that source is mentioned at all, because the text in your library does not actually seem to be derived from the Titus version, or if it is has suppressed my identification of the Saṁhitā source of the mantrapratīkas.
Le 13 févr. 2020 à 18:15, Peter Scharf <scharf@sanskritlibrary.org> a écrit :
You should have your source text in the Sanskrit Library Phonetic (SLP) encoding and transcode input and display to that for searching for reliable results. Patrick McCalister analyzed the disparities between searching in IAST or Unicode Devanagari; they give unexpected results and require extra effort to work around. The Sanskrit Library uses the method I recommend for searching its texts.
****************************** Peter M. Scharf, President The Sanskrit Library scharf@sanskritlibrary.org https://sanskritlibrary.org ******************************
On 13 Feb 2020, at 9:19 PM, Andrew Ollett <andrew.ollett@gmail.com> wrote:
Dear Dan,
I'm also very interested to see if there are suggestions in this direction. It seems to me that there are two options:
1. Creating a plain-text corpus from your TEI corpus (by means of XSL transformations), and searching this corpus the way that you would search a directory of texts on a local computer (e.g., grep). This is not a very "high-tech" solution, but I think this is how most of us search the GRETIL archive, and it's very straightforward. To get "fuzzy" results you would have to probably write custom shell scripts that replace a given search term (e.g., "dharma") with one that is modified to be "fuzzy" (e.g., "dha(r)*((ṁm)|(m)+)a" or whatever). 2. A web application with a search interface. Good luck with this. It would probably involve Apache Lucene. The source code for the SARIT <http://sarit.indology.info/> web application, designed for a transitional version of eXistDB (between 3 and 4), is available on GitHub <https://github.com/sarit/sarit-existdb>.
Andrew
On Thu, Feb 13, 2020 at 4:21 AM Dániel Balogh <danbalogh@gmail.com> wrote:
Dear All, in the near future, we in the DHARMA project will be looking into making our epigraphic corpus searchable. Our texts in Sanskrit and other S and SE Asian languages will be marked up in TEI (EpiDoc) and will be Romanised, mostly according to ISO-15919 but with some quirks on top of that, including a few extra characters and case sensitivity. We will not be lemmatising the corpus at short notice, nor is it likely that we'll add <w> tags, though we may do so for part of the corpus later on. We're interested in extracting transliterated text from TEI XML (sometimes including alternative strings in <choice>) and searching it as fruitfully as possible. Ideally, we should have two search methods, a lenient one to gather fuzzy results and tolerate e.g. variations in epigraphic spelling without returning too many false positives (at the moment we only have some rudimentary notes for the specifics of this), and a strict one to return the exact string searched for. I myself do not have the level of technical preparedness even to understand our options, and will be passing any suggestions on to people with the necessary expertise. But to get started, I would welcome some basic suggestions and pointers: any already working open source specialised code we should check out? Any general search solutions that may be adapted to our purposes? Many thanks and apologies for the vague question, Dan _______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
-- Patrick McAllister long-term email: pma@rdorte.org

Dear All, let me add my thanks to Arlo's for the quick and detailed responses I have received to my vague question. I am sadly ignorant of the exact advantages offered by Lucene and the level of difficulty involved in extracting from our corpus a suitable index file, but from the little I have gathered, that indeed seems to be the best way forward. I remain uncertain about SLP, mainly for the reasons pointed out by Andrew. A kind of "SLP2" which includes Unicode characters outside the ASCII range and dedicates some of these to replacing the digraphs in ISO-15959 (and IAST) may solve those, but developing yet another brand new transliteration scheme does not seem feasible at the moment. We shall thus probably live with catching the occasional dharmya while fishing for harmya. The very best, Daniel On Fri, 14 Feb 2020 at 09:30, Arlo Griffiths <arlo.griffiths@efeo.net> wrote:
Dear Andrew,
Thanks for your clarirfication of Peter’s message. Now I understand its pertinence.
Dear Patrick,
Thanks for all your useful information. I have created a github issue < https://github.com/erc-dharma/project-documentation/issues/7> assembling all info we have received on this matter so far. (Yes, we are carefully using the appropriate @xml:lang tags for language and script! And yes, we will want to make it possible to search both per individual language, or without regard to language, and I suppose while we’re at it we may as well try to make it possible to search in languages A and C but not in B.)
Best wishes to all,
Arlo
Le 13 févr. 2020 à 23:57, Patrick McAllister <pma@rdorte.org> a écrit :
Dear list members,
just to add a more general point to what’s been said already: you’ll have to distinguish quite carefully between the two search methods you want:
On Thu, Feb 13, 2020 at 4:21 AM Dániel Balogh <danbalogh@gmail.com> wrote:
Ideally, we should have two search methods, a lenient one to gather fuzzy results and tolerate e.g. variations in epigraphic spelling without returning too many false positives (at the moment we only have some rudimentary notes for the specifics of this), and a strict one to return the exact string searched for.
It’s unlikely that these two wishes can be fulfilled by the same search engine.
“Fuzzy results” is a slightly, ahem, fuzzy term. One important case of fuzzy search is full text search. So, if you search for the terms “X” and “Y” you’d want to find “A X B Y C” as well as “Z Y X W”, “X was in Y”, and perhaps also “X but not Y”. Usually, the utility of full text search depends heavily on how well the indexer can analyze a given language (not script!). In an English language search, it’s standard to get matches on “did” and “done” for a search on “do”. But we don’t really have that kind of thing for Sanskrit yet; Oliver Hellwig’s statistical approaches report success rates in the correct analysis of Sanskrit of around 85%. And if you start mixing languages (not just scripts), then rather weird things might happen if you ask your search engine to index them all in the same way.
To be on the safe side, your project should use the @xml:lang tags with some foresight. You’ll probably want to note both the language and the script (e.g., “sa-Latn”), especially if any inscriptions write different languages in the same script, because that is what your search engine will need to know: which strings are Tamil, which are Sanskrit, and so on. And it’s unclear if you want to search for the same terms across all languages in your collection. Would that be useful?
Spelling variations are a slightly different kind of problem. If you manage to reduce them to regular expressions, you could probably configure your search engine’s indexing functions to smudge the texts for this differences (e.g., don’t differentiate between “ba” and “va”). It’s also quite common to index the same set of texts in several different ways. E.g., one case sensitive, one not, one with punctuation, one without, one with corrections for spelling errors, and so on. You will really need to experiment with the settings, and decide what things should be searchable.
In any case, all the full text searches work in broadly the same way: you need to get the text out of the TEI encoding and into a text-only format that is both simple enough so that a search engine can index it properly yet rich enough to satisfy your queries. The trick is really to find the right balance. Then you configure the search engine to either ignore or augment certain things in your texts.
Full text search engines usually have only very minimal understanding of structural features: many can deal with simple HTML elements (e.g., rate a match in a heading higher than a match in a normal paragraph), but I’ve found that to be rather useless for what we were doing in SARIT (most texts not having headings you’d want to search, having been added by the editor---again, that’s a decision you make; for other researchers, that might be very interesting). You should remove all kinds of notes and other interferences (gaps, line numbers, etc.) that might detract from the text. Of course, there’s no one right way to do this. You’ll have to experiment quite a bit with getting the right extract of the text that is buried in your TEI markup. In SARIT, we tried to remove everything except lg-s (metric text, mostly), and paragraphs.
I suppose you also have quite a bit of “hard” data about your inscriptions: size, location, etc. It would be great to add this in a search interface, and it’s usually easy to add fields to the index storing these kinds of things.
The other search you mention is strict search. As Andrew wrote, that’s much easier in terms of technology. You just extract the full text (perhaps thinking a bit again about notes, labels, etc.), and then use grep or something (make sure to take care of linebreaks properly, many greps search only single lines).
With best wishes,
P.S.: Perhaps of interest for your technical staff: you can try a simple search interface for the SARIT texts here: https://es.rdorte.org/_plugin/calaca/ (Andrew’s search of "dharmya" and "harmya" should work there), and you can look at the technical interface (and especially the index configuration) here: https://es.rdorte.org/_plugin/head/ This was just a proof of concept for SARIT and isn’t being updated anymore, but the searches still work. Technically, this uses Lucene through something called Elastic (https://www.elastic.co/). Elastic is a very nice interface to Lucene, but the license model is weird; it’s now dropped out of several free software distributions, including Debian. You might want to check that. But Lucene is certainly a very solid basis for anyone starting with full text search.
On Thu, Feb 13 2020, Andrew Ollett wrote:
I think that Peter meant that, for searching purposes, it is advisable to convert the ISO-15919 texts to something like the SLP-1 encoding, which does not use digraphs to represent single phonemes (thus we shouldn't get "dharmya" when we search for "harmya," or "aitara" when we search for "itara," etc.). This is a good suggestion, but SLP-1 doesn't include representations for sounds found in languages other than Sanskrit (e.g., short e and o, alveolar consonants in Tamil, retroflex approximants in the Dravidian languages, the Indonesian pepet, etc.) which are part of the DHARMA project.
On Thu, Feb 13, 2020 at 11:52 AM Arlo Griffiths < arlo.griffiths@efeo.net> wrote:
Peter,
Daniel’s message made clear that the DHARMA text corpus is in a variant of ISO-15919 transliteration and does not concern only Sanskrit but a variety of South and Southeast Asian languages. Hence it remains unclear to me how your response is pertinent to his question. Could you explain?
Best wishes,
Arlo
PS Incidentally, I happen to notice that your Library is imprecise in the source attribution for < https://sanskritlibrary.org/catalogsText/titus/vedic/vaits.html>. The Titus e-text was not edited by Jost Gippert but by myself. See <
It is also not clear to me why that source is mentioned at all, because the text in your library does not actually seem to be derived from the Titus version, or if it is has suppressed my identification of the Saṁhitā
of the mantrapratīkas.
Le 13 févr. 2020 à 18:15, Peter Scharf <scharf@sanskritlibrary.org> a écrit :
You should have your source text in the Sanskrit Library Phonetic (SLP) encoding and transcode input and display to that for searching for reliable results. Patrick McCalister analyzed the disparities between searching in IAST or Unicode Devanagari; they give unexpected results and require extra effort to work around. The Sanskrit Library uses the method I recommend for searching its texts.
****************************** Peter M. Scharf, President The Sanskrit Library scharf@sanskritlibrary.org https://sanskritlibrary.org ******************************
On 13 Feb 2020, at 9:19 PM, Andrew Ollett <andrew.ollett@gmail.com> wrote:
Dear Dan,
I'm also very interested to see if there are suggestions in this direction. It seems to me that there are two options:
1. Creating a plain-text corpus from your TEI corpus (by means of XSL transformations), and searching this corpus the way that you would search a directory of texts on a local computer (e.g., grep). This is not a very "high-tech" solution, but I think this is how most of us search the GRETIL archive, and it's very straightforward. To get "fuzzy" results you would have to probably write custom shell scripts that replace a given search term (e.g., "dharma") with one that is modified to be "fuzzy" (e.g., "dha(r)*((ṁm)|(m)+)a" or whatever). 2. A web application with a search interface. Good luck with this. It would probably involve Apache Lucene. The source code for the SARIT <http://sarit.indology.info/> web application, designed for a transitional version of eXistDB (between 3 and 4), is available on GitHub <https://github.com/sarit/sarit-existdb>.
Andrew
On Thu, Feb 13, 2020 at 4:21 AM Dániel Balogh <danbalogh@gmail.com> wrote:
Dear All, in the near future, we in the DHARMA project will be looking into making our epigraphic corpus searchable. Our texts in Sanskrit and other S and SE Asian languages will be marked up in TEI (EpiDoc) and will be Romanised, mostly according to ISO-15919 but with some quirks on top of that, including a few extra characters and case sensitivity. We will not be lemmatising the corpus at short notice, nor is it likely that we'll add <w> tags, though we may do so for part of the corpus later on. We're interested in extracting transliterated text from TEI XML (sometimes including alternative strings in <choice>) and searching it as fruitfully as possible. Ideally, we should have two search methods, a lenient one to gather fuzzy results and tolerate e.g. variations in epigraphic spelling without returning too many false positives (at the moment we only have some rudimentary notes for the specifics of this), and a strict one to return the exact string searched for. I myself do not have the level of technical preparedness even to understand our options, and will be passing any suggestions on to
http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/av/vaits/vaits.htm>. source people
with the necessary expertise. But to get started, I would welcome some basic suggestions and pointers: any already working open source specialised code we should check out? Any general search solutions that may be adapted to our purposes? Many thanks and apologies for the vague question, Dan _______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
-- Patrick McAllister long-term email: pma@rdorte.org
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts

Patrick, Would it be possible to build a search interface that searched the TEI source and restricted that search to the content of desired elements such as lg and s elements? This rather than extracting to a text document and searching that? The Oxygen search interface permits XML aware searching to differentiate between searching the content of elements, versus attribute values, versus element names, etc. It might be similarly possible to differentiate which elements to search. Yours, Peter ****************************** Peter M. Scharf, President The Sanskrit Library scharf@sanskritlibrary.org https://sanskritlibrary.org ******************************
On 14 Feb 2020, at 4:27 AM, Patrick McAllister <pma@rdorte.org> wrote:
Dear list members,
just to add a more general point to what’s been said already: you’ll have to distinguish quite carefully between the two search methods you want:
On Thu, Feb 13, 2020 at 4:21 AM Dániel Balogh <danbalogh@gmail.com <mailto:danbalogh@gmail.com>> wrote:
Ideally, we should have two search methods, a lenient one to gather fuzzy results and tolerate e.g. variations in epigraphic spelling without returning too many false positives (at the moment we only have some rudimentary notes for the specifics of this), and a strict one to return the exact string searched for.
It’s unlikely that these two wishes can be fulfilled by the same search engine.
“Fuzzy results” is a slightly, ahem, fuzzy term. One important case of fuzzy search is full text search. So, if you search for the terms “X” and “Y” you’d want to find “A X B Y C” as well as “Z Y X W”, “X was in Y”, and perhaps also “X but not Y”. Usually, the utility of full text search depends heavily on how well the indexer can analyze a given language (not script!). In an English language search, it’s standard to get matches on “did” and “done” for a search on “do”. But we don’t really have that kind of thing for Sanskrit yet; Oliver Hellwig’s statistical approaches report success rates in the correct analysis of Sanskrit of around 85%. And if you start mixing languages (not just scripts), then rather weird things might happen if you ask your search engine to index them all in the same way.
To be on the safe side, your project should use the @xml:lang tags with some foresight. You’ll probably want to note both the language and the script (e.g., “sa-Latn”), especially if any inscriptions write different languages in the same script, because that is what your search engine will need to know: which strings are Tamil, which are Sanskrit, and so on. And it’s unclear if you want to search for the same terms across all languages in your collection. Would that be useful?
Spelling variations are a slightly different kind of problem. If you manage to reduce them to regular expressions, you could probably configure your search engine’s indexing functions to smudge the texts for this differences (e.g., don’t differentiate between “ba” and “va”). It’s also quite common to index the same set of texts in several different ways. E.g., one case sensitive, one not, one with punctuation, one without, one with corrections for spelling errors, and so on. You will really need to experiment with the settings, and decide what things should be searchable.
In any case, all the full text searches work in broadly the same way: you need to get the text out of the TEI encoding and into a text-only format that is both simple enough so that a search engine can index it properly yet rich enough to satisfy your queries. The trick is really to find the right balance. Then you configure the search engine to either ignore or augment certain things in your texts.
Full text search engines usually have only very minimal understanding of structural features: many can deal with simple HTML elements (e.g., rate a match in a heading higher than a match in a normal paragraph), but I’ve found that to be rather useless for what we were doing in SARIT (most texts not having headings you’d want to search, having been added by the editor---again, that’s a decision you make; for other researchers, that might be very interesting). You should remove all kinds of notes and other interferences (gaps, line numbers, etc.) that might detract from the text. Of course, there’s no one right way to do this. You’ll have to experiment quite a bit with getting the right extract of the text that is buried in your TEI markup. In SARIT, we tried to remove everything except lg-s (metric text, mostly), and paragraphs.
I suppose you also have quite a bit of “hard” data about your inscriptions: size, location, etc. It would be great to add this in a search interface, and it’s usually easy to add fields to the index storing these kinds of things.
The other search you mention is strict search. As Andrew wrote, that’s much easier in terms of technology. You just extract the full text (perhaps thinking a bit again about notes, labels, etc.), and then use grep or something (make sure to take care of linebreaks properly, many greps search only single lines).
With best wishes,
P.S.: Perhaps of interest for your technical staff: you can try a simple search interface for the SARIT texts here: https://es.rdorte.org/_plugin/calaca/ <https://es.rdorte.org/_plugin/calaca/> (Andrew’s search of "dharmya" and "harmya" should work there), and you can look at the technical interface (and especially the index configuration) here: https://es.rdorte.org/_plugin/head/ <https://es.rdorte.org/_plugin/head/> This was just a proof of concept for SARIT and isn’t being updated anymore, but the searches still work. Technically, this uses Lucene through something called Elastic (https://www.elastic.co/ <https://www.elastic.co/>). Elastic is a very nice interface to Lucene, but the license model is weird; it’s now dropped out of several free software distributions, including Debian. You might want to check that. But Lucene is certainly a very solid basis for anyone starting with full text search.
On Thu, Feb 13 2020, Andrew Ollett wrote:
I think that Peter meant that, for searching purposes, it is advisable to convert the ISO-15919 texts to something like the SLP-1 encoding, which does not use digraphs to represent single phonemes (thus we shouldn't get "dharmya" when we search for "harmya," or "aitara" when we search for "itara," etc.). This is a good suggestion, but SLP-1 doesn't include representations for sounds found in languages other than Sanskrit (e.g., short e and o, alveolar consonants in Tamil, retroflex approximants in the Dravidian languages, the Indonesian pepet, etc.) which are part of the DHARMA project.
On Thu, Feb 13, 2020 at 11:52 AM Arlo Griffiths <arlo.griffiths@efeo.net> wrote:
Peter,
Daniel’s message made clear that the DHARMA text corpus is in a variant of ISO-15919 transliteration and does not concern only Sanskrit but a variety of South and Southeast Asian languages. Hence it remains unclear to me how your response is pertinent to his question. Could you explain?
Best wishes,
Arlo
PS Incidentally, I happen to notice that your Library is imprecise in the source attribution for < https://sanskritlibrary.org/catalogsText/titus/vedic/vaits.html>. The Titus e-text was not edited by Jost Gippert but by myself. See < http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/av/vaits/vaits.htm>. It is also not clear to me why that source is mentioned at all, because the text in your library does not actually seem to be derived from the Titus version, or if it is has suppressed my identification of the Saṁhitā source of the mantrapratīkas.
Le 13 févr. 2020 à 18:15, Peter Scharf <scharf@sanskritlibrary.org> a écrit :
You should have your source text in the Sanskrit Library Phonetic (SLP) encoding and transcode input and display to that for searching for reliable results. Patrick McCalister analyzed the disparities between searching in IAST or Unicode Devanagari; they give unexpected results and require extra effort to work around. The Sanskrit Library uses the method I recommend for searching its texts.
****************************** Peter M. Scharf, President The Sanskrit Library scharf@sanskritlibrary.org https://sanskritlibrary.org ******************************
On 13 Feb 2020, at 9:19 PM, Andrew Ollett <andrew.ollett@gmail.com> wrote:
Dear Dan,
I'm also very interested to see if there are suggestions in this direction. It seems to me that there are two options:
1. Creating a plain-text corpus from your TEI corpus (by means of XSL transformations), and searching this corpus the way that you would search a directory of texts on a local computer (e.g., grep). This is not a very "high-tech" solution, but I think this is how most of us search the GRETIL archive, and it's very straightforward. To get "fuzzy" results you would have to probably write custom shell scripts that replace a given search term (e.g., "dharma") with one that is modified to be "fuzzy" (e.g., "dha(r)*((ṁm)|(m)+)a" or whatever). 2. A web application with a search interface. Good luck with this. It would probably involve Apache Lucene. The source code for the SARIT <http://sarit.indology.info/ <http://sarit.indology.info/>> web application, designed for a transitional version of eXistDB (between 3 and 4), is available on GitHub <https://github.com/sarit/sarit-existdb <https://github.com/sarit/sarit-existdb>>.
Andrew
On Thu, Feb 13, 2020 at 4:21 AM Dániel Balogh <danbalogh@gmail.com> wrote:
Dear All, in the near future, we in the DHARMA project will be looking into making our epigraphic corpus searchable. Our texts in Sanskrit and other S and SE Asian languages will be marked up in TEI (EpiDoc) and will be Romanised, mostly according to ISO-15919 but with some quirks on top of that, including a few extra characters and case sensitivity. We will not be lemmatising the corpus at short notice, nor is it likely that we'll add <w> tags, though we may do so for part of the corpus later on. We're interested in extracting transliterated text from TEI XML (sometimes including alternative strings in <choice>) and searching it as fruitfully as possible. Ideally, we should have two search methods, a lenient one to gather fuzzy results and tolerate e.g. variations in epigraphic spelling without returning too many false positives (at the moment we only have some rudimentary notes for the specifics of this), and a strict one to return the exact string searched for. I myself do not have the level of technical preparedness even to understand our options, and will be passing any suggestions on to people with the necessary expertise. But to get started, I would welcome some basic suggestions and pointers: any already working open source specialised code we should check out? Any general search solutions that may be adapted to our purposes? Many thanks and apologies for the vague question, Dan _______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org <mailto:Indic-texts@lists.tei-c.org> http://lists.lists.tei-c.org/mailman/listinfo/indic-texts <http://lists.lists.tei-c.org/mailman/listinfo/indic-texts>
-- Patrick McAllister long-term email: pma@rdorte.org <mailto:pma@rdorte.org> _______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org <mailto:Indic-texts@lists.tei-c.org> http://lists.lists.tei-c.org/mailman/listinfo/indic-texts <http://lists.lists.tei-c.org/mailman/listinfo/indic-texts>

On Fri, Feb 14 2020, Peter Scharf wrote:
Patrick, Would it be possible to build a search interface that searched the TEI source and restricted that search to the content of desired elements such as lg and s elements? This rather than extracting to a text document and searching that? The Oxygen search interface permits XML aware searching to differentiate between searching the content of elements, versus attribute values, versus element names, etc. It might be similarly possible to differentiate which elements to search.
Dear Peter, yes, that is possible. But I don’t think it would be very useful to stick too closely to the TEI markup. Someone will have to decide which text (or, “content”) is relevant. Consider these cases: ┌──── │ <lg xmlns="tei"> │ <l>A B C</l> │ </lg> └──── ┌──── │ <lg xmlns="tei"> │ <l>A <hi>B</hi> C</l> │ </lg> └──── ┌──── │ <lg xmlns="tei"> │ <l>A <del>B</del> C</l> │ </lg> └──── ┌──── │ <lg xmlns="tei"> │ <l>A <note xml:lang="en">B</note> C</l> │ </lg> └──── What would you expect a search for “A AND B” to return? And in which order? What is the “content” of the tei:lg elements? It gets quite tricky to think through all the variations, and for a group of texts with markup as diverse as what you find in SARIT’s library you’ll run into contradictions (forcing you to define rules per document). So it’s easiest to give “abstract” classes, metric texts vs. prose, for example, which can be extracted from your markup and exposed through a simple search interface that doesn’t presuppose acquaintance with the TEI. Technically, it’s no big problem to store tag information alongside the strings: you can easily search for “tag:lg *harmy*” at https://es.rdorte.org/_plugin/calaca/. This is somewhere between XML-aware search and full text search. But the utility is rather limited, especially if you get users who don’t know what tei:lg or tei:p (or an XML element in general) is supposed to be. Best wishes,
Yours, Peter
****************************** Peter M. Scharf, President The Sanskrit Library scharf@sanskritlibrary.org https://sanskritlibrary.org ******************************
On 14 Feb 2020, at 4:27 AM, Patrick McAllister <pma@rdorte.org> wrote:
Dear list members,
just to add a more general point to what’s been said already: you’ll
-- Patrick McAllister long-term email: pma@rdorte.org

Thanks, Patrick. Yes it gets quite complicated with the subordinate elements. However, one has to deal with that complication anyway to extract the text to be searched to a text document too. In the extraction algorithm one would have to specify whether to include text within subordinate hi-elements but not del- or note-elements, for example. So one could do the same for search within the TEI document. ****************************** Peter M. Scharf, President The Sanskrit Library scharf@sanskritlibrary.org https://sanskritlibrary.org ******************************
On 15 Feb 2020, at 1:54 PM, Patrick McAllister <pma@rdorte.org> wrote:
On Fri, Feb 14 2020, Peter Scharf wrote:
Patrick, Would it be possible to build a search interface that searched the TEI source and restricted that search to the content of desired elements such as lg and s elements? This rather than extracting to a text document and searching that? The Oxygen search interface permits XML aware searching to differentiate between searching the content of elements, versus attribute values, versus element names, etc. It might be similarly possible to differentiate which elements to search.
Dear Peter,
yes, that is possible. But I don’t think it would be very useful to stick too closely to the TEI markup. Someone will have to decide which text (or, “content”) is relevant. Consider these cases:
┌──── │ <lg xmlns="tei"> │ <l>A B C</l> │ </lg> └────
┌──── │ <lg xmlns="tei"> │ <l>A <hi>B</hi> C</l> │ </lg> └────
┌──── │ <lg xmlns="tei"> │ <l>A <del>B</del> C</l> │ </lg> └────
┌──── │ <lg xmlns="tei"> │ <l>A <note xml:lang="en">B</note> C</l> │ </lg> └────
What would you expect a search for “A AND B” to return? And in which order? What is the “content” of the tei:lg elements? It gets quite tricky to think through all the variations, and for a group of texts with markup as diverse as what you find in SARIT’s library you’ll run into contradictions (forcing you to define rules per document). So it’s easiest to give “abstract” classes, metric texts vs. prose, for example, which can be extracted from your markup and exposed through a simple search interface that doesn’t presuppose acquaintance with the TEI.
Technically, it’s no big problem to store tag information alongside the strings: you can easily search for “tag:lg *harmy*” at https://es.rdorte.org/_plugin/calaca/ <https://es.rdorte.org/_plugin/calaca/>. This is somewhere between XML-aware search and full text search. But the utility is rather limited, especially if you get users who don’t know what tei:lg or tei:p (or an XML element in general) is supposed to be.
Best wishes,
Yours, Peter
****************************** Peter M. Scharf, President The Sanskrit Library scharf@sanskritlibrary.org https://sanskritlibrary.org ******************************
On 14 Feb 2020, at 4:27 AM, Patrick McAllister <pma@rdorte.org> wrote:
Dear list members,
just to add a more general point to what’s been said already: you’ll
-- Patrick McAllister long-term email: pma@rdorte.org <mailto:pma@rdorte.org>
participants (5)
-
Andrew Ollett
-
Arlo Griffiths
-
Dániel Balogh
-
Patrick McAllister
-
Peter Scharf