Searching transliterated Indic texts in TEI
Dear All, in the near future, we in the DHARMA project will be looking into making our epigraphic corpus searchable. Our texts in Sanskrit and other S and SE Asian languages will be marked up in TEI (EpiDoc) and will be Romanised, mostly according to ISO-15919 but with some quirks on top of that, including a few extra characters and case sensitivity. We will not be lemmatising the corpus at short notice, nor is it likely that we'll add <w> tags, though we may do so for part of the corpus later on. We're interested in extracting transliterated text from TEI XML (sometimes including alternative strings in <choice>) and searching it as fruitfully as possible. Ideally, we should have two search methods, a lenient one to gather fuzzy results and tolerate e.g. variations in epigraphic spelling without returning too many false positives (at the moment we only have some rudimentary notes for the specifics of this), and a strict one to return the exact string searched for. I myself do not have the level of technical preparedness even to understand our options, and will be passing any suggestions on to people with the necessary expertise. But to get started, I would welcome some basic suggestions and pointers: any already working open source specialised code we should check out? Any general search solutions that may be adapted to our purposes? Many thanks and apologies for the vague question, Dan
Dear Dan,
I'm also very interested to see if there are suggestions in this direction.
It seems to me that there are two options:
1. Creating a plain-text corpus from your TEI corpus (by means of XSL
transformations), and searching this corpus the way that you would search a
directory of texts on a local computer (e.g., grep). This is not a very
"high-tech" solution, but I think this is how most of us search the GRETIL
archive, and it's very straightforward. To get "fuzzy" results you would
have to probably write custom shell scripts that replace a given search
term (e.g., "dharma") with one that is modified to be "fuzzy" (e.g.,
"dha(r)*((ṁm)|(m)+)a" or whatever).
2. A web application with a search interface. Good luck with this. It
would probably involve Apache Lucene. The source code for the SARIT
http://sarit.indology.info/ web application, designed for a
transitional version of eXistDB (between 3 and 4), is available on GitHub
https://github.com/sarit/sarit-existdb.
Andrew
On Thu, Feb 13, 2020 at 4:21 AM Dániel Balogh
Dear All, in the near future, we in the DHARMA project will be looking into making our epigraphic corpus searchable. Our texts in Sanskrit and other S and SE Asian languages will be marked up in TEI (EpiDoc) and will be Romanised, mostly according to ISO-15919 but with some quirks on top of that, including a few extra characters and case sensitivity. We will not be lemmatising the corpus at short notice, nor is it likely that we'll add <w> tags, though we may do so for part of the corpus later on. We're interested in extracting transliterated text from TEI XML (sometimes including alternative strings in <choice>) and searching it as fruitfully as possible. Ideally, we should have two search methods, a lenient one to gather fuzzy results and tolerate e.g. variations in epigraphic spelling without returning too many false positives (at the moment we only have some rudimentary notes for the specifics of this), and a strict one to return the exact string searched for. I myself do not have the level of technical preparedness even to understand our options, and will be passing any suggestions on to people with the necessary expertise. But to get started, I would welcome some basic suggestions and pointers: any already working open source specialised code we should check out? Any general search solutions that may be adapted to our purposes? Many thanks and apologies for the vague question, Dan _______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
You should have your source text in the Sanskrit Library Phonetic (SLP) encoding and transcode input and display to that for searching for reliable results. Patrick McCalister analyzed the disparities between searching in IAST or Unicode Devanagari; they give unexpected results and require extra effort to work around. The Sanskrit Library uses the method I recommend for searching its texts. ****************************** Peter M. Scharf, President The Sanskrit Library scharf@sanskritlibrary.org https://sanskritlibrary.org ******************************
On 13 Feb 2020, at 9:19 PM, Andrew Ollett
wrote: Dear Dan,
I'm also very interested to see if there are suggestions in this direction. It seems to me that there are two options: Creating a plain-text corpus from your TEI corpus (by means of XSL transformations), and searching this corpus the way that you would search a directory of texts on a local computer (e.g., grep). This is not a very "high-tech" solution, but I think this is how most of us search the GRETIL archive, and it's very straightforward. To get "fuzzy" results you would have to probably write custom shell scripts that replace a given search term (e.g., "dharma") with one that is modified to be "fuzzy" (e.g., "dha(r)*((ṁm)|(m)+)a" or whatever). A web application with a search interface. Good luck with this. It would probably involve Apache Lucene. The source code for the SARIT http://sarit.indology.info/ web application, designed for a transitional version of eXistDB (between 3 and 4), is available on GitHub https://github.com/sarit/sarit-existdb. Andrew
On Thu, Feb 13, 2020 at 4:21 AM Dániel Balogh
mailto:danbalogh@gmail.com> wrote: Dear All, in the near future, we in the DHARMA project will be looking into making our epigraphic corpus searchable. Our texts in Sanskrit and other S and SE Asian languages will be marked up in TEI (EpiDoc) and will be Romanised, mostly according to ISO-15919 but with some quirks on top of that, including a few extra characters and case sensitivity. We will not be lemmatising the corpus at short notice, nor is it likely that we'll add <w> tags, though we may do so for part of the corpus later on. We're interested in extracting transliterated text from TEI XML (sometimes including alternative strings in <choice>) and searching it as fruitfully as possible. Ideally, we should have two search methods, a lenient one to gather fuzzy results and tolerate e.g. variations in epigraphic spelling without returning too many false positives (at the moment we only have some rudimentary notes for the specifics of this), and a strict one to return the exact string searched for. I myself do not have the level of technical preparedness even to understand our options, and will be passing any suggestions on to people with the necessary expertise. But to get started, I would welcome some basic suggestions and pointers: any already working open source specialised code we should check out? Any general search solutions that may be adapted to our purposes? Many thanks and apologies for the vague question, Dan _______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org mailto:Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts http://lists.lists.tei-c.org/mailman/listinfo/indic-texts _______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
Peter,
Daniel’s message made clear that the DHARMA text corpus is in a variant of ISO-15919 transliteration and does not concern only Sanskrit but a variety of South and Southeast Asian languages. Hence it remains unclear to me how your response is pertinent to his question. Could you explain?
Best wishes,
Arlo
PS Incidentally, I happen to notice that your Library is imprecise in the source attribution for https://sanskritlibrary.org/catalogsText/titus/vedic/vaits.html. The Titus e-text was not edited by Jost Gippert but by myself. See http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/av/vaits/vaits.htm. It is also not clear to me why that source is mentioned at all, because the text in your library does not actually seem to be derived from the Titus version, or if it is has suppressed my identification of the Saṁhitā source of the mantrapratīkas.
Le 13 févr. 2020 à 18:15, Peter Scharf
I think that Peter meant that, for searching purposes, it is advisable to
convert the ISO-15919 texts to something like the SLP-1 encoding, which
does not use digraphs to represent single phonemes (thus we shouldn't get
"dharmya" when we search for "harmya," or "aitara" when we search for
"itara," etc.). This is a good suggestion, but SLP-1 doesn't include
representations for sounds found in languages other than Sanskrit (e.g.,
short e and o, alveolar consonants in Tamil, retroflex approximants in the
Dravidian languages, the Indonesian pepet, etc.) which are part of the
DHARMA project.
On Thu, Feb 13, 2020 at 11:52 AM Arlo Griffiths
Peter,
Daniel’s message made clear that the DHARMA text corpus is in a variant of ISO-15919 transliteration and does not concern only Sanskrit but a variety of South and Southeast Asian languages. Hence it remains unclear to me how your response is pertinent to his question. Could you explain?
Best wishes,
Arlo
PS Incidentally, I happen to notice that your Library is imprecise in the source attribution for < https://sanskritlibrary.org/catalogsText/titus/vedic/vaits.html>. The Titus e-text was not edited by Jost Gippert but by myself. See < http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/av/vaits/vaits.htm>. It is also not clear to me why that source is mentioned at all, because the text in your library does not actually seem to be derived from the Titus version, or if it is has suppressed my identification of the Saṁhitā source of the mantrapratīkas.
Le 13 févr. 2020 à 18:15, Peter Scharf
a écrit : You should have your source text in the Sanskrit Library Phonetic (SLP) encoding and transcode input and display to that for searching for reliable results. Patrick McCalister analyzed the disparities between searching in IAST or Unicode Devanagari; they give unexpected results and require extra effort to work around. The Sanskrit Library uses the method I recommend for searching its texts.
****************************** Peter M. Scharf, President The Sanskrit Library scharf@sanskritlibrary.org https://sanskritlibrary.org ******************************
On 13 Feb 2020, at 9:19 PM, Andrew Ollett
wrote: Dear Dan,
I'm also very interested to see if there are suggestions in this direction. It seems to me that there are two options:
1. Creating a plain-text corpus from your TEI corpus (by means of XSL transformations), and searching this corpus the way that you would search a directory of texts on a local computer (e.g., grep). This is not a very "high-tech" solution, but I think this is how most of us search the GRETIL archive, and it's very straightforward. To get "fuzzy" results you would have to probably write custom shell scripts that replace a given search term (e.g., "dharma") with one that is modified to be "fuzzy" (e.g., "dha(r)*((ṁm)|(m)+)a" or whatever). 2. A web application with a search interface. Good luck with this. It would probably involve Apache Lucene. The source code for the SARIT http://sarit.indology.info/ web application, designed for a transitional version of eXistDB (between 3 and 4), is available on GitHub https://github.com/sarit/sarit-existdb.
Andrew
On Thu, Feb 13, 2020 at 4:21 AM Dániel Balogh
wrote: Dear All, in the near future, we in the DHARMA project will be looking into making our epigraphic corpus searchable. Our texts in Sanskrit and other S and SE Asian languages will be marked up in TEI (EpiDoc) and will be Romanised, mostly according to ISO-15919 but with some quirks on top of that, including a few extra characters and case sensitivity. We will not be lemmatising the corpus at short notice, nor is it likely that we'll add <w> tags, though we may do so for part of the corpus later on. We're interested in extracting transliterated text from TEI XML (sometimes including alternative strings in <choice>) and searching it as fruitfully as possible. Ideally, we should have two search methods, a lenient one to gather fuzzy results and tolerate e.g. variations in epigraphic spelling without returning too many false positives (at the moment we only have some rudimentary notes for the specifics of this), and a strict one to return the exact string searched for. I myself do not have the level of technical preparedness even to understand our options, and will be passing any suggestions on to people with the necessary expertise. But to get started, I would welcome some basic suggestions and pointers: any already working open source specialised code we should check out? Any general search solutions that may be adapted to our purposes? Many thanks and apologies for the vague question, Dan _______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
Dear list members, just to add a more general point to what’s been said already: you’ll have to distinguish quite carefully between the two search methods you want:
On Thu, Feb 13, 2020 at 4:21 AM Dániel Balogh
wrote: Ideally, we should have two search methods, a lenient one to gather fuzzy results and tolerate e.g. variations in epigraphic spelling without returning too many false positives (at the moment we only have some rudimentary notes for the specifics of this), and a strict one to return the exact string searched for.
It’s unlikely that these two wishes can be fulfilled by the same search engine. “Fuzzy results” is a slightly, ahem, fuzzy term. One important case of fuzzy search is full text search. So, if you search for the terms “X” and “Y” you’d want to find “A X B Y C” as well as “Z Y X W”, “X was in Y”, and perhaps also “X but not Y”. Usually, the utility of full text search depends heavily on how well the indexer can analyze a given language (not script!). In an English language search, it’s standard to get matches on “did” and “done” for a search on “do”. But we don’t really have that kind of thing for Sanskrit yet; Oliver Hellwig’s statistical approaches report success rates in the correct analysis of Sanskrit of around 85%. And if you start mixing languages (not just scripts), then rather weird things might happen if you ask your search engine to index them all in the same way. To be on the safe side, your project should use the @xml:lang tags with some foresight. You’ll probably want to note both the language and the script (e.g., “sa-Latn”), especially if any inscriptions write different languages in the same script, because that is what your search engine will need to know: which strings are Tamil, which are Sanskrit, and so on. And it’s unclear if you want to search for the same terms across all languages in your collection. Would that be useful? Spelling variations are a slightly different kind of problem. If you manage to reduce them to regular expressions, you could probably configure your search engine’s indexing functions to smudge the texts for this differences (e.g., don’t differentiate between “ba” and “va”). It’s also quite common to index the same set of texts in several different ways. E.g., one case sensitive, one not, one with punctuation, one without, one with corrections for spelling errors, and so on. You will really need to experiment with the settings, and decide what things should be searchable. In any case, all the full text searches work in broadly the same way: you need to get the text out of the TEI encoding and into a text-only format that is both simple enough so that a search engine can index it properly yet rich enough to satisfy your queries. The trick is really to find the right balance. Then you configure the search engine to either ignore or augment certain things in your texts. Full text search engines usually have only very minimal understanding of structural features: many can deal with simple HTML elements (e.g., rate a match in a heading higher than a match in a normal paragraph), but I’ve found that to be rather useless for what we were doing in SARIT (most texts not having headings you’d want to search, having been added by the editor---again, that’s a decision you make; for other researchers, that might be very interesting). You should remove all kinds of notes and other interferences (gaps, line numbers, etc.) that might detract from the text. Of course, there’s no one right way to do this. You’ll have to experiment quite a bit with getting the right extract of the text that is buried in your TEI markup. In SARIT, we tried to remove everything except lg-s (metric text, mostly), and paragraphs. I suppose you also have quite a bit of “hard” data about your inscriptions: size, location, etc. It would be great to add this in a search interface, and it’s usually easy to add fields to the index storing these kinds of things. The other search you mention is strict search. As Andrew wrote, that’s much easier in terms of technology. You just extract the full text (perhaps thinking a bit again about notes, labels, etc.), and then use grep or something (make sure to take care of linebreaks properly, many greps search only single lines). With best wishes, P.S.: Perhaps of interest for your technical staff: you can try a simple search interface for the SARIT texts here: https://es.rdorte.org/_plugin/calaca/ (Andrew’s search of "dharmya" and "harmya" should work there), and you can look at the technical interface (and especially the index configuration) here: https://es.rdorte.org/_plugin/head/ This was just a proof of concept for SARIT and isn’t being updated anymore, but the searches still work. Technically, this uses Lucene through something called Elastic (https://www.elastic.co/). Elastic is a very nice interface to Lucene, but the license model is weird; it’s now dropped out of several free software distributions, including Debian. You might want to check that. But Lucene is certainly a very solid basis for anyone starting with full text search. On Thu, Feb 13 2020, Andrew Ollett wrote:
I think that Peter meant that, for searching purposes, it is advisable to convert the ISO-15919 texts to something like the SLP-1 encoding, which does not use digraphs to represent single phonemes (thus we shouldn't get "dharmya" when we search for "harmya," or "aitara" when we search for "itara," etc.). This is a good suggestion, but SLP-1 doesn't include representations for sounds found in languages other than Sanskrit (e.g., short e and o, alveolar consonants in Tamil, retroflex approximants in the Dravidian languages, the Indonesian pepet, etc.) which are part of the DHARMA project.
On Thu, Feb 13, 2020 at 11:52 AM Arlo Griffiths
wrote: Peter,
Daniel’s message made clear that the DHARMA text corpus is in a variant of ISO-15919 transliteration and does not concern only Sanskrit but a variety of South and Southeast Asian languages. Hence it remains unclear to me how your response is pertinent to his question. Could you explain?
Best wishes,
Arlo
PS Incidentally, I happen to notice that your Library is imprecise in the source attribution for < https://sanskritlibrary.org/catalogsText/titus/vedic/vaits.html>. The Titus e-text was not edited by Jost Gippert but by myself. See < http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/av/vaits/vaits.htm>. It is also not clear to me why that source is mentioned at all, because the text in your library does not actually seem to be derived from the Titus version, or if it is has suppressed my identification of the Saṁhitā source of the mantrapratīkas.
Le 13 févr. 2020 à 18:15, Peter Scharf
a écrit : You should have your source text in the Sanskrit Library Phonetic (SLP) encoding and transcode input and display to that for searching for reliable results. Patrick McCalister analyzed the disparities between searching in IAST or Unicode Devanagari; they give unexpected results and require extra effort to work around. The Sanskrit Library uses the method I recommend for searching its texts.
****************************** Peter M. Scharf, President The Sanskrit Library scharf@sanskritlibrary.org https://sanskritlibrary.org ******************************
On 13 Feb 2020, at 9:19 PM, Andrew Ollett
wrote: Dear Dan,
I'm also very interested to see if there are suggestions in this direction. It seems to me that there are two options:
1. Creating a plain-text corpus from your TEI corpus (by means of XSL transformations), and searching this corpus the way that you would search a directory of texts on a local computer (e.g., grep). This is not a very "high-tech" solution, but I think this is how most of us search the GRETIL archive, and it's very straightforward. To get "fuzzy" results you would have to probably write custom shell scripts that replace a given search term (e.g., "dharma") with one that is modified to be "fuzzy" (e.g., "dha(r)*((ṁm)|(m)+)a" or whatever). 2. A web application with a search interface. Good luck with this. It would probably involve Apache Lucene. The source code for the SARIT http://sarit.indology.info/ web application, designed for a transitional version of eXistDB (between 3 and 4), is available on GitHub https://github.com/sarit/sarit-existdb.
Andrew
On Thu, Feb 13, 2020 at 4:21 AM Dániel Balogh
wrote: Dear All, in the near future, we in the DHARMA project will be looking into making our epigraphic corpus searchable. Our texts in Sanskrit and other S and SE Asian languages will be marked up in TEI (EpiDoc) and will be Romanised, mostly according to ISO-15919 but with some quirks on top of that, including a few extra characters and case sensitivity. We will not be lemmatising the corpus at short notice, nor is it likely that we'll add <w> tags, though we may do so for part of the corpus later on. We're interested in extracting transliterated text from TEI XML (sometimes including alternative strings in <choice>) and searching it as fruitfully as possible. Ideally, we should have two search methods, a lenient one to gather fuzzy results and tolerate e.g. variations in epigraphic spelling without returning too many false positives (at the moment we only have some rudimentary notes for the specifics of this), and a strict one to return the exact string searched for. I myself do not have the level of technical preparedness even to understand our options, and will be passing any suggestions on to people with the necessary expertise. But to get started, I would welcome some basic suggestions and pointers: any already working open source specialised code we should check out? Any general search solutions that may be adapted to our purposes? Many thanks and apologies for the vague question, Dan _______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
-- Patrick McAllister long-term email: pma@rdorte.org
Dear Andrew, Thanks for your clarirfication of Peter’s message. Now I understand its pertinence. Dear Patrick, Thanks for all your useful information. I have created a github issue https://github.com/erc-dharma/project-documentation/issues/7 assembling all info we have received on this matter so far. (Yes, we are carefully using the appropriate @xml:lang tags for language and script! And yes, we will want to make it possible to search both per individual language, or without regard to language, and I suppose while we’re at it we may as well try to make it possible to search in languages A and C but not in B.) Best wishes to all, Arlo
Le 13 févr. 2020 à 23:57, Patrick McAllister
a écrit : Dear list members,
just to add a more general point to what’s been said already: you’ll have to distinguish quite carefully between the two search methods you want:
On Thu, Feb 13, 2020 at 4:21 AM Dániel Balogh
wrote: Ideally, we should have two search methods, a lenient one to gather fuzzy results and tolerate e.g. variations in epigraphic spelling without returning too many false positives (at the moment we only have some rudimentary notes for the specifics of this), and a strict one to return the exact string searched for.
It’s unlikely that these two wishes can be fulfilled by the same search engine.
“Fuzzy results” is a slightly, ahem, fuzzy term. One important case of fuzzy search is full text search. So, if you search for the terms “X” and “Y” you’d want to find “A X B Y C” as well as “Z Y X W”, “X was in Y”, and perhaps also “X but not Y”. Usually, the utility of full text search depends heavily on how well the indexer can analyze a given language (not script!). In an English language search, it’s standard to get matches on “did” and “done” for a search on “do”. But we don’t really have that kind of thing for Sanskrit yet; Oliver Hellwig’s statistical approaches report success rates in the correct analysis of Sanskrit of around 85%. And if you start mixing languages (not just scripts), then rather weird things might happen if you ask your search engine to index them all in the same way.
To be on the safe side, your project should use the @xml:lang tags with some foresight. You’ll probably want to note both the language and the script (e.g., “sa-Latn”), especially if any inscriptions write different languages in the same script, because that is what your search engine will need to know: which strings are Tamil, which are Sanskrit, and so on. And it’s unclear if you want to search for the same terms across all languages in your collection. Would that be useful?
Spelling variations are a slightly different kind of problem. If you manage to reduce them to regular expressions, you could probably configure your search engine’s indexing functions to smudge the texts for this differences (e.g., don’t differentiate between “ba” and “va”). It’s also quite common to index the same set of texts in several different ways. E.g., one case sensitive, one not, one with punctuation, one without, one with corrections for spelling errors, and so on. You will really need to experiment with the settings, and decide what things should be searchable.
In any case, all the full text searches work in broadly the same way: you need to get the text out of the TEI encoding and into a text-only format that is both simple enough so that a search engine can index it properly yet rich enough to satisfy your queries. The trick is really to find the right balance. Then you configure the search engine to either ignore or augment certain things in your texts.
Full text search engines usually have only very minimal understanding of structural features: many can deal with simple HTML elements (e.g., rate a match in a heading higher than a match in a normal paragraph), but I’ve found that to be rather useless for what we were doing in SARIT (most texts not having headings you’d want to search, having been added by the editor---again, that’s a decision you make; for other researchers, that might be very interesting). You should remove all kinds of notes and other interferences (gaps, line numbers, etc.) that might detract from the text. Of course, there’s no one right way to do this. You’ll have to experiment quite a bit with getting the right extract of the text that is buried in your TEI markup. In SARIT, we tried to remove everything except lg-s (metric text, mostly), and paragraphs.
I suppose you also have quite a bit of “hard” data about your inscriptions: size, location, etc. It would be great to add this in a search interface, and it’s usually easy to add fields to the index storing these kinds of things.
The other search you mention is strict search. As Andrew wrote, that’s much easier in terms of technology. You just extract the full text (perhaps thinking a bit again about notes, labels, etc.), and then use grep or something (make sure to take care of linebreaks properly, many greps search only single lines).
With best wishes,
P.S.: Perhaps of interest for your technical staff: you can try a simple search interface for the SARIT texts here: https://es.rdorte.org/_plugin/calaca/ (Andrew’s search of "dharmya" and "harmya" should work there), and you can look at the technical interface (and especially the index configuration) here: https://es.rdorte.org/_plugin/head/ This was just a proof of concept for SARIT and isn’t being updated anymore, but the searches still work. Technically, this uses Lucene through something called Elastic (https://www.elastic.co/). Elastic is a very nice interface to Lucene, but the license model is weird; it’s now dropped out of several free software distributions, including Debian. You might want to check that. But Lucene is certainly a very solid basis for anyone starting with full text search.
On Thu, Feb 13 2020, Andrew Ollett wrote:
I think that Peter meant that, for searching purposes, it is advisable to convert the ISO-15919 texts to something like the SLP-1 encoding, which does not use digraphs to represent single phonemes (thus we shouldn't get "dharmya" when we search for "harmya," or "aitara" when we search for "itara," etc.). This is a good suggestion, but SLP-1 doesn't include representations for sounds found in languages other than Sanskrit (e.g., short e and o, alveolar consonants in Tamil, retroflex approximants in the Dravidian languages, the Indonesian pepet, etc.) which are part of the DHARMA project.
On Thu, Feb 13, 2020 at 11:52 AM Arlo Griffiths
wrote: Peter,
Daniel’s message made clear that the DHARMA text corpus is in a variant of ISO-15919 transliteration and does not concern only Sanskrit but a variety of South and Southeast Asian languages. Hence it remains unclear to me how your response is pertinent to his question. Could you explain?
Best wishes,
Arlo
PS Incidentally, I happen to notice that your Library is imprecise in the source attribution for < https://sanskritlibrary.org/catalogsText/titus/vedic/vaits.html>. The Titus e-text was not edited by Jost Gippert but by myself. See < http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/av/vaits/vaits.htm>. It is also not clear to me why that source is mentioned at all, because the text in your library does not actually seem to be derived from the Titus version, or if it is has suppressed my identification of the Saṁhitā source of the mantrapratīkas.
Le 13 févr. 2020 à 18:15, Peter Scharf
a écrit : You should have your source text in the Sanskrit Library Phonetic (SLP) encoding and transcode input and display to that for searching for reliable results. Patrick McCalister analyzed the disparities between searching in IAST or Unicode Devanagari; they give unexpected results and require extra effort to work around. The Sanskrit Library uses the method I recommend for searching its texts.
****************************** Peter M. Scharf, President The Sanskrit Library scharf@sanskritlibrary.org https://sanskritlibrary.org ******************************
On 13 Feb 2020, at 9:19 PM, Andrew Ollett
wrote: Dear Dan,
I'm also very interested to see if there are suggestions in this direction. It seems to me that there are two options:
1. Creating a plain-text corpus from your TEI corpus (by means of XSL transformations), and searching this corpus the way that you would search a directory of texts on a local computer (e.g., grep). This is not a very "high-tech" solution, but I think this is how most of us search the GRETIL archive, and it's very straightforward. To get "fuzzy" results you would have to probably write custom shell scripts that replace a given search term (e.g., "dharma") with one that is modified to be "fuzzy" (e.g., "dha(r)*((ṁm)|(m)+)a" or whatever). 2. A web application with a search interface. Good luck with this. It would probably involve Apache Lucene. The source code for the SARIT http://sarit.indology.info/ web application, designed for a transitional version of eXistDB (between 3 and 4), is available on GitHub https://github.com/sarit/sarit-existdb.
Andrew
On Thu, Feb 13, 2020 at 4:21 AM Dániel Balogh
wrote: Dear All, in the near future, we in the DHARMA project will be looking into making our epigraphic corpus searchable. Our texts in Sanskrit and other S and SE Asian languages will be marked up in TEI (EpiDoc) and will be Romanised, mostly according to ISO-15919 but with some quirks on top of that, including a few extra characters and case sensitivity. We will not be lemmatising the corpus at short notice, nor is it likely that we'll add <w> tags, though we may do so for part of the corpus later on. We're interested in extracting transliterated text from TEI XML (sometimes including alternative strings in <choice>) and searching it as fruitfully as possible. Ideally, we should have two search methods, a lenient one to gather fuzzy results and tolerate e.g. variations in epigraphic spelling without returning too many false positives (at the moment we only have some rudimentary notes for the specifics of this), and a strict one to return the exact string searched for. I myself do not have the level of technical preparedness even to understand our options, and will be passing any suggestions on to people with the necessary expertise. But to get started, I would welcome some basic suggestions and pointers: any already working open source specialised code we should check out? Any general search solutions that may be adapted to our purposes? Many thanks and apologies for the vague question, Dan _______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
-- Patrick McAllister long-term email: pma@rdorte.org
Dear All, let me add my thanks to Arlo's for the quick and detailed
responses I have received to my vague question. I am sadly ignorant of the
exact advantages offered by Lucene and the level of difficulty involved in
extracting from our corpus a suitable index file, but from the little I
have gathered, that indeed seems to be the best way forward. I remain
uncertain about SLP, mainly for the reasons pointed out by Andrew. A kind
of "SLP2" which includes Unicode characters outside the ASCII range and
dedicates some of these to replacing the digraphs in ISO-15959 (and IAST)
may solve those, but developing yet another brand new transliteration
scheme does not seem feasible at the moment. We shall thus probably live
with catching the occasional dharmya while fishing for harmya.
The very best,
Daniel
On Fri, 14 Feb 2020 at 09:30, Arlo Griffiths
Dear Andrew,
Thanks for your clarirfication of Peter’s message. Now I understand its pertinence.
Dear Patrick,
Thanks for all your useful information. I have created a github issue < https://github.com/erc-dharma/project-documentation/issues/7> assembling all info we have received on this matter so far. (Yes, we are carefully using the appropriate @xml:lang tags for language and script! And yes, we will want to make it possible to search both per individual language, or without regard to language, and I suppose while we’re at it we may as well try to make it possible to search in languages A and C but not in B.)
Best wishes to all,
Arlo
Le 13 févr. 2020 à 23:57, Patrick McAllister
a écrit : Dear list members,
just to add a more general point to what’s been said already: you’ll have to distinguish quite carefully between the two search methods you want:
On Thu, Feb 13, 2020 at 4:21 AM Dániel Balogh
wrote: Ideally, we should have two search methods, a lenient one to gather fuzzy results and tolerate e.g. variations in epigraphic spelling without returning too many false positives (at the moment we only have some rudimentary notes for the specifics of this), and a strict one to return the exact string searched for.
It’s unlikely that these two wishes can be fulfilled by the same search engine.
“Fuzzy results” is a slightly, ahem, fuzzy term. One important case of fuzzy search is full text search. So, if you search for the terms “X” and “Y” you’d want to find “A X B Y C” as well as “Z Y X W”, “X was in Y”, and perhaps also “X but not Y”. Usually, the utility of full text search depends heavily on how well the indexer can analyze a given language (not script!). In an English language search, it’s standard to get matches on “did” and “done” for a search on “do”. But we don’t really have that kind of thing for Sanskrit yet; Oliver Hellwig’s statistical approaches report success rates in the correct analysis of Sanskrit of around 85%. And if you start mixing languages (not just scripts), then rather weird things might happen if you ask your search engine to index them all in the same way.
To be on the safe side, your project should use the @xml:lang tags with some foresight. You’ll probably want to note both the language and the script (e.g., “sa-Latn”), especially if any inscriptions write different languages in the same script, because that is what your search engine will need to know: which strings are Tamil, which are Sanskrit, and so on. And it’s unclear if you want to search for the same terms across all languages in your collection. Would that be useful?
Spelling variations are a slightly different kind of problem. If you manage to reduce them to regular expressions, you could probably configure your search engine’s indexing functions to smudge the texts for this differences (e.g., don’t differentiate between “ba” and “va”). It’s also quite common to index the same set of texts in several different ways. E.g., one case sensitive, one not, one with punctuation, one without, one with corrections for spelling errors, and so on. You will really need to experiment with the settings, and decide what things should be searchable.
In any case, all the full text searches work in broadly the same way: you need to get the text out of the TEI encoding and into a text-only format that is both simple enough so that a search engine can index it properly yet rich enough to satisfy your queries. The trick is really to find the right balance. Then you configure the search engine to either ignore or augment certain things in your texts.
Full text search engines usually have only very minimal understanding of structural features: many can deal with simple HTML elements (e.g., rate a match in a heading higher than a match in a normal paragraph), but I’ve found that to be rather useless for what we were doing in SARIT (most texts not having headings you’d want to search, having been added by the editor---again, that’s a decision you make; for other researchers, that might be very interesting). You should remove all kinds of notes and other interferences (gaps, line numbers, etc.) that might detract from the text. Of course, there’s no one right way to do this. You’ll have to experiment quite a bit with getting the right extract of the text that is buried in your TEI markup. In SARIT, we tried to remove everything except lg-s (metric text, mostly), and paragraphs.
I suppose you also have quite a bit of “hard” data about your inscriptions: size, location, etc. It would be great to add this in a search interface, and it’s usually easy to add fields to the index storing these kinds of things.
The other search you mention is strict search. As Andrew wrote, that’s much easier in terms of technology. You just extract the full text (perhaps thinking a bit again about notes, labels, etc.), and then use grep or something (make sure to take care of linebreaks properly, many greps search only single lines).
With best wishes,
P.S.: Perhaps of interest for your technical staff: you can try a simple search interface for the SARIT texts here: https://es.rdorte.org/_plugin/calaca/ (Andrew’s search of "dharmya" and "harmya" should work there), and you can look at the technical interface (and especially the index configuration) here: https://es.rdorte.org/_plugin/head/ This was just a proof of concept for SARIT and isn’t being updated anymore, but the searches still work. Technically, this uses Lucene through something called Elastic (https://www.elastic.co/). Elastic is a very nice interface to Lucene, but the license model is weird; it’s now dropped out of several free software distributions, including Debian. You might want to check that. But Lucene is certainly a very solid basis for anyone starting with full text search.
On Thu, Feb 13 2020, Andrew Ollett wrote:
I think that Peter meant that, for searching purposes, it is advisable to convert the ISO-15919 texts to something like the SLP-1 encoding, which does not use digraphs to represent single phonemes (thus we shouldn't get "dharmya" when we search for "harmya," or "aitara" when we search for "itara," etc.). This is a good suggestion, but SLP-1 doesn't include representations for sounds found in languages other than Sanskrit (e.g., short e and o, alveolar consonants in Tamil, retroflex approximants in the Dravidian languages, the Indonesian pepet, etc.) which are part of the DHARMA project.
On Thu, Feb 13, 2020 at 11:52 AM Arlo Griffiths < arlo.griffiths@efeo.net> wrote:
Peter,
Daniel’s message made clear that the DHARMA text corpus is in a variant of ISO-15919 transliteration and does not concern only Sanskrit but a variety of South and Southeast Asian languages. Hence it remains unclear to me how your response is pertinent to his question. Could you explain?
Best wishes,
Arlo
PS Incidentally, I happen to notice that your Library is imprecise in the source attribution for < https://sanskritlibrary.org/catalogsText/titus/vedic/vaits.html>. The Titus e-text was not edited by Jost Gippert but by myself. See <
It is also not clear to me why that source is mentioned at all, because the text in your library does not actually seem to be derived from the Titus version, or if it is has suppressed my identification of the Saṁhitā
of the mantrapratīkas.
Le 13 févr. 2020 à 18:15, Peter Scharf
a écrit : You should have your source text in the Sanskrit Library Phonetic (SLP) encoding and transcode input and display to that for searching for reliable results. Patrick McCalister analyzed the disparities between searching in IAST or Unicode Devanagari; they give unexpected results and require extra effort to work around. The Sanskrit Library uses the method I recommend for searching its texts.
****************************** Peter M. Scharf, President The Sanskrit Library scharf@sanskritlibrary.org https://sanskritlibrary.org ******************************
On 13 Feb 2020, at 9:19 PM, Andrew Ollett
wrote: Dear Dan,
I'm also very interested to see if there are suggestions in this direction. It seems to me that there are two options:
1. Creating a plain-text corpus from your TEI corpus (by means of XSL transformations), and searching this corpus the way that you would search a directory of texts on a local computer (e.g., grep). This is not a very "high-tech" solution, but I think this is how most of us search the GRETIL archive, and it's very straightforward. To get "fuzzy" results you would have to probably write custom shell scripts that replace a given search term (e.g., "dharma") with one that is modified to be "fuzzy" (e.g., "dha(r)*((ṁm)|(m)+)a" or whatever). 2. A web application with a search interface. Good luck with this. It would probably involve Apache Lucene. The source code for the SARIT http://sarit.indology.info/ web application, designed for a transitional version of eXistDB (between 3 and 4), is available on GitHub https://github.com/sarit/sarit-existdb.
Andrew
On Thu, Feb 13, 2020 at 4:21 AM Dániel Balogh
wrote: Dear All, in the near future, we in the DHARMA project will be looking into making our epigraphic corpus searchable. Our texts in Sanskrit and other S and SE Asian languages will be marked up in TEI (EpiDoc) and will be Romanised, mostly according to ISO-15919 but with some quirks on top of that, including a few extra characters and case sensitivity. We will not be lemmatising the corpus at short notice, nor is it likely that we'll add <w> tags, though we may do so for part of the corpus later on. We're interested in extracting transliterated text from TEI XML (sometimes including alternative strings in <choice>) and searching it as fruitfully as possible. Ideally, we should have two search methods, a lenient one to gather fuzzy results and tolerate e.g. variations in epigraphic spelling without returning too many false positives (at the moment we only have some rudimentary notes for the specifics of this), and a strict one to return the exact string searched for. I myself do not have the level of technical preparedness even to understand our options, and will be passing any suggestions on to
http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/av/vaits/vaits.htm>. source people
with the necessary expertise. But to get started, I would welcome some basic suggestions and pointers: any already working open source specialised code we should check out? Any general search solutions that may be adapted to our purposes? Many thanks and apologies for the vague question, Dan _______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
-- Patrick McAllister long-term email: pma@rdorte.org
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
Patrick, Would it be possible to build a search interface that searched the TEI source and restricted that search to the content of desired elements such as lg and s elements? This rather than extracting to a text document and searching that? The Oxygen search interface permits XML aware searching to differentiate between searching the content of elements, versus attribute values, versus element names, etc. It might be similarly possible to differentiate which elements to search. Yours, Peter ****************************** Peter M. Scharf, President The Sanskrit Library scharf@sanskritlibrary.org https://sanskritlibrary.org ******************************
On 14 Feb 2020, at 4:27 AM, Patrick McAllister
wrote: Dear list members,
just to add a more general point to what’s been said already: you’ll have to distinguish quite carefully between the two search methods you want:
On Thu, Feb 13, 2020 at 4:21 AM Dániel Balogh
mailto:danbalogh@gmail.com> wrote: Ideally, we should have two search methods, a lenient one to gather fuzzy results and tolerate e.g. variations in epigraphic spelling without returning too many false positives (at the moment we only have some rudimentary notes for the specifics of this), and a strict one to return the exact string searched for.
It’s unlikely that these two wishes can be fulfilled by the same search engine.
“Fuzzy results” is a slightly, ahem, fuzzy term. One important case of fuzzy search is full text search. So, if you search for the terms “X” and “Y” you’d want to find “A X B Y C” as well as “Z Y X W”, “X was in Y”, and perhaps also “X but not Y”. Usually, the utility of full text search depends heavily on how well the indexer can analyze a given language (not script!). In an English language search, it’s standard to get matches on “did” and “done” for a search on “do”. But we don’t really have that kind of thing for Sanskrit yet; Oliver Hellwig’s statistical approaches report success rates in the correct analysis of Sanskrit of around 85%. And if you start mixing languages (not just scripts), then rather weird things might happen if you ask your search engine to index them all in the same way.
To be on the safe side, your project should use the @xml:lang tags with some foresight. You’ll probably want to note both the language and the script (e.g., “sa-Latn”), especially if any inscriptions write different languages in the same script, because that is what your search engine will need to know: which strings are Tamil, which are Sanskrit, and so on. And it’s unclear if you want to search for the same terms across all languages in your collection. Would that be useful?
Spelling variations are a slightly different kind of problem. If you manage to reduce them to regular expressions, you could probably configure your search engine’s indexing functions to smudge the texts for this differences (e.g., don’t differentiate between “ba” and “va”). It’s also quite common to index the same set of texts in several different ways. E.g., one case sensitive, one not, one with punctuation, one without, one with corrections for spelling errors, and so on. You will really need to experiment with the settings, and decide what things should be searchable.
In any case, all the full text searches work in broadly the same way: you need to get the text out of the TEI encoding and into a text-only format that is both simple enough so that a search engine can index it properly yet rich enough to satisfy your queries. The trick is really to find the right balance. Then you configure the search engine to either ignore or augment certain things in your texts.
Full text search engines usually have only very minimal understanding of structural features: many can deal with simple HTML elements (e.g., rate a match in a heading higher than a match in a normal paragraph), but I’ve found that to be rather useless for what we were doing in SARIT (most texts not having headings you’d want to search, having been added by the editor---again, that’s a decision you make; for other researchers, that might be very interesting). You should remove all kinds of notes and other interferences (gaps, line numbers, etc.) that might detract from the text. Of course, there’s no one right way to do this. You’ll have to experiment quite a bit with getting the right extract of the text that is buried in your TEI markup. In SARIT, we tried to remove everything except lg-s (metric text, mostly), and paragraphs.
I suppose you also have quite a bit of “hard” data about your inscriptions: size, location, etc. It would be great to add this in a search interface, and it’s usually easy to add fields to the index storing these kinds of things.
The other search you mention is strict search. As Andrew wrote, that’s much easier in terms of technology. You just extract the full text (perhaps thinking a bit again about notes, labels, etc.), and then use grep or something (make sure to take care of linebreaks properly, many greps search only single lines).
With best wishes,
P.S.: Perhaps of interest for your technical staff: you can try a simple search interface for the SARIT texts here: https://es.rdorte.org/_plugin/calaca/ https://es.rdorte.org/_plugin/calaca/ (Andrew’s search of "dharmya" and "harmya" should work there), and you can look at the technical interface (and especially the index configuration) here: https://es.rdorte.org/_plugin/head/ https://es.rdorte.org/_plugin/head/ This was just a proof of concept for SARIT and isn’t being updated anymore, but the searches still work. Technically, this uses Lucene through something called Elastic (https://www.elastic.co/ https://www.elastic.co/). Elastic is a very nice interface to Lucene, but the license model is weird; it’s now dropped out of several free software distributions, including Debian. You might want to check that. But Lucene is certainly a very solid basis for anyone starting with full text search.
On Thu, Feb 13 2020, Andrew Ollett wrote:
I think that Peter meant that, for searching purposes, it is advisable to convert the ISO-15919 texts to something like the SLP-1 encoding, which does not use digraphs to represent single phonemes (thus we shouldn't get "dharmya" when we search for "harmya," or "aitara" when we search for "itara," etc.). This is a good suggestion, but SLP-1 doesn't include representations for sounds found in languages other than Sanskrit (e.g., short e and o, alveolar consonants in Tamil, retroflex approximants in the Dravidian languages, the Indonesian pepet, etc.) which are part of the DHARMA project.
On Thu, Feb 13, 2020 at 11:52 AM Arlo Griffiths
wrote: Peter,
Daniel’s message made clear that the DHARMA text corpus is in a variant of ISO-15919 transliteration and does not concern only Sanskrit but a variety of South and Southeast Asian languages. Hence it remains unclear to me how your response is pertinent to his question. Could you explain?
Best wishes,
Arlo
PS Incidentally, I happen to notice that your Library is imprecise in the source attribution for < https://sanskritlibrary.org/catalogsText/titus/vedic/vaits.html>. The Titus e-text was not edited by Jost Gippert but by myself. See < http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/av/vaits/vaits.htm>. It is also not clear to me why that source is mentioned at all, because the text in your library does not actually seem to be derived from the Titus version, or if it is has suppressed my identification of the Saṁhitā source of the mantrapratīkas.
Le 13 févr. 2020 à 18:15, Peter Scharf
a écrit : You should have your source text in the Sanskrit Library Phonetic (SLP) encoding and transcode input and display to that for searching for reliable results. Patrick McCalister analyzed the disparities between searching in IAST or Unicode Devanagari; they give unexpected results and require extra effort to work around. The Sanskrit Library uses the method I recommend for searching its texts.
****************************** Peter M. Scharf, President The Sanskrit Library scharf@sanskritlibrary.org https://sanskritlibrary.org ******************************
On 13 Feb 2020, at 9:19 PM, Andrew Ollett
wrote: Dear Dan,
I'm also very interested to see if there are suggestions in this direction. It seems to me that there are two options:
1. Creating a plain-text corpus from your TEI corpus (by means of XSL transformations), and searching this corpus the way that you would search a directory of texts on a local computer (e.g., grep). This is not a very "high-tech" solution, but I think this is how most of us search the GRETIL archive, and it's very straightforward. To get "fuzzy" results you would have to probably write custom shell scripts that replace a given search term (e.g., "dharma") with one that is modified to be "fuzzy" (e.g., "dha(r)*((ṁm)|(m)+)a" or whatever). 2. A web application with a search interface. Good luck with this. It would probably involve Apache Lucene. The source code for the SARIT <http://sarit.indology.info/ http://sarit.indology.info/> web application, designed for a transitional version of eXistDB (between 3 and 4), is available on GitHub <https://github.com/sarit/sarit-existdb https://github.com/sarit/sarit-existdb>.
Andrew
On Thu, Feb 13, 2020 at 4:21 AM Dániel Balogh
wrote: Dear All, in the near future, we in the DHARMA project will be looking into making our epigraphic corpus searchable. Our texts in Sanskrit and other S and SE Asian languages will be marked up in TEI (EpiDoc) and will be Romanised, mostly according to ISO-15919 but with some quirks on top of that, including a few extra characters and case sensitivity. We will not be lemmatising the corpus at short notice, nor is it likely that we'll add <w> tags, though we may do so for part of the corpus later on. We're interested in extracting transliterated text from TEI XML (sometimes including alternative strings in <choice>) and searching it as fruitfully as possible. Ideally, we should have two search methods, a lenient one to gather fuzzy results and tolerate e.g. variations in epigraphic spelling without returning too many false positives (at the moment we only have some rudimentary notes for the specifics of this), and a strict one to return the exact string searched for. I myself do not have the level of technical preparedness even to understand our options, and will be passing any suggestions on to people with the necessary expertise. But to get started, I would welcome some basic suggestions and pointers: any already working open source specialised code we should check out? Any general search solutions that may be adapted to our purposes? Many thanks and apologies for the vague question, Dan _______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org mailto:Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
-- Patrick McAllister long-term email: pma@rdorte.org mailto:pma@rdorte.org _______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org mailto:Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
On Fri, Feb 14 2020, Peter Scharf wrote:
Patrick, Would it be possible to build a search interface that searched the TEI source and restricted that search to the content of desired elements such as lg and s elements? This rather than extracting to a text document and searching that? The Oxygen search interface permits XML aware searching to differentiate between searching the content of elements, versus attribute values, versus element names, etc. It might be similarly possible to differentiate which elements to search.
Dear Peter, yes, that is possible. But I don’t think it would be very useful to stick too closely to the TEI markup. Someone will have to decide which text (or, “content”) is relevant. Consider these cases: ┌──── │ <lg xmlns="tei"> │ <l>A B C</l> │ </lg> └──── ┌──── │ <lg xmlns="tei"> │ <l>A <hi>B</hi> C</l> │ </lg> └──── ┌──── │ <lg xmlns="tei"> │ <l>A <del>B</del> C</l> │ </lg> └──── ┌──── │ <lg xmlns="tei"> │ <l>A <note xml:lang="en">B</note> C</l> │ </lg> └──── What would you expect a search for “A AND B” to return? And in which order? What is the “content” of the tei:lg elements? It gets quite tricky to think through all the variations, and for a group of texts with markup as diverse as what you find in SARIT’s library you’ll run into contradictions (forcing you to define rules per document). So it’s easiest to give “abstract” classes, metric texts vs. prose, for example, which can be extracted from your markup and exposed through a simple search interface that doesn’t presuppose acquaintance with the TEI. Technically, it’s no big problem to store tag information alongside the strings: you can easily search for “tag:lg *harmy*” at https://es.rdorte.org/_plugin/calaca/. This is somewhere between XML-aware search and full text search. But the utility is rather limited, especially if you get users who don’t know what tei:lg or tei:p (or an XML element in general) is supposed to be. Best wishes,
Yours, Peter
****************************** Peter M. Scharf, President The Sanskrit Library scharf@sanskritlibrary.org https://sanskritlibrary.org ******************************
On 14 Feb 2020, at 4:27 AM, Patrick McAllister
wrote: Dear list members,
just to add a more general point to what’s been said already: you’ll
-- Patrick McAllister long-term email: pma@rdorte.org
Thanks, Patrick. Yes it gets quite complicated with the subordinate elements. However, one has to deal with that complication anyway to extract the text to be searched to a text document too. In the extraction algorithm one would have to specify whether to include text within subordinate hi-elements but not del- or note-elements, for example. So one could do the same for search within the TEI document. ****************************** Peter M. Scharf, President The Sanskrit Library scharf@sanskritlibrary.org https://sanskritlibrary.org ******************************
On 15 Feb 2020, at 1:54 PM, Patrick McAllister
wrote: On Fri, Feb 14 2020, Peter Scharf wrote:
Patrick, Would it be possible to build a search interface that searched the TEI source and restricted that search to the content of desired elements such as lg and s elements? This rather than extracting to a text document and searching that? The Oxygen search interface permits XML aware searching to differentiate between searching the content of elements, versus attribute values, versus element names, etc. It might be similarly possible to differentiate which elements to search.
Dear Peter,
yes, that is possible. But I don’t think it would be very useful to stick too closely to the TEI markup. Someone will have to decide which text (or, “content”) is relevant. Consider these cases:
┌──── │ <lg xmlns="tei"> │ <l>A B C</l> │ </lg> └────
┌──── │ <lg xmlns="tei"> │ <l>A <hi>B</hi> C</l> │ </lg> └────
┌──── │ <lg xmlns="tei"> │ <l>A <del>B</del> C</l> │ </lg> └────
┌──── │ <lg xmlns="tei"> │ <l>A <note xml:lang="en">B</note> C</l> │ </lg> └────
What would you expect a search for “A AND B” to return? And in which order? What is the “content” of the tei:lg elements? It gets quite tricky to think through all the variations, and for a group of texts with markup as diverse as what you find in SARIT’s library you’ll run into contradictions (forcing you to define rules per document). So it’s easiest to give “abstract” classes, metric texts vs. prose, for example, which can be extracted from your markup and exposed through a simple search interface that doesn’t presuppose acquaintance with the TEI.
Technically, it’s no big problem to store tag information alongside the strings: you can easily search for “tag:lg *harmy*” at https://es.rdorte.org/_plugin/calaca/ https://es.rdorte.org/_plugin/calaca/. This is somewhere between XML-aware search and full text search. But the utility is rather limited, especially if you get users who don’t know what tei:lg or tei:p (or an XML element in general) is supposed to be.
Best wishes,
Yours, Peter
****************************** Peter M. Scharf, President The Sanskrit Library scharf@sanskritlibrary.org https://sanskritlibrary.org ******************************
On 14 Feb 2020, at 4:27 AM, Patrick McAllister
wrote: Dear list members,
just to add a more general point to what’s been said already: you’ll
-- Patrick McAllister long-term email: pma@rdorte.org mailto:pma@rdorte.org
participants (5)
-
Andrew Ollett
-
Arlo Griffiths
-
Dániel Balogh
-
Patrick McAllister
-
Peter Scharf