Dear Andrew, Thanks for your clarirfication of Peter’s message. Now I understand its pertinence. Dear Patrick, Thanks for all your useful information. I have created a github issue https://github.com/erc-dharma/project-documentation/issues/7 assembling all info we have received on this matter so far. (Yes, we are carefully using the appropriate @xml:lang tags for language and script! And yes, we will want to make it possible to search both per individual language, or without regard to language, and I suppose while we’re at it we may as well try to make it possible to search in languages A and C but not in B.) Best wishes to all, Arlo
Le 13 févr. 2020 à 23:57, Patrick McAllister
a écrit : Dear list members,
just to add a more general point to what’s been said already: you’ll have to distinguish quite carefully between the two search methods you want:
On Thu, Feb 13, 2020 at 4:21 AM Dániel Balogh
wrote: Ideally, we should have two search methods, a lenient one to gather fuzzy results and tolerate e.g. variations in epigraphic spelling without returning too many false positives (at the moment we only have some rudimentary notes for the specifics of this), and a strict one to return the exact string searched for.
It’s unlikely that these two wishes can be fulfilled by the same search engine.
“Fuzzy results” is a slightly, ahem, fuzzy term. One important case of fuzzy search is full text search. So, if you search for the terms “X” and “Y” you’d want to find “A X B Y C” as well as “Z Y X W”, “X was in Y”, and perhaps also “X but not Y”. Usually, the utility of full text search depends heavily on how well the indexer can analyze a given language (not script!). In an English language search, it’s standard to get matches on “did” and “done” for a search on “do”. But we don’t really have that kind of thing for Sanskrit yet; Oliver Hellwig’s statistical approaches report success rates in the correct analysis of Sanskrit of around 85%. And if you start mixing languages (not just scripts), then rather weird things might happen if you ask your search engine to index them all in the same way.
To be on the safe side, your project should use the @xml:lang tags with some foresight. You’ll probably want to note both the language and the script (e.g., “sa-Latn”), especially if any inscriptions write different languages in the same script, because that is what your search engine will need to know: which strings are Tamil, which are Sanskrit, and so on. And it’s unclear if you want to search for the same terms across all languages in your collection. Would that be useful?
Spelling variations are a slightly different kind of problem. If you manage to reduce them to regular expressions, you could probably configure your search engine’s indexing functions to smudge the texts for this differences (e.g., don’t differentiate between “ba” and “va”). It’s also quite common to index the same set of texts in several different ways. E.g., one case sensitive, one not, one with punctuation, one without, one with corrections for spelling errors, and so on. You will really need to experiment with the settings, and decide what things should be searchable.
In any case, all the full text searches work in broadly the same way: you need to get the text out of the TEI encoding and into a text-only format that is both simple enough so that a search engine can index it properly yet rich enough to satisfy your queries. The trick is really to find the right balance. Then you configure the search engine to either ignore or augment certain things in your texts.
Full text search engines usually have only very minimal understanding of structural features: many can deal with simple HTML elements (e.g., rate a match in a heading higher than a match in a normal paragraph), but I’ve found that to be rather useless for what we were doing in SARIT (most texts not having headings you’d want to search, having been added by the editor---again, that’s a decision you make; for other researchers, that might be very interesting). You should remove all kinds of notes and other interferences (gaps, line numbers, etc.) that might detract from the text. Of course, there’s no one right way to do this. You’ll have to experiment quite a bit with getting the right extract of the text that is buried in your TEI markup. In SARIT, we tried to remove everything except lg-s (metric text, mostly), and paragraphs.
I suppose you also have quite a bit of “hard” data about your inscriptions: size, location, etc. It would be great to add this in a search interface, and it’s usually easy to add fields to the index storing these kinds of things.
The other search you mention is strict search. As Andrew wrote, that’s much easier in terms of technology. You just extract the full text (perhaps thinking a bit again about notes, labels, etc.), and then use grep or something (make sure to take care of linebreaks properly, many greps search only single lines).
With best wishes,
P.S.: Perhaps of interest for your technical staff: you can try a simple search interface for the SARIT texts here: https://es.rdorte.org/_plugin/calaca/ (Andrew’s search of "dharmya" and "harmya" should work there), and you can look at the technical interface (and especially the index configuration) here: https://es.rdorte.org/_plugin/head/ This was just a proof of concept for SARIT and isn’t being updated anymore, but the searches still work. Technically, this uses Lucene through something called Elastic (https://www.elastic.co/). Elastic is a very nice interface to Lucene, but the license model is weird; it’s now dropped out of several free software distributions, including Debian. You might want to check that. But Lucene is certainly a very solid basis for anyone starting with full text search.
On Thu, Feb 13 2020, Andrew Ollett wrote:
I think that Peter meant that, for searching purposes, it is advisable to convert the ISO-15919 texts to something like the SLP-1 encoding, which does not use digraphs to represent single phonemes (thus we shouldn't get "dharmya" when we search for "harmya," or "aitara" when we search for "itara," etc.). This is a good suggestion, but SLP-1 doesn't include representations for sounds found in languages other than Sanskrit (e.g., short e and o, alveolar consonants in Tamil, retroflex approximants in the Dravidian languages, the Indonesian pepet, etc.) which are part of the DHARMA project.
On Thu, Feb 13, 2020 at 11:52 AM Arlo Griffiths
wrote: Peter,
Daniel’s message made clear that the DHARMA text corpus is in a variant of ISO-15919 transliteration and does not concern only Sanskrit but a variety of South and Southeast Asian languages. Hence it remains unclear to me how your response is pertinent to his question. Could you explain?
Best wishes,
Arlo
PS Incidentally, I happen to notice that your Library is imprecise in the source attribution for < https://sanskritlibrary.org/catalogsText/titus/vedic/vaits.html>. The Titus e-text was not edited by Jost Gippert but by myself. See < http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/av/vaits/vaits.htm>. It is also not clear to me why that source is mentioned at all, because the text in your library does not actually seem to be derived from the Titus version, or if it is has suppressed my identification of the Saṁhitā source of the mantrapratīkas.
Le 13 févr. 2020 à 18:15, Peter Scharf
a écrit : You should have your source text in the Sanskrit Library Phonetic (SLP) encoding and transcode input and display to that for searching for reliable results. Patrick McCalister analyzed the disparities between searching in IAST or Unicode Devanagari; they give unexpected results and require extra effort to work around. The Sanskrit Library uses the method I recommend for searching its texts.
****************************** Peter M. Scharf, President The Sanskrit Library scharf@sanskritlibrary.org https://sanskritlibrary.org ******************************
On 13 Feb 2020, at 9:19 PM, Andrew Ollett
wrote: Dear Dan,
I'm also very interested to see if there are suggestions in this direction. It seems to me that there are two options:
1. Creating a plain-text corpus from your TEI corpus (by means of XSL transformations), and searching this corpus the way that you would search a directory of texts on a local computer (e.g., grep). This is not a very "high-tech" solution, but I think this is how most of us search the GRETIL archive, and it's very straightforward. To get "fuzzy" results you would have to probably write custom shell scripts that replace a given search term (e.g., "dharma") with one that is modified to be "fuzzy" (e.g., "dha(r)*((ṁm)|(m)+)a" or whatever). 2. A web application with a search interface. Good luck with this. It would probably involve Apache Lucene. The source code for the SARIT http://sarit.indology.info/ web application, designed for a transitional version of eXistDB (between 3 and 4), is available on GitHub https://github.com/sarit/sarit-existdb.
Andrew
On Thu, Feb 13, 2020 at 4:21 AM Dániel Balogh
wrote: Dear All, in the near future, we in the DHARMA project will be looking into making our epigraphic corpus searchable. Our texts in Sanskrit and other S and SE Asian languages will be marked up in TEI (EpiDoc) and will be Romanised, mostly according to ISO-15919 but with some quirks on top of that, including a few extra characters and case sensitivity. We will not be lemmatising the corpus at short notice, nor is it likely that we'll add <w> tags, though we may do so for part of the corpus later on. We're interested in extracting transliterated text from TEI XML (sometimes including alternative strings in <choice>) and searching it as fruitfully as possible. Ideally, we should have two search methods, a lenient one to gather fuzzy results and tolerate e.g. variations in epigraphic spelling without returning too many false positives (at the moment we only have some rudimentary notes for the specifics of this), and a strict one to return the exact string searched for. I myself do not have the level of technical preparedness even to understand our options, and will be passing any suggestions on to people with the necessary expertise. But to get started, I would welcome some basic suggestions and pointers: any already working open source specialised code we should check out? Any general search solutions that may be adapted to our purposes? Many thanks and apologies for the vague question, Dan _______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
-- Patrick McAllister long-term email: pma@rdorte.org