Dear All, let me add my thanks to Arlo's for the quick and detailed responses I have received to my vague question. I am sadly ignorant of the exact advantages offered by Lucene and the level of difficulty involved in extracting from our corpus a suitable index file, but from the little I have gathered, that indeed seems to be the best way forward. I remain uncertain about SLP, mainly for the reasons pointed out by Andrew. A kind of "SLP2" which includes Unicode characters outside the ASCII range and dedicates some of these to replacing the digraphs in ISO-15959 (and IAST) may solve those, but developing yet another brand new transliteration scheme does not seem feasible at the moment. We shall thus probably live with catching the occasional dharmya while fishing for harmya.
The very best,
Daniel

On Fri, 14 Feb 2020 at 09:30, Arlo Griffiths <arlo.griffiths@efeo.net> wrote:
Dear Andrew,

Thanks for your clarirfication of Peter’s message. Now I understand its pertinence.

Dear Patrick,

Thanks for all your useful information. I have created a github issue <https://github.com/erc-dharma/project-documentation/issues/7> assembling all info we have received on this matter so far. (Yes, we are carefully using the appropriate @xml:lang tags for language and script! And yes, we will want to make it possible to search both per individual language, or without regard to language, and I suppose while we’re at it we may as well try to make it possible to search in languages A and C but not in B.)

Best wishes to all,

Arlo


> Le 13 févr. 2020 à 23:57, Patrick McAllister <pma@rdorte.org> a écrit :
>
> Dear list members,
>
> just to add a more general point to what’s been said already: you’ll
> have to distinguish quite carefully between the two search methods you
> want:
>
>>> On Thu, Feb 13, 2020 at 4:21 AM Dániel Balogh <danbalogh@gmail.com> wrote:
>>>>
>>>> Ideally, we should have two search methods, a lenient one to gather
>>>> fuzzy results and tolerate e.g. variations in epigraphic spelling
>>>> without returning too many false positives (at the moment we only
>>>> have some rudimentary notes for the specifics of this), and a strict
>>>> one to return the exact string searched for.
>
> It’s unlikely that these two wishes can be fulfilled by the same search
> engine.
>
> “Fuzzy results” is a slightly, ahem, fuzzy term.  One important case of
> fuzzy search is full text search.  So, if you search for the terms “X”
> and “Y” you’d want to find “A X B Y C” as well as “Z Y X W”, “X was in
> Y”, and perhaps also “X but not Y”.  Usually, the utility of full text
> search depends heavily on how well the indexer can analyze a given
> language (not script!).  In an English language search, it’s standard to
> get matches on “did” and “done” for a search on “do”.  But we don’t
> really have that kind of thing for Sanskrit yet; Oliver Hellwig’s
> statistical approaches report success rates in the correct analysis of
> Sanskrit of around 85%.  And if you start mixing languages (not just
> scripts), then rather weird things might happen if you ask your search
> engine to index them all in the same way.
>
> To be on the safe side, your project should use the @xml:lang tags with
> some foresight.  You’ll probably want to note both the language and the
> script (e.g., “sa-Latn”), especially if any inscriptions write different
> languages in the same script, because that is what your search engine
> will need to know: which strings are Tamil, which are Sanskrit, and so
> on.  And it’s unclear if you want to search for the same terms across
> all languages in your collection.  Would that be useful?
>
> Spelling variations are a slightly different kind of problem.  If you
> manage to reduce them to regular expressions, you could probably
> configure your search engine’s indexing functions to smudge the texts
> for this differences (e.g., don’t differentiate between “ba” and “va”).
> It’s also quite common to index the same set of texts in several
> different ways.  E.g., one case sensitive, one not, one with
> punctuation, one without, one with corrections for spelling errors, and
> so on.  You will really need to experiment with the settings, and decide
> what things should be searchable.
>
> In any case, all the full text searches work in broadly the same way:
> you need to get the text out of the TEI encoding and into a text-only
> format that is both simple enough so that a search engine can index it
> properly yet rich enough to satisfy your queries.  The trick is really
> to find the right balance.  Then you configure the search engine to
> either ignore or augment certain things in your texts.
>
> Full text search engines usually have only very minimal understanding of
> structural features: many can deal with simple HTML elements (e.g., rate
> a match in a heading higher than a match in a normal paragraph), but
> I’ve found that to be rather useless for what we were doing in SARIT
> (most texts not having headings you’d want to search, having been added
> by the editor---again, that’s a decision you make; for other
> researchers, that might be very interesting).  You should remove all
> kinds of notes and other interferences (gaps, line numbers, etc.)  that
> might detract from the text.  Of course, there’s no one right way to do
> this.  You’ll have to experiment quite a bit with getting the right
> extract of the text that is buried in your TEI markup.  In SARIT, we
> tried to remove everything except lg-s (metric text, mostly), and
> paragraphs.
>
> I suppose you also have quite a bit of “hard” data about your
> inscriptions: size, location, etc.  It would be great to add this in a
> search interface, and it’s usually easy to add fields to the index
> storing these kinds of things.
>
> The other search you mention is strict search.  As Andrew wrote, that’s
> much easier in terms of technology.  You just extract the full text
> (perhaps thinking a bit again about notes, labels, etc.), and then use
> grep or something (make sure to take care of linebreaks properly, many
> greps search only single lines).
>
> With best wishes,
>
> P.S.: Perhaps of interest for your technical staff: you can try a simple
> search interface for the SARIT texts here:
> https://es.rdorte.org/_plugin/calaca/ (Andrew’s search of "dharmya" and
> "harmya" should work there), and you can look at the technical interface
> (and especially the index configuration) here:
> https://es.rdorte.org/_plugin/head/ This was just a proof of concept for
> SARIT and isn’t being updated anymore, but the searches still work.
> Technically, this uses Lucene through something called Elastic
> (https://www.elastic.co/).  Elastic is a very nice interface to Lucene,
> but the license model is weird; it’s now dropped out of several free
> software distributions, including Debian.  You might want to check that.
> But Lucene is certainly a very solid basis for anyone starting with full
> text search.
>
>
> On Thu, Feb 13 2020, Andrew Ollett wrote:
>
>> I think that Peter meant that, for searching purposes, it is advisable to
>> convert the ISO-15919 texts to something like the SLP-1 encoding, which
>> does not use digraphs to represent single phonemes (thus we shouldn't get
>> "dharmya" when we search for "harmya," or "aitara" when we search for
>> "itara," etc.). This is a good suggestion, but SLP-1 doesn't include
>> representations for sounds found in languages other than Sanskrit (e.g.,
>> short e and o, alveolar consonants in Tamil, retroflex approximants in the
>> Dravidian languages, the Indonesian pepet, etc.) which are part of the
>> DHARMA project.
>>
>> On Thu, Feb 13, 2020 at 11:52 AM Arlo Griffiths <arlo.griffiths@efeo.net>
>> wrote:
>>
>>> Peter,
>>>
>>> Daniel’s message made clear that the DHARMA text corpus is in a variant of
>>> ISO-15919 transliteration and does not concern only Sanskrit but a variety
>>> of South and Southeast Asian languages. Hence it remains unclear to me how
>>> your response is pertinent to his question. Could you explain?
>>>
>>> Best wishes,
>>>
>>> Arlo
>>>
>>> PS Incidentally, I happen to notice that your Library is imprecise in the
>>> source attribution for <
>>> https://sanskritlibrary.org/catalogsText/titus/vedic/vaits.html>. The
>>> Titus e-text was not edited by Jost Gippert but by myself. See <
>>> http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/av/vaits/vaits.htm>.
>>> It is also not clear to me why that source is mentioned at all, because the
>>> text in your library does not actually seem to be derived from the Titus
>>> version, or if it is has suppressed my identification of the Saṁhitā source
>>> of the mantrapratīkas.
>>>
>>>
>>>
>>>
>>> Le 13 févr. 2020 à 18:15, Peter Scharf <scharf@sanskritlibrary.org> a
>>> écrit :
>>>
>>> You should have your source text in the Sanskrit Library Phonetic (SLP)
>>> encoding and transcode input and display to that for searching for reliable
>>> results.  Patrick McCalister analyzed the disparities between searching in
>>> IAST or Unicode Devanagari; they give unexpected results and require extra
>>> effort to work around.  The Sanskrit  Library uses the method I recommend
>>> for searching its texts.
>>>
>>> ******************************
>>> Peter M. Scharf, President
>>> The Sanskrit Library
>>> scharf@sanskritlibrary.org
>>> https://sanskritlibrary.org
>>> ******************************
>>>
>>> On 13 Feb 2020, at 9:19 PM, Andrew Ollett <andrew.ollett@gmail.com> wrote:
>>>
>>> Dear Dan,
>>>
>>> I'm also very interested to see if there are suggestions in this
>>> direction. It seems to me that there are two options:
>>>
>>>   1. Creating a plain-text corpus from your TEI corpus (by means of XSL
>>>   transformations), and searching this corpus the way that you would search a
>>>   directory of texts on a local computer (e.g., grep). This is not a very
>>>   "high-tech" solution, but I think this is how most of us search the GRETIL
>>>   archive, and it's very straightforward. To get "fuzzy" results you would
>>>   have to probably write custom shell scripts that replace a given search
>>>   term (e.g., "dharma") with one that is modified to be "fuzzy" (e.g.,
>>>   "dha(r)*((ṁm)|(m)+)a" or whatever).
>>>   2. A web application with a search interface. Good luck with this. It
>>>   would probably involve Apache Lucene. The source code for the SARIT
>>>   <http://sarit.indology.info/> web application, designed for a
>>>   transitional version of eXistDB (between 3 and 4), is available on
>>>   GitHub <https://github.com/sarit/sarit-existdb>.
>>>
>>> Andrew
>>>
>>> On Thu, Feb 13, 2020 at 4:21 AM Dániel Balogh <danbalogh@gmail.com> wrote:
>>>
>>>> Dear All,
>>>> in the near future, we in the DHARMA project will be looking into making
>>>> our epigraphic corpus searchable. Our texts in Sanskrit and other S and SE
>>>> Asian languages will be marked up in TEI (EpiDoc) and will be Romanised,
>>>> mostly according to ISO-15919 but with some quirks on top of that,
>>>> including a few extra characters and case sensitivity. We will not be
>>>> lemmatising the corpus at short notice, nor is it likely that we'll add <w>
>>>> tags, though we may do so for part of the corpus later on. We're interested
>>>> in extracting transliterated text from TEI XML (sometimes including
>>>> alternative strings in <choice>) and searching it as fruitfully as
>>>> possible. Ideally, we should have two search methods, a lenient one to
>>>> gather fuzzy results and tolerate e.g. variations in epigraphic spelling
>>>> without returning too many false positives (at the moment we only have some
>>>> rudimentary notes for the specifics of this), and a strict one to return
>>>> the exact string searched for.
>>>> I myself do not have the level of technical preparedness even to
>>>> understand our options, and will be passing any suggestions on to people
>>>> with the necessary expertise. But to get started, I would welcome some
>>>> basic suggestions and pointers: any already working open source specialised
>>>> code we should check out? Any general search solutions that may be adapted
>>>> to our purposes?
>>>> Many thanks and apologies for the vague question,
>>>> Dan
>>>> _______________________________________________
>>>> Indic-texts mailing list
>>>> Indic-texts@lists.tei-c.org
>>>> http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
>>>>
>>> _______________________________________________
>>> Indic-texts mailing list
>>> Indic-texts@lists.tei-c.org
>>> http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
>>>
>>>
>>> _______________________________________________
>>> Indic-texts mailing list
>>> Indic-texts@lists.tei-c.org
>>> http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
>>>
>>>
>>>
>> _______________________________________________
>> Indic-texts mailing list
>> Indic-texts@lists.tei-c.org
>> http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
>
>
> --
> Patrick McAllister
> long-term email: pma@rdorte.org

_______________________________________________
Indic-texts mailing list
Indic-texts@lists.tei-c.org
http://lists.lists.tei-c.org/mailman/listinfo/indic-texts