Re: [Indic-texts] Searching transliterated Indic texts in TEI

14 Feb 2020

      Dear Andrew,

Thanks for your clarirfication of Peter’s message. Now I understand its pertinence.

Dear Patrick,

Thanks for all your useful information. I have created a github issue <https://github.com/erc-dharma/project-documentation/issues/7> assembling all info we have received on this matter so far. (Yes, we are carefully using the appropriate @xml:lang tags for language and script! And yes, we will want to make it possible to search both per individual language, or without regard to language, and I suppose while we’re at it we may as well try to make it possible to search in languages A and C but not in B.)

Best wishes to all,

Arlo
...
Le 13 févr. 2020 à 23:57, Patrick McAllister <pma@rdorte.org> a écrit :
Dear list members,
just to add a more general point to what’s been said already: you’ll
have to distinguish quite carefully between the two search methods you
want:
...
...
On Thu, Feb 13, 2020 at 4:21 AM Dániel Balogh <danbalogh@gmail.com> wrote:
...
Ideally, we should have two search methods, a lenient one to gather
fuzzy results and tolerate e.g. variations in epigraphic spelling
without returning too many false positives (at the moment we only
have some rudimentary notes for the specifics of this), and a strict
one to return the exact string searched for.
It’s unlikely that these two wishes can be fulfilled by the same search
engine.
“Fuzzy results” is a slightly, ahem, fuzzy term.  One important case of
fuzzy search is full text search.  So, if you search for the terms “X”
and “Y” you’d want to find “A X B Y C” as well as “Z Y X W”, “X was in
Y”, and perhaps also “X but not Y”.  Usually, the utility of full text
search depends heavily on how well the indexer can analyze a given
language (not script!).  In an English language search, it’s standard to
get matches on “did” and “done” for a search on “do”.  But we don’t
really have that kind of thing for Sanskrit yet; Oliver Hellwig’s
statistical approaches report success rates in the correct analysis of
Sanskrit of around 85%.  And if you start mixing languages (not just
scripts), then rather weird things might happen if you ask your search
engine to index them all in the same way.
To be on the safe side, your project should use the @xml:lang tags with
some foresight.  You’ll probably want to note both the language and the
script (e.g., “sa-Latn”), especially if any inscriptions write different
languages in the same script, because that is what your search engine
will need to know: which strings are Tamil, which are Sanskrit, and so
on.  And it’s unclear if you want to search for the same terms across
all languages in your collection.  Would that be useful?
Spelling variations are a slightly different kind of problem.  If you
manage to reduce them to regular expressions, you could probably
configure your search engine’s indexing functions to smudge the texts
for this differences (e.g., don’t differentiate between “ba” and “va”).
It’s also quite common to index the same set of texts in several
different ways.  E.g., one case sensitive, one not, one with
punctuation, one without, one with corrections for spelling errors, and
so on.  You will really need to experiment with the settings, and decide
what things should be searchable.
In any case, all the full text searches work in broadly the same way:
you need to get the text out of the TEI encoding and into a text-only
format that is both simple enough so that a search engine can index it
properly yet rich enough to satisfy your queries.  The trick is really
to find the right balance.  Then you configure the search engine to
either ignore or augment certain things in your texts.
Full text search engines usually have only very minimal understanding of
structural features: many can deal with simple HTML elements (e.g., rate
a match in a heading higher than a match in a normal paragraph), but
I’ve found that to be rather useless for what we were doing in SARIT
(most texts not having headings you’d want to search, having been added
by the editor---again, that’s a decision you make; for other
researchers, that might be very interesting).  You should remove all
kinds of notes and other interferences (gaps, line numbers, etc.)  that
might detract from the text.  Of course, there’s no one right way to do
this.  You’ll have to experiment quite a bit with getting the right
extract of the text that is buried in your TEI markup.  In SARIT, we
tried to remove everything except lg-s (metric text, mostly), and
paragraphs.
I suppose you also have quite a bit of “hard” data about your
inscriptions: size, location, etc.  It would be great to add this in a
search interface, and it’s usually easy to add fields to the index
storing these kinds of things.
The other search you mention is strict search.  As Andrew wrote, that’s
much easier in terms of technology.  You just extract the full text
(perhaps thinking a bit again about notes, labels, etc.), and then use
grep or something (make sure to take care of linebreaks properly, many
greps search only single lines).
With best wishes,
P.S.: Perhaps of interest for your technical staff: you can try a simple
search interface for the SARIT texts here:
https://es.rdorte.org/_plugin/calaca/ (Andrew’s search of "dharmya" and
"harmya" should work there), and you can look at the technical interface
(and especially the index configuration) here:
https://es.rdorte.org/_plugin/head/ This was just a proof of concept for
SARIT and isn’t being updated anymore, but the searches still work.
Technically, this uses Lucene through something called Elastic
(https://www.elastic.co/).  Elastic is a very nice interface to Lucene,
but the license model is weird; it’s now dropped out of several free
software distributions, including Debian.  You might want to check that.
But Lucene is certainly a very solid basis for anyone starting with full
text search.
On Thu, Feb 13 2020, Andrew Ollett wrote:
...
I think that Peter meant that, for searching purposes, it is advisable to
convert the ISO-15919 texts to something like the SLP-1 encoding, which
does not use digraphs to represent single phonemes (thus we shouldn't get
"dharmya" when we search for "harmya," or "aitara" when we search for
"itara," etc.). This is a good suggestion, but SLP-1 doesn't include
representations for sounds found in languages other than Sanskrit (e.g.,
short e and o, alveolar consonants in Tamil, retroflex approximants in the
Dravidian languages, the Indonesian pepet, etc.) which are part of the
DHARMA project.
On Thu, Feb 13, 2020 at 11:52 AM Arlo Griffiths <arlo.griffiths@efeo.net>
wrote:
...
Peter,
Daniel’s message made clear that the DHARMA text corpus is in a variant of
ISO-15919 transliteration and does not concern only Sanskrit but a variety
of South and Southeast Asian languages. Hence it remains unclear to me how
your response is pertinent to his question. Could you explain?
Best wishes,
Arlo
PS Incidentally, I happen to notice that your Library is imprecise in the
source attribution for <
https://sanskritlibrary.org/catalogsText/titus/vedic/vaits.html>. The
Titus e-text was not edited by Jost Gippert but by myself. See <
http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/av/vaits/vaits.htm>.
It is also not clear to me why that source is mentioned at all, because the
text in your library does not actually seem to be derived from the Titus
version, or if it is has suppressed my identification of the Saṁhitā source
of the mantrapratīkas.
Le 13 févr. 2020 à 18:15, Peter Scharf <scharf@sanskritlibrary.org> a
écrit :
You should have your source text in the Sanskrit Library Phonetic (SLP)
encoding and transcode input and display to that for searching for reliable
results.  Patrick McCalister analyzed the disparities between searching in
IAST or Unicode Devanagari; they give unexpected results and require extra
effort to work around.  The Sanskrit  Library uses the method I recommend
for searching its texts.
******************************
Peter M. Scharf, President
The Sanskrit Library
scharf@sanskritlibrary.org
https://sanskritlibrary.org
******************************
On 13 Feb 2020, at 9:19 PM, Andrew Ollett <andrew.ollett@gmail.com> wrote:
Dear Dan,
I'm also very interested to see if there are suggestions in this
direction. It seems to me that there are two options:
1. Creating a plain-text corpus from your TEI corpus (by means of XSL
  transformations), and searching this corpus the way that you would search a
  directory of texts on a local computer (e.g., grep). This is not a very
  "high-tech" solution, but I think this is how most of us search the GRETIL
  archive, and it's very straightforward. To get "fuzzy" results you would
  have to probably write custom shell scripts that replace a given search
  term (e.g., "dharma") with one that is modified to be "fuzzy" (e.g.,
  "dha(r)*((ṁm)|(m)+)a" or whatever).
  2. A web application with a search interface. Good luck with this. It
  would probably involve Apache Lucene. The source code for the SARIT
  <http://sarit.indology.info/> web application, designed for a
  transitional version of eXistDB (between 3 and 4), is available on
  GitHub <https://github.com/sarit/sarit-existdb>.
Andrew
On Thu, Feb 13, 2020 at 4:21 AM Dániel Balogh <danbalogh@gmail.com> wrote:
...
Dear All,
in the near future, we in the DHARMA project will be looking into making
our epigraphic corpus searchable. Our texts in Sanskrit and other S and SE
Asian languages will be marked up in TEI (EpiDoc) and will be Romanised,
mostly according to ISO-15919 but with some quirks on top of that,
including a few extra characters and case sensitivity. We will not be
lemmatising the corpus at short notice, nor is it likely that we'll add <w>
tags, though we may do so for part of the corpus later on. We're interested
in extracting transliterated text from TEI XML (sometimes including
alternative strings in <choice>) and searching it as fruitfully as
possible. Ideally, we should have two search methods, a lenient one to
gather fuzzy results and tolerate e.g. variations in epigraphic spelling
without returning too many false positives (at the moment we only have some
rudimentary notes for the specifics of this), and a strict one to return
the exact string searched for.
I myself do not have the level of technical preparedness even to
understand our options, and will be passing any suggestions on to people
with the necessary expertise. But to get started, I would welcome some
basic suggestions and pointers: any already working open source specialised
code we should check out? Any general search solutions that may be adapted
to our purposes?
Many thanks and apologies for the vague question,
Dan
_______________________________________________
Indic-texts mailing list
Indic-texts@lists.tei-c.org
http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________
Indic-texts mailing list
Indic-texts@lists.tei-c.org
http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________
Indic-texts mailing list
Indic-texts@lists.tei-c.org
http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
_______________________________________________
Indic-texts mailing list
Indic-texts@lists.tei-c.org
http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
--
Patrick McAllister
long-term email: pma@rdorte.org