On 15 Feb 2020, at 1:54 PM, Patrick McAllister <pma@rdorte.org> wrote:On Fri, Feb 14 2020, Peter Scharf wrote:Patrick,
Would it be possible to build a search interface that searched the TEI source and restricted that search to the content of desired elements such as lg and s
elements? This rather than extracting to a text document and searching that? The Oxygen search interface permits XML aware searching to differentiate
between searching the content of elements, versus attribute values, versus element names, etc. It might be similarly possible to differentiate which elements to
search.
Dear Peter,
yes, that is possible. But I don’t think it would be very useful to
stick too closely to the TEI markup. Someone will have to decide which
text (or, “content”) is relevant. Consider these cases:
┌────
│ <lg xmlns="tei">
│ <l>A B C</l>
│ </lg>
└────
┌────
│ <lg xmlns="tei">
│ <l>A <hi>B</hi> C</l>
│ </lg>
└────
┌────
│ <lg xmlns="tei">
│ <l>A <del>B</del> C</l>
│ </lg>
└────
┌────
│ <lg xmlns="tei">
│ <l>A <note xml:lang="en">B</note> C</l>
│ </lg>
└────
What would you expect a search for “A AND B” to return? And in which
order? What is the “content” of the tei:lg elements? It gets quite
tricky to think through all the variations, and for a group of texts
with markup as diverse as what you find in SARIT’s library you’ll run
into contradictions (forcing you to define rules per document). So it’s
easiest to give “abstract” classes, metric texts vs. prose, for example,
which can be extracted from your markup and exposed through a simple
search interface that doesn’t presuppose acquaintance with the TEI.
Technically, it’s no big problem to store tag information alongside the
strings: you can easily search for “tag:lg *harmy*” at
https://es.rdorte.org/_plugin/calaca/. This is somewhere between
XML-aware search and full text search. But the utility is rather
limited, especially if you get users who don’t know what tei:lg or tei:p
(or an XML element in general) is supposed to be.
Best wishes,Yours,
Peter
******************************
Peter M. Scharf, President
The Sanskrit Library
scharf@sanskritlibrary.org
https://sanskritlibrary.org
******************************
On 14 Feb 2020, at 4:27 AM, Patrick McAllister <pma@rdorte.org> wrote:
Dear list members,
just to add a more general point to what’s been said already: you’ll
--
Patrick McAllister
long-term email: pma@rdorte.org