Yes, I did not make this clear but I exclude the splitting of compounds when I say words are hardly ever split by pāda boundaries. I do of course know that this happens quite commonly, and I also realise that these are indeed split words in terms of most definitions of a "word" including that of the TEI <w> tag. But I still think that if one wants to mark up compounds with nested <w> elements (e.g. <w><w>jāla</w><w>pracaya</w></w>), the least problematic way to handle compounds crossing pāda boundaries would be to stick to <l> elements for pādas and link the <w> element at the end of one pāda to the one at the beginning of the next. Using a <caesura> or any other milestone type element for the pāda break does not seem better to me, as the caesura is not a part of the compound (e.g. <w><w>jāla</w><caesura /><w>pracaya</w></w> strikes me as odd), so the enclosing <w> would still need to be split and linked.

I have no practical experience of word tagging so can't really comment on the second problem. But it seems to me that standoff markup is necessary to get a perfect solution. Without standoff markup, the <w> elements could be supplied with @lemma attributes, but where sandhi merges the end of a word with the beginning of the next, one would then inevitably have to truncate either the first or the second word.

Thanks Dan. I certainly agree that I have always understood that the TEI element <l> maps precisely onto the sanskritist’s pāda, and hence that I was surprised by some of Peter’s reasoning. It would seem preferable to me to stick to this principle and adjust the rest of our encoding principles accordingly.

For an example of pāda boundary inside a word (i.e. a compound), see three cases in stanza I and VI here: <http://hisoma.huma-num.fr/exist/apps/EIAD/works/EIAD0186.xml?&odd=teipublisher.odd>. This is quite a common phenomenon, I believe, so you may have been asking about something else...

To change the topic slightly: I was a bit shocked by Andrew’s words « we've basically thrown word boundaries to the wind with sandhi, and if we want to do serious processing that involves word boundaries, it's probably better to create an alternative encoding without metrical structure (as Peter suggests), or not to use TEI at all. »

I had thought that one of the main reasons for the creation of this SIG was precisely to try to come up with a smart collective decision on how to encode texts down to the level of <w> (by which, in the Sanskrit context, I mean element of compound), to allow indexing on this basis while still being able to publish text displayed with proper sandhi.

In my experience in the TEI world so far, making texts searchable once they have been encoded is a major headache, and I have never yet been offerred a fully satisfactory solution — but none of the solutions that have ever been offered to me by people more competent than myself have ever involved anything else than tagging at <w> level or using space as word delimiter (which works well for many languages but is ineffective for Sanskrit).

Nevertheless, I am not yet ready to give up on this ambition.

Best,

Arlo

Le 16 juin 2018 à 15:28, Balogh Dániel <danbalogh@gmail.com> a écrit :

To me it is intuitively obvious (i.e. not necessarily correct, nor necessarily obvious to others) that what matters is the metrical line which is equivalent to a pāda. Typographic lines are a typesetting matter and can be left to styling. Varṇavṛtta stanzas are conceived as catuṣpadī, after all. They divide into four quarters, admittedly with pairs of quarters bound more closely together than the first pair to the second. But they are still pairs of quarters. What I'm trying to say is that the ardhaśloka is a less straightforward or natural unit than the pāda.

As for word boundaries or sandhi transgressing pāda boundaries, can anyone give actual examples of that happening? In, like, more than once in ten thousand cases? I believe I have seen a handful of instances in my life of pādas (ab or cd) joined in vowel sandhi in short and common metres like anuṣṭubh or upajāti (and maybe vasantatilaka). As far as I recall, no more than that, not in any complex metre ever, and no pāda boundary inside a morpheme, ever.

Dan