Dear Andrew and everyone else,

the solution I've come up with for Siddham is to mark up pādas as <l>, and to nest <lg> elements for half-verses, thus for all varṇavṛtta metres:

<lg n="1" met="pṛthvī">
<lg type="halfverse" n="ab">
    <l>pradāna-bhuja-vikkrama-praśama-śāstra-vākyodayair</l>
    <l>uparyyupari-sañcayocchritam aneka-mārggaṃ yaśaḥ</l>
</lg>
<lg type="halfverse" n="cd">
    <l>punāti bhuvana-trayaṃ paśupater jjaṭāntar-guhā-</l>
    <l>nirodha-parimokṣa-śīghram iva pāṇḍu gāṅgaṃ payaḥ</l>
</lg>
</lg>

I feel this method is fully compliant with TEI and it paves the way for typography, giving you the choice to print or not to print a line break at the end of odd pādas, and to add automated | and || punctuation if desired.
The shortcomings that I am aware of are 1) this works best with transliteration (e.g. pāda a would have to end at yai, and ru would have to move to pāda b in an alphasyllabary); and 2) it may interfere with lexical tagging, e.g. if you wanted to wrap all words including compounds in <w>, then the compound spanning b to c is problematic.
As I see things, problem 1 is universal, not restricted to this scheme; those who encode texts in Devanagari or another Indic script just have to do some things differently, and automated conversion between scripts remainst tricky with markup. As for problem 2, it can still be handled in TEI if necessary; I am not tagging words in my corpus so I have not looked into linking elements together. In a language where lexical units stretch across pāda boundaries a lot of the time, it may be inconvenient to keep doing this, but it still looks like best practice to me.

Now in metres of the āryā family I mark up only two <l> elements, and each of those is alwo wrapped in <lg>, like this:

<lg n="42" met="āryā">
<lg type="halfverse" n="ab">
    <l>śaśineva nabho vimalaṃ kaustubha-maṇineva śārṇgiṇo vakṣaḥ|</l>
</lg>
<lg type="halfverse" n="cd">
    <l>bhavana-vareṇa tathedaṃ puram akhilam alaṃkṛtam udāraṃ||</l>
</lg>
</lg>

This leaves the caesura out, which is my choice. I have likewise chosen not to tag the caesura in varṇavṛtta metres, and I feel that the caesura in āryā is more akin to the caesura within a pāda of a catuṣpadī than to the yati at the end of an odd pāda of a catuṣpadī. This is subjective and one could argue differently. If I did want to mark up caesurae then I would use the <caesura> element in both āryā and within pādas of varṇavṛttas for that purpose. This seems to be much easier to work with than using <seg> elements for every colon.

All best,
Dan

On 2018. 06. 07. 15:17, Andrew Ollett wrote:

Dear colleagues,

I am in the midst of a workshop in which we are attempting to encode texts in Old Javanese in TEI format, and the issue of encoding pādas has (once again) reared its head. We discussed this issue at length in the context of the SARIT project, and we came to the conclusion that <l> should be used for a pair of pādas, and the boundary between even and odd pādas should be represented by the <caesura/> element. Hence the following vasantatilaka verse:

<lg n="3" met="vasantatilaka">

<l>vvantən mañumbana puḍak ginuritnya pārtha<caesura/>

ndān susvasusvani kinolnya hanan liniṅliṅ</l>

<l>rakryan vədinta tan akun ləvu paṅhavista<caesura/>

heman kitābapa niragraha māsku liṅnya</l>

</lg>

This is somewhat contrary to what many people would expect, namely, that each pāda should correspond to a single <l> element, as follows:

<lg n="3" met="vasantatilaka">

<l>vvantən mañumbana puḍak ginuritnya pārtha</l>

<l>ndān susvasusvani kinolnya hanan liniṅliṅ</l>

<l>rakryan vədinta tan akun ləvu paṅhavista</l>
<l>heman kitābapa niragraha māsku liṅnya</l>

</lg>

My arguments for the use of <caesura/> involved (a) the practical necessity of encoding texts from printed editions, where the pādas are not separated typographically in all cases, especially in shorter verse forms, and thus (b) the requirement that <l> should mean the same thing for an anuṣṭubh verse as for (e.g.) śakvarī verse, i.e., it should not refer to a pādayuga in the first case, and a pāda in the second case; and (c) the frequent occurrence, in Sanskrit, of words that span the boundaries between odd and even pādas, and the undesirability of having structural elements like <l> overlap with the grammatical structure of the text (at least at the level of the word). The use of <caesura/> would be optional: it's not required (and often isn't marked typographically in shorter verse forms), but if it is present, the stylesheets will insert a space.

But I can now think of counterarguments for all of these points, and in some ways, it might be easier if <l> always mean a "pāda." (<caesura/> also doesn't have a @type attribute in standard TEI, so it might be more difficult than I expected to differentiate this "pāda-boundary caesura" from the pāda-internal yati.) So I am asking everyone whether there are compelling reasons you've discovered for preferring one encoding solution over another. (Or if you have other suggestions altogether, including the use of <seg> or other such elements.) I know that there are some features that vary across Indic languages, such as the coincidence of metrical and grammatical (esp. lexical) boundaries: these structures always coincide in Old Javanese, and almost never in Kannada, so I am hoping to avoid the problem of overlapping hierarchies completely.

Andrew
_______________________________________________
indic-texts mailing list
indic-texts@lists.tei-c.org
http://lists.lists.tei-c.org/mailman/listinfo/indic-texts