Thanks Dan. I certainly agree that I have always understood that the TEI element <l> maps precisely onto the sanskritist’s pāda, and hence that I was surprised by some of Peter’s reasoning. It would seem preferable to me to stick to this principle and adjust the rest of our encoding principles accordingly.
For an example of pāda boundary inside a word (i.e. a compound), see three cases in stanza I and VI here: <http://hisoma.huma-num.fr/exist/apps/EIAD/works/EIAD0186.xml?&odd=teipublisher.odd>. This is quite a common phenomenon, I believe, so you may have been asking about something else...

To change the topic slightly: I was a bit shocked by Andrew’s words « we've basically thrown word boundaries to the wind with sandhi, and if we want to do serious processing that involves word boundaries, it's probably better to create an alternative encoding without metrical structure (as Peter suggests), or not to use TEI at all. » 
I had thought that one of the main reasons for the creation of this SIG was precisely to try to come up with a smart collective decision on how to encode texts down to the level of <w> (by which, in the Sanskrit context, I mean element of compound), to allow indexing on this basis while still being able to publish text displayed with proper sandhi.
In my experience in the TEI world so far, making texts searchable once they have been encoded is a major headache, and I have never yet been offerred a fully satisfactory solution — but none of the solutions that have ever been offered to me by people more competent than myself have ever involved anything else than tagging at <w> level or using space as word delimiter (which works well for many languages but is ineffective for Sanskrit).
Nevertheless, I am not yet ready to give up on this ambition.

Best,

Arlo


Le 16 juin 2018 à 15:28, Balogh Dániel <danbalogh@gmail.com> a écrit :

To me it is intuitively obvious (i.e. not necessarily correct, nor necessarily obvious to others) that what matters is the metrical line which is equivalent to a pāda. Typographic lines are a typesetting matter and can be left to styling. Varṇavṛtta stanzas are conceived as catuṣpadī, after all. They divide into four quarters, admittedly with pairs of quarters bound more closely together than the first pair to the second. But they are still pairs of quarters. What I'm trying to say is that the ardhaśloka is a less straightforward or natural unit than the pāda.

As for word boundaries or sandhi transgressing pāda boundaries, can anyone give actual examples of that happening? In, like, more than once in ten thousand cases? I believe I have seen a handful of instances in my life of pādas (ab or cd) joined in vowel sandhi in short and common metres like anuṣṭubh or upajāti (and maybe vasantatilaka). As far as I recall, no more than that, not in any complex metre ever, and no pāda boundary inside a morpheme, ever.

Dan


On 2018. 06. 16. 9:20, Andrew Ollett wrote:
Dear Peter (and colleagues),

I notice that, between the people who have weighed in on this matter, we have exhausted every possible alternative for this particular markup question that the TEI makes available. Just one question: by "lines" you surely mean typographic lines, right? Otherwise I am not sure that, if we thought of a Sanskrit equivalent for a metrical line, it would surely be a pāda? (I ask because I take this for granted but the meaning of "line" is precisely what is at issue here.)

With regard to the first major problem that we face in our world, and which other traditions encounter rarely if at all, namely the lack of coincidence between metrical and word boundaries, I suppose we shouldn't worry about it. After all, we've basically thrown word boundaries to the wind with sandhi, and if we want to do serious processing that involves word boundaries, it's probably better to create an alternative encoding without metrical structure (as Peter suggests), or not to use TEI at all. In any case, all of the solutions offered here are equally good (or bad) from this point of view.

With regard to the second such problem, the lack of coincidence between typographic and metrical lines, I am really not sure. I do not see it as practical to encode anuṣṭubh and āryā verses with four obligatory <l> or <seg> elements. But otherwise there will be a major difference between shorter and longer verse-forms in terms of their encoding, which we don't want. I realize that <caesura/> has problems, but I think an optional milestone element is really what we want in this context. Maybe just <milestone type="pada-boundary"/> would work. With regard to words breaking across pāda boundaries, we could either process it implicitly, based on the adjoining text, as in Charles' example, or explicitly, with @break (as we do with <lb/> and <pb/> elements).

Andrew


2018-06-08 17:41 GMT+02:00 Peter Scharf <scharf@sanskritlibrary.org>:
If one tags words within a verse one will have to disolve sandh which destroys the meter and makes the text no longer metrical, i.e. no longer a verse.  For this reason, I do word tagging in a separate file from the file that has sandhi of the verse undisturbed and coordinate it with the verse file by identical xml:id elements.
Yours,
Peter

******************************
Peter M. Scharf, President
The Sanskrit Library
******************************

On 7 Jun. 2018, at 11:38 PM, Arlo Griffiths <arlo.griffiths@efeo.net> wrote:

Dear all,

I am in agreement with Andrew and Patrick. The system propopse by Andrew is the one we have implemented on EIAD, where most texts have been tagged at the <w> level.

See e.g. the first stanza from EIAD 186:

<lg met="mālinī" n="I">
<l n="a">
<lb n="1"/>
<g type="dextrorotatory-spiral"/>
<w>jayati</w>
<w>munir</w>
<w part="I">udagrakhyātacandrāṁśujāla</w>
</l>
<l n="b">
<w part="F">pracayarucirakī<lb n="2" break="no"/>rttiśrīr</w>
<w>ajeyasya</w>
<w>yasya</w>
</l>
<l n="c">
<w>jagad</w>
<w>idam</w>
<w>abhiṣiktan</w>
<w>dakṣiṇāṁmbhobhir</w>
<w part="I">u<lb n="3" break="no"/>ccaiḥ</w>
</l>
<l n="d">
<w part="F">kṣubhitasalilanāthasparddhibhir</w>
<w>mmārasainyaiḥ</w>
<pc>||</pc>
</l>
</lg>

See <http://hisoma.huma-num.fr/exist/apps/EIAD/works/EIAD0186.xml?&odd=teipublisher.odd>. In our system, the layout is indeed based upon a test for the value of @met.

Best wishes,

Arlo



Le 7 juin 2018 à 23:04, Patrick McAllister <patrick.mcallister@oeaw.ac.at> a écrit :

Dear list members,

I've found the alternative solution proposed by Andrew to be serve most
of my purposes (one important one being ease of use):

       <lg n="3" met="vasantatilaka">
         <l>vvantən mañumbana puḍak ginuritnya pārtha</l>
         <l>ndān susvasusvani kinolnya hanan liniṅliṅ</l>
         <l>rakryan vədinta tan akun ləvu paṅhavista</l>
         <l>heman kitābapa niragraha māsku liṅnya</l>
       </lg>


All else being equal (problems with word boundaries across markup etc.),
I find it easier to think of all pādas as being on the same level and of
the same type: that’s why I prefer tei:l elements for them.  Introducing
tei:seg elements or additional (nested) tei:lg seems to complicate
things, and I’m not sure which problems they are supposed to solve.  If
I know that a verse is supposed to consist of, say, 4 pādas (by looking
at the @met attribute), I’d expect to just have those 4 items on the
same level, and immediately under the tei:lg.

All typesetting problems are easily solved by looking at the @met
attribute, and perhaps also considering the number of tei:l elements
immediately under a tei:lg.  In SARIT, as Andrew has pointed out, the
encoding is usually 2 pādas per tei:l (but not always if I remember
correctly).

It all depends, of course, on the purpose of the text being encoded (I
haven’t done any automated metrical analysis, for example, but imagine
I’d strip out most of the markup first in any case and wouldn’t want
more sophisticated markup).

Best wishes,

On Thu, Jun 07 2018, Peter Scharf wrote:

I use the seg-element to mark pAdas

 <lg type="upendravajrA/indravajrA/indravajrA/upendravajrA" ana="upajAti" met="jtjgg/ttjgg/ttjgg/jtjgg" n="93">
  <l>
   <seg type='foot'>anantarodIritalakzmaBAjO</seg>
   <seg type='foot'>pAdO yadIyAv-upajAtayas-tAH</seg>
  </l>
  <l>
   <seg type='foot'>itTaM kilAnyAsv-api miSritAsu</seg>
   <seg type='foot'>vadanti2 jAtizv-idam-eva nAma ..93..</seg>
  </l>
 </lg>

I use the space element if there is a word break between pAdas in anuzwuB meter (where there often is not), though I see no harm in using the caesura element there.  There are however often caesurae even where there is no pAda break, so if consistency is the issue, the seg element seems to me to be the best solution.

Yours,
Peter

******************************
Peter M. Scharf, President
The Sanskrit Library
scharf@sanskritlibrary.org
http://sanskritlibrary.org
******************************

On 7 Jun. 2018, at 9:26 AM, Balogh Dániel <danbalogh@gmail.com> wrote:

Dear Andrew and everyone else,

the solution I've come up with for Siddham is to mark up pādas as <l>, and to nest <lg> elements for half-verses, thus for all varṇavṛtta metres:
<lg n="1" met="pṛthvī">
 <lg type="halfverse" n="ab">
   <l>pradāna-bhuja-vikkrama-praśama-śāstra-vākyodayair</l>
   <l>uparyyupari-sañcayocchritam aneka-mārggaṃ yaśaḥ</l>
 </lg>
 <lg type="halfverse" n="cd">
   <l>punāti bhuvana-trayaṃ paśupater jjaṭāntar-guhā-</l>
   <l>nirodha-parimokṣa-śīghram iva pāṇḍu gāṅgaṃ payaḥ</l>
 </lg>
</lg>

I feel this method is fully compliant with TEI and it paves the way for typography, giving you the choice to print or not to print a line break at the end of odd pādas, and to add automated | and || punctuation if desired.
The shortcomings that I am aware of are 1) this works best with transliteration (e.g. pāda a would have to end at yai, and ru would have to move to pāda b in an alphasyllabary); and 2) it may interfere with lexical tagging, e.g. if you wanted to wrap all words including compounds in <w>, then the compound spanning b to c is problematic.
As I see things, problem 1 is universal, not restricted to this scheme; those who encode texts in Devanagari or another Indic script just have to do some things differently, and automated conversion between scripts remainst tricky with markup. As for problem 2, it can still be handled in TEI if necessary; I am not tagging words in my corpus so I have not looked into linking elements together. In a language where lexical units stretch across pāda boundaries a lot of the time, it may be inconvenient to keep doing this, but it still looks like best practice to me.

Now in metres of the āryā family I mark up only two <l> elements, and each of those is alwo wrapped in <lg>, like this:

<lg n="42" met="āryā">
 <lg type="halfverse" n="ab">
   <l>śaśineva nabho vimalaṃ kaustubha-maṇineva śārṇgiṇo vakṣaḥ|</l>
 </lg>
 <lg type="halfverse" n="cd">
   <l>bhavana-vareṇa tathedaṃ puram akhilam alaṃkṛtam udāraṃ||</l>
 </lg>
</lg>

This leaves the caesura out, which is my choice. I have likewise chosen not to tag the caesura in varṇavṛtta metres, and I feel that the caesura in āryā is more akin to the caesura within a pāda of a catuṣpadī than to the yati at the end of an odd pāda of a catuṣpadī. This is subjective and one could argue differently. If I did want to mark up caesurae then I would use the <caesura> element in both āryā and within pādas of varṇavṛttas for that purpose. This seems to be much easier to work with than using <seg> elements for every colon.

All best,
Dan

On 2018. 06. 07. 15:17, Andrew Ollett wrote:
Dear colleagues,

I am in the midst of a workshop in which we are attempting to encode texts in Old Javanese in TEI format, and the issue of encoding pādas has (once again) reared its head. We discussed this issue at length in the context of the SARIT project, and we came to the conclusion that <l> should be used for a pair of pādas, and the boundary between even and odd pādas should be represented by the <caesura/> element. Hence the following vasantatilaka verse:

       <lg n="3" met="vasantatilaka">
         <l>vvantən mañumbana puḍak ginuritnya pārtha<caesura/>
            ndān susvasusvani kinolnya hanan liniṅliṅ</l>
         <l>rakryan vədinta tan akun ləvu paṅhavista<caesura/>
            heman kitābapa niragraha māsku liṅnya</l>
       </lg>

This is somewhat contrary to what many people would expect, namely, that each pāda should correspond to a single <l> element, as follows:

       <lg n="3" met="vasantatilaka">
         <l>vvantən mañumbana puḍak ginuritnya pārtha</l>
         <l>ndān susvasusvani kinolnya hanan liniṅliṅ</l>
         <l>rakryan vədinta tan akun ləvu paṅhavista</l>
         <l>heman kitābapa niragraha māsku liṅnya</l>
       </lg>

My arguments for the use of <caesura/> involved (a) the practical necessity of encoding texts from printed editions, where the pādas are not separated typographically in all cases, especially in shorter verse forms, and thus (b) the requirement that <l> should mean the same thing for an anuṣṭubh verse as for (e.g.) śakvarī verse, i.e., it should not refer to a pādayuga in the first case, and a pāda in the second case; and (c) the frequent occurrence, in Sanskrit, of words that span the boundaries between odd and even pādas, and the undesirability of having structural elements like <l> overlap with the grammatical structure of the text (at least at the level of the word). The use of <caesura/> would be optional: it's not required (and often isn't marked typographically in shorter verse forms), but if it is present, the stylesheets will insert a space.

But I can now think of counterarguments for all of these points, and in some ways, it might be easier if <l> always mean a "pāda." (<caesura/> also doesn't have a @type attribute in standard TEI, so it might be more difficult than I expected to differentiate this "pāda-boundary caesura" from the pāda-internal yati.) So I am asking everyone whether there are compelling reasons you've discovered for preferring one encoding solution over another. (Or if you have other suggestions altogether, including the use of <seg> or other such elements.) I know that there are some features that vary across Indic languages, such as the coincidence of metrical and grammatical (esp. lexical) boundaries: these structures always coincide in Old Javanese, and almost never in Kannada, so I am hoping to avoid the problem of overlapping hierarchies completely.

Andrew


_______________________________________________
indic-texts mailing list
indic-texts@lists.tei-c.org <mailto:indic-texts@lists.tei-c.org>
http://lists.lists.tei-c.org/mailman/listinfo/indic-texts <http://lists.lists.tei-c.org/mailman/listinfo/indic-texts>

_______________________________________________
indic-texts mailing list
indic-texts@lists.tei-c.org
http://lists.lists.tei-c.org/mailman/listinfo/indic-texts

_______________________________________________
indic-texts mailing list
indic-texts@lists.tei-c.org
http://lists.lists.tei-c.org/mailman/listinfo/indic-texts


--
Patrick McAllister

Email: patrick.mcallister@oeaw.ac.at
Phone: + 43 1 51581 6423

Institute for the Cultural and Intellectual History of Asia (IKGA)
Austrian Academy of Sciences
Hollandstraße 11-13, 2nd floor
1020 Vienna, Austria

http://www.ikga.oeaw.ac.at/
_______________________________________________
indic-texts mailing list
indic-texts@lists.tei-c.org
http://lists.lists.tei-c.org/mailman/listinfo/indic-texts

_______________________________________________
indic-texts mailing list
indic-texts@lists.tei-c.org
http://lists.lists.tei-c.org/mailman/listinfo/indic-texts


_______________________________________________
indic-texts mailing list
indic-texts@lists.tei-c.org
http://lists.lists.tei-c.org/mailman/listinfo/indic-texts




_______________________________________________
indic-texts mailing list
indic-texts@lists.tei-c.org
http://lists.lists.tei-c.org/mailman/listinfo/indic-texts

_______________________________________________
indic-texts mailing list
indic-texts@lists.tei-c.org
http://lists.lists.tei-c.org/mailman/listinfo/indic-texts