Dear all, Lines in Sanskrit verse are strings of text with dandas after them in printed editions and manuscripts. Yes it is a graphical unit, but also a well recognised sub-unit of the verse greater than the pAda. If one uses the l-element for pAda then one should devise a way of indicating whether that pAda is the first or second in the line by an attribute. Regarding word analysis, accepting compounds as words and marking their components with the m-element rather than the w-element seems most natural rather than nesting w-elements. m-elements can be nested for deeper analysis. This morhological analysis should certainly be done separately (in a separate file) from the higher level text structure markup. I’ll demonstrate how I mark up at the WSC in Vancouver in a few weeks. Yours, Peter
On 16 Jun 2018, at 9:45 AM, Andrew Ollett
wrote: I share Dan and Arlo's intuition regarding the equivalence of <l> and a pāda. The problem is just that we can't apply those intuitions consistently without a lot of extra work. This particularly inconsistency is probably not terrible (in processing we can always make a distinction between <lg> with two children and <lg> with four children) but if we wanted to solve it, I think a milestone element would do the trick.
There are two problems regarding metrical and word boundary coincidence. The first is more general: since the pāda boundary often falls in the seam of a compound, you can't use <w> for compounds in such cases, because your metrical element will end before its child element, a word, is finished. The second is specific to Dravidian languages (which are "Indic" for our purposes): metrical structure and word structure are much more independent than in Sanskrit. We have ci // tram, vi // cāra-, en // dum, ā // ṟaneya, a // tta, usually between odd and even pādas, but sometimes also between even and odd pādas.
I of course believe that word analysis is something that we definitely want. But I don't think it's something we can do programmatically at the moment, and even if we do it manually, I completely agree with Dan that standoff markup (i.e., have the unanalyzed text in one place, and the analyzed text linked to it in another place) is the only way to avoid really nasty problems of OHCO violations, vowel sandhi, and so on. In principle one could always find a solution (the BHELA conventions sometimes used on GRETIL are a nice way of specifying how vowel sandhi is to be "undone") but in general I don't think we should worry too much about word analysis at the stage of applying structural markup to texts. Here is my "alternate encoding" (using TEI) for the above verse, just to show that I have not entirely thrown word analysis to the wind:
<s> <w lemma="ādisvarapada"> <w lemma="ādi" ana="#ibc">ādi</w> <w lemma="svara" ana="#ibc">svara</w> <w lemma="pada">padaṁ</w> </w> <w lemma="anta" ana="#sing #loc">antadoḷ</w> <w lemma="āgu" ana="#p #adjp">āda</w> <w lemma="eḍe" ana="#sing #loc">eḍeyoḷ</w> <w lemma="āgu" ana="#habit">akkuṁ</w> <w lemma="eraḍaneya" ana="#adn">eraḍaneya</w> <w lemma="vibhaktyādānapada"> <w lemma="vibhakti" ana="#ibc">vibhakti</w> <w lemma="ādāna" ana="#ibc">ādāna</w> <w lemma="pada" ana="#sing #nom">padaṁ</w> </w> <w lemma="dīrgha" ana="#sing #nom">dīrghaṁ</w> <w lemma="pādānta"> <w lemma="pāda" ana="#ibc">pāda</w> <w lemma="anta" ana="#sing #loc">antadoḷ</w> </w> <w lemma="uḻi" ana="#p #adjp">uḻida</w> <w lemma="tāṇa" ana="#sing #loc">tāṇadoḷ</w> <w lemma="svacchanda" ana="#sing #nom">svacchandaṁ</w> </s>
_______________________________________________ indic-texts mailing list indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts