Hi Patrick,
Thanks for the interesting question. In the past few months, I've been
experimenting with writing my own XSL stylesheets for extracting text and
structural information from the SARIT XML files for some natural language
processing purposes. My personal wish, as an only half-savvy user of these
things, is that the @(sub)type and @n attributes had indeed been used more,
if only as human-readable labels, i.e., regardless of whether the SARIT
system itself needed them. That is, I find the way the <div>s are encoded
in the Nyāyamañjarī file ...
<div n="1" type="level1" subtype="volume">
<div n="1" type="level2" subtype="āhnika">
<div n="2" type="level2" subtype="āhnika">
<div n="3" type="level2" subtype="āhnika">
etc.
...to be really clear and easy to work with for my purposes. By contrast,
the way the <div>s are encoded in the Pramāṇavārttikālaṅkāra file — mostly
just simple <div> elements (i.e., no attributes) with immediately following
<head> elements — to be much harder to make sense or large-scale use of.
Fortunately, with considerable effort spent in understanding XML and XSLT,
I seem to be having gradual success with automatically reading out the
nesting structure of <div>s using XSL transforms (I'm sure there are
probably better ways, too — suggestions welcome!), but the point is that I
was surprised that I needed to do so.
So, in sum, since nothing currently seems to depend on the @(sub)type or @n
attributes, I suggest encoding traditional nomenclature and numbering
there, for use in unexpected ways by others who will try to use SARIT data
in the future. Which means that I think I agree with your own feeling, that
Sanskrit technical terms should not be *relied upon* for encoding text
divisions, but if would be nice if they weren't totally discarded (and
section numbering, of whatever provenance, is nice too); in this way, no
grand concessions need to be made for such information in the TEI
guidelines, either.
Just my two cents, as I'm gradually learning about all these things. I'm
very interested to hear responses. And cheers for all the good work!
Tyler
On Wed, Oct 10, 2018 at 1:08 PM Patrick McAllister
Dear list members,
I’m currently thinking about the encoding of sections (mainly tei:div elements) for the documents in the SARIT collection. Many of those documents are outside my field of expertise, and so I’m having a hard time with this.
We have quite a large variety of section @types that either directly use Sanskrit terminology or use terminology that might (IMO) be based on it. They are:
adhikaraṇa adhikāra adhyāya canto closing commentary commentary1 commentary2 conclusion kārika kārikā pariccheda prakāśa pāda samuddeśa sutra sūtra sūtra_with_bhāṣya
It’s obviously a bit of a mess. I was wondering if anyone has a principled approach to these naming schemes in general? I at least have only very limited knowledge of the traditional sectioning of texts in Sanskrit literature as a whole, and am not sure what the aim should be here.
Currently the SARIT guidelines ( https://github.com/sarit/SARIT-corpus/blob/a355567b5bc61032e1d7fa39392f5e06d... ) are rather cautious about the usage of these terms:
######
Sections of the text ════════════════════
There are many Sanskrit words for sections of a text: sarga, adhyāya, aṅka, pariccheda, ucchvāsa, etc. The encoding of these sections should meet several requirements:
• the XML document itself should be valid;
• the structure of the XML document reflects the logical structure of the text;
• a standard reference system should be able to use the structure of the XML document as a proxy for the structure of the text;
• texts in the corpus are broadly consistent in the encoding strategy used for these sections.
These considerations lead us to recommend the use of div for all "parts," "sections," and "divisions" in the text, whatever their Sanskrit name is, and at whatever depth they occur. Do /not/ use the numbered divisions available in earlier versions of the TEI Guidelines (`<div1 xmlns="http://www.tei-c.org/ns/Examples"/>', `<div2 xmlns="http://www.tei-c.org/ns/Examples"/>', etc.).
Before encoding a text, you should figure out a strategy for representing all of the relevant levels of the text as div elements. Some Mīmāṃsā texts, for example, are organized according to the hierarchical organization of the Mīmāṃsā Sūtras into adhyāyas, pādas, and adhikaraṇas. The first div beneath the body element (body/div) will thus correspond to an adhyāya, the first div below this to a pāda (body/div/div), and the first div below this (body/div/div/div) to an adhikaraṇa. If this schema is applied consistently, there is no need for assigning a type to the div elements themselves (e.g., `<div xmlns="http://www.tei-c.org/ns/Examples" type="adhyāya"/>', `<div xmlns="http://www.tei-c.org/ns/Examples" type="pāda"/>', `<div xmlns="http://www.tei-c.org/ns/Examples" type="adhikaraṇa"/>'), but these type attributes may be included in order to make the XML easier to read.
The strategy followed in the text can be, and should be, documented in the reference declaration (refsDecl), which is part of the encoding description (encodingDesc) in the TEI Header (see above).
1 Labelling sections ────────────────────
Sections of the text should be /identifiable/. In order to be identifiable to humans, sections generally carry a heading and/or a trailer (see below). In order to be identifiable to machines, sections carry a numeric /label/ that is represented by the n attribute of the corresponding div element. The value of n will usually be the serial number of the section: `<div xmlns="http://www.tei-c.org/ns/Examples" n="2"/>' represents the second division (adhyāya, pāda, adhikaraṇa, etc.), even if it's not actually the second div element within its parent element. For division that comprises more than one such section, we simply put all of the corresponding numbers in the n attribute: `<div xmlns="http://www.tei-c.org/ns/Examples" n="2 3 4"/>'.
Divisions (divs) are block-level elements, and they constitute the text hierarchy, together with other block-level elements like paragraphs (p), verses (lg), and the “anonymous blocks” used for sūtras and the like (ab). The overall reference system of the text will therefore usually include the numbering of div elements at the upper levels and the numbering of lg or ab elements at the lower levels.
######
I can at least say that for most practical purposes the @type on the div is ignored, at the moment: the level of any given div is inferred from the number of parent div-s, and the typesetting/display on the various platforms of SARIT depends on several factors about a div’s content rather than on the @type.
But my question is more theoretical: should we generally encourage the use of Sanskrit technical terms for types of a text’s division or not? And if you think we should encourage them, would it make sense (or even be possible) to somehow specify guidelines for their usage that would apply to many texts?
In SARIT at least, there are some cases where we decided against the use of Sanskrit terms, mainly because their meaning is actually not so clear, e.g., in the markup of tei:quote with @type="lemma" instead of something like @type="pratīka".
Hoping for some help,
-- Patrick McAllister long-term email: pma@rdorte.org _______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts