Sanskrit names for division types
Dear list members, I’m currently thinking about the encoding of sections (mainly tei:div elements) for the documents in the SARIT collection. Many of those documents are outside my field of expertise, and so I’m having a hard time with this. We have quite a large variety of section @types that either directly use Sanskrit terminology or use terminology that might (IMO) be based on it. They are: adhikaraṇa adhikāra adhyāya canto closing commentary commentary1 commentary2 conclusion kārika kārikā pariccheda prakāśa pāda samuddeśa sutra sūtra sūtra_with_bhāṣya It’s obviously a bit of a mess. I was wondering if anyone has a principled approach to these naming schemes in general? I at least have only very limited knowledge of the traditional sectioning of texts in Sanskrit literature as a whole, and am not sure what the aim should be here. Currently the SARIT guidelines (https://github.com/sarit/SARIT-corpus/blob/a355567b5bc61032e1d7fa39392f5e06d...) are rather cautious about the usage of these terms: ###### Sections of the text ════════════════════ There are many Sanskrit words for sections of a text: sarga, adhyāya, aṅka, pariccheda, ucchvāsa, etc. The encoding of these sections should meet several requirements: • the XML document itself should be valid; • the structure of the XML document reflects the logical structure of the text; • a standard reference system should be able to use the structure of the XML document as a proxy for the structure of the text; • texts in the corpus are broadly consistent in the encoding strategy used for these sections. These considerations lead us to recommend the use of div for all "parts," "sections," and "divisions" in the text, whatever their Sanskrit name is, and at whatever depth they occur. Do /not/ use the numbered divisions available in earlier versions of the TEI Guidelines (`<div1 xmlns="http://www.tei-c.org/ns/Examples"/>', `<div2 xmlns="http://www.tei-c.org/ns/Examples"/>', etc.). Before encoding a text, you should figure out a strategy for representing all of the relevant levels of the text as div elements. Some Mīmāṃsā texts, for example, are organized according to the hierarchical organization of the Mīmāṃsā Sūtras into adhyāyas, pādas, and adhikaraṇas. The first div beneath the body element (body/div) will thus correspond to an adhyāya, the first div below this to a pāda (body/div/div), and the first div below this (body/div/div/div) to an adhikaraṇa. If this schema is applied consistently, there is no need for assigning a type to the div elements themselves (e.g., `<div xmlns="http://www.tei-c.org/ns/Examples" type="adhyāya"/>', `<div xmlns="http://www.tei-c.org/ns/Examples" type="pāda"/>', `<div xmlns="http://www.tei-c.org/ns/Examples" type="adhikaraṇa"/>'), but these type attributes may be included in order to make the XML easier to read. The strategy followed in the text can be, and should be, documented in the reference declaration (refsDecl), which is part of the encoding description (encodingDesc) in the TEI Header (see above). 1 Labelling sections ──────────────────── Sections of the text should be /identifiable/. In order to be identifiable to humans, sections generally carry a heading and/or a trailer (see below). In order to be identifiable to machines, sections carry a numeric /label/ that is represented by the n attribute of the corresponding div element. The value of n will usually be the serial number of the section: `<div xmlns="http://www.tei-c.org/ns/Examples" n="2"/>' represents the second division (adhyāya, pāda, adhikaraṇa, etc.), even if it's not actually the second div element within its parent element. For division that comprises more than one such section, we simply put all of the corresponding numbers in the n attribute: `<div xmlns="http://www.tei-c.org/ns/Examples" n="2 3 4"/>'. Divisions (divs) are block-level elements, and they constitute the text hierarchy, together with other block-level elements like paragraphs (p), verses (lg), and the “anonymous blocks” used for sūtras and the like (ab). The overall reference system of the text will therefore usually include the numbering of div elements at the upper levels and the numbering of lg or ab elements at the lower levels. ###### I can at least say that for most practical purposes the @type on the div is ignored, at the moment: the level of any given div is inferred from the number of parent div-s, and the typesetting/display on the various platforms of SARIT depends on several factors about a div’s content rather than on the @type. But my question is more theoretical: should we generally encourage the use of Sanskrit technical terms for types of a text’s division or not? And if you think we should encourage them, would it make sense (or even be possible) to somehow specify guidelines for their usage that would apply to many texts? In SARIT at least, there are some cases where we decided against the use of Sanskrit terms, mainly because their meaning is actually not so clear, e.g., in the markup of tei:quote with @type="lemma" instead of something like @type="pratīka". Hoping for some help, -- Patrick McAllister long-term email: pma@rdorte.org
Hi Patrick,
Thanks for the interesting question. In the past few months, I've been
experimenting with writing my own XSL stylesheets for extracting text and
structural information from the SARIT XML files for some natural language
processing purposes. My personal wish, as an only half-savvy user of these
things, is that the @(sub)type and @n attributes had indeed been used more,
if only as human-readable labels, i.e., regardless of whether the SARIT
system itself needed them. That is, I find the way the <div>s are encoded
in the Nyāyamañjarī file ...
<div n="1" type="level1" subtype="volume">
<div n="1" type="level2" subtype="āhnika">
<div n="2" type="level2" subtype="āhnika">
<div n="3" type="level2" subtype="āhnika">
etc.
...to be really clear and easy to work with for my purposes. By contrast,
the way the <div>s are encoded in the Pramāṇavārttikālaṅkāra file — mostly
just simple <div> elements (i.e., no attributes) with immediately following
<head> elements — to be much harder to make sense or large-scale use of.
Fortunately, with considerable effort spent in understanding XML and XSLT,
I seem to be having gradual success with automatically reading out the
nesting structure of <div>s using XSL transforms (I'm sure there are
probably better ways, too — suggestions welcome!), but the point is that I
was surprised that I needed to do so.
So, in sum, since nothing currently seems to depend on the @(sub)type or @n
attributes, I suggest encoding traditional nomenclature and numbering
there, for use in unexpected ways by others who will try to use SARIT data
in the future. Which means that I think I agree with your own feeling, that
Sanskrit technical terms should not be *relied upon* for encoding text
divisions, but if would be nice if they weren't totally discarded (and
section numbering, of whatever provenance, is nice too); in this way, no
grand concessions need to be made for such information in the TEI
guidelines, either.
Just my two cents, as I'm gradually learning about all these things. I'm
very interested to hear responses. And cheers for all the good work!
Tyler
On Wed, Oct 10, 2018 at 1:08 PM Patrick McAllister
Dear list members,
I’m currently thinking about the encoding of sections (mainly tei:div elements) for the documents in the SARIT collection. Many of those documents are outside my field of expertise, and so I’m having a hard time with this.
We have quite a large variety of section @types that either directly use Sanskrit terminology or use terminology that might (IMO) be based on it. They are:
adhikaraṇa adhikāra adhyāya canto closing commentary commentary1 commentary2 conclusion kārika kārikā pariccheda prakāśa pāda samuddeśa sutra sūtra sūtra_with_bhāṣya
It’s obviously a bit of a mess. I was wondering if anyone has a principled approach to these naming schemes in general? I at least have only very limited knowledge of the traditional sectioning of texts in Sanskrit literature as a whole, and am not sure what the aim should be here.
Currently the SARIT guidelines ( https://github.com/sarit/SARIT-corpus/blob/a355567b5bc61032e1d7fa39392f5e06d... ) are rather cautious about the usage of these terms:
######
Sections of the text ════════════════════
There are many Sanskrit words for sections of a text: sarga, adhyāya, aṅka, pariccheda, ucchvāsa, etc. The encoding of these sections should meet several requirements:
• the XML document itself should be valid;
• the structure of the XML document reflects the logical structure of the text;
• a standard reference system should be able to use the structure of the XML document as a proxy for the structure of the text;
• texts in the corpus are broadly consistent in the encoding strategy used for these sections.
These considerations lead us to recommend the use of div for all "parts," "sections," and "divisions" in the text, whatever their Sanskrit name is, and at whatever depth they occur. Do /not/ use the numbered divisions available in earlier versions of the TEI Guidelines (`<div1 xmlns="http://www.tei-c.org/ns/Examples"/>', `<div2 xmlns="http://www.tei-c.org/ns/Examples"/>', etc.).
Before encoding a text, you should figure out a strategy for representing all of the relevant levels of the text as div elements. Some Mīmāṃsā texts, for example, are organized according to the hierarchical organization of the Mīmāṃsā Sūtras into adhyāyas, pādas, and adhikaraṇas. The first div beneath the body element (body/div) will thus correspond to an adhyāya, the first div below this to a pāda (body/div/div), and the first div below this (body/div/div/div) to an adhikaraṇa. If this schema is applied consistently, there is no need for assigning a type to the div elements themselves (e.g., `<div xmlns="http://www.tei-c.org/ns/Examples" type="adhyāya"/>', `<div xmlns="http://www.tei-c.org/ns/Examples" type="pāda"/>', `<div xmlns="http://www.tei-c.org/ns/Examples" type="adhikaraṇa"/>'), but these type attributes may be included in order to make the XML easier to read.
The strategy followed in the text can be, and should be, documented in the reference declaration (refsDecl), which is part of the encoding description (encodingDesc) in the TEI Header (see above).
1 Labelling sections ────────────────────
Sections of the text should be /identifiable/. In order to be identifiable to humans, sections generally carry a heading and/or a trailer (see below). In order to be identifiable to machines, sections carry a numeric /label/ that is represented by the n attribute of the corresponding div element. The value of n will usually be the serial number of the section: `<div xmlns="http://www.tei-c.org/ns/Examples" n="2"/>' represents the second division (adhyāya, pāda, adhikaraṇa, etc.), even if it's not actually the second div element within its parent element. For division that comprises more than one such section, we simply put all of the corresponding numbers in the n attribute: `<div xmlns="http://www.tei-c.org/ns/Examples" n="2 3 4"/>'.
Divisions (divs) are block-level elements, and they constitute the text hierarchy, together with other block-level elements like paragraphs (p), verses (lg), and the “anonymous blocks” used for sūtras and the like (ab). The overall reference system of the text will therefore usually include the numbering of div elements at the upper levels and the numbering of lg or ab elements at the lower levels.
######
I can at least say that for most practical purposes the @type on the div is ignored, at the moment: the level of any given div is inferred from the number of parent div-s, and the typesetting/display on the various platforms of SARIT depends on several factors about a div’s content rather than on the @type.
But my question is more theoretical: should we generally encourage the use of Sanskrit technical terms for types of a text’s division or not? And if you think we should encourage them, would it make sense (or even be possible) to somehow specify guidelines for their usage that would apply to many texts?
In SARIT at least, there are some cases where we decided against the use of Sanskrit terms, mainly because their meaning is actually not so clear, e.g., in the markup of tei:quote with @type="lemma" instead of something like @type="pratīka".
Hoping for some help,
-- Patrick McAllister long-term email: pma@rdorte.org _______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
On Thu, 18 Oct 2018 at 15:06, Tyler Neill
<div n="1" type="level1" subtype="volume"> <div n="1" type="level2" subtype="āhnika"> <div n="2" type="level2" subtype="āhnika"> <div n="3" type="level2" subtype="āhnika"> etc.
I agree with Tyler's view that this is clear and conveys all the information one wants without having to extend TEI in local ways. Best, Dominik
It is not really necessary to include level1, level2 in every div. One could describe this structure, namely, that level1 is volume, level 2 is Ahnika, etc., in the header in machine readable form. Yours, Peter ****************************** Peter M. Scharf, President The Sanskrit Library scharf@sanskritlibrary.org http://sanskritlibrary.org ******************************
On 2 Nov 2018, at 2:57 AM, Dominik Wujastyk
wrote: On Thu, 18 Oct 2018 at 15:06, Tyler Neill
mailto:tyler.g.neill@gmail.com> wrote: <div n="1" type="level1" subtype="volume"> <div n="1" type="level2" subtype="āhnika"> <div n="2" type="level2" subtype="āhnika"> <div n="3" type="level2" subtype="āhnika"> etc.
I agree with Tyler's view that this is clear and conveys all the information one wants without having to extend TEI in local ways.
Best, Dominik _______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
Agreed, Peter. To rephrase: I would be happy with whatever makes it
possible for various mid-level users (whether of XML/XSLT or of the
particular work or literary genre) to understand and manipulate the
material within a few minutes. The @n and @subtype attributes seem to be
doing the majority of that user-friendliness work here; perhaps the @type
attribute goes too far, as you say. But whether a few more or a few less, I
think the basic feedback to Patrick's question remains the same: Such
attributes need not be required by the guidelines, but they can still
serve to conveniently store annotations concerning traditional structural
information.
On Fri, Nov 2, 2018 at 7:17 AM Peter Scharf
It is not really necessary to include level1, level2 in every div. One could describe this structure, namely, that level1 is volume, level 2 is Ahnika, etc., in the header in machine readable form. Yours, Peter
****************************** Peter M. Scharf, President The Sanskrit Library scharf@sanskritlibrary.org http://sanskritlibrary.org ******************************
On 2 Nov 2018, at 2:57 AM, Dominik Wujastyk
wrote: On Thu, 18 Oct 2018 at 15:06, Tyler Neill
wrote: <div n="1" type="level1" subtype="volume"> <div n="1" type="level2" subtype="āhnika"> <div n="2" type="level2" subtype="āhnika"> <div n="3" type="level2" subtype="āhnika"> etc.
I agree with Tyler's view that this is clear and conveys all the information one wants without having to extend TEI in local ways.
Best, Dominik _______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
On Fri, Nov 02 2018, Tyler Neill wrote:
Agreed, Peter. To rephrase: I would be happy with whatever makes it possible for various mid-level users (whether of XML/XSLT or of the particular work or literary genre) to understand and manipulate the material within a few minutes. The @n and @subtype attributes seem to be doing the majority of that user-friendliness work here; perhaps the @type attribute goes too far, as you say. But whether a few more or a few less, I think the basic feedback to Patrick's question remains the same: Such attributes need not be required by the guidelines, but they can still serve to conveniently store annotations concerning traditional structural information.
Thank you for your answers so far: I’m convinced now that it’s useful to have these attributes. What I’m not yet sure about is whether their proposed mark up is satisfactory. I have doubts in two areas: 1) Ease of use/accessibility Something like this was suggested: <div n="1" type="level1" subtype="volume"> <div n="1" type="level2" subtype="āhnika"> </div> <div n="1" type="level2" subtype="āhnika"> </div> </div> Does the value of the @type represent something that’s in the book (as a printed, material thing) or the text (as an abstract thing)? If not, and the attribute values “level1” or “level2” only indicate how many “div-s” down the current element is, then it seems redundant. (I think also Peter tends towards this opinion.) Granted, with their presence it’s obvious at first sight where you are in the document, but probably, after some exposure to xml technologies, you’ll find it easier and more reliable to query the document itself for this information, rather than rely on these indicators. 2) @type/@subtype More problematical, IMO, is the relation of @type/@subtype. @subtype is described like this: “subtype: provides a sub-categorization of the element, if needed” (at http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.typed.html) I think the relation of these two attributes is quite literal. The Guidelines have examples like “sentence/declarative”, “phrase/preposition”, “word/noun”, etc. So, with this expectation, what the @type/@subtype we’re discussing would tell us is that a “volume” div is a subtype of a “level1” div, and that an “āhnika” div is a subtype of a “level2” div. This seems to be mixing things of different categories. Additionally, the value of the @type would change depending on whether the book has volumes, parts, or only chapters. But I suppose that “āhnika” could be used in all three, and would thus clearly not be a subtype of a “levelX” div. Something like this should really be sufficient: <div n="1" type="āhnika"> </div> (And it would solve my two problems.)
On Fri, Nov 2, 2018 at 7:17 AM Peter Scharf
wrote: It is not really necessary to include level1, level2 in every div. One could describe this structure, namely, that level1 is volume, level 2 is Ahnika, etc., in the header in machine readable form.
This is a good suggestion, too. I don’t know how to make these things machine-readable in TEI, but I’ve found myself using the @ana attribute frequently to link interpretations with elements, e.g.: <div n="1" ana="#āhnika"> </div> And somewhere else (in the document or separately): <interp xml:id="āhnika">Section of text manageable in a day.</interp> This would also allow you to attach multiple interpretations: <div n="1" ana="#āhnika #pariccheda-as-defined-by-dominik"> </div> (For @ana, see http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.global.analytic.....) But it is not machine-readable (unless we introduce some restrictions on the content of the tei:interp elements). About tei:interp, the TEI Guidelines say (http://www.tei-c.org/release/doc/tei-p5-doc/en/html/AI.html#AISP): “The same analysis may be expressed with the interp element instead of the span element; this element provides attributes for recording an interpretive category and its value, as well as the identity of the interpreter, but does not itself indicate which passage of text is being interpreted; the same interpretive structures can thus be associated with many passages of the text.” For SARIT, I think it would be very useful to have a central and visible list of such tei:interp elements that one could link the @ana against, in various texts. Any volunteers? Or other ideas of how to coordinate the types/analysis across multiple texts? -- Patrick
The encoding of divs sown by Patrick in his first example is exactly what I use and what I had in mind, i.e. <div n=“1” type=“Ahnika”> As for how to describe levels in a machine readable way, I had in mind the tagsDecl element in the teiHeader: http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-tagsDecl.html http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-tagsDecl.html http://www.tei-c.org/release/doc/tei-p5-doc/en/html/HD.html#HD57 http://www.tei-c.org/release/doc/tei-p5-doc/en/html/HD.html#HD57 Note that the latter link includes the following description: "The tagsDecl http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-tagsDecl.html element is used to record the following information about the tagging used within a particular document: … any comment relating to the usage of particular elements not specified elsewhere in the header. While the description of the use of this element mentions the specifying the use of elements, I’m not sure whether there is some equivalent to describe the use of attributes. Yours, Peter ****************************** Peter M. Scharf, President The Sanskrit Library scharf@sanskritlibrary.org http://sanskritlibrary.org ******************************
On 2 Nov 2018, at 8:15 PM, Patrick McAllister
wrote: On Fri, Nov 02 2018, Tyler Neill wrote:
Agreed, Peter. To rephrase: I would be happy with whatever makes it possible for various mid-level users (whether of XML/XSLT or of the particular work or literary genre) to understand and manipulate the material within a few minutes. The @n and @subtype attributes seem to be doing the majority of that user-friendliness work here; perhaps the @type attribute goes too far, as you say. But whether a few more or a few less, I think the basic feedback to Patrick's question remains the same: Such attributes need not be required by the guidelines, but they can still serve to conveniently store annotations concerning traditional structural information.
Thank you for your answers so far: I’m convinced now that it’s useful to have these attributes. What I’m not yet sure about is whether their proposed mark up is satisfactory. I have doubts in two areas:
1) Ease of use/accessibility
Something like this was suggested:
<div n="1" type="level1" subtype="volume"> <div n="1" type="level2" subtype="āhnika"> </div> <div n="1" type="level2" subtype="āhnika"> </div> </div>
Does the value of the @type represent something that’s in the book (as a printed, material thing) or the text (as an abstract thing)? If not, and the attribute values “level1” or “level2” only indicate how many “div-s” down the current element is, then it seems redundant. (I think also Peter tends towards this opinion.)
Granted, with their presence it’s obvious at first sight where you are in the document, but probably, after some exposure to xml technologies, you’ll find it easier and more reliable to query the document itself for this information, rather than rely on these indicators.
2) @type/@subtype
More problematical, IMO, is the relation of @type/@subtype. @subtype is described like this:
“subtype: provides a sub-categorization of the element, if needed” (at http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.typed.html)
I think the relation of these two attributes is quite literal. The Guidelines have examples like “sentence/declarative”, “phrase/preposition”, “word/noun”, etc.
So, with this expectation, what the @type/@subtype we’re discussing would tell us is that a “volume” div is a subtype of a “level1” div, and that an “āhnika” div is a subtype of a “level2” div. This seems to be mixing things of different categories. Additionally, the value of the @type would change depending on whether the book has volumes, parts, or only chapters. But I suppose that “āhnika” could be used in all three, and would thus clearly not be a subtype of a “levelX” div.
Something like this should really be sufficient:
<div n="1" type="āhnika"> </div>
(And it would solve my two problems.)
On Fri, Nov 2, 2018 at 7:17 AM Peter Scharf
wrote: It is not really necessary to include level1, level2 in every div. One could describe this structure, namely, that level1 is volume, level 2 is Ahnika, etc., in the header in machine readable form.
This is a good suggestion, too. I don’t know how to make these things machine-readable in TEI, but I’ve found myself using the @ana attribute frequently to link interpretations with elements, e.g.:
<div n="1" ana="#āhnika"> </div>
And somewhere else (in the document or separately):
<interp xml:id="āhnika">Section of text manageable in a day.</interp>
This would also allow you to attach multiple interpretations:
<div n="1" ana="#āhnika #pariccheda-as-defined-by-dominik"> </div>
(For @ana, see http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.global.analytic.....)
But it is not machine-readable (unless we introduce some restrictions on the content of the tei:interp elements).
About tei:interp, the TEI Guidelines say (http://www.tei-c.org/release/doc/tei-p5-doc/en/html/AI.html#AISP):
“The same analysis may be expressed with the interp element instead of the span element; this element provides attributes for recording an interpretive category and its value, as well as the identity of the interpreter, but does not itself indicate which passage of text is being interpreted; the same interpretive structures can thus be associated with many passages of the text.”
For SARIT, I think it would be very useful to have a central and visible list of such tei:interp elements that one could link the @ana against, in various texts. Any volunteers? Or other ideas of how to coordinate the types/analysis across multiple texts?
-- Patrick _______________________________________________ Indic-texts mailing list Indic-texts@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/indic-texts
participants (5)
-
Dominik Wujastyk
-
Patrick McAllister
-
Patrick McAllister
-
Peter Scharf
-
Tyler Neill