Hi Patrick,

Thanks for the interesting question. In the past few months, I've been experimenting with writing my own XSL stylesheets for extracting text and structural information from the SARIT XML files for some natural language processing purposes. My personal wish, as an only half-savvy user of these things, is that the @(sub)type and @n attributes had indeed been used more, if only as human-readable labels, i.e., regardless of whether the SARIT system itself needed them. That is, I find the way the <div>s are encoded in the Nyāyamañjarī file ...

etc.

...to be really clear and easy to work with for my purposes. By contrast, the way the <div>s are encoded in the Pramāṇavārttikālaṅkāra file — mostly just simple <div> elements (i.e., no attributes) with immediately following <head> elements — to be much harder to make sense or large-scale use of. Fortunately, with considerable effort spent in understanding XML and XSLT, I seem to be having gradual success with automatically reading out the nesting structure of <div>s using XSL transforms (I'm sure there are probably better ways, too — suggestions welcome!), but the point is that I was surprised that I needed to do so.

So, in sum, since nothing currently seems to depend on the @(sub)type or @n attributes, I suggest encoding traditional nomenclature and numbering there, for use in unexpected ways by others who will try to use SARIT data in the future. Which means that I think I agree with your own feeling, that Sanskrit technical terms should not be relied upon for encoding text divisions, but if would be nice if they weren't totally discarded (and section numbering, of whatever provenance, is nice too); in this way, no grand concessions need to be made for such information in the TEI guidelines, either.

Just my two cents, as I'm gradually learning about all these things. I'm very interested to hear responses. And cheers for all the good work!

Tyler

On Wed, Oct 10, 2018 at 1:08 PM Patrick McAllister <pma@rdorte.org> wrote:

Dear list members,

I’m currently thinking about the encoding of sections (mainly tei:div
elements) for the documents in the SARIT collection. Many of those
documents are outside my field of expertise, and so I’m having a hard
time with this.

We have quite a large variety of section @types that either directly use
Sanskrit terminology or use terminology that might (IMO) be based on it.
They are:

adhikaraṇa
adhikāra
adhyāya
canto
closing
commentary
commentary1
commentary2
conclusion
kārika
kārikā
pariccheda
prakāśa
pāda
samuddeśa
sutra
sūtra
sūtra_with_bhāṣya

It’s obviously a bit of a mess. I was wondering if anyone has a
principled approach to these naming schemes in general? I at least have
only very limited knowledge of the traditional sectioning of texts in
Sanskrit literature as a whole, and am not sure what the aim should be
here.

Currently the SARIT guidelines
(https://github.com/sarit/SARIT-corpus/blob/a355567b5bc61032e1d7fa39392f5e06d679be5c/schemas/odd/sarit-guidelines.xml#L996)
are rather cautious about the usage of these terms:

######

Sections of the text
════════════════════

There are many Sanskrit words for sections of a text: sarga, adhyāya,
aṅka, pariccheda, ucchvāsa, etc. The encoding of these sections should
meet several requirements:

• the XML document itself should be valid;

• the structure of the XML document reflects the logical structure of
the text;

• a standard reference system should be able to use the structure of
the XML document as a proxy for the structure of the text;

• texts in the corpus are broadly consistent in the encoding strategy
used for these sections.

These considerations lead us to recommend the use of div for all
"parts," "sections," and "divisions" in the text, whatever their
Sanskrit name is, and at whatever depth they occur. Do /not/ use the
numbered divisions available in earlier versions of the TEI Guidelines
(`<div1 xmlns="http://www.tei-c.org/ns/Examples"/>', `<div2
xmlns="http://www.tei-c.org/ns/Examples"/>', etc.).

Before encoding a text, you should figure out a strategy for
representing all of the relevant levels of the text as div
elements. Some Mīmāṃsā texts, for example, are organized according to
the hierarchical organization of the Mīmāṃsā Sūtras into adhyāyas,
pādas, and adhikaraṇas. The first div beneath the body element
(body/div) will thus correspond to an adhyāya, the first div below
this to a pāda (body/div/div), and the first div below this
(body/div/div/div) to an adhikaraṇa. If this schema is applied
consistently, there is no need for assigning a type to the div
elements themselves (e.g., `<div
xmlns="http://www.tei-c.org/ns/Examples" type="adhyāya"/>', `<div
xmlns="http://www.tei-c.org/ns/Examples" type="pāda"/>', `<div
xmlns="http://www.tei-c.org/ns/Examples" type="adhikaraṇa"/>'), but
these type attributes may be included in order to make the XML easier
to read.

The strategy followed in the text can be, and should be, documented in
the reference declaration (refsDecl), which is part of the encoding
description (encodingDesc) in the TEI Header (see above).

1 Labelling sections
────────────────────

Sections of the text should be /identifiable/. In order to be
identifiable to humans, sections generally carry a heading and/or a
trailer (see below). In order to be identifiable to machines, sections
carry a numeric /label/ that is represented by the n attribute of the
corresponding div element. The value of n will usually be the serial
number of the section: `<div xmlns="http://www.tei-c.org/ns/Examples"
n="2"/>' represents the second division (adhyāya, pāda, adhikaraṇa,
etc.), even if it's not actually the second div element within its
parent element. For division that comprises more than one such
section, we simply put all of the corresponding numbers in the n
attribute: `<div xmlns="http://www.tei-c.org/ns/Examples" n="2 3
4"/>'.

Divisions (divs) are block-level elements, and they constitute the
text hierarchy, together with other block-level elements like
paragraphs (p), verses (lg), and the “anonymous blocks” used for
sūtras and the like (ab). The overall reference system of the text
will therefore usually include the numbering of div elements at the
upper levels and the numbering of lg or ab elements at the lower
levels.

######

I can at least say that for most practical purposes the @type on the div
is ignored, at the moment: the level of any given div is inferred from
the number of parent div-s, and the typesetting/display on the various
platforms of SARIT depends on several factors about a div’s content
rather than on the @type.

But my question is more theoretical: should we generally encourage the
use of Sanskrit technical terms for types of a text’s division or not?
And if you think we should encourage them, would it make sense (or even
be possible) to somehow specify guidelines for their usage that would
apply to many texts?

In SARIT at least, there are some cases where we decided against the use
of Sanskrit terms, mainly because their meaning is actually not so
clear, e.g., in the markup of tei:quote with @type="lemma" instead of
something like @type="pratīka".

Hoping for some help,

--
Patrick McAllister
long-term email: pma@rdorte.org
_______________________________________________
Indic-texts mailing list
Indic-texts@lists.tei-c.org
http://lists.lists.tei-c.org/mailman/listinfo/indic-texts