Re: [Indic-texts] Jaina-Prosopography

29 May 2018

      I would like to add a word about where TEI is useful in creating a prosopographic database.  TEI is for marking up text, but the TEI-Ms. guidelines include specific guidelines for transcribing manuscripts and for including catalogue information in the teiHeader.  These TEI elements include elements for author, scribes, other people, places, dates, that is, just the information that would be useful in a prosopographic database.  Manusrcripts catalogued using these TEI elements are easily mined for this information.  The Sanskrit Library has now catalogued about 2,000 manuscripts using a template we developed in accordance with the TEI-Ms guidelines.  We hope someday to mine this information, add it to Pandit, and link the catalogue entries with Pandit.  I think this is the kind of thing that TEI is good for.  Once such information is extracted from text and put in a database, there is no need for TEI; it is quite clumsy to mark up a database in TEI with table, row and cell elements.  Similarly using TEI to mark up texts with person and place elements can also contribute to enriching a prosopgraphic database.

Yours,
Peter

******************************
Peter M. Scharf, President
The Sanskrit Library
scharf@sanskritlibrary.org
http://sanskritlibrary.org
******************************
...
On 28 May. 2018, at 11:24 AM, Andrew Ollett <andrew.ollett@gmail.com> wrote:
Dear list members,
I have been following both the Jaina-Prosopography and PANDiT projects with great interest and optimism. There are two general questions that have arisen which I think need to be separated: the possibility of sharing data between projects, and the use of TEI as a data format. There is luckily no disagreement over the fact that data should be published in a free and accessible way. But the really essential thing is not just to publish the data, but to publish it in a format that can be queried and retrieved programmatically. This is precisely what "Linked Open Data" is supposed to do, and there has been a huge amount of work in neighboring fields like Classics to build resources that are linked and open in precisely this way. For example, this <http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/> collection of papers (now a bit dated) about "current practice in linked open data for the ancient world" and the SNAP:DRGN <http://snapdrgn.net/> project (also a bit dated). I think Gabriel Bodard is consulting on the Jaina-Prosopography project and from Peter's description the data will be published in accordance with LOD standards. Making the data available in other formats, such as CSV or TEI, is a nice gesture, and may be useful for certain users, but because CSV and TEI documents are just documents, and there aren't tools for extracting relations from huge amounts of CSV or TEI data (well, there probably are for CSV...), they are about as useful as plain text files.
One of the great benefits of the LOD approach is that projects can share data despite having different data models. In order for one project to use another's data, there will inevitably be some work of mapping the ontology of one onto the ontology of another (something that PANDiT has dealt with over successive imports of data from other sources). But we are not in the situation we were in previously, where the data of one project is essentially useless to another without a massive investment of time and money.
Now to come back to TEI: there are projects that use TEI as the basic data format for prosopographic data, such as Syriaca.org <http://syriaca.org/>. But TEI is meant to encode text data, and it is not particularly good at representing relations between entities in the kind of well-defined ontologies that prosopographic databases need. Syriaca has essentially had to define their ontology, and controlled vocabularies, and then find ways of representing those vocabularies in TEI. There's a lot that can go wrong there. Neither of the databases we're talking about uses TEI as its basic data format, for good reasons. We might want them to publish their data as TEI, as an exchange format, but as I noted above, it's not clear what any of us would do with (in the case of PANDiT) 50,848 TEI documents.
prayojanam anuddiśya na mando ’pi pravartate. What is it exactly that we want from the published data? What do we want to do with it? How do we want to share it, query it, connect it? We now have these amazing resources, and we should try to use them often, and use them creatively. I think that LOD standards would help in a lot of respects (e.g., being able to get relevant biodata for a given person just from the PANDiT ID and put it on a website programmatically) but I am very curious about what specific purposes would be served by publishing the data in TEI format.
Andrew
_______________________________________________
indic-texts mailing list
indic-texts@lists.tei-c.org
http://lists.lists.tei-c.org/mailman/listinfo/indic-texts

Re: [Indic-texts] Jaina-Prosopography

Peter Scharf