Re: [Indic-texts] Jaina-Prosopography

28 May 2018

      Dear list members,

I have been following both the Jaina-Prosopography and PANDiT projects with
great interest and optimism. There are two general questions that have
arisen which I think need to be separated: the possibility of sharing data
between projects, and the use of TEI as a data format. There is luckily no
disagreement over the fact that data should be published in a free and
accessible way. But the really essential thing is not just to publish the
data, but to publish it in a format that can be queried and retrieved
programmatically. This is precisely what "Linked Open Data" is supposed to
do, and there has been a huge amount of work in neighboring fields like
Classics to build resources that are linked and open in precisely this way.
For example, this <http://dlib.nyu.edu/awdl/isaw/isaw-papers/7/> collection
of papers (now a bit dated) about "current practice in linked open data for
the ancient world" and the SNAP:DRGN <http://snapdrgn.net> project (also a
bit dated). I think Gabriel Bodard is consulting on the Jaina-Prosopography
project and from Peter's description the data will be published in
accordance with LOD standards. Making the data available in other formats,
such as CSV or TEI, is a nice gesture, and may be useful for certain
users, but because CSV and TEI documents are just documents, and there
aren't tools for extracting relations from huge amounts of CSV or TEI data
(well, there probably are for CSV...), they are about as useful as plain
text files.

One of the great benefits of the LOD approach is that projects can share
data despite having different data models. In order for one project to use
another's data, there will inevitably be some work of mapping the ontology
of one onto the ontology of another (something that PANDiT has dealt with
over successive imports of data from other sources). But we are not in the
situation we were in previously, where the data of one project is
essentially useless to another without a massive investment of time and
money.

Now to come back to TEI: there are projects that use TEI as the basic data
format for prosopographic data, such as Syriaca.org. But TEI is meant to
encode text data, and it is not particularly good at representing relations
between entities in the kind of well-defined ontologies that prosopographic
databases need. Syriaca has essentially had to define their ontology, and
controlled vocabularies, and then find ways of representing those
vocabularies in TEI. There's a lot that can go wrong there. Neither of the
databases we're talking about uses TEI as its basic data format, for good
reasons. We might want them to publish their data as TEI, as an exchange
format, but as I noted above, it's not clear what any of us would do with
(in the case of PANDiT) 50,848 TEI documents.

prayojanam anuddiśya na mando ’pi pravartate. What is it exactly that we
want from the published data? What do we want to do with it? How do we want
to share it, query it, connect it? We now have these amazing resources, and
we should try to use them often, and use them creatively. I think that LOD
standards would help in a lot of respects (e.g., being able to get relevant
biodata for a given person just from the PANDiT ID and put it on a website
programmatically) but I am very curious about what specific purposes would
be served by publishing the data in TEI format.

Andrew

Re: [Indic-texts] Jaina-Prosopography

Andrew Ollett