TEI does include a mechanism for pointing to external data entities in complex ways (Guidelines chapter 16); it can be done, although this is clumsy and not the answer one would wish for in a large project.

Also, TEI has an extraordinary depth of documentary awareness that nobody with serious scholarly engagement would want to relinquish. Just one example, Guidelines chapter 21. The ability to express degrees of certainty is central to the scholarly endeavour.

The basic idea of the data triple doesn't - as far as I can see - provide anything like the granularity that one would look for in a set of relations. Everything is linked by "is" as if that were an unproblematic or universal form predication. (I hasten to say that I don't understand data triples and semantic ontologies very well, and I could well be wrong about what can be done.)

The tension between TEI and semantically linked data - or in my old-fashioned language, between documents and databases - is very much a current discussion in the TEI world. See, e.g., https://journals.openedition.org/jtei/1191?lang=en, https://journals.openedition.org/jtei/1480#tocto1n2, https://hcmc.uvic.ca/tei2017/abstracts/t_141_ore_ontologiesconceptualmodels.html, http://www.1890s.ca/PDFs/Crossing%20the%20Stile.pdf, and much more.

My view may be summarized as "jam today." The Pandit project already exists and is already rather wonderful. It has achieved a critical mass of data that makes it already a discovery tool. Today. It has received a lot of expert curation and funding over many years. It is not to be ignored or discarded. Data entered into Pandit will not be lost in future. Yes, Pandit needs to grow in important new directions. Among the most important is that it should develop transparent import-export mechanisms. And - yes - it needs to be able to write out data in a form that maintains the semantic ontology that it embodies, and that can be used by others. But it seems inexplicable to me to ignore Pandit and start a prosopographical project in competition with it.

Best,

Dominik

On Mon, 28 May 2018 at 10:24, Andrew Ollett <andrew.ollett@gmail.com> wrote:

Dear list members,

I have been following both the Jaina-Prosopography and PANDiT projects with great interest and optimism. There are two general questions that have arisen which I think need to be separated: the possibility of sharing data between projects, and the use of TEI as a data format. There is luckily no disagreement over the fact that data should be published in a free and accessible way. But the really essential thing is not just to publish the data, but to publish it in a format that can be queried and retrieved programmatically. This is precisely what "Linked Open Data" is supposed to do, and there has been a huge amount of work in neighboring fields like Classics to build resources that are linked and open in precisely this way. For example, this collection of papers (now a bit dated) about "current practice in linked open data for the ancient world" and the SNAP:DRGN project (also a bit dated). I think Gabriel Bodard is consulting on the Jaina-Prosopography project and from Peter's description the data will be published in accordance with LOD standards. Making the data available in other formats, such as CSV or TEI, is a nice gesture, and may be useful for certain users, but because CSV and TEI documents are just documents, and there aren't tools for extracting relations from huge amounts of CSV or TEI data (well, there probably are for CSV...), they are about as useful as plain text files.

One of the great benefits of the LOD approach is that projects can share data despite having different data models. In order for one project to use another's data, there will inevitably be some work of mapping the ontology of one onto the ontology of another (something that PANDiT has dealt with over successive imports of data from other sources). But we are not in the situation we were in previously, where the data of one project is essentially useless to another without a massive investment of time and money.

Now to come back to TEI: there are projects that use TEI as the basic data format for prosopographic data, such as Syriaca.org. But TEI is meant to encode text data, and it is not particularly good at representing relations between entities in the kind of well-defined ontologies that prosopographic databases need. Syriaca has essentially had to define their ontology, and controlled vocabularies, and then find ways of representing those vocabularies in TEI. There's a lot that can go wrong there. Neither of the databases we're talking about uses TEI as its basic data format, for good reasons. We might want them to publish their data as TEI, as an exchange format, but as I noted above, it's not clear what any of us would do with (in the case of PANDiT) 50,848 TEI documents.

prayojanam anuddiśya na mando ’pi pravartate. What is it exactly that we want from the published data? What do we want to do with it? How do we want to share it, query it, connect it? We now have these amazing resources, and we should try to use them often, and use them creatively. I think that LOD standards would help in a lot of respects (e.g., being able to get relevant biodata for a given person just from the PANDiT ID and put it on a website programmatically) but I am very curious about what specific purposes would be served by publishing the data in TEI format.

Andrew
_______________________________________________
indic-texts mailing list
indic-texts@lists.tei-c.org
http://lists.lists.tei-c.org/mailman/listinfo/indic-texts