[tei-council] datatypes coming back to bite us

29 Mar 2015

      As you may recall, on our conference call Fri we agreed I'd try to
get Schematron checks for the default value problems Lou pointed out
in FR 546[1] ready for inclusion in next release. The cases that we'd
like to check are

 * If the attribute being defined is required, a default value should
   not be defined.

 * If the attribute being defined has a closed list of values, then
   the default value should be among the enumerated list.

That second one causes a problem; I've hit a predictable snag. We
have explicitly (and unwisely IMHO) permitted whitespace in the value
of an attribute defined as a member of an enumerated set of values.
Thus it is perfectly legal in TEI to have
  <valItem ident="Bono"/>
  <valItem ident="Enya"/>
  <valItem ident="Diana"/>
  <valItem ident="Prince"/>
  <valItem ident="Madonna"/>
  <valItem ident="Pat Benetar"/>
  <valItem ident="Diana Prince"/>
  <valItem ident="Michael Jackson"/>
As I've argued before, this makes it difficult to differentiate
multiple singleton values from single multiples. E.g., presume the
above list is associated with
   <datatype minOccurs="1" maxOccurs="3">
     <rng:ref name="data.enumerated"/>
   </datatype>
Then is the value "Diana Prince" one value or two?[2] This problem
comes up because we actually have a <defaultVal> that has multiple
singleton values: namely the default for the optional @type of
<metDecl>, is "met real". (AFAIK we don't have any @ident of
<valItem> that contain whitespace.)

Now, since the attributes allows 1-3 occurrences of its closed value
list of "met", "real", and "rhyme", we (humans) know that the default
value must be two singleton values, "met" and "real". But the
Schematron check is a different story. If I write the test to match
the datatype, I would use
      string( chlid::defaultVal ) = child::valList/valItem/@ident
But that means "met real" doesn't match. If I write the test to
presume whitespace-separated singletons, I would use
      tokenize( normalize-space( child::defaultVal ),' ')
      =
      child::valList/valItem/@ident
(And remember, these are XPath 2 expressions, so the '=' operator
operates on sequences, not single values.) The problem with writing
the test that 2nd way is that then a user who takes advantage of the
fact that we allow spaces in enumerated lists will get a false error
message when she tries
              <attDef ident="Q1" mode="add">
                <defaultVal>one two</defaultVal>
                <valList type="closed">
                  <valItem ident="one two"/>
                  <valItem ident="three four"/>
                </valList>
              </attDef>
(And, I'd have to do better than normalize-space() if we wanted to
take possible leading or trailing space into account.)

I'm really not sure what to do about this. Possibilities include:

 1) Use the first method (string), and change the build process so
    that the error on "met real" is ignored.

 2) Change the definition of <metDecl> so that the default value of
    @type is discussed in prose, not given in <defaultVal>.

 3) Change the test to use the first method (string) iff maxOccurs is
    1 (or unspecified), and to use the second method (tokenize) iff
    maxOccurs is 2+ (or "unbounded").

Other suggestions?

I think (1) and (2) are pretty much cop-outs, but not crazy. I think
either may be acceptable as a temporary hack, but neither is a good
long-term solution. I think (3) does the easy cases right, but I
haven't thought through the harder cases. (That said, probably no one
should be using the harder cases.)

Notes
-----
[1] https://sourceforge.net/p/tei/feature-requests/546/?page=0
[2] To complicate things we've (very unwisely IMHO) defined @ident of
    <valItem> and the content of <defaultVal> as rng:string and
    <rng:text>, respectively, as opposed to rng:token (or better yet,
    something that does not allow whitespace like data.word). That
    said
    a) our ODD processing software silently converts the value of @ident
       of <valItem> from rng:string to rng:token
    b) even using rng:token doesn't really fix this problem
    c) no matter what you do, you probably don't get the results you
       expect 
    Let me know if you want evidence of (c).

[tei-council] datatypes coming back to bite us

Syd Bauman