datatypes coming back to bite us

As you may recall, on our conference call Fri we agreed I'd try to get Schematron checks for the default value problems Lou pointed out in FR 546[1] ready for inclusion in next release. The cases that we'd like to check are * If the attribute being defined is required, a default value should not be defined. * If the attribute being defined has a closed list of values, then the default value should be among the enumerated list. That second one causes a problem; I've hit a predictable snag. We have explicitly (and unwisely IMHO) permitted whitespace in the value of an attribute defined as a member of an enumerated set of values. Thus it is perfectly legal in TEI to have <valItem ident="Bono"/> <valItem ident="Enya"/> <valItem ident="Diana"/> <valItem ident="Prince"/> <valItem ident="Madonna"/> <valItem ident="Pat Benetar"/> <valItem ident="Diana Prince"/> <valItem ident="Michael Jackson"/> As I've argued before, this makes it difficult to differentiate multiple singleton values from single multiples. E.g., presume the above list is associated with <datatype minOccurs="1" maxOccurs="3"> <rng:ref name="data.enumerated"/> </datatype> Then is the value "Diana Prince" one value or two?[2] This problem comes up because we actually have a <defaultVal> that has multiple singleton values: namely the default for the optional @type of <metDecl>, is "met real". (AFAIK we don't have any @ident of <valItem> that contain whitespace.) Now, since the attributes allows 1-3 occurrences of its closed value list of "met", "real", and "rhyme", we (humans) know that the default value must be two singleton values, "met" and "real". But the Schematron check is a different story. If I write the test to match the datatype, I would use string( chlid::defaultVal ) = child::valList/valItem/@ident But that means "met real" doesn't match. If I write the test to presume whitespace-separated singletons, I would use tokenize( normalize-space( child::defaultVal ),' ') = child::valList/valItem/@ident (And remember, these are XPath 2 expressions, so the '=' operator operates on sequences, not single values.) The problem with writing the test that 2nd way is that then a user who takes advantage of the fact that we allow spaces in enumerated lists will get a false error message when she tries <attDef ident="Q1" mode="add"> <defaultVal>one two</defaultVal> <valList type="closed"> <valItem ident="one two"/> <valItem ident="three four"/> </valList> </attDef> (And, I'd have to do better than normalize-space() if we wanted to take possible leading or trailing space into account.) I'm really not sure what to do about this. Possibilities include: 1) Use the first method (string), and change the build process so that the error on "met real" is ignored. 2) Change the definition of <metDecl> so that the default value of @type is discussed in prose, not given in <defaultVal>. 3) Change the test to use the first method (string) iff maxOccurs is 1 (or unspecified), and to use the second method (tokenize) iff maxOccurs is 2+ (or "unbounded"). Other suggestions? I think (1) and (2) are pretty much cop-outs, but not crazy. I think either may be acceptable as a temporary hack, but neither is a good long-term solution. I think (3) does the easy cases right, but I haven't thought through the harder cases. (That said, probably no one should be using the harder cases.) Notes ----- [1] https://sourceforge.net/p/tei/feature-requests/546/?page=0 [2] To complicate things we've (very unwisely IMHO) defined @ident of <valItem> and the content of <defaultVal> as rng:string and <rng:text>, respectively, as opposed to rng:token (or better yet, something that does not allow whitespace like data.word). That said a) our ODD processing software silently converts the value of @ident of <valItem> from rng:string to rng:token b) even using rng:token doesn't really fix this problem c) no matter what you do, you probably don't get the results you expect Let me know if you want evidence of (c).

I've implemented (3). If there are objections, I presume it's OK to back it out during the freeze, so poke at it, and speak up if there are objections.
As you may recall, on our conference call Fri we agreed I'd try to get Schematron checks for the default value problems Lou pointed out in FR 546[1] ready for inclusion in next release. The cases that we'd like to check are
* If the attribute being defined is required, a default value should not be defined.
* If the attribute being defined has a closed list of values, then the default value should be among the enumerated list.
That second one causes a problem; I've hit a predictable snag. We have explicitly (and unwisely IMHO) permitted whitespace in the value of an attribute defined as a member of an enumerated set of values. Thus it is perfectly legal in TEI to have <valItem ident="Bono"/> <valItem ident="Enya"/> <valItem ident="Diana"/> <valItem ident="Prince"/> <valItem ident="Madonna"/> <valItem ident="Pat Benetar"/> <valItem ident="Diana Prince"/> <valItem ident="Michael Jackson"/> As I've argued before, this makes it difficult to differentiate multiple singleton values from single multiples. E.g., presume the above list is associated with <datatype minOccurs="1" maxOccurs="3"> <rng:ref name="data.enumerated"/> </datatype> Then is the value "Diana Prince" one value or two?[2] This problem comes up because we actually have a <defaultVal> that has multiple singleton values: namely the default for the optional @type of <metDecl>, is "met real". (AFAIK we don't have any @ident of <valItem> that contain whitespace.)
Now, since the attributes allows 1-3 occurrences of its closed value list of "met", "real", and "rhyme", we (humans) know that the default value must be two singleton values, "met" and "real". But the Schematron check is a different story. If I write the test to match the datatype, I would use string( chlid::defaultVal ) = child::valList/valItem/@ident But that means "met real" doesn't match. If I write the test to presume whitespace-separated singletons, I would use tokenize( normalize-space( child::defaultVal ),' ') = child::valList/valItem/@ident (And remember, these are XPath 2 expressions, so the '=' operator operates on sequences, not single values.) The problem with writing the test that 2nd way is that then a user who takes advantage of the fact that we allow spaces in enumerated lists will get a false error message when she tries <attDef ident="Q1" mode="add"> <defaultVal>one two</defaultVal> <valList type="closed"> <valItem ident="one two"/> <valItem ident="three four"/> </valList> </attDef> (And, I'd have to do better than normalize-space() if we wanted to take possible leading or trailing space into account.)
I'm really not sure what to do about this. Possibilities include:
1) Use the first method (string), and change the build process so that the error on "met real" is ignored.
2) Change the definition of <metDecl> so that the default value of @type is discussed in prose, not given in <defaultVal>.
3) Change the test to use the first method (string) iff maxOccurs is 1 (or unspecified), and to use the second method (tokenize) iff maxOccurs is 2+ (or "unbounded").
Other suggestions?
I think (1) and (2) are pretty much cop-outs, but not crazy. I think either may be acceptable as a temporary hack, but neither is a good long-term solution. I think (3) does the easy cases right, but I haven't thought through the harder cases. (That said, probably no one should be using the harder cases.)
Notes ----- [1] https://sourceforge.net/p/tei/feature-requests/546/?page=0 [2] To complicate things we've (very unwisely IMHO) defined @ident of <valItem> and the content of <defaultVal> as rng:string and <rng:text>, respectively, as opposed to rng:token (or better yet, something that does not allow whitespace like data.word). That said a) our ODD processing software silently converts the value of @ident of <valItem> from rng:string to rng:token b) even using rng:token doesn't really fix this problem c) no matter what you do, you probably don't get the results you expect Let me know if you want evidence of (c).
participants (1)
-
Syd Bauman