New subject: [tei-council] stand-alone Schematron extraction improvements

16 Oct 2016

      There are 3 different processes (that I am immediately aware of) for
getting the ISO Schematron constraints from an ODD:

  1) teitoschematron
  2) generate a RELAX NG schema
  3) extract the constraints

My homework assignment, i.e. this post, is supposed to be about
(3). However, it is worth noting that, through a process similar to
the first few steps Martin outlined in homework #1, (1) just calls the
same routine as (3). (However, it does not merge a customization ODD
with P5 source ODD before processing. It probably should.)

So here goes. The big-picture extraction process is *really* simple. A
single stylesheet, Stylesheets/odds/extract-isosch.xsl, is run with an
ODD file as input; it produces ISO Schematron as output.  To be most
useful, the input should be the "merged" ODD file created when a
customization is applied to the P5 source. The output is a (mostly)
usable file with the (hopefully) correct ISO Schematron code.

But the devil is in the details ... this is pretty complicated stuff.

"Why is it complicated?" you ask; "wouldn't just yanking everything
that is in the ISO Schematron namespace out and copying it over do the
trick?" Turns out that won't work, for several reasons, two of which
I'll mention here. First, because the rules of <constraintSpec> say
that if it is in (e.g.) an <elementSpec>, you don't have to specify a
@context on an <sch:rule>, rather the context is assumed to be the
element being defined; so this extraction process has to build an
<sch:rule> with the right @context. Second, because some of the
constraints are not expressed in <constraintSpec>, but rather are
expressed by @validUntil.

But that said, the basic underlying process is, indeed, to find all
the stuff in the ISO Schematron namespace and copy it to the
output. And therein lies the first problem.

In a TEI ODD, the <schemaSpec> element is (perhaps unwisely)
repeatable.[1] The `roma` commandline tool only processes the first
<schemaSpec> that is encountered (in document order). I think
odd2odd.xsl does the same thing (unless a different <schemaSpec> is
specified via the 'selectedScema' parameter).[2] This leads to a
potential mis-match: if a single ODD file defines two schemas, the
Schematron from both of them will be extracted by extract-isosch.xsl,
whereas the rest of the schema and custom documentation will be built
only from the first schema defined.

This is not as big of a problem is it sounds. First, nobody (that I
know of) actually puts two <schemaSpec>s in a single file. Second, the
extract-isosch.xsl program is typically run on an ODD that has already
been "merged". So even if the input to the merge process had two or
more <schemaSpec>s, the output has only one.

So what does extract-isosch.xsl do? A lot, actually. Here's an
overview with some details thrown in.

 1) On matching the root of the input ODD, the entire input tree is
    processed in two passes. This template is found circa line 129.

 2) Pass #1 "decorates" the input tree with extra attributes about the
    namespaces. That is, pass #1 (which is mode "NSdecoration") is an
    identity transform EXCEPT that each <attDef>, <elementSpec>, and
    <schemaSpec> gets two new attributes:

    @nsu (namespace URI) = the URI of the namespace of the construct
       being defined; the default is the TEI namespace for elements
       and no namespace for attributes.

    @nsp (namespace prefix) = a prefix for use with the @nsu. An
       intelligent one is chosen heuristically if possible; if not one
       is just created out of thin air.

 3) The results of pass #1 are then processed in pass #2, which starts
    out in mode "schematron-extraction".[3] A skeleton of the output
    schema is spit out with 6 major sections:
    * namespaces, declared
    * namespaces, implicit
    * keys (including problematic ones)
    * the constraints themselves
    * deprecations
    * paramLists

 4) Namespaces, declared. All <sch:ns> elements (except those inside
    an <egXML>) that are in the right language and do not have a
    prefix of "xsl" are copied over to the output. (At the moment, I
    cannot think of why the XSLT namespace is exempted, or moreover,
    why it is found by testing the prefix, not the URI.)

 5) Namespaces, implicit. All distinct namespaces that we calculated
    in pass 1 are converted to <sch:ns> elements and copied into the
    output. Again, the XSLT namespace is exempted, this time by URI.

 6) Keys. All <xsl:key> elements (except those inside an <egXML>) that
    are in the right language are processed. (Said processing is that
    they are copied over to the output.)

 7) A warning message is generated if any <sch:key> elements are
    encountered. (Because there is no <sch:key> element. :-)

 8) Constraints. Every <constraint> (except those inside an <egXML>)
    that is in the right language, and is in a <constraintSpec
    scheme=isoschematron>[5] is processed as follows.

    a) if there is a child <sch:pattern>, the children (of the
       <constraint>) are processed

    b) if there is a child <sch:rule>, a wrapper <sch:pattern> (with a
       generated @id) is output and children (of the constraint) are
       processed and the output of that is put within the
       <sch:pattern>.

    c) if there is a child <sch:assert> or <sch:report>, both a wrapper
       <sch:pattern> (with a generated @id) and its child <sch:rule>
       (with a generated @context) are generated, and children are
       processed within.

 9) Deprecations. The code is only set up (at the moment) to handle
    cases of @validUntil that actually occur in the _Guidelines_ at
    the time said code was written. (E.g., there is code to handle
    elementSpec//attDef/@validUntil, but not constraintSpec/
    @validUntil, because even though it is valid, there are no
    <constraintSpec>s with @validUntil in the GLs.) The list is as
    follows; note that @validUntil inside an <egXML> is always
    ignored.
    * elementSpec//attDef/@validUntil
    * classSpec//attDef/@validUntil
    * elementSpec/@validUntil
    * elementSpec//valItem/@validUntil[6]

 10) Deprecation, elementSpec//attDef. An <sch:pattern> with a child
     <sch:rule> whose @context is set to the element being defined by
     the <elementSpec> is output. (Luckily, we put an attribute on
     that <elementSpec> back in pass 1 that gives us its namespace
     prefix.) That <sch:rule> has a child <sch:report> that fires
     whenever the attribute being specified is found (again, we can
     use the namespace prefix we ascertained in pass 1).[7]

 11) Deprecation, classSpec//attDef. Same idea as above, except that
     developing the @context is harder because it might be any element
     that is a member of the class. Note that this code only searches
     for elements that are members of the class being defined -- it
     does NOT look for elements that are members of a class that is a
     member of the class being defined, or members of a class that is
     a member of a class that is a member of the class being
     defined.[8]

 12) Deprecation, elementSpec. An sch:pattern/sch:rule/sch:report is
     output. The <sch:rule> has a @context set to the element being
     defined. (Again, using the namespace prefix inserted in pass 1.)
     The report has a @test set to the string (not the function)
     "true()". (Thus, when interpreted by Schematron, it will be the
     function "true()".) So the <sch:report> fires whenever the
     condition on its parent sch:rule/@context is met.

 13) Deprecation, elementSpec//valItem. Similar to the above, but
     instead of a @test of "true()", the test is for the specific
     value being defined.

 14) Parameter lists. I really don't understand this well, in large
     part because I've never seen any output generated from it. (Even
     building tei_simple does not fire this code.) But it looks to me
     like we will get an <sch:pattern> (with a nice @id) that has a
     child <sch:rule> that fires on <param> elements that are inside a
     <model> whose @behaviour is the one being currently defined. That
     <sch:rule> will have a child <sch:assert> that tests that the
     @name (of the <param>) matches one of the names being defined in
     this <paramList>.

Worth noting that in cases where the <constraint> is inside an
<attDef>, the value of the resulting sch:rule/@context is an
attribute. I do not see why this should be a problem, nor I have found
anything in ISO 19757-3:2006 that suggests it should be a
problem. However, I know at least 1 Schematron processor fails to work
correctly in this case.

Notes
-----
 [1] Bizarrely, it is model.divPart; I would have thought that, if
     anything, it would be in model.divLike.

 [2] This means, I think, that the work of extracting the @ident of
     the first <schemaSpec> is performed twice when you run `roma`.

 [3] I have to admit, I don't fully understand why that template,
     matching "/" in mode "schematron-extraction", fires. The content
     of the variable that is selected by the <apply-templates> is all
     of the nodes that are children of the document node, not the
     document node itself. I also am not quite sure why we pass the
     <TEI> element in as a parameter, rather than just set a variable
     from within the template.

 [4] The result of the two steps to generate <sch:ns> namespace
     declarations for the Schematron may well be that a single
     namespace URI may be bound to two occurrences of the same prefix,
     or to two or more different prefixes. I do not see why this
     should be a problem, nor I have found anything in ISO
     19757-3:2006 that suggests it should be a problem. However, I am
     aware of at least 1 Schematron processor that gets very upset by
     occurrences of <sch:ns> that have either the same @prefix or the
     same @uri as another occurrence of <sch:ns>.

 [5] Uh-oh. That should now be changed to process those with
     scheme=schematron, too.

 [6] The complete set of cases that exist in the _Guidelines_ is:
       1 classSpec/attList/attDef/defaultVal/@validUntil
       2 elementSpec/@validUntil
       4 elementSpec/attList/attDef/@validUntil
       4 elementSpec/attList/attDef/defaultVal/@validUntil
      28 macroSpec/@validUntil
     Martin & I don't think there is anything to be done for the
     <defaultVal> cases (after all, a processor would not know if the
     value was specified or defaulted, which is why we don't like them
     in the first place! :-) So the only case that is not handled but
     perhaps should be is the most common: <macroSpec>.

 [7] I just noticed that although we test for the attribute correctly
     using the namespace prefix, the message includes just the
     attribute name -- it does not include the namespace prefix. Since
     in 99% of all cases, and in 100% of actual current cases, there
     is no namespace prefix, this doesn't matter much. Nonetheless, it
     should probably be fixed.

 [8] I just noticed that both the definition of and reference to
     $fqgis (which stands for "fully qualified generic identifiers", I
     believe) are separated with union operators (aka "or bars",
     '|'). That is probably an error that won't cause a problem,
     because (I am guessing) by the time it is referenced it is
     already a single string, so the @separator has no effect. Also
     the @test of this report does not use the namespace prefix;
     probably should.

-- 
 Syd Bauman, EMT-Paramedic
 Senior XML Programmer/Analyst
 Northeastern University Women Writers Project
 s.bauman@northeastern.edu or
 Syd_Bauman@alumni.Brown.edu

co-operative stylesheet education session #1 homework #2

Syd Bauman

Syd Bauman

Peter Stadler

tags

participants (3)