
I very much like the idea of an attribute on which the Unicode standard notation would be used. I do not like the idea of just using NCRs, because (as far as an XML processor is concerned), that's the same as using characters, and one should not have characters from the PUA in a document prepared for interchange. If we went with an attribute, it brings up a few questions: * What would it be named? * What would be used to separate multiple characters? (I think I lean towards whitespace, meaning one would have to use "U+0020" if one really wanted a space.) * Would it be mutually exclusive with content? Using whitespace to separate would allow us to set up a datatype for "U\+[0-9A-F]{4,6}", and then have the attribute be minOccurs=1 maxOccurs=unbounded of that datatype, meaning it would be whitespace separated. Yay! As for being mutually exclusive with content, there are advantages and disadvantages. Allowing <mapping usn="U+00B5">µ</mapping> gives the user both an extra place to make a mistake (by having "[" by accident) and an extra way of validating that everything is alright. Note-to-self ------------ A proof-of-concept for testing that @usn matches content (done in XSLT, not Schematron, and does not work on multiple characters): --------- begin xslt snippet --------- <xsl:template match="mapping"> <xsl:variable name="ucp" select="substring-after( @usn, '+')"/> <xsl:variable name="ccp" select="string-to-codepoints(.)"/> <xsl:variable name="ccph" select="wf:decimal2hexDigits( $ccp ) ! translate(., ' ', '') => string-join()"/> <xsl:message select="'Debug: ucp='||$ucp||', ccp='||$ccp||', ccph='"/> <xsl:choose> <xsl:when test="replace( $ucp, '^0+','') eq $ccph"> <xsl:value-of select="'Yay! '||@usn||'='||.||' '"/> </xsl:when> <xsl:otherwise> <xsl:value-of select="'Sigh '||@usn||'≠'||.||' '"/> </xsl:otherwise> </xsl:choose> </xsl:template> --------- end xslt snippet ---------
This is something I've wondered about in the past myself. I think, though, that if the content of the element is to be allowed to include "straight character content", then it shouldn't also allow the standard Unicode representation; I'd rather have an attribute where that could be provided. Either that or in place of U+00B5, we should insist on µ (i.e. a numeric character entity reference), which will sit more easily in a context where character content also appears.