
There are 21 examples of <mapping> in the _Guidelines_, and 5 different methods of representing the content thereof. (Although three of these methods are variants of the same idea.) The content model of <mapping> is macro.xtext, thus it can have only character data or <g>. Of the 21 of them, the desired mapping is represented in the content as: * straight character content in 15 (1 of those is multi-character, the rest are single characters); * numeric character references[1] in 3 (1 of those is multi-character, the others are single characters); * a <g> element in 1; and * the standard Unicode representation[2] in 2, i.e. they match "U\+[0-9A-F]{4,6}". The first three are variations of the same theme: the *content* of <mapping> *is* the character to which the character or glyph being defined should be mapped (whether expressed as character data, NCR(s), <g>(s), or some combination thereof). The last is entirely different. I think we can all agree that <mapping>U+00B5</mapping> means that the character or glyph currently being defined should be mapped to a single character µ = µ = µ = <g ref="#micro"/>, NOT a sequence of 6 characters U+00B5 = U+00B5. (Note that in the examples in _Guidelines_, the Unicode standard notational convention is used for all <mapping type="PUA">, and no other types of <mapping>.) But, as far as I know *nowhere* is this (the use of Unicode standard notation) mentioned in the _Guidelines_. Although I may be wrong about this, I do not think it would ever make sense to map a <char> onto a literal string "U+0123" (or "V=3456"). Furthermore, I think use of Unicode notation is a *really* good idea, *especially* if the character being represented is in the PUA; furthermore I suspect that the intention of whoever wrote this in the past was just that: Unicode standard convention should be used rather than actually putting PUA characters in a document instance. Thus I think the right solution is to: a) Have a _suggest values include_ list for mapping/@type (after all, we have it in the prose, sort of). b) Add some prose to 5.2 (#D25-20) that explains that in general "U+0123" notation may be used instead of "ģ" or "ģ", and that if the mapping is to the PUA area, this is the (vastly) preferred method. c) If anyone thinks it important to be able to map to absolutely arbitrary strings, then pick some value of @type (say "exact") for which use of "U+0123" would actually map to the string "U+0123". Notes ----- [1] Of course, they occur with "&" instead of an actual ampersand if you are reading the source. [2] Per the Unicode Standard Appendix A, "Notational Conventions".

Hi Syd, This is something I've wondered about in the past myself. I think, though, that if the content of the element is to be allowed to include "straight character content", then it shouldn't also allow the standard Unicode representation; I'd rather have an attribute where that could be provided. Either that or in place of U+00B5, we should insist on µ (i.e. a numeric character entity reference), which will sit more easily in a context where character content also appears. Cheers, Martin On 2020-02-19 6:09 a.m., Syd Bauman wrote:

I very much like the idea of an attribute on which the Unicode standard notation would be used. I do not like the idea of just using NCRs, because (as far as an XML processor is concerned), that's the same as using characters, and one should not have characters from the PUA in a document prepared for interchange. If we went with an attribute, it brings up a few questions: * What would it be named? * What would be used to separate multiple characters? (I think I lean towards whitespace, meaning one would have to use "U+0020" if one really wanted a space.) * Would it be mutually exclusive with content? Using whitespace to separate would allow us to set up a datatype for "U\+[0-9A-F]{4,6}", and then have the attribute be minOccurs=1 maxOccurs=unbounded of that datatype, meaning it would be whitespace separated. Yay! As for being mutually exclusive with content, there are advantages and disadvantages. Allowing <mapping usn="U+00B5">µ</mapping> gives the user both an extra place to make a mistake (by having "[" by accident) and an extra way of validating that everything is alright. Note-to-self ------------ A proof-of-concept for testing that @usn matches content (done in XSLT, not Schematron, and does not work on multiple characters): --------- begin xslt snippet --------- <xsl:template match="mapping"> <xsl:variable name="ucp" select="substring-after( @usn, '+')"/> <xsl:variable name="ccp" select="string-to-codepoints(.)"/> <xsl:variable name="ccph" select="wf:decimal2hexDigits( $ccp ) ! translate(., ' ', '') => string-join()"/> <xsl:message select="'Debug: ucp='||$ucp||', ccp='||$ccp||', ccph='"/> <xsl:choose> <xsl:when test="replace( $ucp, '^0+','') eq $ccph"> <xsl:value-of select="'Yay! '||@usn||'='||.||' '"/> </xsl:when> <xsl:otherwise> <xsl:value-of select="'Sigh '||@usn||'≠'||.||' '"/> </xsl:otherwise> </xsl:choose> </xsl:template> --------- end xslt snippet ---------

Hi Syd, Good point on PUAs. I wasn't thinking clearly there. So the documentation would have to clarify that if you're using PUAs, you must use the attribute rather than the text content. Do we need a separator for the standard notation? Can't we just do "U+2000U+2001" and so on? What would it mean for the attribute value to be different from the text content? That's potentially so confusing that I think they probably should be mutually exclusive; you can add another <mapping> element if you want to provide a different kind of mapping. Cheers, Martin On 2020-02-19 9:18 a.m., Syd Bauman wrote:
-- ------------------------------------------ Martin Holmes UVic Humanities Computing and Media Centre
participants (2)
-
Martin Holmes
-
Syd Bauman