another freakin' inconsistency
I am writing here instead of on a Github ticket as this is either a Guidelines issue (if we decide to encode consistently) or a Stylesheets issue (if we decide to present consistently). Sigh. Right now there appear to be roughly 3700 apostrophes in P5.[1] Of those, 3490 are encoded with U+0027 and 241 with U+2019.[2,3] Of course, many of these are inside <egXML>, and thus maybe should be left alone; and some are inside comments (or maybe PIs), and we probably don’t care at all how they are encoded. So, refining the search to exclude those, there are only 2853 apostrophes,[4] of which 2689 are U+0027 and 190 are U+2019.[5,6] Those that are encoded as U+0027 show up as U+0027 in the HTML output of the Guidelines. I think that is a sin. The question is, what do we want to do about this? The following are (roughly) in my order of preference: 1. Change all the apostrophes to U+0027 in the encoded files, and change the Stylesheets to produce U+2019 on output. Update TCW20 or whatever to match. (I.e. to tell people to use U+0027, not U+2019.) 2. Leave the inconsistent encoding, and don’t bother to update the documentation, but change the Stylesheets so the output is more readable. 3. Change all the apostrophes to U+2019 in the encoded files and update documentation to match. (I.e. to tell people to use U+2019, not U+0027.) Also need to add some checks to the build process to prevent U+0027 from slipping in. 4. Change all the apostrophes to U+0027 in the encoded files, but do not change the Stylesheets, thus leaving our output ugly. 5. Leave things alone. commands used [1] egrep -c "[^=]['’][a-z]" P5/p5.xml [2] egrep -c "[^=]'[a-z]" P5/p5.xml [3] egrep -c "[^=]’[a-z]" P5/p5.xml [4] xsel -t -m "//t:*/text()" -v "." -n P5/p5.xml | egrep -c "['’][a-z]" [5] xsel -t -m "//t:*/text()" -v "." -n P5/p5.xml | egrep -c "'[a-z]" [6] xsel -t -m "//t:*/text()" -v "." -n P5/p5.xml | egrep -c "’[a-z]" [7] Note: xsel is an alias for an xmlstarlet sel command that binds “t:” to the TEI namespace. Thus the above 3 commands ignore elements in any other namespace, particularly the TEI Examples namespace.
HI Syd, I would go with option 3, and add a Schematron rule to prevent the use of the straight apostrophe outside of code contexts. Straight apostrophes are definitely ugly, and especially so when they're mixed with curly ones. I would use XSLT to fix the existing ones rather than a text search-and-replace, because then you can be much more sensitive to context. I don't think we should change the Stylesheets because typographical issues like this aren't really a transformation issue. And we'd have to work through all the possible outputs, which seems like a nightmare to me. Cheers, Martin On 2022-11-14 07:10, Bauman, Syd wrote:
I am writing here instead of on a Github ticket as this is either a Guidelines issue (if we decide to encode consistently) or a Stylesheets issue (if we decide to present consistently). Sigh.
Right now there appear to be roughly 3700 apostrophes in P5.[1] Of those, 3490 are encoded with U+0027 and 241 with U+2019.[2,3]
Of course, many of these are inside <egXML>, and thus maybe should be left alone; and some are inside comments (or maybe PIs), and we probably don’t care at all how they are encoded. So, refining the search to exclude those, there are only 2853 apostrophes,[4] of which 2689 are U+0027 and 190 are U+2019.[5,6]
Those that are encoded as U+0027 show up as U+0027 in the HTML output of the /Guidelines./ I think that is a sin.
The question is, what do we want to do about this? The following are (roughly) in my order of preference:
1. Change all the apostrophes to U+0027 in the encoded files, and change the Stylesheets to produce U+2019 on output. Update TCW20 or whatever to match. (I.e. to tell people to use U+0027, not U+2019.) 2. Leave the inconsistent encoding, and don’t bother to update the documentation, but change the Stylesheets so the output is more readable. 3. Change all the apostrophes to U+2019 in the encoded files and update documentation to match. (I.e. to tell people to use U+2019, not U+0027.) Also need to add some checks to the build process to prevent U+0027 from slipping in. 4. Change all the apostrophes to U+0027 in the encoded files, but do not change the Stylesheets, thus leaving our output ugly. 5. Leave things alone.
_commands used_ [1] egrep -c "[^=]['’][a-z]" P5/p5.xml [2] egrep -c "[^=]'[a-z]" P5/p5.xml [3] egrep -c "[^=]’[a-z]" P5/p5.xml [4] xsel -t -m "//t:*/text()" -v "." -n P5/p5.xml | egrep -c "['’][a-z]" [5] xsel -t -m "//t:*/text()" -v "." -n P5/p5.xml | egrep -c "'[a-z]" [6] xsel -t -m "//t:*/text()" -v "." -n P5/p5.xml | egrep -c "’[a-z]" [7] Note: xsel is an alias for an xmlstarlet sel command that binds “t:” to the TEI namespace. Thus the above 3 commands ignore elements in any other namespace, particularly the TEI Examples namespace.
_______________________________________________ Tei-council mailing list Tei-council@lists.tei-c.org http://lists.lists.tei-c.org/mailman/listinfo/tei-council
-- ------------------------------------------ Martin Holmes UVic Humanities Computing and Media Centre I acknowledge and respect the lək̓ʷəŋən peoples on whose traditional territory the university stands and the Songhees, Esquimalt and WSÁNEĆ peoples whose historical relationships with the land continue to this day.
participants (2)
-
Bauman, Syd
-
Martin Holmes