labels: * TEI: Guidelines & Documentation * Priority: Low * CouncilResponsibility I had intended to submit a ticket on this topic, but after a bit of research realize I don't know what the ticket should say. So I'm asking for guidance from y'all -- what should we do about the following issue? IIRC, in general a <desc> should be 1 sentence that does not start with a capital letter, but does end in a full stop. By my count there are 10,475 tei:desc elements in the _Guidelines_. (That's not counting the 228 that are inside <egXML>; they are teix:desc elements that probably should be addressed, too.) Of those, 7689 end in a full stop,[1,2] and the other 2786 do not. That's a lot. Of course some of them should not end in full stop (e.g., those that end in ".)" or those that should not be a <desc> at all like the gloss of the value "group" of the @org attribute of <attList>). But most, I suspect, should. What to do? I submit that a) 2786 is way too many for manual intervention b) there are a bunch that end in screwy characters that probably need to be looked at (e.g., '>', superscript digits, '*') c) I am utterly unqualified to deal with non-English cases One thought is to divvy them up by language, so that (for example) I only have to deal with English, and ask for help for the other languages (perhaps ask Peter or Martina to deal with the German cases, Martin the Japanese, Raff the Italian, Alejandro the Spanish). Each language would be much more manageable on its own, I suspect. At least, the English certainly would be. There are only 9 categories: 1424 . 404 [a-z] 15 ) 2 [superscript digit] 2 [A-Z] 1 5 1 … 1 " 1 ; After some spot-checking, I would not be surprised if it made sense to automatically insert a period after all the [a-z] and ')' cases, leaving only 8 cases (in 6 categories) to be examined by hand. Thoughts? In case you want to see them (clicking on a column headings sorts): * All of them: https://bauman.zapto.org/~syd/temp/4TEICouncil/last_of_the_descs/lotd.xhtml * Only those that end in something other than full stop: https://bauman.zapto.org/~syd/temp/4TEICouncil/last_of_the_descs/lotdnfsonly... * Only those that are in English: https://bauman.zapto.org/~syd/temp/4TEICouncil/last_of_the_descs/lotend.xhtm... * Only those that are in English and do not end in full stop: https://bauman.zapto.org/~syd/temp/4TEICouncil/last_of_the_descs/lotendnfson... Notes ----- [1] I get bizarrely different results with different counting methods, not sure why. [2] 1002 are U+3002, IDEOGRAPHIC FULL STOP, mostly in zh-TW, but 57 in ja 1370 are U+FF0E, FULLWIDTH FULL STOP, most if not all in ja 5317 are U+002E, FULLSTOP, in lots of other languages
participants (1)
-
Syd Bauman