Hi Syd: I suddenly noticed that I didn't notice you'd replied 7 days ago!

That is pretty ridiculous of me, but has everything to do with the launch of spring semester.

In answer to your question, I am teaching a class called "Text Analysis" which is something like my old class called "Coding and Data Visualization", in that I think of it as my XQuery class. I'm also cramming a few weeks of Python into it, which is new teaching ground for me, though not new coding ground. Unlike those who have taught this course in the past as an entirely Python course, my course will involve an intersection of structured markup, regex conversion of so-called "plain-text" to XML at scale, and pull-processing of lightly marked XML into stuff like TSVs, maybe JSON (ugh, but maybe yeah), and visualizing the data in various ways such as networks and SVG plots. And I want to explore a little NLTK and topic modeling, and just play around with Pythonic things on whatever documents we've heaped up around us by March / April. It's my first time with this class, so we'll see where it goes!

Anyway, I wouldn't mind taking a peak at your XSLT to see if there's any suggestions for making the pile of XML output on the other side a little less hideous! But I am swamped--this is the sort of thing I'd enjoy looking at during a real old-fashioned face-to-face TEI session before we all wander over the restaurant for dinner!

Cheers,

Elisa

On Thu, Jan 14, 2021 at 7:48 AM Bauman, Syd <s.bauman@northeastern.edu> wrote:

Hi Elisa — Yes, I mentioned XPath, and I am not surprised it whetted your appetite even at 01:00 in the freakin’ morning! 🙂

However, it’s not the code[0] that’s the problem, it’s the data. If you ask GitHub for all the PRs,[1] convert to XML,[2] grab all the PR numbers,[3] and ask for the reviewers for each of those,[4] and convert them to XML,[2] you now have a large pile of ugly XML data.[5] I have not figured out a consistent way to extract reviewer user names (which are encoded as <string key="login"> elements) from that pile. (Note that reviewers can appear both in the main set of PR info and the separate little array of review info.)

Just to save anyone interested the trouble, I have put the pile of data I got yesterday in [6].

P.S. Elisa: what are you teaching this semester?

Notes

[0] Which is XSLT, not Python. I have not written Python in over a decade. And I was bad at it.

[1] https://api.github.com/repos/TEIC/TEI/pulls plus https://api.github.com/repos/TEIC/Stylesheets/pulls

[2] Presuming here with just json-to-xml(), but actually I have taken an extra step after that to give me slightly more XML-like data. (And then another step to give me real XML data, but a) that is not part of this discussion, and b) it is not finished.)

[3] …/fn:array/fn:map/fn:number[@key eq 'number']; and be careful, because you do NOT want the …/fn:array/fn:map/fn:map/fn:number[@key eq 'number'] values.

[4] https://api.github.com/repos/TEIC/(TEI|Stylesheets)/pulls/$N/requested_reviewers.

[5] You want to save this data on disk, rather than ask for it again every time you want to play, because GitHub limits you to 60 requests per hour unless you pay them.

[6] http://bauman.zapto.org/~syd/temp/4TEICouncil/GitHub_PR_data_2021-01-14.tgz

You mentioned XPath? I'd be curious to see the rest of the code. I've just been writing some Python myself to address the Box.com API to rescue some old project metadata from a big set of nested file directories and I'm half shocked that I got my code to work. I do wonder whether Python might be easier to address the GitHub API. Fair warning: classes start up again for me on 1/19, but I'd be happy to take a look anyway! ;-)

_______________________________________________
Tei-council mailing list
Tei-council@lists.tei-c.org
http://lists.lists.tei-c.org/mailman/listinfo/tei-council