As I begin working on my project involving Taiyō magazine, I thought I’d document what I’m doing so others can see the process of cleaning the data I’ve gotten, and then experimenting with it. This is the first part in that series: first steps with data, cleaning it, and getting it ready for analysis. If I have the Taiyō data in “plain text,” what’s there to clean? Oh, you have no idea.
It’s come to my attention that Fukuzawa Yukichi’s (and others’) early Meiji (1868-1912) journal, Meiroku zasshi 明六雑誌, is available online not just as PDF (which I knew about) but also as a fully tagged XML corpus from NINJAL (and oh my god, it has lemmas). All right!
I recently met up with Mark Ravina at Association for Asian Studies, who brought this to my attention, and we are doing a lot of brainstorming about what we can do with this as a proof-of-concept project, and then move on to other early Meiji documents. We have big ideas like training OCR to recognize the difference between the katakana and kanji 二, for example; Meiji documents generally break OCR for various reasons like this, because they’re so different from contemporary Japanese. It’s like asking Acrobat to handle a medieval manuscript, in some ways.
But to start, we want to run the contents of Meiroku zasshi through tools like MALLET and Voyant, just to see how they do with non-Western languages (don’t expect any problems, but we’ll see) and what we get out of it. I’d also be interested in going back to the Stanford Core NLP API and seeing what kind of linguistic analysis we can do there. (First, I have to think of a methodology. :O)
In order to do this, we need whitespace-delimited text with words separated by spaces. I’ve written about this elsewhere, but to sum up, Japanese is not separated by spaces, so tools intended for Western languages think it’s all one big word. There are currently no easy ways I can find to do this splitting; I’m currently working on an application that both strips ruby from Aozora bunko texts AND splits words with a space, but it’s coming slowly. How to get this with Meiroku zasshi in a quick and dirty way that lets us just play with the data?
So today after work, I’m going to use Python’s eTree library for XML to take the contents of the word tags from the corpus and just spit them into a text file delimited by spaces. Quick and dirty! I’ve been meaning to do this for weeks, but since it’s a “day of DH,” I thought I’d use the opportunity to motivate myself. Then, we can play.
Exciting stuff, this corpus. Unfortunately most of NINJAL’s other amazing corpora are available only on CD-ROMs that work on old versions of Windows. Sigh. But I’ll work with what I’ve got.
So that’s your update from the world of Japanese text analysis.
If you are working in a functional, stateless language, but can still get away with for loops in a more conventional way thanks to for-each functions – should you still favor recursion over explicit for loops? Discuss.
Now that I am, as the title implies, “getting there,” I want to reflect a little on the learning process that has been XSLT. In my last post I glossed over what makes it (and functional programming languages generally) distinctive and, for people who are used to procedural languages, unintuitive and hard to grasp at first. This will be a post with several simple points, but that’s quite in keeping with the theme.
The major shift in thinking that needs to happen when working with XSLT, in my opinion, is one of trusting the computer more than we are accustomed to. It all stems from letting go of telling the computer how exactly to figure out when to execute sections of code, and letting it make the decisions for us.
I made a comment recently: “I know I’m getting more comfortable with XSLT because suddenly I’m trying to use recursion everywhere I can, and avoiding the for-loop like a crutch.” As others I talked to put it, this is idiomatic XSLT.*; In other words, it’s one of the mental leaps that you (and I) have to make in order to start writing elegant and functional code (no pun intended) using this language.
What is recursion? In this case, to oversimplify, it’s how XSLT loops.** In a procedural language – C++, Java, most languages other than Lisp dialects to be honest – recursion is clunky and wasteful; telling the computer to specifically “do this for the number of times I tell you, or until this thing reaches this state” is how you get things done. This means that the languages have state, too – you can change the value of variables. This is important for having counters that are the backbone of those loops. If there were no variable to increment or change in another way, the loop would either never execute (such as a while), only execute once, or loop endlessly. None of these things are very helpful.
So how do you get away with counter-based loop, at least of the “for each thing in this set” variety, with a stateless language (all variables are permanent, aka constants) that discourages use of for-each loops in the first place?
The first is much simpler: xsl:apply-templates or xsl:call-template. This involves the trust that I introduced above. With a procedural language it’s hard to trust the computer to take care of things without your telling it exactly how to do it (keep doing this thing until a condition is met) because you’ve had to become so used to it. It might have been hard to get used to having to explain the proverbial peanut butter sandwich recipe in excruciating detail for the sandwich to get made. Now, XSLT is forcing you to go back to the higher level of trust, where you can tell the computer “do this for all X” without telling it how it’s going to do that.
xsl:apply-templates simply means, “for all X, do Y.” (The Y is in the template.) It’s unsettling and worrying, at least for me at first, to just leave this up to the computer. There’s no guarantee that templates will ever be executed, or that they will be executed in order. How can I trust that this is going to turn out okay? Yet, with judicious application of xsl:apply-templates (like, where you want the results to be), it will happen.
Second, the recursive aspect. Keep calling the template until there are no more things left – whether that’s a counter, or a set of stuff. But how to get a counter without being able to change the variable? With each xsl:apply-templates (or call-template), do so with xsl:with-param, and adjust the parameter as needed. Call it with the rest of the set but not the thing that is being modified in the current template. When it runs out of stuff, that is when results are returned. Again, it takes the explicit instruction – xsl:for-each is very heavy-handed – and turns it into “if there’s anything left, keep on doing this.” It may seem from my description that there’s no real difference between these two, and in their end result, there isn’t. But this is a big leap, and moving from instinctively reaching for xsl:for-each to xsl:apply-templates is conceptually profound. It is getting XSLT.
Finally, a note on the brevity and simplicity of XSLT. I’ve noticed that once I’ve found a good, relatively elegant solution to what I’m trying to do (they can’t always be!), suddenly my code becomes very short and very simple. It’s not hard to write and I don’t type for a long time. It’s the thinking and planning that takes up the time. Obviously this is true for programming just about anything, but I find myself doing a whole lot less typing this summer than usual (compared to languages I’ve used such as C, C++, Java, Python).
It’s both satisfying and disappointing at the same time: getting a template that recursively creates arbitrary nested menus wants to make me jump up and high five myself; the fact that it’s only about four lines and incredibly simple makes me wonder if any of it was that hard to begin with. But this isn’t limited to XSLT or even programming: the 90-page thesis seems like more work than the 40-page thesis, but if the shorter one is talking about more profound ideas and/or is simply more well-written, the length and time comparison falls apart. The time spent typing and the length of the output doesn’t tell us as much as we’re used to assuming.
That’s what I have to say about what I’ve been doing this summer, as far as learning XSLT goes. I still can’t say I like it. The syntax is maddening. I haven’t been in this long enough to judge whether it’s the best choice for getting something done within a lot of constraints. But at the very least I’ve finally had that brain shift again, the one I had with Lisp so long ago, to a different approach to problem-solving entirely. And that feeling is profoundly gratifying.
Speaking of a good feeling, I’ve been able to have extended chats with multiple people about XSLT on the U of M School of Information mailing list this summer after someone posted asking for help with it. It’s a good thing I replied despite thinking “I’m not an expert, so I probably don’t have much to offer.” Talking with the questioner and the others who replied-all on our emails was really enlightening, both by getting feedback, hearing others’ questions about how the language works (questions that I hadn’t articulated very well), and also giving my own feedback. There’s nothing like teaching to help you learn. I would not have been able to write this post before talking to my fellow students and figuring it out together. (Or, you would have read a very unclear and aimless post.)
(Very last, I’d like to recommend the O’Reilly book XSLT Cookbook for using this language regularly after getting acquainted with it. If I were continuing on with an XSLT project after this internship, or working on adding more to this one, I’d be using this book for suggestions.)
* Thank you all for reminding me that this word exists.
** XSLT now includes not only the for-each loop, but also the xs:for tag. These do have their appropriate uses and I do use them quite a lot, because my application doesn’t give me a huge number of chances for recursion. I’m being dramatic to make a point.
Cross-posted from the iSchools & Digital Humanities intern blog.
My internship this summer gives me the opportunity to get acquainted with and even use some XSLT – misleadingly the “stylesheets” of XML.
XSLT has been hard to wrap my head around, not least because “stylesheets” and “used to format XML” make me think of CSS, not – well, functional programming. It’s been a good many years since I got to play around in Lisp, let alone make something with it, and this has brought me back to those two great semesters of AI electives that introduced me to this way of thinking. It took a few weeks to get into it, but once I “got” how Lisp worked at a more intuitive level, I remember my impression: I am thinking in a different way. It wasn’t just about programming, it was about problem solving, and about a way of looking at the world.
Diving into a functional programming language again has got me thinking about that experience. Learning how XSLT works has of course made me remember a time when Lisp made sense, because XSLT is functional programming. If I had been introduced to it in that way at the outset, it would have clicked much sooner. Now it makes me yearn a little for the time when I didn’t just know that I was working in a different way, but when that way came to make sense and I was able to start going somewhere with it.
But when I learned Lisp in the context of an AI (artificial intelligence) class, I didn’t learn it as “functional programming” then either. I wasn’t introduced to it in the context of lambda calculus, which I came to find much later – last semester – in a natural language processing course. I knew it was different, but I didn’t know how on a bigger picture level.
Now that I have that bigger picture, I am appreciating this way of thinking more and more.
Why is functional programming “hard”? Why is it something that I had to get used to for a time before it clicked? I have an answer this time – because I have been doing imperative programming for so long, because that’s how I was introduced to programming (I didn’t attend MIT after all), and because that has become the natural and intuitive way for me to solve a problem. But imperative programming isn’t a more natural way of thinking about things. It’s a different way. Obviously, these two ways of approaching problems have different applications, but the elegant and concise ways of approaching problems that functional programming offers are perhaps even more appealing to me now.
Because I am not an expert, I write this not to make a profound statement about how to approach problem solving, but to share a great article about where functional programming came from, why it’s so appealing, and the things it makes possible. I give you,
Take the 10 minutes to read this and enjoy!