As I begin working on my project involving Taiyō magazine, I thought I’d document what I’m doing so others can see the process of cleaning the data I’ve gotten, and then experimenting with it. This is the first part in that series: first steps with data, cleaning it, and getting it ready for analysis. If I have the Taiyō data in “plain text,” what’s there to clean? Oh, you have no idea.
What am I working on these days? Well, one thing is working with the Taiyō magazine corpus (1895-1925, selected articles) from NINJAL, released on CD about 10 years ago but currently being prepared for web release. In addition, I should note that Taiyō has been reproduced digitally as a paid resource through JKBooks (on the JapanKnowledge+ platform).
Taiyō was a general-interest magazine spanning Meiji through Taishō periods in Japan, with articles on all topics as well as fiction, and innovative for its time in 1895 with the use of lithography to reproduce pages of photographs. (And let me tell you, they were random at the time: battleships, various nations’ viceroys, stuff like that. I’m not making this up.) Unfortunately, the text-only nature of my project doesn’t reflect the cool printing technology and visual nature of the magazine, but I was wondering, what can I do with just the text of the articles and metadata kindly provided by NINJAL (including genre by NDL classification and style of writing).
Because I’m working on another project (under wraps and in very beginning stages at the moment) involving periodicals in the Japanese empire, I was already thinking about this question. I hit upon something very basic but an important topic: what language did Japanese publications use to talk about Japan at the time? With “Japan” in the early 20th century, we can think of both a nation and an empire, with blurred and constantly shifting boundaries. Over the span of Taiyō‘s publication, Japan annexed both Korea and Taiwan, increased hostilities with China, and battled (and defeated) Russia in the Russo-Japanese War (thus gaining some territories there). There was a lot going on to keep Japan’s borders in flux, and make Japanese question the limits and definition of their “nation.”
Especially because of the discourse in the early 20th century of naichi 内地 (inner lands or “home islands”, referring to the archipelago of Japan we know today) and gaichi 外地 (outer lands or “colonies”, referring to Korea/Taiwan), which are both subsumed under the name of Japan, I’m really interested in how those terms were being used, other terms that might have been used as well, and what qualities and relationships were associated with them. How did Japanese define these areas and how did it change over time? While I can’t get in the minds of people in the imperial period, I can take a look at one of its most popular magazines, intended for a broad audience, to see at least the public, print discourse of the nation and empire.
How to work with it, though? That’s where I’m still just beginning. It’s a daunting project in some ways. For example, I am not a linguist, let alone a Japanese linguist. I haven’t specialized in this period in the past, so keywords for territories will take some research on my part (for example, there were multiple names for Taiwan at the time in addition to the gaichi reference). Moreover, the corpus is 1.2 GB in UTF-8 text (which I converted from sentence-tokenized XML to word-tokenized, non-tagged text). It breaks Voyant Server and Topic Modeling Tool on my machine with 12 GB RAM when attempting to analyze the whole thing at once. Of course, I could split it up, but then that raises another methodological question: how and why to split it up? What divisions should I use: years, genres, authors, etc.? Right now I have it in text files by article, but could combine those articles in any number of ways.
I am also stymied by methodologies for analysis, but my plan at the moment is to start by doing some basic visualizations of the articles, in different groupings, as an exploration of what kind of things people talked about in Taiyō over time. Are they even talking about the nation? When they talk about naichi what kinds of things do they associate with those territories, as opposed to gaichi? Is the distinction changing, and is it even a reliable distinction?
As a Price Lab Fellow this year at Penn, I hope to explore these questions and start to nail down what I want to analyze in more detail over time in Taiyō — and hopefully gain some insight into the language of empire in Japan 1895-1925.
In addition I’ll be presenting about this at a workshop at the University of Chicago in November, so if you’re in the area please attend and help me figure all this out!
I just got off the phone with a researcher this morning who is interested in looking at sentiment analysis on a corpus of fiction, specifically by having some native speakers of Japanese (I think) tag adjectives as positive or negative, then look at the overall shape of the corpus with those tags in mind.
A while back, I wrote a paper about geoparsing and sentiment analysis for a class, describing a project I worked on. Talking to this researcher made me think back to this project – which I’m actually currently trying to rewrite in Python and then make work on some Japanese, rather than Victorian English, texts – and my own definition of sentiment analysis for humanistic inquiry.*
How is my definition of sentiment analysis different? How about I start with the methodology? What I did was look for salient adjectives, which I searched for by looking at most “salient” nouns (not necessarily the most frequent, but I need to refine my heuristics) and then the adjectives that appeared next to them. I also used Wordnet to look for words related to these adjectives and nouns to expand my search beyond just those specific words to ones with similar meaning that I might have missed (in particular, I looked at hypernyms (broader terms) and synonyms of nouns, and synonyms of adjectives).
My method of sentiment analysis ends up looking more like automatic summarization than a positive-negative sentiment analysis we more frequently encounter, even in humanistic work such as Matt Jockers’s recent research. I argue, of course, that my method is somewhat more meaningful. I consider all adjectives to be sentiment words, because they carry subjective judgment (even something that’s kind of green might be described by someone else as also kind of blue). And I’m more interested in the character of subjective judgment than whether it should be able to be considered ‘objectively’ as positive or negative (something I don’t think is really possible in humanistic inquiry, and even in business applications). In other words, if we have to pick out the most representative feelings of people about what they’re experiencing, what are they feeling about that experience?
After all, can you really say that weather is good or bad, that there being a lot of farm fields is good or bad? I looked at 19th-century British women’s travel narratives of “exotic” places, and I found that their sentiment was often just observations about trains and the landscape and the people. They didn’t talk about whether they were feeling positively or negatively about those things; rather, they gave us their subjective judgment of what those things were like.
My take on sentiment analysis, then, is clearly that we need to introduce human judgment to the end of the process, perhaps gathering these representative phrases and adjectives (I lean toward phrases or even whole sentences) and then deciding what we can about them. I don’t even think a human interlocutor could put down a verdict of positive or negative on these observations and judgments – sentiments – that the women had about their experiences and environments. If not even a human could do it, and humans write and train the algorithms, how can the computer do it?
Is there even a point? Does it matter if it’s possible or not? We should be looking for something else entirely.
(I really need to get cracking on this project. Stay tuned for the revised methodology and heuristics, because I hope to write more and share code here as I go along.)
* I’m also trying to write a more extensive and revised paper on this, meant for the new incarnation of LLC.
I recently wrote a small script to perform a couple of functions for pre-processing Aozora Bunko texts (text files of public domain, modern Japanese literature and non-fiction) to be used with Western-oriented text analysis tools, such as Voyant, other TAPoR tools, and MALLET. Whereas Japanese text analysis software focuses largely on linguistics (tagging parts of speech, lemmatizing, etc.), Western tools open up possibilities for visualization, concordances, topic modeling, and other various modes of analysis.
Why do these Aozora texts need to be processed? Well, a couple of issues.
- They contain ruby, which are basically glosses of Chinese characters that give their pronunciation. These can be straightforward pronunciation help, or actually different words that give added meaning and context. While I have my issues with removing ruby, it’s impossible to do straightforward tool-based analysis without removing it, and many people who want to do this kind of analysis want it to be removed.
- The Aozora files are not exactly plain text: they’re HTML. The HTML tags and Aozora metadata (telling where the text came from, for example) need to be removed before analysis can be performed.
- There are no spaces between words in Japanese, but Western text analysis tools identify words by looking at where there are spaces. Without inserting spaces, it looks like each line is one big word. So I needed to insert spaces between the Japanese words.
How did I do it? My approach, because of my background and expertise, was to create a Python script that used a couple of helpful libraries, including BeautifulSoup for ruby removal based on HTML tags, and TinySegmenter for inserting spaces between words. My script requires you to have these packages installed, but it’s not a big deal to do so. You then run the script in a command line prompt. The way it works is to look for all .html files in a directory, load them and run the pre-processing, then output each processed file with the same filename, .txt ending, a plain text UTF-8 encoded file.
The first step in the script is to remove the ruby. Helpfully, the ruby is contained in several HTML tags. I had BeautifulSoup traverse the file and remove all elements contained within these tags; it removes both the tags and content.
Next, I used a very simple regular expression to remove everything in brackets – i.e. the HTML tags. This is kind of quick and dirty, and won’t work on every file in the universe, but in Aozora texts everything inside a bracket is an HTML tag, so it’s not a problem here.
Finally, I used TinySegmenter on the resulting HTML-free text to split the text into words. Luckily for me, it returns an array of words – basically, each word is a separate element in a list like [‘word1’, ‘word2’, … ‘wordn’] for n words. This makes my life easy for two reasons. First, I simply joined the array with a space between each word, creating one long string (the outputted text) with spaces between each element in the array (words). Second, it made it easy to just remove the part of the array that contains Aozora metadata before creating that string. Again, this is quick and dirty, but from examining the files I noted that the metadata always comes at the end of the file and begins with the word 底本 (‘source text’). Remove that word and everything after it, and then you have a metadata-free file.
Write this resulting text into a plain text file, and you have a non-ruby, non-HTML, metadata-free, whitespace-delimited Aozora text! Although you have to still download all the Aozora files individually and then do what you will with the resulting individual text files, it’s an easy way to pre-process this text and get it ready for tool-based (and also your-own-program-based) text analysis.
I plan to put the script on GitHub for your perusal and use (and of course modification) but for now, check it out on my Japanese Text Analysis research guide at Penn.
It’s come to my attention that Fukuzawa Yukichi’s (and others’) early Meiji (1868-1912) journal, Meiroku zasshi 明六雑誌, is available online not just as PDF (which I knew about) but also as a fully tagged XML corpus from NINJAL (and oh my god, it has lemmas). All right!
I recently met up with Mark Ravina at Association for Asian Studies, who brought this to my attention, and we are doing a lot of brainstorming about what we can do with this as a proof-of-concept project, and then move on to other early Meiji documents. We have big ideas like training OCR to recognize the difference between the katakana and kanji 二, for example; Meiji documents generally break OCR for various reasons like this, because they’re so different from contemporary Japanese. It’s like asking Acrobat to handle a medieval manuscript, in some ways.
But to start, we want to run the contents of Meiroku zasshi through tools like MALLET and Voyant, just to see how they do with non-Western languages (don’t expect any problems, but we’ll see) and what we get out of it. I’d also be interested in going back to the Stanford Core NLP API and seeing what kind of linguistic analysis we can do there. (First, I have to think of a methodology. :O)
In order to do this, we need whitespace-delimited text with words separated by spaces. I’ve written about this elsewhere, but to sum up, Japanese is not separated by spaces, so tools intended for Western languages think it’s all one big word. There are currently no easy ways I can find to do this splitting; I’m currently working on an application that both strips ruby from Aozora bunko texts AND splits words with a space, but it’s coming slowly. How to get this with Meiroku zasshi in a quick and dirty way that lets us just play with the data?
So today after work, I’m going to use Python’s eTree library for XML to take the contents of the word tags from the corpus and just spit them into a text file delimited by spaces. Quick and dirty! I’ve been meaning to do this for weeks, but since it’s a “day of DH,” I thought I’d use the opportunity to motivate myself. Then, we can play.
Exciting stuff, this corpus. Unfortunately most of NINJAL’s other amazing corpora are available only on CD-ROMs that work on old versions of Windows. Sigh. But I’ll work with what I’ve got.
So that’s your update from the world of Japanese text analysis.
I’ve been looking (okay, not looking, wishing) for a Japanese tokenizer for a while now, and today I decided to sit down and do some research into what’s out there. It didn’t take long – things have improved recently.
First off – so what is tokenization? Basically, it’s separating sentences by words, or documents by sentences, or any text by some unit, to be able to chunk that text into parts and analyze them (or do other things with them). When you tokenize a document by word, like a web page, you enable searching: this is how Google finds individual words in documents. You can also find keywords from a document this way, by writing an algorithm to choose the most meaningful nouns, for example. It’s also the first step in more involved linguistic analysis like part-of-speech tagging (thing, marking individual words as nouns, verbs, and so on) and lemmatizing (paring words down to their stems, such as removing plural markers and un-conjugating verbs).
This gives you a taste of why tokenization is so fundamental and important for text analysis. It’s what lets you break up an otherwise unintelligible (to the computer) string of characters into units that the computer can attempt to analyze. It can index them, search them, categorize them, group them, visualize them, and so on. Without this, you’re stuck with “words” that are entire sentences or documents, that the computer thinks are individual units based on the fact that they’re one long string of characters.
Usually, the way you tokenize is to break up “words” based on spaces (or sentences based on punctuation rules, etc., although that doesn’t always work). (I put “words” in quotes because you can really make any kind of unit you want, the computer doesn’t understand what words are, and in the end it doesn’t matter. I’m using “words” as an example here.) However, for languages like Japanese and Chinese (and to a lesser extent Korean) that don’t use spaces to delimit all words (for example, in Korean particles are attached to nouns with no space in between, like saying “athome” instead of “at home”), you run into problems quickly. How to break up texts into words when there’s no easy way to distinguish between them?
The question of tokenizing Japanese may be a linguistic debate. I don’t know enough about linguistics to begin to participate in it, if it is. But I’ll quickly say that you can break up Japanese based on linguistic rules and dictionary rules – understanding which character compounds are nouns, which verb conjugations go with which verb stems (as opposed to being particles in between words), then breaking up common particles into their own units. This appears to be how these tools are doing it. For my own purposes, I’m not as interested in linguistic patterns as I am in noun and verb usage (the meaning rather than the kind) so linguistic nitpicking won’t be my area anyway.
Moving on to the tools. I put them through the wringer: Higuchi Ichiyō’s Ame no yoru, the first two lines, from Aozora bunko.
One, kuromoji, is the tokenizer behind Solr and Lucene. It does a fairly good job, although with Ichiyō’s uncommon word usage and conjugation, it faltered and couldn’t figure out that 高やか is one word; rather it divided it into 高 や か. It gives the base form, reading, and pronunciation, but nothing else. However, in the version that ships with Solr/Lucene, it lemmatizes. Would that ever make me happy. (That’s, again, reducing a word to its base form, making it easy to count all instances of both “people” and “person” for example, if you’re just after meaning.) I would kill for this feature to be integrated with the below tool.
The other, U-Tokenizer, did significantly better, but its major drawback is that it’s done in the form of an HTTP request, meaning that you can’t put in entire documents (well, maybe you could? how much can you pass in an HTTP request?). If it were downloadable code with an API, I would be very happy (kuromoji is downloadable and has a command line interface). U-Tokenizer figured out that 高やか is one word, and also provides a list of “keywords,” which as far as I can tell is a bunch of salient nouns. I used it for a very short piece of text, so I can’t comment on how many keywords it would come up with for an entire document. The documentation on this is sparse, and it’s not open source, so it’s impossible to know what it’s doing. Still, it’s a fantastic tool, and also seems to work decently for Chinese and Korean.
Each of these tools has its strengths, and both are quite usable for modern and contemporary Japanese. (I really was cruel to feed them Ichiyō.) However, there is a major trial involved in using them with freely-available corpora like Aozora bunko. Guess what? Preprocessing ruby.
Aozora texts contain ruby marked up within the documents. I have my issues with stripping out ruby from documents that heavily use them (like Meiji writers, for example) because they add so much meaning to the text, but let’s say for argument’s sake that we’re not interested in the ruby. Now, it’s time to cut it all out. If I were a regular expressions wizard (or even had basic competency with them) I could probably strip this out easily, but it’s still time consuming. Download text, strip out ruby and other metadata, save as plain text. (Aozora texts are XHTML, NOT “plain text” as they’re often touted to be.) Repeat. For topic modeling using a tool like MALLET, you’re going to want to have hundreds of documents at the end of it. For example, you might be downloading all Meiji novels from Aozora and dividing them into chunks or chapters. Even the complete works of Natsume Sōseki aren’t enough without cutting them down into chapters or even paragraphs to make enough documents to use a topic modeling tool effectively. Possibly, run all these through a part-of-speech tagger like KH Coder. This is going to take a significant amount of time.
Then again, preprocessing is an essential and extremely time-consuming part of almost any text analysis project. I went through a moderate amount of work just removing Project Gutenberg metadata and dividing into chapters a set of travel narratives that I downloaded in plain text, thankfully not in HTML or XML. It made for easy processing. With something that’s not already real plain text, with a lot of metadata, and with a lot of ruby, it’s going to take much more time and effort, which is more typical of a project like this. The digital humanities are a lot of manual labor, despite the glamorous image and the idea that computers can do a lot of manual labor for us. They are a little finicky with what they’ll accept. (Granted, I’ll be using a computer script to strip out the XHTML and ruby tags, but it’s going to take work for me to write it in the first place.)
In conclusion? Text analysis, despite exciting available tools, is still hard and time consuming. There is a lot of potential here, but I also see myself going through some trials to get to the fun part, the experimentation. Still, stay tuned, especially for some follow-up posts on these tools and KH Coder as I become more familiar with them. And, I promise to stop being difficult and giving them Ichiyō’s Meiji-style bungo.
Oh snap – I just fixed this by turning on caching in the Cocoon sitemap. Thanks Brian Pytlik Zillig for pointing out that this is where that functionality is useful! And note to self (and all of us): asking questions when you’re torn between solutions can lead to a third solution that does much better than either of the ones you came up with.
With programming or web design, “clean and elegant” is a satisfaction for me second only to “it’s working by god it’s finally doing what it’s supposed to.” So what am I to do when I’ve got a perfectly clean and elegant solution – one that involves zero data entry and only takes up a handful of lines in my XSLT stylesheets – that crunches browser speed so hard that it takes nearly a minute to load the homepage of my application?
I’ve got a choice here: Two XML files (one for each problem area) that list all of the data that I’d otherwise dynamically be grabbing out of all files sitting in a certain directory. This is time-consuming and not very elegant (although it certainly could be worse). The worst part is that it requires explicit maintenance on the part of the user. Wouldn’t it be nice to be able to give my application to any person who has a directory of XML files without any need for them to hand-customize it, even just a small part?
On the other hand, I can’t expect Web users to sit there and wait at least 30 seconds for TokenX to dynamically generate its list of texts, an action that would take a split second if it were only loading the data out of an XML file. I already have all the site menu data stored in XML for retrieval, meaning that modifications need only take place once and that nested menus can be easily entered without having to worry about the algorithm I’m using to make them appear nested on the screen in the final product.
You can tell from reading my thought process here what the solution is going to be. It’s too bad, because aiming for elegance often ends up leading you to better performance at the same time. Practicality vs. idealism: the eternal question to which we already know the answer.
If you are working in a functional, stateless language, but can still get away with for loops in a more conventional way thanks to for-each functions – should you still favor recursion over explicit for loops? Discuss.
Now that I am, as the title implies, “getting there,” I want to reflect a little on the learning process that has been XSLT. In my last post I glossed over what makes it (and functional programming languages generally) distinctive and, for people who are used to procedural languages, unintuitive and hard to grasp at first. This will be a post with several simple points, but that’s quite in keeping with the theme.
The major shift in thinking that needs to happen when working with XSLT, in my opinion, is one of trusting the computer more than we are accustomed to. It all stems from letting go of telling the computer how exactly to figure out when to execute sections of code, and letting it make the decisions for us.
I made a comment recently: “I know I’m getting more comfortable with XSLT because suddenly I’m trying to use recursion everywhere I can, and avoiding the for-loop like a crutch.” As others I talked to put it, this is idiomatic XSLT.*; In other words, it’s one of the mental leaps that you (and I) have to make in order to start writing elegant and functional code (no pun intended) using this language.
What is recursion? In this case, to oversimplify, it’s how XSLT loops.** In a procedural language – C++, Java, most languages other than Lisp dialects to be honest – recursion is clunky and wasteful; telling the computer to specifically “do this for the number of times I tell you, or until this thing reaches this state” is how you get things done. This means that the languages have state, too – you can change the value of variables. This is important for having counters that are the backbone of those loops. If there were no variable to increment or change in another way, the loop would either never execute (such as a while), only execute once, or loop endlessly. None of these things are very helpful.
So how do you get away with counter-based loop, at least of the “for each thing in this set” variety, with a stateless language (all variables are permanent, aka constants) that discourages use of for-each loops in the first place?
The first is much simpler: xsl:apply-templates or xsl:call-template. This involves the trust that I introduced above. With a procedural language it’s hard to trust the computer to take care of things without your telling it exactly how to do it (keep doing this thing until a condition is met) because you’ve had to become so used to it. It might have been hard to get used to having to explain the proverbial peanut butter sandwich recipe in excruciating detail for the sandwich to get made. Now, XSLT is forcing you to go back to the higher level of trust, where you can tell the computer “do this for all X” without telling it how it’s going to do that.
xsl:apply-templates simply means, “for all X, do Y.” (The Y is in the template.) It’s unsettling and worrying, at least for me at first, to just leave this up to the computer. There’s no guarantee that templates will ever be executed, or that they will be executed in order. How can I trust that this is going to turn out okay? Yet, with judicious application of xsl:apply-templates (like, where you want the results to be), it will happen.
Second, the recursive aspect. Keep calling the template until there are no more things left – whether that’s a counter, or a set of stuff. But how to get a counter without being able to change the variable? With each xsl:apply-templates (or call-template), do so with xsl:with-param, and adjust the parameter as needed. Call it with the rest of the set but not the thing that is being modified in the current template. When it runs out of stuff, that is when results are returned. Again, it takes the explicit instruction – xsl:for-each is very heavy-handed – and turns it into “if there’s anything left, keep on doing this.” It may seem from my description that there’s no real difference between these two, and in their end result, there isn’t. But this is a big leap, and moving from instinctively reaching for xsl:for-each to xsl:apply-templates is conceptually profound. It is getting XSLT.
Finally, a note on the brevity and simplicity of XSLT. I’ve noticed that once I’ve found a good, relatively elegant solution to what I’m trying to do (they can’t always be!), suddenly my code becomes very short and very simple. It’s not hard to write and I don’t type for a long time. It’s the thinking and planning that takes up the time. Obviously this is true for programming just about anything, but I find myself doing a whole lot less typing this summer than usual (compared to languages I’ve used such as C, C++, Java, Python).
It’s both satisfying and disappointing at the same time: getting a template that recursively creates arbitrary nested menus wants to make me jump up and high five myself; the fact that it’s only about four lines and incredibly simple makes me wonder if any of it was that hard to begin with. But this isn’t limited to XSLT or even programming: the 90-page thesis seems like more work than the 40-page thesis, but if the shorter one is talking about more profound ideas and/or is simply more well-written, the length and time comparison falls apart. The time spent typing and the length of the output doesn’t tell us as much as we’re used to assuming.
That’s what I have to say about what I’ve been doing this summer, as far as learning XSLT goes. I still can’t say I like it. The syntax is maddening. I haven’t been in this long enough to judge whether it’s the best choice for getting something done within a lot of constraints. But at the very least I’ve finally had that brain shift again, the one I had with Lisp so long ago, to a different approach to problem-solving entirely. And that feeling is profoundly gratifying.
Speaking of a good feeling, I’ve been able to have extended chats with multiple people about XSLT on the U of M School of Information mailing list this summer after someone posted asking for help with it. It’s a good thing I replied despite thinking “I’m not an expert, so I probably don’t have much to offer.” Talking with the questioner and the others who replied-all on our emails was really enlightening, both by getting feedback, hearing others’ questions about how the language works (questions that I hadn’t articulated very well), and also giving my own feedback. There’s nothing like teaching to help you learn. I would not have been able to write this post before talking to my fellow students and figuring it out together. (Or, you would have read a very unclear and aimless post.)
(Very last, I’d like to recommend the O’Reilly book XSLT Cookbook for using this language regularly after getting acquainted with it. If I were continuing on with an XSLT project after this internship, or working on adding more to this one, I’d be using this book for suggestions.)
* Thank you all for reminding me that this word exists.
** XSLT now includes not only the for-each loop, but also the xs:for tag. These do have their appropriate uses and I do use them quite a lot, because my application doesn’t give me a huge number of chances for recursion. I’m being dramatic to make a point.
Cross-posted from the iSchools & Digital Humanities intern blog.
I’ve been here at CDRH (The Center for Digital Research in the Humanities) at the University of Nebraska-Lincoln since early May, and the time went by so quickly that I’m only writing about what I’m doing a few weeks before my internship ends! But I’m in the thick of things now, in terms of my main work, so this may be the perfect time to introduce it.
My job this summer is (mostly) to update TokenX for the Willa Cather Archive (you can find it from the “Text Analysis” link at http://cather.unl.edu). I’m updating it in two senses:
- Redesigning the basic interface. This means starting from scratch with a list of functions that TokenX performs, organizing them for user access, figuring out which categories will form the interface (and what to put under each), and then making a visual mockup of where all this will go.
- The exciting part: Implementing as much of TokenX with the new interface as I can in the time I’m here. Why is it exciting?
- TokenX is written in XSLT, which tends to be mentioned in a cursory way as “stylesheets for XML” as though it’s like CSS. It’s not. It’s a functional programming language with syntax devised by a sadist. XSLT has a steep learning curve and I have had 9 weeks this summer to try my best to get over it before I leave CDRH. I’m doing my best and it’s going better than I ever imagined.
- I’m also getting to learn how XSLT is often used to generate Web sites (which is what I’m doing): using Apache Cocoon. Another technology that I had no idea existed before this summer, and which is coming to feel surprisingly comfortable at this point.
- I have never said so many times in a span of only two months, “I’m glad I had that four years of computer science curriculum in college. It’s paying off!” Given that I never went into software development after graduating, and haven’t done any non-trivial programming in quite a long time, I had mostly dismissed my education as something that could end up being so relevant to my work now. And honestly, I miss using it. This is awesome.
I’m getting toward the end of implementing all of the functionality of TokenX in XSLT for the new Web site, hooking that up with the XSLT that then generates the HTML that delivers it to the user. (To be more concrete for those interested in Cocoon, I’m writing a sitemap that first processes the requested XML file with the appropriate stylesheet for transformation results, then passing those results on to a series of formatting stylesheets that eventually produce the final HTML product.) And I’m about midway through the process of doing from Web prototype to basic site styling to more polished end result. I’ve got 2.5 weeks left now, and I’m on track to having it up and running pretty soon! I’ll keep you updated with comments about the process – both XSLT, and crafting the site with HTML5 and CSS – and maybe some screenshots.
* TokenX can be, and is, used for more than this collection at CDRH. Notably it’s used at the Walt Whitman Archive in basically the same way as Cather. But we have to start somewhere for now, and expand as we can later.