Category Archives: project

Pre-processing Japanese literature for text analysis

I recently wrote a small script to perform a couple of functions for pre-processing Aozora Bunko texts (text files of public domain, modern Japanese literature and non-fiction) to be used with Western-oriented text analysis tools, such as Voyant, other TAPoR tools, and MALLET. Whereas Japanese text analysis software focuses largely on linguistics (tagging parts of speech, lemmatizing, etc.), Western tools open up possibilities for visualization, concordances, topic modeling, and other various modes of analysis.

Why do these Aozora texts need to be processed? Well, a couple of issues.

  1. They contain ruby, which are basically glosses of Chinese characters that give their pronunciation. These can be straightforward pronunciation help, or actually different words that give added meaning and context. While I have my issues with removing ruby, it’s impossible to do straightforward tool-based analysis without removing it, and many people who want to do this kind of analysis want it to be removed.
  2. The Aozora files are not exactly plain text: they’re HTML. The HTML tags and Aozora metadata (telling where the text came from, for example) need to be removed before analysis can be performed.
  3. There are no spaces between words in Japanese, but Western text analysis tools identify words by looking at where there are spaces. Without inserting spaces, it looks like each line is one big word. So I needed to insert spaces between the Japanese words.

How did I do it? My approach, because of my background and expertise, was to create a Python script that used a couple of helpful libraries, including BeautifulSoup for ruby removal based on HTML tags, and TinySegmenter for inserting spaces between words. My script requires you to have these packages installed, but it’s not a big deal to do so. You then run the script in a command line prompt. The way it works is to look for all .html files in a directory, load them and run the pre-processing, then output each processed file with the same filename, .txt ending, a plain text UTF-8 encoded file.

The first step in the script is to remove the ruby. Helpfully, the ruby is contained in several HTML tags. I had BeautifulSoup traverse the file and remove all elements contained within these tags; it removes both the tags and content.

Next, I used a very simple regular expression to remove everything in brackets – i.e. the HTML tags. This is kind of quick and dirty, and won’t work on every file in the universe, but in Aozora texts everything inside a bracket is an HTML tag, so it’s not a problem here.

Finally, I used TinySegmenter on the resulting HTML-free text to split the text into words. Luckily for me, it returns an array of words – basically, each word is a separate element in a list like [‘word1’, ‘word2’, … ‘wordn’] for n words. This makes my life easy for two reasons. First, I simply joined the array with a space between each word, creating one long string (the outputted text) with spaces between each element in the array (words). Second, it made it easy to just remove the part of the array that contains Aozora metadata before creating that string. Again, this is quick and dirty, but from examining the files I noted that the metadata always comes at the end of the file and begins with the word 底本 (‘source text’). Remove that word and everything after it, and then you have a metadata-free file.

Write this resulting text into a plain text file, and you have a non-ruby, non-HTML, metadata-free, whitespace-delimited Aozora text! Although you have to still download all the Aozora files individually and then do what you will with the resulting individual text files, it’s an easy way to pre-process this text and get it ready for tool-based (and also your-own-program-based) text analysis.

I plan to put the script on GitHub for your perusal and use (and of course modification) but for now, check it out on my Japanese Text Analysis research guide at Penn.

#dayofDH Meiroku zasshi 明六雑誌 project

It’s come to my attention that Fukuzawa Yukichi’s (and others’) early Meiji (1868-1912) journal, Meiroku zasshi 明六雑誌, is available online not just as PDF (which I knew about) but also as a fully tagged XML corpus from NINJAL (and oh my god, it has lemmas). All right!

Screen Shot 2014-04-08 at 11.09.55 AM

I recently met up with Mark Ravina at Association for Asian Studies, who brought this to my attention, and we are doing a lot of brainstorming about what we can do with this as a proof-of-concept project, and then move on to other early Meiji documents. We have big ideas like training OCR to recognize the difference between the katakana and kanji 二, for example; Meiji documents generally break OCR for various reasons like this, because they’re so different from contemporary Japanese. It’s like asking Acrobat to handle a medieval manuscript, in some ways.

But to start, we want to run the contents of Meiroku zasshi through tools like MALLET and Voyant, just to see how they do with non-Western languages (don’t expect any problems, but we’ll see) and what we get out of it. I’d also be interested in going back to the Stanford Core NLP API and seeing what kind of linguistic analysis we can do there. (First, I have to think of a methodology.  :O)

In order to do this, we need whitespace-delimited text with words separated by spaces. I’ve written about this elsewhere, but to sum up, Japanese is not separated by spaces, so tools intended for Western languages think it’s all one big word. There are currently no easy ways I can find to do this splitting; I’m currently working on an application that both strips ruby from Aozora bunko texts AND splits words with a space, but it’s coming slowly. How to get this with Meiroku zasshi in a quick and dirty way that lets us just play with the data?

So today after work, I’m going to use Python’s eTree library for XML to take the contents of the word tags from the corpus and just spit them into a text file delimited by spaces. Quick and dirty! I’ve been meaning to do this for weeks, but since it’s a “day of DH,” I thought I’d use the opportunity to motivate myself. Then, we can play.

Exciting stuff, this corpus. Unfortunately most of NINJAL’s other amazing corpora are available only on CD-ROMs that work on old versions of Windows. Sigh. But I’ll work with what I’ve got.

So that’s your update from the world of Japanese text analysis.

ruins – the past, the real, the monumental, the personal

Did I ever tell you about one of my favorite buildings in the world? It’s a public housing project named Kaigan-dori Danchi 海岸通り団地 (not to be confused with the type of projects one finds in the US, it was perfectly desirable housing in its time). This particular danchi (“community housing” or – generally public – housing project) was located smack in the middle of the richest section of Yokohama, between Kannai and Minato Mirai, perhaps one of the richest areas of the Tokyo region. Here it is in all its dirty, dirty glory, with Landmark Tower in the background.

Yes. This is Kaigan-dori Danchi, one of the grossest “ruins” (haikyo 廃墟) I had ever seen. Or, I thought it was a ruin. You know, an abandoned building. Because it looked too much like a shell to be anything else.

Then I got a message on Flickr.

In it, the messager wrote that he grew up in Kaigan-dori Danchi and now lives in New York City. He advised me that yes, it’s still inhabited, and thanked me for putting so many photos of it on Flickr. (Yes, I went for a photo shoot of this complex, more than once – hey, it was on my walk home from school!) He felt nostalgic at seeing his boyhood home and was interested to see what it looked like now.

In other words, what I’d felt vaguely strange about as some kind of ruins voyeurism – the same kind of ruins porn that takes hold of nearly everyone who wants to take photos of Detroit, for example – turned out to be a two-way street. It wasn’t pure voyeurism; it was a way to connect with someone who had a direct experience of the past of this place, a place that was still alive and had a memory and a history, rather than being some monstrosity out of time – as I’d been thinking of it. I saw it as a monument, not an artifact.

So this was in 2008, a half year after I’d become obsessed with Japanese urban exploration photography, which was enjoying a boom in the form of guidebooks, a glossy monthly magazine, calendars, DVDs, tours, photo books, and more, in Japan at the time. (Shortly thereafter, and I CALLED IT, came the public housing complex boom. I do have some of the photo books related to this boom too, because there’s nothing I love more than a good danchi.)

As part of the research for a presentation I gave on the topic for my Japanese class at IUC that year, I’d done some research into websites about ruins in Japan (all in Japanese of course). These were fascinating: some of them were just about the photography, but others were about reconnecting with the past, posting pictures of old schools and letting former classmates write on the guestbooks of the sites. There was a mixi (like myspace) group for the Shime Coal Mine (the only landmark of the first town I’d lived in in Japan). The photo books, on the other hand, profoundly decontextualized their objects and presented them as aesthetic monuments, much the way I’d first viewed Kaidan-dori Danchi.

So I wonder, with ruins porn a genre in the United States and Europe as well, do we have the same yearning for a concrete, real past that some of these sites and photographers exhibit, and not just vague nostalgia for the ruins of something that never existed? How much of ruins photography and guidebooks are about the site in context – the end point of a history – and how much is just about “hey I found this thing”? How much of this past is invented, never existed, purely fantasy, and how much of it is real, at least in the minds of those who remember it?

These are answers I don’t yet have, but I’ve just begun on this project. In the meantime, I’m happy to share Kaigan-dori Danchi with you.

fans, collectors, and archives

In the course of my research, I’ve been studying the connection between the first “complete works” anthology of writer Ihara Saikaku, his canonization, and the collectors and fans who created the anthology – a very archival anthology. (I say this because it has information about the contemporary provenance of the texts that make it up, among other things. It names the collector that contributed the text to the project on every title page!)

It’s struck me throughout this project that the role of fans – which these people were – and their connection with collectors, as well as their overlap, is of crucial importance in preserving, in creating archives and maintaining them, in creating resources that make study or access possible in the first place. They do the hard work of searching, finding, discovering, buying, arranging, preserving, and if we’re lucky, disseminating – through reprinting or, now, through making digital resources.

As I’ve become more acquainted with digital humanities and the range of projects out there, I can’t help but notice the role of collectors and fans here too. It’s not so much in the realm of academic projects, but in the numbers of Web sites out there that provide images or other surrogates for documents and objects that would otherwise be inaccessible. These are people who have built up personal collections over years, and who have created what would otherwise be called authoritative guides and resources without qualification – but who are not academics. They occupy a gray area of a combination of expertise and lack of academic affiliation or degree, but they are the ones who have provided massive amounts of information and documentation – including digital copies of otherwise-inaccessible primary sources.

I think we can see this in action with fandoms surrounding contemporary media, in particular – just look at how much information is available on Wikipedia about current video games and TV shows. Check out the Unofficial Elder Scrolls Pages and other similar wikis. (Note that UESP began as a Web site, not a wiki; it’s a little time capsule that reflects how fan pages have moved from individual labors of love to collective ones, with the spread of wikis for fan sites. A history of the site itself – “much of the early details and dates are vague as there are no records available anymore” – can be found here.)

I’m not a researcher of contemporary media or fan culture, but I can’t help but notice this and how little it’s talked about in the context of digital humanities, creating digital resources, and looking at the preservation of information over time.

Without collectors like Awashima Kangetsu and fans like Ozaki Kōyō and Ōhashi Otowa, we may not have Ihara Saikaku here today – and yet he is now among the most famous Japanese authors, read in survey courses as the representative Edo (1600-1867) author. He was unknown at the time, an underground obsession of a handful of literary youths. It was their collecting work, and their dedication (and connections to a major publisher) that produced The Complete Works of Saikaku in 1894, a reprinted two-volume set of those combined fans’ collections of used books. Who will we be saying this about in a hundred years?

For my readers out there who have their feet more in fandom and fan culture than I do, what do you think?

mishima__bot 三島由紀夫

The Internet never ceases to amaze me. Thing found on Twitter today:

@Mishima__bot 三島由紀夫

Here is its description:

1925年(大正14)1月14日に生まれ、1970年(昭和45)11月25日に自殺。代表作は小説に『仮面の告白』、『禁色』、『潮騒』、『金閣寺』、『鏡子の家』、『午後の曳航』等。当Botでは『金閣寺』を一とし彼の試論+対談+代表作よりあらゆる名文句を抜粋。参照作品合計50作の中から、セリフパターン3,000以上を呟く。

Yup. It’s a “bot” that posts famous quotes from his works. To Twitter. Mishima Yukio lives! I highly recommend following it if only for the uncanny tweets you will find in your フロー. I almost want to set this thing to send me texts when it tweets, but it posts too often.

What I’m doing this summer at CDRH: overview

I’ve been here at CDRH (The Center for Digital Research in the Humanities) at the University of Nebraska-Lincoln since early May, and the time went by so quickly that I’m only writing about what I’m doing a few weeks before my internship ends! But I’m in the thick of things now, in terms of my main work, so this may be the perfect time to introduce it.

My job this summer is (mostly) to update TokenX for the Willa Cather Archive (you can find it from the “Text Analysis” link at http://cather.unl.edu). I’m updating it in two senses:

  1. Redesigning the basic interface. This means starting from scratch with a list of functions that TokenX performs, organizing them for user access, figuring out which categories will form the interface (and what to put under each), and then making a visual mockup of where all this will go.
  2. Taking this interface redesign to a new Web site for TokenX at the Cather Archive.* The site redesign mostly involves adapting the new interface for the Web. Concretely, I’m doing everything from the ground up with HTML5 and styles separated into CSS (and aiming for modularity, I’m using multiple stylesheets that style at different levels of functionality – for example, the color scheme, etc., is separated from the rest of the site to be modified or switched out easily). The goal is to avoid JavaScript completely, and I think we’re going to make it happen. We’re also aiming for text rather than images (for example, in menus) and keeping the site as basic and functional as possible. After all, this is an academic site and too much flashy will make people suspicious. 😀
  3. The exciting part: Implementing as much of TokenX with the new interface as I can in the time I’m here. Why is it exciting?
    • TokenX is written in XSLT, which tends to be mentioned in a cursory way as “stylesheets for XML” as though it’s like CSS. It’s not. It’s a functional programming language with syntax devised by a sadist. XSLT has a steep learning curve and I have had 9 weeks this summer to try my best to get over it before I leave CDRH. I’m doing my best and it’s going better than I ever imagined.
    • I’m also getting to learn how XSLT is often used to generate Web sites (which is what I’m doing): using Apache Cocoon. Another technology that I had no idea existed before this summer, and which is coming to feel surprisingly comfortable at this point.
    • I have never said so many times in a span of only two months, “I’m glad I had that four years of computer science curriculum in college. It’s paying off!” Given that I never went into software development after graduating, and haven’t done any non-trivial programming in quite a long time, I had mostly dismissed my education as something that could end up being so relevant to my work now. And honestly, I miss using it. This is awesome.

I’m getting toward the end of implementing all of the functionality of TokenX in XSLT for the new Web site, hooking that up with the XSLT that then generates the HTML that delivers it to the user. (To be more concrete for those interested in Cocoon, I’m writing a sitemap that first processes the requested XML file with the appropriate stylesheet for transformation results, then passing those results on to a series of formatting stylesheets that eventually produce the final HTML product.) And I’m about midway through the process of doing from Web prototype to basic site styling to more polished end result. I’ve got 2.5 weeks left now, and I’m on track to having it up and running pretty soon! I’ll keep you updated with comments about the process – both XSLT, and crafting the site with HTML5 and CSS – and maybe some screenshots.

* TokenX can be, and is, used for more than this collection at CDRH. Notably it’s used at the Walt Whitman Archive in basically the same way as Cather. But we have to start somewhere for now, and expand as we can later.