Tag Archives: code

Pre-processing Japanese literature for text analysis

I recently wrote a small script to perform a couple of functions for pre-processing Aozora Bunko texts (text files of public domain, modern Japanese literature and non-fiction) to be used with Western-oriented text analysis tools, such as Voyant, other TAPoR tools, and MALLET. Whereas Japanese text analysis software focuses largely on linguistics (tagging parts of speech, lemmatizing, etc.), Western tools open up possibilities for visualization, concordances, topic modeling, and other various modes of analysis.

Why do these Aozora texts need to be processed? Well, a couple of issues.

  1. They contain ruby, which are basically glosses of Chinese characters that give their pronunciation. These can be straightforward pronunciation help, or actually different words that give added meaning and context. While I have my issues with removing ruby, it’s impossible to do straightforward tool-based analysis without removing it, and many people who want to do this kind of analysis want it to be removed.
  2. The Aozora files are not exactly plain text: they’re HTML. The HTML tags and Aozora metadata (telling where the text came from, for example) need to be removed before analysis can be performed.
  3. There are no spaces between words in Japanese, but Western text analysis tools identify words by looking at where there are spaces. Without inserting spaces, it looks like each line is one big word. So I needed to insert spaces between the Japanese words.

How did I do it? My approach, because of my background and expertise, was to create a Python script that used a couple of helpful libraries, including BeautifulSoup for ruby removal based on HTML tags, and TinySegmenter for inserting spaces between words. My script requires you to have these packages installed, but it’s not a big deal to do so. You then run the script in a command line prompt. The way it works is to look for all .html files in a directory, load them and run the pre-processing, then output each processed file with the same filename, .txt ending, a plain text UTF-8 encoded file.

The first step in the script is to remove the ruby. Helpfully, the ruby is contained in several HTML tags. I had BeautifulSoup traverse the file and remove all elements contained within these tags; it removes both the tags and content.

Next, I used a very simple regular expression to remove everything in brackets – i.e. the HTML tags. This is kind of quick and dirty, and won’t work on every file in the universe, but in Aozora texts everything inside a bracket is an HTML tag, so it’s not a problem here.

Finally, I used TinySegmenter on the resulting HTML-free text to split the text into words. Luckily for me, it returns an array of words – basically, each word is a separate element in a list like [‘word1’, ‘word2’, … ‘wordn’] for n words. This makes my life easy for two reasons. First, I simply joined the array with a space between each word, creating one long string (the outputted text) with spaces between each element in the array (words). Second, it made it easy to just remove the part of the array that contains Aozora metadata before creating that string. Again, this is quick and dirty, but from examining the files I noted that the metadata always comes at the end of the file and begins with the word 底本 (‘source text’). Remove that word and everything after it, and then you have a metadata-free file.

Write this resulting text into a plain text file, and you have a non-ruby, non-HTML, metadata-free, whitespace-delimited Aozora text! Although you have to still download all the Aozora files individually and then do what you will with the resulting individual text files, it’s an easy way to pre-process this text and get it ready for tool-based (and also your-own-program-based) text analysis.

I plan to put the script on GitHub for your perusal and use (and of course modification) but for now, check it out on my Japanese Text Analysis research guide at Penn.

footnotes: paper books still do it better

Obviously, being in librarianship (training) and at the School of Information, I hear lots of things about the “death of the book” and the rise of e-books. Many are mainstream media articles that border on the downright silly; others are from tech leaders with interesting speculations; still others are inflamed (but sometimes reasonable) discussions on our school listserv.

To summarize, I hear the following: 1) you can’t get rid of paper books (because you’ll have to pry them out of my cold dead hands) because there’s just something special or nostalgic about them that I can’t put my finger on, or 2) market forces will drive out paper books for good shortly in the future, because there just won’t be enough demand once everyone is on board with e-readers. Get used to it, suckers.

You’d think I have a strong investment in one or the other, but I don’t. Both sides sound vaguely ridiculous to me. I say this as someone with a used book collection that is really larger than a sane person should have. It takes over large regions of my apartment and yet still at least half is in storage elsewhere. I don’t buy a lot of e-books.

Here’s my very strong opinion on the issue: I use both at the same time, and I want to keep doing so. I have a strong preference for e-books (because I have a Kindle now and find it so easy to read on) for pretty much everything that suits the medium, because frankly, they won’t take up physical space and create even more of a nuisance for me than my book collection already does. Why carry around some big trade paperback if you just want to read, and your Project Gutenberg edition is free anyway, for God’s sake. I love having an e-book option and I spend most of my time angsting not over the change, but over the fact that a lot of the selection still sucks, the quality often sucks, and I can’t get enough of what I want on the Kindle.

But why do I still maintain that I need print books, if I love getting an electronic version so much? Because e-books can’t do what paper books do best for me: serve as reference books that demand looking in multiple places at once. I don’t rank “the tactile sensation of paper and the smell of a new (or old, ugh) book” as the positive qualities that paper books offer. Incidentally, I am frustrated that people do not think about the tactile qualities of e-book readers and tablets and their computers, ever. At least no one’s talking about them. Reading PDFs on my 27″ desktop monitor has a certain physical quality that I really enjoy (that big screen where I can read about 3 side by side and move them around!), and my Kindle has some awesome tactile qualities that I really love. (Being approximately the same reading experience on the page as a small paperback book is particularly great, because it is nostalgia central for me.)

So, because of this lack of ability to conveniently and easily keep multiple pages “marked” (often with fingers, right?) to flip back and forth between easily, or even look at them semi-simultaneously (I know I’m not the only one who kind of keeps both sections of the book half-open when I’m looking back and forth), I cannot give up paper books. This is a key feature for at least 50% of what I read and it’s so important that if an e-book does it poorly, I am not going to put up with it.

Most of my experience is with PDFs on the computer and the Kindle, but I haven’t found any electronic book that does footnotes well – I’m talking about endnotes here too. The Kindle tries and fails pretty miserably. The process is so slow that it is nothing like mimicking flipping back and forth between the endnote section and the page you’re reading. PDF hotlinks are pretty much as bad, or worse.

So what I use my Kindle largely for, right now, is reading some stuff that I don’t have to read too hard (news, fiction, short or light non-fiction), and for previewing books that I have to buy in physical format.

The ones I can’t buy on the Kindle (even though yes, a version is available): Reference-style books. Any book with a lot of foonotes. Programming or technical books. (seriously, who wants to try to view code examples on a page that small?) Any book that needs to be larger format to be readable. Books with a lot of pictures. (Duh.)

Books for “school” (i.e. related to my dissertation or other research) fall into this category too: I fill them with post-it notes and frequently have to flip between sections when I’m writing, keep track of many pages that I’m using all at once, referring to earlier or later sections, using the abundant footnotes. There’s no way I can look at this stuff on an e-reader, or on a computer.

Yes, a PDF viewer on my large monitor that let me keep pages from the same book open in new windows all right next to each other would be helpful, but as far as I know this doesn’t exist. Tabs wouldn’t cut it. The problem with “flipping” between foonote links and a page, or between tabs, is just too slow. E-book don’t give me the speed that paper books do.

Honestly, I would be a happy camper if someone were to solve this ergonomic problem and let me buy more e-books to free up valuable apartment space. O’Reilly books are a particular offender. But I’m not holding my breath here; like being a PPC user, am I relegated to a shrinking and soon-to-be obsolete “user” or “consumer” base here? I hope not.