As I begin working on my project involving Taiyō magazine, I thought I’d document what I’m doing so others can see the process of cleaning the data I’ve gotten, and then experimenting with it. This is the first part in that series: first steps with data, cleaning it, and getting it ready for analysis. If I have the Taiyō data in “plain text,” what’s there to clean? Oh, you have no idea.
I recently wrote a small script to perform a couple of functions for pre-processing Aozora Bunko texts (text files of public domain, modern Japanese literature and non-fiction) to be used with Western-oriented text analysis tools, such as Voyant, other TAPoR tools, and MALLET. Whereas Japanese text analysis software focuses largely on linguistics (tagging parts of speech, lemmatizing, etc.), Western tools open up possibilities for visualization, concordances, topic modeling, and other various modes of analysis.
Why do these Aozora texts need to be processed? Well, a couple of issues.
- They contain ruby, which are basically glosses of Chinese characters that give their pronunciation. These can be straightforward pronunciation help, or actually different words that give added meaning and context. While I have my issues with removing ruby, it’s impossible to do straightforward tool-based analysis without removing it, and many people who want to do this kind of analysis want it to be removed.
- The Aozora files are not exactly plain text: they’re HTML. The HTML tags and Aozora metadata (telling where the text came from, for example) need to be removed before analysis can be performed.
- There are no spaces between words in Japanese, but Western text analysis tools identify words by looking at where there are spaces. Without inserting spaces, it looks like each line is one big word. So I needed to insert spaces between the Japanese words.
How did I do it? My approach, because of my background and expertise, was to create a Python script that used a couple of helpful libraries, including BeautifulSoup for ruby removal based on HTML tags, and TinySegmenter for inserting spaces between words. My script requires you to have these packages installed, but it’s not a big deal to do so. You then run the script in a command line prompt. The way it works is to look for all .html files in a directory, load them and run the pre-processing, then output each processed file with the same filename, .txt ending, a plain text UTF-8 encoded file.
The first step in the script is to remove the ruby. Helpfully, the ruby is contained in several HTML tags. I had BeautifulSoup traverse the file and remove all elements contained within these tags; it removes both the tags and content.
Next, I used a very simple regular expression to remove everything in brackets – i.e. the HTML tags. This is kind of quick and dirty, and won’t work on every file in the universe, but in Aozora texts everything inside a bracket is an HTML tag, so it’s not a problem here.
Finally, I used TinySegmenter on the resulting HTML-free text to split the text into words. Luckily for me, it returns an array of words – basically, each word is a separate element in a list like [‘word1’, ‘word2’, … ‘wordn’] for n words. This makes my life easy for two reasons. First, I simply joined the array with a space between each word, creating one long string (the outputted text) with spaces between each element in the array (words). Second, it made it easy to just remove the part of the array that contains Aozora metadata before creating that string. Again, this is quick and dirty, but from examining the files I noted that the metadata always comes at the end of the file and begins with the word 底本 (‘source text’). Remove that word and everything after it, and then you have a metadata-free file.
Write this resulting text into a plain text file, and you have a non-ruby, non-HTML, metadata-free, whitespace-delimited Aozora text! Although you have to still download all the Aozora files individually and then do what you will with the resulting individual text files, it’s an easy way to pre-process this text and get it ready for tool-based (and also your-own-program-based) text analysis.
I plan to put the script on GitHub for your perusal and use (and of course modification) but for now, check it out on my Japanese Text Analysis research guide at Penn.
It’s come to my attention that Fukuzawa Yukichi’s (and others’) early Meiji (1868-1912) journal, Meiroku zasshi 明六雑誌, is available online not just as PDF (which I knew about) but also as a fully tagged XML corpus from NINJAL (and oh my god, it has lemmas). All right!
I recently met up with Mark Ravina at Association for Asian Studies, who brought this to my attention, and we are doing a lot of brainstorming about what we can do with this as a proof-of-concept project, and then move on to other early Meiji documents. We have big ideas like training OCR to recognize the difference between the katakana and kanji 二, for example; Meiji documents generally break OCR for various reasons like this, because they’re so different from contemporary Japanese. It’s like asking Acrobat to handle a medieval manuscript, in some ways.
But to start, we want to run the contents of Meiroku zasshi through tools like MALLET and Voyant, just to see how they do with non-Western languages (don’t expect any problems, but we’ll see) and what we get out of it. I’d also be interested in going back to the Stanford Core NLP API and seeing what kind of linguistic analysis we can do there. (First, I have to think of a methodology. :O)
In order to do this, we need whitespace-delimited text with words separated by spaces. I’ve written about this elsewhere, but to sum up, Japanese is not separated by spaces, so tools intended for Western languages think it’s all one big word. There are currently no easy ways I can find to do this splitting; I’m currently working on an application that both strips ruby from Aozora bunko texts AND splits words with a space, but it’s coming slowly. How to get this with Meiroku zasshi in a quick and dirty way that lets us just play with the data?
So today after work, I’m going to use Python’s eTree library for XML to take the contents of the word tags from the corpus and just spit them into a text file delimited by spaces. Quick and dirty! I’ve been meaning to do this for weeks, but since it’s a “day of DH,” I thought I’d use the opportunity to motivate myself. Then, we can play.
Exciting stuff, this corpus. Unfortunately most of NINJAL’s other amazing corpora are available only on CD-ROMs that work on old versions of Windows. Sigh. But I’ll work with what I’ve got.
So that’s your update from the world of Japanese text analysis.