Tag Archives: language

#dayofDH Japanese apps workshop for new Penn students

Today, we’re having a day in the library for prospective and new Penn students who will (hopefully) join our community in the fall. As part of the library presentations, I’ve been asked to talk about Japanese mobile apps, especially for language learning.

While I don’t consider this a necessarily DH thing, some people do, and it’s a way that I integrate technology into my job – through workshops and research guides on various digital resources. (More on that later.)

I did this workshop for librarians at the National Coordinating Council on Japanese Library Resources (NCC)’s workshop before the Council on East Asian Libraries conference a few weeks ago in March 2014. My focus was perhaps too basic for a savvy crowd that uses foreign languages frequently in their work: I covered the procedure for setting up international keyboards on Android and iOS devices, dictionaries, news apps, language learning assistance, and Aozora bunko readers. However, I did manage to impart some lesser known information: how to set up Japanese and other language dictionaries that are built into iOS devices for free. I got some thanks on that one. Also noted was the Aozora 2 Kindle PDF-maker.

Today, I’ll focus more on language learning and the basics of setting up international keyboards. I’ve been surprised at the number of people who don’t know how to do this, but not everyone uses foreign languages on their devices regularly, and on top of that, not everyone loves to poke around deep in the settings of their computer or device. And keyboard switching on Android can be especially tricky, with apps like Simeji. So perhaps covering the basics is a good idea after all.

I don’t have a huge amount of contact with undergrads compared to the reference librarians here, and my workshops tend to be focused on graduate students and faculty with Japanese language skills. So I look forward to working with a new community of pre-undergrads and seeing what their needs and desires are from the library.

the first-world internet

I heard an interesting presentation today, but it concluded with a very developed-world, class-based interpretation of the Internet that I simply can’t agree with.

Although it’s true that more students are coming from abroad to study in the US (attributed in the presentation partially to budgetary issues in public schools in the US, another issue entirely), the idea of ‘globalization’, I’d argue, is really a concept based in the developed world. Yes, we have more students studying ‘cross-border’ topics, and interested in the world outside of the US. American students are coming into more contact with international students thanks to their presence in American universities, and perhaps gaining more cultural competency through this interaction. ‘Global studies’ are now a thing.

But this presentation talked at the end about the global power of the Internet, and globalization generally, about being able to reach across borders and communicate unimpeded. It doesn’t just have the potential to break down barriers, but already actively does so, this presenter posited. It doesn’t just encourage dissent but is already a channel for dissent, and an opportunity available to all.

International students in the US may be experiencing this power of the Internet, yes. But at home? Students from nations such as China and Saudi Arabia may not have experienced the Internet in this way, and may not be able to experience it back home in the same way as they can in the West, in Korea, in Japan, in other developed countries. (And I realize that’s a problematic term in itself.) Moreover, not all American students have experienced this Internet either. The students we find in universities generally already have opportunities not available to everyone, including their access to technology and the Internet.

There’s also the inherent assumption that this global access – and ‘global studies’ in general – takes place in English. While many students abroad are studying English, not all have this opportunity; moreover, their access to the educational opportunities of the developed world are limited to those opportunities they can access in English. Many undergraduates and even graduate students in the US limit themselves to the kind of global studies that can take place without foreign language competency. I realize that many do attempt foreign language studies and while the vast majority of undergraduates I encounter who are interested in Japan and Korea cannot read materials in their focus countries’ languages, they are often enrolled in language classes and doing their best. However, there are many more who are not. They do not come to the world – they expect the world to come to them.

And there are many, many students around the world who do not have access to the English Internet, or cross-border collaboration in English through the opportunities the Internet potentially affords (or doesn’t, depending on the country). They may not even have reliable access to electricity, let alone a data connection. This is changing, but not at the speed that the kind of thinking I encountered today assumes.

Related to this, another presentation talked about the power of MOOCs and online learning experiences in general. And yes, while I generally agree that there is much potential here, the vast majority of MOOCs currently available require English, a reliable connection, reliable electricity. They are by and large taken by educated adult males, who speak English. There is potential, but that is not the same as actual opportunity.

Overall, I think we need to question what we are saying when we talk about the power of the global Internet, and distinguish between potential and reality. Moreover, we need to distinguish exactly the groups we are talking about when we talk about globalization, global studies, and cross-border/cross-cultural communication. Even without the assumption of a developed-world, upper-class Internet, we need to recognize that by and large, our work is still conducted in silos, especially in the humanities. Science researchers in Japan may be doing English-language collaboration with international colleagues, but humanities researchers largely cannot communicate in English and cross-language research in those fields is rare. I can’t speak for countries other than Japan and the US, really, but despite the close mutual interest in areas such as Japanese literature and history, there is little collaboration between the two countries – despite the potential, as with digitizing rare materials and pooling resources to create common-interest digital archives, for example.

Even those international students often conduct their American educations in language and culture silos. Even the ones with reliable Internet access use country-based chat and social media, although resources such as Facebook are gaining in popularity. We go with what is most comfortable for us, what comes to us; that doesn’t apply only to Americans. Our channels of communication are those that allow us the path of least resistance. Even if Twitter and Facebook weren’t blocked in China, would they prove as popular as Sina Weibo and other Chinese technologies? Do Americans know what Line is or are they going to continue using WhatsApp?

If we find that English, money, and understanding of American cultural norms are major barriers to our communication, we might find other ways. Yes, that developed-world Internet may hold a lot of potential, but its global promise may not go in a direction that points toward us in America anyway.

Japanese tokenization – tools and trials

I’ve been looking (okay, not looking, wishing) for a Japanese tokenizer for a while now, and today I decided to sit down and do some research into what’s out there. It didn’t take long – things have improved recently.

I found two tools quickly: kuromoji Japanese morphological analyzer and the U-Tokenizer CJK Tokenizer API.

First off – so what is tokenization? Basically, it’s separating sentences by words, or documents by sentences, or any text by some unit, to be able to chunk that text into parts and analyze them (or do other things with them). When you tokenize a document by word, like a web page, you enable searching: this is how Google finds individual words in documents. You can also find keywords from a document this way, by writing an algorithm to choose the most meaningful nouns, for example. It’s also the first step in more involved linguistic analysis like part-of-speech tagging (thing, marking individual words as nouns, verbs, and so on) and lemmatizing (paring words down to their stems, such as removing plural markers and un-conjugating verbs).

This gives you a taste of why tokenization is so fundamental and important for text analysis. It’s what lets you break up an otherwise unintelligible (to the computer) string of characters into units that the computer can attempt to analyze. It can index them, search them, categorize them, group them, visualize them, and so on. Without this, you’re stuck with “words” that are entire sentences or documents, that the computer thinks are individual units based on the fact that they’re one long string of characters.

Usually, the way you tokenize is to break up “words” based on spaces (or sentences based on punctuation rules, etc., although that doesn’t always work). (I put “words” in quotes because you can really make any kind of unit you want, the computer doesn’t understand what words are, and in the end it doesn’t matter. I’m using “words” as an example here.) However, for languages like Japanese and Chinese (and to a lesser extent Korean) that don’t use spaces to delimit all words (for example, in Korean particles are attached to nouns with no space in between, like saying “athome” instead of “at home”), you run into problems quickly. How to break up texts into words when there’s no easy way to distinguish between them?

The question of tokenizing Japanese may be a linguistic debate. I don’t know enough about linguistics to begin to participate in it, if it is. But I’ll quickly say that you can break up Japanese based on linguistic rules and dictionary rules – understanding which character compounds are nouns, which verb conjugations go with which verb stems (as opposed to being particles in between words), then breaking up common particles into their own units. This appears to be how these tools are doing it. For my own purposes, I’m not as interested in linguistic patterns as I am in noun and verb usage (the meaning rather than the kind) so linguistic nitpicking won’t be my area anyway.

Moving on to the tools. I put them through the wringer: Higuchi Ichiyō’s Ame no yoru, the first two lines, from Aozora bunko.

One, kuromoji, is the tokenizer behind Solr and Lucene. It does a fairly good job, although with Ichiyō’s uncommon word usage and conjugation, it faltered and couldn’t figure out that 高やか is one word; rather it divided it into 高 や か.  It gives the base form, reading, and pronunciation, but nothing else. However, in the version that ships with Solr/Lucene, it lemmatizes. Would that ever make me happy. (That’s, again, reducing a word to its base form, making it easy to count all instances of both “people” and “person” for example, if you’re just after meaning.) I would kill for this feature to be integrated with the below tool.

The other, U-Tokenizer, did significantly better, but its major drawback is that it’s done in the form of an HTTP request, meaning that you can’t put in entire documents (well, maybe you could? how much can you pass in an HTTP request?). If it were downloadable code with an API, I would be very happy (kuromoji is downloadable and has a command line interface). U-Tokenizer figured out that 高やか is one word, and also provides a list of “keywords,” which as far as I can tell is a bunch of salient nouns. I used it for a very short piece of text, so I can’t comment on how many keywords it would come up with for an entire document. The documentation on this is sparse, and it’s not open source, so it’s impossible to know what it’s doing. Still, it’s a fantastic tool, and also seems to work decently for Chinese and Korean.

Each of these tools has its strengths, and both are quite usable for modern and contemporary Japanese. (I really was cruel to feed them Ichiyō.) However, there is a major trial involved in using them with freely-available corpora like Aozora bunko. Guess what? Preprocessing ruby.

Aozora texts contain ruby marked up within the documents. I have my issues with stripping out ruby from documents that heavily use them (like Meiji writers, for example) because they add so much meaning to the text, but let’s say for argument’s sake that we’re not interested in the ruby. Now, it’s time to cut it all out. If I were a regular expressions wizard (or even had basic competency with them) I could probably strip this out easily, but it’s still time consuming. Download text, strip out ruby and other metadata, save as plain text. (Aozora texts are XHTML, NOT “plain text” as they’re often touted to be.) Repeat. For topic modeling using a tool like MALLET, you’re going to want to have hundreds of documents at the end of it. For example, you might be downloading all Meiji novels from Aozora and dividing them into chunks or chapters. Even the complete works of Natsume Sōseki aren’t enough without cutting them down into chapters or even paragraphs to make enough documents to use a topic modeling tool effectively. Possibly, run all these through a part-of-speech tagger like KH Coder. This is going to take a significant amount of time.

Then again, preprocessing is an essential and extremely time-consuming part of almost any text analysis project. I went through a moderate amount of work just removing Project Gutenberg metadata and dividing into chapters a set of travel narratives that I downloaded in plain text, thankfully not in HTML or XML. It made for easy processing. With something that’s not already real plain text, with a lot of metadata, and with a lot of ruby, it’s going to take much more time and effort, which is more typical of a project like this. The digital humanities are a lot of manual labor, despite the glamorous image and the idea that computers can do a lot of manual labor for us. They are a little finicky with what they’ll accept. (Granted, I’ll be using a computer script to strip out the XHTML and ruby tags, but it’s going to take work for me to write it in the first place.)

In conclusion? Text analysis, despite exciting available tools, is still hard and time consuming. There is a lot of potential here, but I also see myself going through some trials to get to the fun part, the experimentation. Still, stay tuned, especially for some follow-up posts on these tools and KH Coder as I become more familiar with them. And, I promise to stop being difficult and giving them Ichiyō’s Meiji-style bungo.

android slashdot reader: 和英コメントで言語学び!

Now that I have an Android phone and have found some pretty great things on the Android Market for getting ahold of Japanese content, I would like to start sharing with you all what I’ve been using and whether it’s worth downloading.

First up is my favorite new find: Slashdot Reader. Yeah. It’s an RSS feed reader for Slashdot. Why so great?

Well, can you imagine my reaction when I read its description and saw images of posts from slashdot.org and slashdot.jp showing up all mixed in together? Then I read the description: “just a feed reader, nothing more” – for both Japanese and English Slashdot.

It’s like I found an app made by my doppelganger. Really.

If you don’t want one or the other of the languages, it allows you to toggle both, Japanese only, and English only.

Because it’s a feed reader, you only get the headlines and leads from Slashdot, but can easily click through to the full story, and therein lies the amazing language learning tool that somehow never really occurred to me.

All of these years, I could have been learning Japanese through Slashdot comments! That’s right. Of course it’s not textbook Japanese. I already know how trolls (荒らし) talk after just a minute or two of reading. How nerds talk. (They always use が and never けど, although they do use ね sparingly for emphasis. A certain language teacher from several years ago who forbid us from using けど in class for an entire semester would be proud.) And how random users talk.

I also know how they’re basically saying exactly the same things that commenters do on Slashdot in English, only they’re saying it in Japanese. (open source != free as in beer, anyone? I seriously just read this. 無償 is free as in beer, and note that it’s not the same as the widely-used word for “free” 無料 – so I just learned something new about software licensing.) So if you’re a Slashdot reader, this is going to help you immensely. It’s all about context.

Yes, so there are people out there who would disparage the idea of learning language from internet comments. But I counter that with: it’s real language! And this is a specific forum where you know what is coming: some nerdspeak, some posturing, some trolls, some reasonable people, talking about a rather limited set of topics. So you are going to learn voices, not just “Japanese.” You are going to learn what people say in a certain situation, and also what not to say. I can’t think of anything more helpful than that!

And here you go: Slashdot Reader for Android (this takes you to Android Market).

podcast: 学問のススメ

New podcast for me, and best podcast ever in my opinion. Especially for learning! And you have to love the title. It is…

学問のススメ – “Special Edition” (ラジオ版) (Gakumon no susume – radio)

(Why is the title great? Because Fukuzawa Yukichi, who was all about enlightenment, published a book in the 1870s called “Gakumon no susume.” In other words (and in my words), “The furthering of knowledge/study.” This podcast takes the same name, somewhat sincerely and somewhat tongue-in-cheek.)

The mission: Learn things that you either didn’t learn in school, or that you forgot. The format: Interviews with a variety of interesting people, from authors to reporters to photographers to athletes. Why? It’s interesting, you’re exposed to many different fields, and for Japanese learners, it’s fast-paced but obviously a great source of vocabulary on different topics and practice listening to “real Japanese” at native speed, intended for the average layperson. And best of all, it’s free!

You can download individual episodes, or subscribe via the Japanese iTunes store. (You don’t need an account or Japanese credit card to download episodes that are free – just change your location to “Japan” at the bottom of your iTunes Store home screen and search for the title. If you’d like to find more podcasts, just browse away by topic!)

Click here for 学問のススメ!