Category Archives: Japan

Using Collections – Virtually

I heard a remark the other day that struck a cord – or, churned my butter, a bit.

The gist of it was, “we should make digital facsimiles of our library materials (especially rare materials) and put them online, so they spark library use when people visit to see them in person, after becoming aware of them thanks to the digitized versions.”

Now, at Penn, we have digitized a couple of Japanese collections: Japanese Juvenile Fiction Collection (aka Tatsukawa Bunko series), Japanese Naval Collection (in-process, focused on Renshū Kantai training fleet materials), and a miscellaneous collection of Japanese rare books in general.* These materials have been used both in person (thanks to publicizing them, pre- and post-digitization, on library news sites, blogs, and social media as well as word-of-mouth), and also digitally by researchers who cannot travel to Penn. In fact, a graduate student in Australia used our juvenile fiction collection for part of his dissertation; another student in Wisconsin plans to use facsimiles of our naval materials once they’re complete; and faculty at University of Montana have used our digital facsimile of Meiji-period journal Hōbunkai-sui (or Hōbunkai-shi).

These researchers, due to distance and budget, will likely never be able to visit Penn in person to use the collections. On top of that, some items – like the juvenile fiction and lengthy government documents related to the Imperial Navy – don’t lend themselves to using in a reading room. These aren’t artifacts to look over one page at a time, but research materials that will be read extensively (rather than “intensively,” a distinction we book history folks make). Thus, this is the only use they can make of our materials.

The digitization of Japanese collections at Penn has invited use and a kind of library visit by virtue of being available for researchers worldwide, not just those who are at Penn (who could easily view them in person and don’t “need” a digital facsimile), or who can visit the library to “smell” the books (as the person I paraphrased put it). I think it’s more important to be able to read, research, and use these documents than to smell or witness the material artifact. Of course, there are cases in which one would want to do that, but by and large, our researchers care more about the content and visual aspects of the materials – things that can be captured and conveyed in digital images – rather than touching or handling them.

Isn’t this use, just as visiting the library in person use? Shouldn’t we be tracking visits to our digital collections, downloads, and qualitative stories about their use in research, just as we do a gate count and track circulation? I think so. As we think about the present and future of libraries, and people make comments about their not being needed because libraries are on our smartphones (like libraries of fake news, right?), we must make the argument for providing content both physically and virtually. Who do people think is providing the content for their digital libraries? Physical libraries, of course! Those collections exist in the real world and come from somewhere, with significant investments of money, time, and labor involved – and moreover, it is the skilled and knowledgable labor of professionals that is required.

On top of all of this, I feel it is most important to own up to what we can and cannot “control” online: our collections, by virtue of being able to be released at all, are largely in the public domain. Let’s not put CC licenses on them except for CC-0 (which is explicitly marking materials as public domain), pretending we can control the images when we have no legal right to (but users largely don’t know that). Let’s allow for free remixing and use without citing the digital library/archive it came from, without getting upset about posts on Tumblr. When you release public domain materials on the web (or through other services online), you are giving up your exclusive right to control the circumstances under which people use it – and as a cultural heritage institution, it is your role to perform this service for the world.

But not only should we provide this service, we should take credit for it: take credit for use, visits, and for people getting to do whatever they want with our collections. That is really meaningful and impactful use.

* Many thanks to Michael Williams for his great blog posts about our collections!

Taiyō project: first steps with data

As I begin working on my project involving Taiyō magazine, I thought I’d document what I’m doing so others can see the process of cleaning the data I’ve gotten, and then experimenting with it. This is the first part in that series: first steps with data, cleaning it, and getting it ready for analysis. If I have the Taiyō data in “plain text,” what’s there to clean? Oh, you have no idea.

taiyo_data Continue reading Taiyō project: first steps with data

how to make Japanese udon/soba broth

After packing my copy of Japanese Cooking Contemporary and Traditional (an awesome vegan cookbook) a little hastily before my upcoming move, I scoured the Internet for the basic udon/soba noodle broth recipe. To my surprise, it is not on the Oracle. So here I’ll provide the standard Japanese recipe for soba/udon broth for posterity.

You need:

  • 4 cups any kind of broth (for example konbu-dashi, katsuo-dashi, or chicken, or fake-chicken as I used last night)
  • 2-5 tablespoons light (usu-kuchi) soy sauce as desired for saltiness
  • 1 tablespoon or so ryōri-shu (cooking sake, the cheap kind)
  • 1 teaspoon mirin or sugar if you don’t have mirin handy
  • 1/2 teaspoon salt

Simmer all this together for about 5 minutes and pour over the noodles. Adjust all the salty and sugary elements based on your taste. This makes enough for 2 servings.

You’re welcome, Internet!

Pre-processing Japanese literature for text analysis

I recently wrote a small script to perform a couple of functions for pre-processing Aozora Bunko texts (text files of public domain, modern Japanese literature and non-fiction) to be used with Western-oriented text analysis tools, such as Voyant, other TAPoR tools, and MALLET. Whereas Japanese text analysis software focuses largely on linguistics (tagging parts of speech, lemmatizing, etc.), Western tools open up possibilities for visualization, concordances, topic modeling, and other various modes of analysis.

Why do these Aozora texts need to be processed? Well, a couple of issues.

  1. They contain ruby, which are basically glosses of Chinese characters that give their pronunciation. These can be straightforward pronunciation help, or actually different words that give added meaning and context. While I have my issues with removing ruby, it’s impossible to do straightforward tool-based analysis without removing it, and many people who want to do this kind of analysis want it to be removed.
  2. The Aozora files are not exactly plain text: they’re HTML. The HTML tags and Aozora metadata (telling where the text came from, for example) need to be removed before analysis can be performed.
  3. There are no spaces between words in Japanese, but Western text analysis tools identify words by looking at where there are spaces. Without inserting spaces, it looks like each line is one big word. So I needed to insert spaces between the Japanese words.

How did I do it? My approach, because of my background and expertise, was to create a Python script that used a couple of helpful libraries, including BeautifulSoup for ruby removal based on HTML tags, and TinySegmenter for inserting spaces between words. My script requires you to have these packages installed, but it’s not a big deal to do so. You then run the script in a command line prompt. The way it works is to look for all .html files in a directory, load them and run the pre-processing, then output each processed file with the same filename, .txt ending, a plain text UTF-8 encoded file.

The first step in the script is to remove the ruby. Helpfully, the ruby is contained in several HTML tags. I had BeautifulSoup traverse the file and remove all elements contained within these tags; it removes both the tags and content.

Next, I used a very simple regular expression to remove everything in brackets – i.e. the HTML tags. This is kind of quick and dirty, and won’t work on every file in the universe, but in Aozora texts everything inside a bracket is an HTML tag, so it’s not a problem here.

Finally, I used TinySegmenter on the resulting HTML-free text to split the text into words. Luckily for me, it returns an array of words – basically, each word is a separate element in a list like [‘word1’, ‘word2’, … ‘wordn’] for n words. This makes my life easy for two reasons. First, I simply joined the array with a space between each word, creating one long string (the outputted text) with spaces between each element in the array (words). Second, it made it easy to just remove the part of the array that contains Aozora metadata before creating that string. Again, this is quick and dirty, but from examining the files I noted that the metadata always comes at the end of the file and begins with the word 底本 (‘source text’). Remove that word and everything after it, and then you have a metadata-free file.

Write this resulting text into a plain text file, and you have a non-ruby, non-HTML, metadata-free, whitespace-delimited Aozora text! Although you have to still download all the Aozora files individually and then do what you will with the resulting individual text files, it’s an easy way to pre-process this text and get it ready for tool-based (and also your-own-program-based) text analysis.

I plan to put the script on GitHub for your perusal and use (and of course modification) but for now, check it out on my Japanese Text Analysis research guide at Penn.

#dayofDH Japanese digital resource research guides

Another “digital” thing I’ve been doing that relates to the “humanities” (but is it even remotely DH? I don’t know), is the creation of research guides for digital resources in Japanese studies of all kinds, with a focus on Japanese-language free websites and databases, and open-access publications.

So far, I’ve been working hard on creating guides for electronic Japanese studies resources, and mobile apps easily accessible in the US for both Android and iOS that relate to Japanese research or language study. The digital resources guide covers everything from general digital archives and citation indexes to literature, art, history, pop culture, and kuzushiji resources (for reading handwritten pre- and early modern documents). They range from text and image databases to dictionaries and even YouTube videos and online courseware for learning classical Japanese and how to read manuscripts.

This has been a real challenge, as you can imagine. Creating lists of stuff is one thing (and is one thing I’ve done for Japanese text analysis resources), but actually curating them and creating the equivalent of annotated bibliographies is quite another. It’s been a huge amount of research and writing – both in discovery of sources, and also in investigating and evaluating them, then describing them in plain terms to my community. I spent hours on end surfing the App and Play Stores and downloading/trying countless awful free apps – so you don’t have to!

It’s especially hard to find digital resources in ways other than word of mouth. I find that I end up linking to other librarians’ LibGuides (i.e. research guides) often because they’ve done such a fantastic job curating their own lists already. I wonder sometimes if we’re all just duplicating each other’s efforts! The NCC has a database of research guides, yes, but would it be better if we all collaboratively edited just one? Would it get overwhelming? Would there be serious disagreements about how to organize, whether to include paid resources (and which ones), and where to file things?

The answer to all these questions is probably yes, which creates problems. Logistically, we can’t have every Japanese librarian in the English-speaking world editing the same guide anyway. So it’s hard to say what the solution is – keep working in our silos? Specialize and tell our students and faculty to Google “LibGuide Japanese” + topic? (Which is what I’ve done in the past with art and art history.) Search the master NCC database? Some combination is probably the right path.

Until then, I will keep working on accumulating as many kuzushiji resources as I can for Penn’s reading group, and updating my mobile app guide if I ever find a decent まとめ!

#dayofDH Meiroku zasshi 明六雑誌 project

It’s come to my attention that Fukuzawa Yukichi’s (and others’) early Meiji (1868-1912) journal, Meiroku zasshi 明六雑誌, is available online not just as PDF (which I knew about) but also as a fully tagged XML corpus from NINJAL (and oh my god, it has lemmas). All right!

Screen Shot 2014-04-08 at 11.09.55 AM

I recently met up with Mark Ravina at Association for Asian Studies, who brought this to my attention, and we are doing a lot of brainstorming about what we can do with this as a proof-of-concept project, and then move on to other early Meiji documents. We have big ideas like training OCR to recognize the difference between the katakana and kanji 二, for example; Meiji documents generally break OCR for various reasons like this, because they’re so different from contemporary Japanese. It’s like asking Acrobat to handle a medieval manuscript, in some ways.

But to start, we want to run the contents of Meiroku zasshi through tools like MALLET and Voyant, just to see how they do with non-Western languages (don’t expect any problems, but we’ll see) and what we get out of it. I’d also be interested in going back to the Stanford Core NLP API and seeing what kind of linguistic analysis we can do there. (First, I have to think of a methodology.  :O)

In order to do this, we need whitespace-delimited text with words separated by spaces. I’ve written about this elsewhere, but to sum up, Japanese is not separated by spaces, so tools intended for Western languages think it’s all one big word. There are currently no easy ways I can find to do this splitting; I’m currently working on an application that both strips ruby from Aozora bunko texts AND splits words with a space, but it’s coming slowly. How to get this with Meiroku zasshi in a quick and dirty way that lets us just play with the data?

So today after work, I’m going to use Python’s eTree library for XML to take the contents of the word tags from the corpus and just spit them into a text file delimited by spaces. Quick and dirty! I’ve been meaning to do this for weeks, but since it’s a “day of DH,” I thought I’d use the opportunity to motivate myself. Then, we can play.

Exciting stuff, this corpus. Unfortunately most of NINJAL’s other amazing corpora are available only on CD-ROMs that work on old versions of Windows. Sigh. But I’ll work with what I’ve got.

So that’s your update from the world of Japanese text analysis.

#dayofDH Japanese apps workshop for new Penn students

Today, we’re having a day in the library for prospective and new Penn students who will (hopefully) join our community in the fall. As part of the library presentations, I’ve been asked to talk about Japanese mobile apps, especially for language learning.

While I don’t consider this a necessarily DH thing, some people do, and it’s a way that I integrate technology into my job – through workshops and research guides on various digital resources. (More on that later.)

I did this workshop for librarians at the National Coordinating Council on Japanese Library Resources (NCC)’s workshop before the Council on East Asian Libraries conference a few weeks ago in March 2014. My focus was perhaps too basic for a savvy crowd that uses foreign languages frequently in their work: I covered the procedure for setting up international keyboards on Android and iOS devices, dictionaries, news apps, language learning assistance, and Aozora bunko readers. However, I did manage to impart some lesser known information: how to set up Japanese and other language dictionaries that are built into iOS devices for free. I got some thanks on that one. Also noted was the Aozora 2 Kindle PDF-maker.

Today, I’ll focus more on language learning and the basics of setting up international keyboards. I’ve been surprised at the number of people who don’t know how to do this, but not everyone uses foreign languages on their devices regularly, and on top of that, not everyone loves to poke around deep in the settings of their computer or device. And keyboard switching on Android can be especially tricky, with apps like Simeji. So perhaps covering the basics is a good idea after all.

I don’t have a huge amount of contact with undergrads compared to the reference librarians here, and my workshops tend to be focused on graduate students and faculty with Japanese language skills. So I look forward to working with a new community of pre-undergrads and seeing what their needs and desires are from the library.

ruins – the past, the real, the monumental, the personal

Did I ever tell you about one of my favorite buildings in the world? It’s a public housing project named Kaigan-dori Danchi 海岸通り団地 (not to be confused with the type of projects one finds in the US, it was perfectly desirable housing in its time). This particular danchi (“community housing” or – generally public – housing project) was located smack in the middle of the richest section of Yokohama, between Kannai and Minato Mirai, perhaps one of the richest areas of the Tokyo region. Here it is in all its dirty, dirty glory, with Landmark Tower in the background.

Yes. This is Kaigan-dori Danchi, one of the grossest “ruins” (haikyo 廃墟) I had ever seen. Or, I thought it was a ruin. You know, an abandoned building. Because it looked too much like a shell to be anything else.

Then I got a message on Flickr.

In it, the messager wrote that he grew up in Kaigan-dori Danchi and now lives in New York City. He advised me that yes, it’s still inhabited, and thanked me for putting so many photos of it on Flickr. (Yes, I went for a photo shoot of this complex, more than once – hey, it was on my walk home from school!) He felt nostalgic at seeing his boyhood home and was interested to see what it looked like now.

In other words, what I’d felt vaguely strange about as some kind of ruins voyeurism – the same kind of ruins porn that takes hold of nearly everyone who wants to take photos of Detroit, for example – turned out to be a two-way street. It wasn’t pure voyeurism; it was a way to connect with someone who had a direct experience of the past of this place, a place that was still alive and had a memory and a history, rather than being some monstrosity out of time – as I’d been thinking of it. I saw it as a monument, not an artifact.

So this was in 2008, a half year after I’d become obsessed with Japanese urban exploration photography, which was enjoying a boom in the form of guidebooks, a glossy monthly magazine, calendars, DVDs, tours, photo books, and more, in Japan at the time. (Shortly thereafter, and I CALLED IT, came the public housing complex boom. I do have some of the photo books related to this boom too, because there’s nothing I love more than a good danchi.)

As part of the research for a presentation I gave on the topic for my Japanese class at IUC that year, I’d done some research into websites about ruins in Japan (all in Japanese of course). These were fascinating: some of them were just about the photography, but others were about reconnecting with the past, posting pictures of old schools and letting former classmates write on the guestbooks of the sites. There was a mixi (like myspace) group for the Shime Coal Mine (the only landmark of the first town I’d lived in in Japan). The photo books, on the other hand, profoundly decontextualized their objects and presented them as aesthetic monuments, much the way I’d first viewed Kaidan-dori Danchi.

So I wonder, with ruins porn a genre in the United States and Europe as well, do we have the same yearning for a concrete, real past that some of these sites and photographers exhibit, and not just vague nostalgia for the ruins of something that never existed? How much of ruins photography and guidebooks are about the site in context – the end point of a history – and how much is just about “hey I found this thing”? How much of this past is invented, never existed, purely fantasy, and how much of it is real, at least in the minds of those who remember it?

These are answers I don’t yet have, but I’ve just begun on this project. In the meantime, I’m happy to share Kaigan-dori Danchi with you.

politics and anthologizing

In this past year, I’ve spent a lot of time thinking about how the form of the anthologies I study (literary individual author anthologies in Japan at the turn of the 20th century) impacts possibilities of reading and interpretation. I’ve also commented at a couple of conferences that the narratives of who these authors “belong” to have been shaped and guided in these anthologies, and have written that taking works out of their original contexts fundamentally erases a part of their meaning (in terms of the ways readers encounter them) and simultaneously alters the work in terms of its received meaning.

After doing some reading this morning, I realized that one thing links these various threads in anthologies, and it’s a word I wasn’t using: politics.

I want to talk specifically about the example of Higuchi Ichiyō. For much of her career, she wrote for the magazine Bungakukai (among others) which was a driver of the first Romantic movement in Japan. In her anthologies, of course her serial works from that magazine are included as whole pieces, as though they were wholes from the outset, which has its own implications for reading. But the other piece of this is that just as the editors were writing the Bungakukai coterie social and ideological connections out of her career in their prefaces, they simultaneously erased this connection – this fundamental supplier of meaning – from her works by taking them out of their original Romantic context.

The first readers of Ichiyō’s works would have seen them embedded in theory and poetry heavily influenced by western Romanticism, including translations of English works and illustrations of faded ruins and statuary. The readers of her individual anthology, as well as reprints in wider circulation magazines such as Bungei kurabu before her death, would have encountered a very different context: in the magazines, other “modern” mainstream Japanese literature (presented as unaffiliated with any coterie or group other than the influential publishers of the magazines), and in the anthology, Ichiyō’s own works as a cohesive and self-contained whole. No longer would her work be infused, by virtue of proximity, with the politics of literature at the time she wrote in the early-to-mid 1890s. She becomes depoliticized, ironically despite the heavily social and what I would call political themes of her work: that is, the plight of the lower class and the inequity of Japanese society at the turn of the 20th century.

Especially in her second anthology, published in 1912, Ichiyō becomes a timeless woman writer, an elegant author of prose and poetry whose works are infused with tragedy – just as her poverty-stricken life was, to paraphrase the editors of the two volumes. Yet it is not a structural tragedy that pervades society, as it is in her work, but a personal, elegant, and heart-wrenching individual tragedy, one that makes her work even more poignant without necessarily having political implications. I can’t speak to the Romantic movement’s attitude toward this kind of theme found in Bungakukai, not being as familiar with its politics as I should be, but I can say that Kitamura Tōkoku – the founder of Bungakukai – basically started his career with the publication of Soshū no shi, a piece of “new-form” poetry about a prisoner, written at the height of his political involvement in the late 1880s.

So there is an association, simply by virtue of publishing in the same venues, between Ichiyō’s politics and those of Tōkoku, and the literary politics of the Romantic movement vis-à-vis the multitude of other ideologies of writing that existed at the time. Yet in her anthologies, this politics disappears and her context is lost entirely, in favor of a new context of Ichiyō alone, her works as something that stand alone without interference from the outside world. It is a profound depoliticization and something to think about in considering other anthologies as well, both early ones in Japan, current ones, and those found elsewhere in the world.