Category Archives: libraries

Using Collections – Virtually

I heard a remark the other day that struck a cord – or, churned my butter, a bit.

The gist of it was, “we should make digital facsimiles of our library materials (especially rare materials) and put them online, so they spark library use when people visit to see them in person, after becoming aware of them thanks to the digitized versions.”

Now, at Penn, we have digitized a couple of Japanese collections: Japanese Juvenile Fiction Collection (aka Tatsukawa Bunko series), Japanese Naval Collection (in-process, focused on Renshū Kantai training fleet materials), and a miscellaneous collection of Japanese rare books in general.* These materials have been used both in person (thanks to publicizing them, pre- and post-digitization, on library news sites, blogs, and social media as well as word-of-mouth), and also digitally by researchers who cannot travel to Penn. In fact, a graduate student in Australia used our juvenile fiction collection for part of his dissertation; another student in Wisconsin plans to use facsimiles of our naval materials once they’re complete; and faculty at University of Montana have used our digital facsimile of Meiji-period journal Hōbunkai-sui (or Hōbunkai-shi).

These researchers, due to distance and budget, will likely never be able to visit Penn in person to use the collections. On top of that, some items – like the juvenile fiction and lengthy government documents related to the Imperial Navy – don’t lend themselves to using in a reading room. These aren’t artifacts to look over one page at a time, but research materials that will be read extensively (rather than “intensively,” a distinction we book history folks make). Thus, this is the only use they can make of our materials.

The digitization of Japanese collections at Penn has invited use and a kind of library visit by virtue of being available for researchers worldwide, not just those who are at Penn (who could easily view them in person and don’t “need” a digital facsimile), or who can visit the library to “smell” the books (as the person I paraphrased put it). I think it’s more important to be able to read, research, and use these documents than to smell or witness the material artifact. Of course, there are cases in which one would want to do that, but by and large, our researchers care more about the content and visual aspects of the materials – things that can be captured and conveyed in digital images – rather than touching or handling them.

Isn’t this use, just as visiting the library in person use? Shouldn’t we be tracking visits to our digital collections, downloads, and qualitative stories about their use in research, just as we do a gate count and track circulation? I think so. As we think about the present and future of libraries, and people make comments about their not being needed because libraries are on our smartphones (like libraries of fake news, right?), we must make the argument for providing content both physically and virtually. Who do people think is providing the content for their digital libraries? Physical libraries, of course! Those collections exist in the real world and come from somewhere, with significant investments of money, time, and labor involved – and moreover, it is the skilled and knowledgable labor of professionals that is required.

On top of all of this, I feel it is most important to own up to what we can and cannot “control” online: our collections, by virtue of being able to be released at all, are largely in the public domain. Let’s not put CC licenses on them except for CC-0 (which is explicitly marking materials as public domain), pretending we can control the images when we have no legal right to (but users largely don’t know that). Let’s allow for free remixing and use without citing the digital library/archive it came from, without getting upset about posts on Tumblr. When you release public domain materials on the web (or through other services online), you are giving up your exclusive right to control the circumstances under which people use it – and as a cultural heritage institution, it is your role to perform this service for the world.

But not only should we provide this service, we should take credit for it: take credit for use, visits, and for people getting to do whatever they want with our collections. That is really meaningful and impactful use.

* Many thanks to Michael Williams for his great blog posts about our collections!

WORD LAB: a room with a whiteboard

Several years ago, I attended Digital Humanities 2011 at Stanford and had the opportunity to meet with Franco Moretti. When Franco asked what I was interested in, I admitted that I badly wanted to see the Literary Lab I’d heard so much about, and seen so much interesting research come out of. He laughed and said he’d show it to me, but that I shouldn’t get too excited.

Why? Because Literary Lab is a windowless conference room in the middle of the English department at Stanford. Literary Lab is a room with a whiteboard.

I couldn’t have been more excited, to Franco’s amusement.

A room with a whiteboard. A room dedicated to talking about projects, to collaborating, to bringing a laptop and getting research done, and to sharing and brainstorming via drawing and notes up on a wall, not on a piece of paper or a shared document. It was an important moment for me.

When I was in graduate school, I’d tossed around a number of projects with colleagues, and gotten excited about a lot of them. But they always petered out, lost momentum, and disappeared. This is surely due to busy schedules and competing projects – not least the dissertation – but I think it’s also partly due to logistics.

Much as our work has gone online, and despite these being digital projects – just like Literary Lab’s research – a physical space is still hugely important. A space to talk, a space to brainstorm and draw and write, a space to work together: a space to keep things going.

I had been turning this over in my head ever since I met with Franco, but never had the opportunity to put my idea into action. Then I came to Penn, and met a like-minded colleague who got just as excited about the idea of dedicated space and collective work on projects as I was.

Our boss thought the idea of a room with a whiteboard was funny, just as Franco had thought my low standards were kind of silly. But you know what? You don’t need a budget to create ideas and momentum. You don’t need a budget to stimulate discussion and cross-disciplinary cooperation. You just need space and time, and willing participants who can make use of it. We made a proposal, got the go-ahead, and took advantage of a new room in our Kislak Center at Penn that was free for an hour and a half a week. It was enough: the Vitale II lab is a room with a whiteboard. It even has giant TVs to hook up a laptop.

Thus, WORD LAB was born: a text-analysis interest group that just needed space to meet, and people to populate it. We recruited hard, mailing every department and discipline list we could think of, and got a mind-boggling 15+ people at the first meeting, plus the organizers and some interested library staff, from across the university. The room was full.

That was the beginning of September 2014. WORD LAB is still going strong, with more formal presentations every other week, interspersed with journal club/coding tutorials/etc. in OPEN LAB on the other weeks. We get a regular attendance of at least 7-10 people a week, and the faces keep changing. It’s a group of Asianists, an Islamic law scholar, Annenberg School of Communication researchers, political scientists, psychologists, and librarians, some belonging to more than one group. We’ve had presentations from Penn staff, other regional university researchers, and upcoming Skype presentations from Chicago and Northeastern.

A room with a whiteboard has turned into a budding cross-disciplinary, cross-professional text analysis interest community at Penn.

Keep up on WORD LAB:
@upennwordlab on Twitter
WORD LAB on Facebook

#dayofDH Japanese digital resource research guides

Another “digital” thing I’ve been doing that relates to the “humanities” (but is it even remotely DH? I don’t know), is the creation of research guides for digital resources in Japanese studies of all kinds, with a focus on Japanese-language free websites and databases, and open-access publications.

So far, I’ve been working hard on creating guides for electronic Japanese studies resources, and mobile apps easily accessible in the US for both Android and iOS that relate to Japanese research or language study. The digital resources guide covers everything from general digital archives and citation indexes to literature, art, history, pop culture, and kuzushiji resources (for reading handwritten pre- and early modern documents). They range from text and image databases to dictionaries and even YouTube videos and online courseware for learning classical Japanese and how to read manuscripts.

This has been a real challenge, as you can imagine. Creating lists of stuff is one thing (and is one thing I’ve done for Japanese text analysis resources), but actually curating them and creating the equivalent of annotated bibliographies is quite another. It’s been a huge amount of research and writing – both in discovery of sources, and also in investigating and evaluating them, then describing them in plain terms to my community. I spent hours on end surfing the App and Play Stores and downloading/trying countless awful free apps – so you don’t have to!

It’s especially hard to find digital resources in ways other than word of mouth. I find that I end up linking to other librarians’ LibGuides (i.e. research guides) often because they’ve done such a fantastic job curating their own lists already. I wonder sometimes if we’re all just duplicating each other’s efforts! The NCC has a database of research guides, yes, but would it be better if we all collaboratively edited just one? Would it get overwhelming? Would there be serious disagreements about how to organize, whether to include paid resources (and which ones), and where to file things?

The answer to all these questions is probably yes, which creates problems. Logistically, we can’t have every Japanese librarian in the English-speaking world editing the same guide anyway. So it’s hard to say what the solution is – keep working in our silos? Specialize and tell our students and faculty to Google “LibGuide Japanese” + topic? (Which is what I’ve done in the past with art and art history.) Search the master NCC database? Some combination is probably the right path.

Until then, I will keep working on accumulating as many kuzushiji resources as I can for Penn’s reading group, and updating my mobile app guide if I ever find a decent まとめ!

#dayofDH Japanese apps workshop for new Penn students

Today, we’re having a day in the library for prospective and new Penn students who will (hopefully) join our community in the fall. As part of the library presentations, I’ve been asked to talk about Japanese mobile apps, especially for language learning.

While I don’t consider this a necessarily DH thing, some people do, and it’s a way that I integrate technology into my job – through workshops and research guides on various digital resources. (More on that later.)

I did this workshop for librarians at the National Coordinating Council on Japanese Library Resources (NCC)’s workshop before the Council on East Asian Libraries conference a few weeks ago in March 2014. My focus was perhaps too basic for a savvy crowd that uses foreign languages frequently in their work: I covered the procedure for setting up international keyboards on Android and iOS devices, dictionaries, news apps, language learning assistance, and Aozora bunko readers. However, I did manage to impart some lesser known information: how to set up Japanese and other language dictionaries that are built into iOS devices for free. I got some thanks on that one. Also noted was the Aozora 2 Kindle PDF-maker.

Today, I’ll focus more on language learning and the basics of setting up international keyboards. I’ve been surprised at the number of people who don’t know how to do this, but not everyone uses foreign languages on their devices regularly, and on top of that, not everyone loves to poke around deep in the settings of their computer or device. And keyboard switching on Android can be especially tricky, with apps like Simeji. So perhaps covering the basics is a good idea after all.

I don’t have a huge amount of contact with undergrads compared to the reference librarians here, and my workshops tend to be focused on graduate students and faculty with Japanese language skills. So I look forward to working with a new community of pre-undergrads and seeing what their needs and desires are from the library.

#DayofDH Good morning and self introduction

Cross-posted from Day of DH Wasting Gold Paper

I’m up early on this Day of DH 2014. So much to do!

I thought I’d introduce myself to you all, so you have an idea of my background. I’m not your typical DH practitioner – I’m not in the academy (in a traditional way) and I’m also not working with Western-language materials. My concerns don’t always apply to English-language text or European medieval manuscripts. So, if you looked in Asia I’d be less remarkable, but here in the English-language DH world I don’t run across many people like myself.

Anyway, good morning; I’m Molly, the Japanese Studies Librarian at University of Pennsylvania, also managing Korean collection. That means that I take care of everything – from collection development to reference and instruction – that has to do with Japan/Korea, or is in Japanese/Korean at the library and beyond.

Penn_1

Let’s start off with my background. I went to college at University of Pittsburgh for Computer Science and History (Asian history of course) and studied Japanese there for 4 years. I fully intended at the outset to become a software developer, but somewhere along the line, I decided to apply my skills somewhere outside that traditional path: librarianship. And so off I went (with a two-year hiatus in between) to graduate school for a PhD in Asian studies (Japanese literature and book history) and an MSI in Library Science at University of Michigan. Along the way, I interned at the University of Nebraska-Lincoln’s Center for Digital Research in the Humanities (CDRH), redesigning the website for, and rewriting part of the code of, a text analysis app using XSLT for the Cather archive.

After Michigan, I spent a year as a postdoc at Harvard’s Resichauer Institute, working half-time on my humanities research and half-time on a digital archive (The Digital Archive of Japan’s 2011 Disasters, or JDArchive.) Then, in July 2013, I made my first big step into librarianship here at Penn, and have been happily practicing in my chosen profession since then. I’m still new, and there is a lot to learn, but I’m loving every minute.

I admit, finding ways to integrate my CS and humanities background has been a huge challenge. I was most of the way through graduate school when someone recommended going into DH (which didn’t exactly happen – there aren’t a lot of non-postdoc or non-teaching jobs out there now). My dissertation project, a very close-reading-based analysis of five case studies of single books as objects and in terms of their publishing and reception, did not lend itself at all to a digital methodology other than using digital archives to get ahold of their prefaces and keyworded newspaper databases to find their advertisements and reviews. I used a citation index that goes back to the Meiji (1868-1912) period to find sources. Well, most of my research in fact involved browsing physical issues of early 20th-century magazines in the basement of a library in Japan, and looking at the books themselves in addition to the discourse surrounding them. I simply couldn’t think of anything to do that would be “digital.”

So my research in that area – plus what I’m working on now – have continued to be non-DH, although if you’re the kind of person who involves anything “new media” in the DH definition, it may be a little. (I am not that person.) Why do I still call myself a DH practitioner, and why do I bother participating in the community even now?

Well, despite working full time, I’m still committed to figuring out how to apply my skills to new, more DH-style projects, even as I don’t want my other traditional humanities research to die out either. It’s a balancing act. How to find the time and energy to learn new skills and just plain old carve out space to practice ones I already have?

I have a couple of opportunities. One is my copious non-work free time. (Ha. Ha.) Second is my involvement in the open and focused lab sessions of Vitale II, the digital lab (okay, it’s a room with a whiteboard and a camera) at the Kislak Center for special collections in Van Pelt Library. I have a top-secret brainstorming session with a buddy today about how we can make even more social, mental, and temporal space for DH work in the library on a topically focused basis. I’m jealous of the Literary Lab; that should speak for itself. In any case, I also ran into a fellow Japanese studies DH aspirant at the Association for Asian Studies Conference a few weeks ago too, and he and I are plotting with each other as well.

So there are time and social connections to be made, and collaboration that can take place despite all odds. But it’s still a huge challenge. I can do my DH work at 5:30 am, in the evening (when I have no brainpower left), or early on the weekends. I have many other things competing for my time, not least two other research articles I’m working on. I could also be doing my real work at any of those times without the need to explain.

Yet I do it. It’s because I love making things, because I love bringing my interests together and working on something that involves a different part of my brain from reading and writing. I’m excited about the strange and wonderful things that can come from experimental analysis that, even if they aren’t usable, can make me think more broadly and weirdly.

More to follow. よろしくお願いします!

Japanese tokenization – tools and trials

I’ve been looking (okay, not looking, wishing) for a Japanese tokenizer for a while now, and today I decided to sit down and do some research into what’s out there. It didn’t take long – things have improved recently.

I found two tools quickly: kuromoji Japanese morphological analyzer and the U-Tokenizer CJK Tokenizer API.

First off – so what is tokenization? Basically, it’s separating sentences by words, or documents by sentences, or any text by some unit, to be able to chunk that text into parts and analyze them (or do other things with them). When you tokenize a document by word, like a web page, you enable searching: this is how Google finds individual words in documents. You can also find keywords from a document this way, by writing an algorithm to choose the most meaningful nouns, for example. It’s also the first step in more involved linguistic analysis like part-of-speech tagging (thing, marking individual words as nouns, verbs, and so on) and lemmatizing (paring words down to their stems, such as removing plural markers and un-conjugating verbs).

This gives you a taste of why tokenization is so fundamental and important for text analysis. It’s what lets you break up an otherwise unintelligible (to the computer) string of characters into units that the computer can attempt to analyze. It can index them, search them, categorize them, group them, visualize them, and so on. Without this, you’re stuck with “words” that are entire sentences or documents, that the computer thinks are individual units based on the fact that they’re one long string of characters.

Usually, the way you tokenize is to break up “words” based on spaces (or sentences based on punctuation rules, etc., although that doesn’t always work). (I put “words” in quotes because you can really make any kind of unit you want, the computer doesn’t understand what words are, and in the end it doesn’t matter. I’m using “words” as an example here.) However, for languages like Japanese and Chinese (and to a lesser extent Korean) that don’t use spaces to delimit all words (for example, in Korean particles are attached to nouns with no space in between, like saying “athome” instead of “at home”), you run into problems quickly. How to break up texts into words when there’s no easy way to distinguish between them?

The question of tokenizing Japanese may be a linguistic debate. I don’t know enough about linguistics to begin to participate in it, if it is. But I’ll quickly say that you can break up Japanese based on linguistic rules and dictionary rules – understanding which character compounds are nouns, which verb conjugations go with which verb stems (as opposed to being particles in between words), then breaking up common particles into their own units. This appears to be how these tools are doing it. For my own purposes, I’m not as interested in linguistic patterns as I am in noun and verb usage (the meaning rather than the kind) so linguistic nitpicking won’t be my area anyway.

Moving on to the tools. I put them through the wringer: Higuchi Ichiyō’s Ame no yoru, the first two lines, from Aozora bunko.

One, kuromoji, is the tokenizer behind Solr and Lucene. It does a fairly good job, although with Ichiyō’s uncommon word usage and conjugation, it faltered and couldn’t figure out that 高やか is one word; rather it divided it into 高 や か.  It gives the base form, reading, and pronunciation, but nothing else. However, in the version that ships with Solr/Lucene, it lemmatizes. Would that ever make me happy. (That’s, again, reducing a word to its base form, making it easy to count all instances of both “people” and “person” for example, if you’re just after meaning.) I would kill for this feature to be integrated with the below tool.

The other, U-Tokenizer, did significantly better, but its major drawback is that it’s done in the form of an HTTP request, meaning that you can’t put in entire documents (well, maybe you could? how much can you pass in an HTTP request?). If it were downloadable code with an API, I would be very happy (kuromoji is downloadable and has a command line interface). U-Tokenizer figured out that 高やか is one word, and also provides a list of “keywords,” which as far as I can tell is a bunch of salient nouns. I used it for a very short piece of text, so I can’t comment on how many keywords it would come up with for an entire document. The documentation on this is sparse, and it’s not open source, so it’s impossible to know what it’s doing. Still, it’s a fantastic tool, and also seems to work decently for Chinese and Korean.

Each of these tools has its strengths, and both are quite usable for modern and contemporary Japanese. (I really was cruel to feed them Ichiyō.) However, there is a major trial involved in using them with freely-available corpora like Aozora bunko. Guess what? Preprocessing ruby.

Aozora texts contain ruby marked up within the documents. I have my issues with stripping out ruby from documents that heavily use them (like Meiji writers, for example) because they add so much meaning to the text, but let’s say for argument’s sake that we’re not interested in the ruby. Now, it’s time to cut it all out. If I were a regular expressions wizard (or even had basic competency with them) I could probably strip this out easily, but it’s still time consuming. Download text, strip out ruby and other metadata, save as plain text. (Aozora texts are XHTML, NOT “plain text” as they’re often touted to be.) Repeat. For topic modeling using a tool like MALLET, you’re going to want to have hundreds of documents at the end of it. For example, you might be downloading all Meiji novels from Aozora and dividing them into chunks or chapters. Even the complete works of Natsume Sōseki aren’t enough without cutting them down into chapters or even paragraphs to make enough documents to use a topic modeling tool effectively. Possibly, run all these through a part-of-speech tagger like KH Coder. This is going to take a significant amount of time.

Then again, preprocessing is an essential and extremely time-consuming part of almost any text analysis project. I went through a moderate amount of work just removing Project Gutenberg metadata and dividing into chapters a set of travel narratives that I downloaded in plain text, thankfully not in HTML or XML. It made for easy processing. With something that’s not already real plain text, with a lot of metadata, and with a lot of ruby, it’s going to take much more time and effort, which is more typical of a project like this. The digital humanities are a lot of manual labor, despite the glamorous image and the idea that computers can do a lot of manual labor for us. They are a little finicky with what they’ll accept. (Granted, I’ll be using a computer script to strip out the XHTML and ruby tags, but it’s going to take work for me to write it in the first place.)

In conclusion? Text analysis, despite exciting available tools, is still hard and time consuming. There is a lot of potential here, but I also see myself going through some trials to get to the fun part, the experimentation. Still, stay tuned, especially for some follow-up posts on these tools and KH Coder as I become more familiar with them. And, I promise to stop being difficult and giving them Ichiyō’s Meiji-style bungo.

New issue of D-Lib magazine

D-Lib magazine has just published their most recent issue, available at http://www.dlib.org

This looks to be a great issue, with a number of fascinating articles on dissertations and theses in institutional repositories, using Wikipedia to increase awareness of digital collections, MOOCs, and automatic ordering of items based on reading lists.

Please check it out! All articles are available in full-text on the site.

NDL makes public the Historical Recordings Collection digital archive

On March 15, 2013, the National Diet Library made public their new digital archive of historical recordings. In partnership with a number of groups, including NHK, they have digitized and made available recordings from SPs from 1900 to the 1950s, in order to preserve them and prevent their becoming lost.

As time goes on, they plan to hold approximately 50,000 recordings in the archive. Although many recordings can be accessed via the Internet, some are only available to listen at the NDL itself due to copyright restrictions.

You can also access an NDL article on the digitization of recordings, entitled 音の歴史を残す (PDF link).

The archive is the Historical Recordings Collection, accessible at http://rekion.dl.ndl.go.jp/

Free Information Literacy Book

This is belated news, but the School of Information class SI641 (University of Michigan) has published a book, Everything You Always Wanted to Know About Information Literacy But Were Afraid to Google, ed. Kristin Fontichiaro, online. The book can be found at Smashwords (https://www.smashwords.com/books/view/266557).

It ranges from K-12 to higher education and specialized settings (including archives and special academic libraries), and thoughts on creating content and methodologies.

Briefly, from the book regarding SI641’s content and objectives (this is a core course in the LIS curriculum, and I took it too!):

This course introduces theories and best practices for integrating library-user instruction with faculty partnerships. Instructional roles are presented within the wider context of meeting institutional learning goals. Students acquire explicit knowledge, skills, and competencies needed to design, develop, integrate, and assess curriculum and instruction in a variety of information settings, including educational and public organizations. The integral relationship between technology and information literacy is examined. Students are given opportunities to partner with professional mentors in schools, academic libraries, museums, and in other educational institutions.

Please check out the book!