what can you do with a million (non-digitized) books?

I am growing into a scholar with a foot in literature and a foot in information science, I have a stake in asking and answering that newly liberated question: What can you do with a million books? What do you do with a million books?

It’s a question that’s being asked a lot in the past few years, and what’s more, so many answers are beginning to be offered in concrete terms rather than speculation. It’s an exciting and promising time for literature, for other humanistic fields. Digital humanities are here, and we finally have both the ready means and ready material to start interrogating texts in ways that were logistically not possible before now.

It’s a question that I’d like to offer my own answers to, in the form of experiments and projects, as so many others are now doing. But there is always another question nagging at me when I look, with real enthusiasm, at the kind of work that is being done to take humanistic inquiry to an unprecedented scale.

At first I asked the question that made me feel like an outsider, despite sharing the same desires and the same curiosity as those whose web sites I visit, and whose articles I read. I asked, why is this happening in the same departments, in the same fields? Why does it seem that this is limited not to a discipline, but to a time, to a place?

To be blunt, the vast majority of projects are dealing with texts in English or French, or more broadly in European languages, with the classics, and with texts from the early modern period through the early 20th century. Why did I read an article today whose very title asked “what is the place of digital humanities in English departments?”

Yes, that’s it. I love to look through the NINES-reviewed sites at what kinds of projects are being taken up. When I first discovered NINES, which is run by some of my intellectual heroes, I had a sense of excitement: I, too, study the “long 19th century,” and I, too, am trying to use what we can do with computing to discover new information and patterns in what I study. I have three projects sitting on a back burner, of visualizations and databases of material that would present what I study in a novel way and more importantly, would provide a tool for others to use to further their own research.

But I quickly learned – and this is by no means a slight to NINES – that only projects covering the British and American long 19th century can come under the NINES umbrella.

The digital humanities community, as much as you can call it that (but it is more of a cross-institutional, interdisciplinary community than average), is largely based in Roman languages too. I see a lot of English, American Studies, Classics, New Media. It took me quite a while to put the pieces together: why is it that coming from a dual computer science and humanities background, pursuing the field of book history (a community unto itself), and committed to exploring new ways of looking at texts – and helping others do the same – and yet I do still feel like I am on the outside looking in?

A part of the answer is this: the surge in digitized (and accessible – think, pre-1923, after all) books is largely in the languages of these fields. Google may have digitized books of all languages – and trust me, having been at an institution that supplied many of them while the digitization was going on, I have seen the holes in the stacks in every part of our libraries – but of course, for any language, the access is going to be quite limit for books that are not in the public domain. On top of it, however, is a technological issue as well. Pre-WWII Japanese texts are enough to stump any OCR that I’ve seen. We can scan them, but they will not have changed in any qualitative way: they are still images on a screen, images on the page of a book.

In order for books to be process-able (for lack of a better word), the computer needs the text in a format it can read. Plain text is lauded by Project Gutenberg as being both computer-readable and human-readable, and to some extent I agree (on the lauding, that is). The computer can process that text, identify patterns and re-present it in a non-linear way, or as a visualization, or any other kind of application you can imagine. The end products, whether plain text or statistics or an interactive map, are human-readable as well, which is essential for the work of the humanities: interpretation and inference.

For Japanese texts, by and large, this isn’t possible.

So when I try to think of what I could do to contribute to this growing field, to add value to my work and create something that is usable and re-usable, larger than the sum of its parts, this is the wall that I run up against very quickly. I have many ideas about interesting things that could be done with texts from the era that I study, and of very productive, small, immediate tasks that could be applied to texts as a stage of my dissertation.

If only they were available! They’re not.

HathiTrust has scanned quite a few Japanese-language books, and I tested their searchability. Not bad. But for almost all of the in-copyright works, that is limited to “search only” without even a snippet. Your results are limited to a list of information in the format of “1 result on page 183.”

There are some large digitized products out there, at institutional prices, such as Yomidasu, which covers all of the Yomiuri Shinbun newspaper from its inception. But although it is searchable, the end result is still an image on a screen. Scans of newspaper pages from the 1880s are a start, but texts in other languages are way ahead of this. Japanese-language products are stuck in the 1990s.

So what do I do with a million books out there that I’d love to work with, but can’t?

I can spend a lot of time wishing I had the money to hire a few students to encode thousands of texts with TEI and make a database of them and make them searchable, modifiable, available for display in non-linear and quantitative ways. But wishing does not produce a large corpus of texts available for study, and that is where we are going to keep stopping until an entity large enough, with broad enough vision and scope, and deep enough pockets, starts the kind of work that not only Google has done, but also countless digitization projects around the world. (In fact, I am working at a center that has digitized a large number of texts of all kinds, and unlike Google, has done so in a way that will provide a lot of useful metadata and structure to the finished product.)

I write this out of frustration, but not only at a lack of material thwarting my project ideas. I also write this as a scholar of literature and book history whose field is usually preceded by “area studies” or “Japanese.” I do Japanese book history. Book history unqualified is reserved for Western Europe. (I was surprised to find that even colonial and later United States is halfway to area studies in terms of isolation within that field, when I did a broad literature review a few years ago.) I forget this sometimes, as I don’t see “Japanese studies” or “area studies” as a hugely productive way to talk about intellectual work.

In fact, I usually try to leave “Japan” for the end of the description, but it’s become a habit after years of talking to people who do what I do, but who are not in “area studies,” throwing up an immediate barrier when I specify the country: Oh, I don’t know anything about Japan. Or, more realistically, just the “Oh.” OH. The conversation comes to an end. We outside of Western Europe are still a footnote, or a single book chapter on our “area,” in book history. I get too far into my own work and my own interest in the field, thinking about what exactly it is that I do from all of the sub-fields of book history that are out there, that I lose perspective and forget that this conversation stopper will inevitably happen.

I don’t want to ask, is there a “Japanese” digital humanities too? Because I don’t think that’s the case, and I hope that my optimism isn’t misplaced when I say that this is too good of a community to throw up that barrier. But I wonder too, because when I ask that question, I realize I can’t quite answer it myself. We – outside of Western European languages – are splintered ourselves.

So part of my frustration is cultural in a sense, and the other part of it is practical. But I think they are interrelated.

And to end on a cultural/practical note of a wholly different sort, e-books, e-journals, everything, is highly unavailable for Japanese-language materials. Here’s how I get a journal article: I fill out a form and mail it to a library with the journal, they photocopy it and mail it back to me. Nothing (well, almost nothing) is available electronically. I bought a Kindle recently in the vain hope that Japanese publishers will wake up sometime this century and start selling e-books, which they currently do not. When that day comes, I will be ready to buy a million digitized $3 novels on my Kindle so I don’t have to pay some friend in Japan to pick up a bunch of them from the train station bookstore for me. Fingers crossed.

(Note: I did not mention Chinese language works here because I honestly don’t know the situation, but in my experience, Chinese-language materials have been much more widely available in digital form for a long time now, and continue to be. China appears to be on top of this.)

2 thoughts on “what can you do with a million (non-digitized) books?”

  1. Funny you should mention digital humanities. I just came across the term yesterday (probably not for the first time), and thought to myself, “really?”. This seems like the kind of term the Japanese would have invented. It sounds kind of awkward, don’t you think? What the hell is “digital humanities”? Is it the humanities-style study of those things digital? Or is it the digitization (digitalization?) of materials studied by humanities scholars, such as historical documents, artworks, and pieces of literature?

    Anyway… Yes, there very much is a Japanese field of digital humanities. Prof Akama RyĆ“ of Ritsumeikan is among the leaders in the field, at least as it applies to ukiyo-e prints and Edo period woodblock-printed books.

    Over the last two days, he introduced me and my fellow Freer/Sackler summer interns to the idea of digital humanities, the history of the field in Japan, to the project he and his team have been engaging in (digitizing collections of prints & Edo pd books at a handful of institutions across Europe and the US), and the hands-on details of how the digitization is done, since that’s what we will be doing this summer – digitizing a collection of Edo pd books in the Freer collection.

    Your point about readable, searchable text, is an excellent one and an important one. Some sites already allow you to do this, and it’s marvelous. (NDL, I think?) But, yeah, it is indeed much more difficult for Japanese books and especially for pre-WWII Japanese books. As we’re an art museum working on an art project and treating the books as art objects, I am sorry to say that we too will be digitizing these books only as photos, not as any kind of searchable full-text.

    But I’m excited to be involved in the process, and would love to exchange notes with you on “digital humanities” and various aspects of this field and process and project.

  2. I know so little about this topic but am fascinated by it. For centuries human beings have relied on sharing information through physical pieces of paper, literally tree pulp or papyrus, and now we are shifting to information transfer occurring in ways we cannot see. (If I loan you a book, I see the book physically leave my hands and go to yours; if I send you a pdf, as an average PC user, I don’t think about the physical transfer of data from my PC to yours through a fiber-optic cable.) The consequences of this shift are manifesting itself in so many interesting ways–from
    research in the humanities as you’ve discussed to the way contracts are interpreted in the courtroom. Since you’re in the field, please keep me updated so I can continue to be informed!

Leave a Reply

Your email address will not be published. Required fields are marked *