Category Archives: literature

Pre-processing Japanese literature for text analysis

I recently wrote a small script to perform a couple of functions for pre-processing Aozora Bunko texts (text files of public domain, modern Japanese literature and non-fiction) to be used with Western-oriented text analysis tools, such as Voyant, other TAPoR tools, and MALLET. Whereas Japanese text analysis software focuses largely on linguistics (tagging parts of speech, lemmatizing, etc.), Western tools open up possibilities for visualization, concordances, topic modeling, and other various modes of analysis.

Why do these Aozora texts need to be processed? Well, a couple of issues.

  1. They contain ruby, which are basically glosses of Chinese characters that give their pronunciation. These can be straightforward pronunciation help, or actually different words that give added meaning and context. While I have my issues with removing ruby, it’s impossible to do straightforward tool-based analysis without removing it, and many people who want to do this kind of analysis want it to be removed.
  2. The Aozora files are not exactly plain text: they’re HTML. The HTML tags and Aozora metadata (telling where the text came from, for example) need to be removed before analysis can be performed.
  3. There are no spaces between words in Japanese, but Western text analysis tools identify words by looking at where there are spaces. Without inserting spaces, it looks like each line is one big word. So I needed to insert spaces between the Japanese words.

How did I do it? My approach, because of my background and expertise, was to create a Python script that used a couple of helpful libraries, including BeautifulSoup for ruby removal based on HTML tags, and TinySegmenter for inserting spaces between words. My script requires you to have these packages installed, but it’s not a big deal to do so. You then run the script in a command line prompt. The way it works is to look for all .html files in a directory, load them and run the pre-processing, then output each processed file with the same filename, .txt ending, a plain text UTF-8 encoded file.

The first step in the script is to remove the ruby. Helpfully, the ruby is contained in several HTML tags. I had BeautifulSoup traverse the file and remove all elements contained within these tags; it removes both the tags and content.

Next, I used a very simple regular expression to remove everything in brackets – i.e. the HTML tags. This is kind of quick and dirty, and won’t work on every file in the universe, but in Aozora texts everything inside a bracket is an HTML tag, so it’s not a problem here.

Finally, I used TinySegmenter on the resulting HTML-free text to split the text into words. Luckily for me, it returns an array of words – basically, each word is a separate element in a list like ['word1', 'word2', ... 'wordn'] for n words. This makes my life easy for two reasons. First, I simply joined the array with a space between each word, creating one long string (the outputted text) with spaces between each element in the array (words). Second, it made it easy to just remove the part of the array that contains Aozora metadata before creating that string. Again, this is quick and dirty, but from examining the files I noted that the metadata always comes at the end of the file and begins with the word 底本 (‘source text’). Remove that word and everything after it, and then you have a metadata-free file.

Write this resulting text into a plain text file, and you have a non-ruby, non-HTML, metadata-free, whitespace-delimited Aozora text! Although you have to still download all the Aozora files individually and then do what you will with the resulting individual text files, it’s an easy way to pre-process this text and get it ready for tool-based (and also your-own-program-based) text analysis.

I plan to put the script on GitHub for your perusal and use (and of course modification) but for now, check it out on my Japanese Text Analysis research guide at Penn.

#dayofDH Meiroku zasshi 明六雑誌 project

It’s come to my attention that Fukuzawa Yukichi’s (and others’) early Meiji (1868-1912) journal, Meiroku zasshi 明六雑誌, is available online not just as PDF (which I knew about) but also as a fully tagged XML corpus from NINJAL (and oh my god, it has lemmas). All right!

Screen Shot 2014-04-08 at 11.09.55 AM

I recently met up with Mark Ravina at Association for Asian Studies, who brought this to my attention, and we are doing a lot of brainstorming about what we can do with this as a proof-of-concept project, and then move on to other early Meiji documents. We have big ideas like training OCR to recognize the difference between the katakana and kanji 二, for example; Meiji documents generally break OCR for various reasons like this, because they’re so different from contemporary Japanese. It’s like asking Acrobat to handle a medieval manuscript, in some ways.

But to start, we want to run the contents of Meiroku zasshi through tools like MALLET and Voyant, just to see how they do with non-Western languages (don’t expect any problems, but we’ll see) and what we get out of it. I’d also be interested in going back to the Stanford Core NLP API and seeing what kind of linguistic analysis we can do there. (First, I have to think of a methodology.  :O)

In order to do this, we need whitespace-delimited text with words separated by spaces. I’ve written about this elsewhere, but to sum up, Japanese is not separated by spaces, so tools intended for Western languages think it’s all one big word. There are currently no easy ways I can find to do this splitting; I’m currently working on an application that both strips ruby from Aozora bunko texts AND splits words with a space, but it’s coming slowly. How to get this with Meiroku zasshi in a quick and dirty way that lets us just play with the data?

So today after work, I’m going to use Python’s eTree library for XML to take the contents of the word tags from the corpus and just spit them into a text file delimited by spaces. Quick and dirty! I’ve been meaning to do this for weeks, but since it’s a “day of DH,” I thought I’d use the opportunity to motivate myself. Then, we can play.

Exciting stuff, this corpus. Unfortunately most of NINJAL’s other amazing corpora are available only on CD-ROMs that work on old versions of Windows. Sigh. But I’ll work with what I’ve got.

So that’s your update from the world of Japanese text analysis.

#dayofDH Japanese apps workshop for new Penn students

Today, we’re having a day in the library for prospective and new Penn students who will (hopefully) join our community in the fall. As part of the library presentations, I’ve been asked to talk about Japanese mobile apps, especially for language learning.

While I don’t consider this a necessarily DH thing, some people do, and it’s a way that I integrate technology into my job – through workshops and research guides on various digital resources. (More on that later.)

I did this workshop for librarians at the National Coordinating Council on Japanese Library Resources (NCC)’s workshop before the Council on East Asian Libraries conference a few weeks ago in March 2014. My focus was perhaps too basic for a savvy crowd that uses foreign languages frequently in their work: I covered the procedure for setting up international keyboards on Android and iOS devices, dictionaries, news apps, language learning assistance, and Aozora bunko readers. However, I did manage to impart some lesser known information: how to set up Japanese and other language dictionaries that are built into iOS devices for free. I got some thanks on that one. Also noted was the Aozora 2 Kindle PDF-maker.

Today, I’ll focus more on language learning and the basics of setting up international keyboards. I’ve been surprised at the number of people who don’t know how to do this, but not everyone uses foreign languages on their devices regularly, and on top of that, not everyone loves to poke around deep in the settings of their computer or device. And keyboard switching on Android can be especially tricky, with apps like Simeji. So perhaps covering the basics is a good idea after all.

I don’t have a huge amount of contact with undergrads compared to the reference librarians here, and my workshops tend to be focused on graduate students and faculty with Japanese language skills. So I look forward to working with a new community of pre-undergrads and seeing what their needs and desires are from the library.

politics and anthologizing

In this past year, I’ve spent a lot of time thinking about how the form of the anthologies I study (literary individual author anthologies in Japan at the turn of the 20th century) impacts possibilities of reading and interpretation. I’ve also commented at a couple of conferences that the narratives of who these authors “belong” to have been shaped and guided in these anthologies, and have written that taking works out of their original contexts fundamentally erases a part of their meaning (in terms of the ways readers encounter them) and simultaneously alters the work in terms of its received meaning.

After doing some reading this morning, I realized that one thing links these various threads in anthologies, and it’s a word I wasn’t using: politics.

I want to talk specifically about the example of Higuchi Ichiyō. For much of her career, she wrote for the magazine Bungakukai (among others) which was a driver of the first Romantic movement in Japan. In her anthologies, of course her serial works from that magazine are included as whole pieces, as though they were wholes from the outset, which has its own implications for reading. But the other piece of this is that just as the editors were writing the Bungakukai coterie social and ideological connections out of her career in their prefaces, they simultaneously erased this connection – this fundamental supplier of meaning – from her works by taking them out of their original Romantic context.

The first readers of Ichiyō’s works would have seen them embedded in theory and poetry heavily influenced by western Romanticism, including translations of English works and illustrations of faded ruins and statuary. The readers of her individual anthology, as well as reprints in wider circulation magazines such as Bungei kurabu before her death, would have encountered a very different context: in the magazines, other “modern” mainstream Japanese literature (presented as unaffiliated with any coterie or group other than the influential publishers of the magazines), and in the anthology, Ichiyō’s own works as a cohesive and self-contained whole. No longer would her work be infused, by virtue of proximity, with the politics of literature at the time she wrote in the early-to-mid 1890s. She becomes depoliticized, ironically despite the heavily social and what I would call political themes of her work: that is, the plight of the lower class and the inequity of Japanese society at the turn of the 20th century.

Especially in her second anthology, published in 1912, Ichiyō becomes a timeless woman writer, an elegant author of prose and poetry whose works are infused with tragedy – just as her poverty-stricken life was, to paraphrase the editors of the two volumes. Yet it is not a structural tragedy that pervades society, as it is in her work, but a personal, elegant, and heart-wrenching individual tragedy, one that makes her work even more poignant without necessarily having political implications. I can’t speak to the Romantic movement’s attitude toward this kind of theme found in Bungakukai, not being as familiar with its politics as I should be, but I can say that Kitamura Tōkoku – the founder of Bungakukai – basically started his career with the publication of Soshū no shi, a piece of “new-form” poetry about a prisoner, written at the height of his political involvement in the late 1880s.

So there is an association, simply by virtue of publishing in the same venues, between Ichiyō’s politics and those of Tōkoku, and the literary politics of the Romantic movement vis-à-vis the multitude of other ideologies of writing that existed at the time. Yet in her anthologies, this politics disappears and her context is lost entirely, in favor of a new context of Ichiyō alone, her works as something that stand alone without interference from the outside world. It is a profound depoliticization and something to think about in considering other anthologies as well, both early ones in Japan, current ones, and those found elsewhere in the world.

Japanese tokenization – tools and trials

I’ve been looking (okay, not looking, wishing) for a Japanese tokenizer for a while now, and today I decided to sit down and do some research into what’s out there. It didn’t take long – things have improved recently.

I found two tools quickly: kuromoji Japanese morphological analyzer and the U-Tokenizer CJK Tokenizer API.

First off – so what is tokenization? Basically, it’s separating sentences by words, or documents by sentences, or any text by some unit, to be able to chunk that text into parts and analyze them (or do other things with them). When you tokenize a document by word, like a web page, you enable searching: this is how Google finds individual words in documents. You can also find keywords from a document this way, by writing an algorithm to choose the most meaningful nouns, for example. It’s also the first step in more involved linguistic analysis like part-of-speech tagging (thing, marking individual words as nouns, verbs, and so on) and lemmatizing (paring words down to their stems, such as removing plural markers and un-conjugating verbs).

This gives you a taste of why tokenization is so fundamental and important for text analysis. It’s what lets you break up an otherwise unintelligible (to the computer) string of characters into units that the computer can attempt to analyze. It can index them, search them, categorize them, group them, visualize them, and so on. Without this, you’re stuck with “words” that are entire sentences or documents, that the computer thinks are individual units based on the fact that they’re one long string of characters.

Usually, the way you tokenize is to break up “words” based on spaces (or sentences based on punctuation rules, etc., although that doesn’t always work). (I put “words” in quotes because you can really make any kind of unit you want, the computer doesn’t understand what words are, and in the end it doesn’t matter. I’m using “words” as an example here.) However, for languages like Japanese and Chinese (and to a lesser extent Korean) that don’t use spaces to delimit all words (for example, in Korean particles are attached to nouns with no space in between, like saying “athome” instead of “at home”), you run into problems quickly. How to break up texts into words when there’s no easy way to distinguish between them?

The question of tokenizing Japanese may be a linguistic debate. I don’t know enough about linguistics to begin to participate in it, if it is. But I’ll quickly say that you can break up Japanese based on linguistic rules and dictionary rules – understanding which character compounds are nouns, which verb conjugations go with which verb stems (as opposed to being particles in between words), then breaking up common particles into their own units. This appears to be how these tools are doing it. For my own purposes, I’m not as interested in linguistic patterns as I am in noun and verb usage (the meaning rather than the kind) so linguistic nitpicking won’t be my area anyway.

Moving on to the tools. I put them through the wringer: Higuchi Ichiyō’s Ame no yoru, the first two lines, from Aozora bunko.

One, kuromoji, is the tokenizer behind Solr and Lucene. It does a fairly good job, although with Ichiyō’s uncommon word usage and conjugation, it faltered and couldn’t figure out that 高やか is one word; rather it divided it into 高 や か.  It gives the base form, reading, and pronunciation, but nothing else. However, in the version that ships with Solr/Lucene, it lemmatizes. Would that ever make me happy. (That’s, again, reducing a word to its base form, making it easy to count all instances of both “people” and “person” for example, if you’re just after meaning.) I would kill for this feature to be integrated with the below tool.

The other, U-Tokenizer, did significantly better, but its major drawback is that it’s done in the form of an HTTP request, meaning that you can’t put in entire documents (well, maybe you could? how much can you pass in an HTTP request?). If it were downloadable code with an API, I would be very happy (kuromoji is downloadable and has a command line interface). U-Tokenizer figured out that 高やか is one word, and also provides a list of “keywords,” which as far as I can tell is a bunch of salient nouns. I used it for a very short piece of text, so I can’t comment on how many keywords it would come up with for an entire document. The documentation on this is sparse, and it’s not open source, so it’s impossible to know what it’s doing. Still, it’s a fantastic tool, and also seems to work decently for Chinese and Korean.

Each of these tools has its strengths, and both are quite usable for modern and contemporary Japanese. (I really was cruel to feed them Ichiyō.) However, there is a major trial involved in using them with freely-available corpora like Aozora bunko. Guess what? Preprocessing ruby.

Aozora texts contain ruby marked up within the documents. I have my issues with stripping out ruby from documents that heavily use them (like Meiji writers, for example) because they add so much meaning to the text, but let’s say for argument’s sake that we’re not interested in the ruby. Now, it’s time to cut it all out. If I were a regular expressions wizard (or even had basic competency with them) I could probably strip this out easily, but it’s still time consuming. Download text, strip out ruby and other metadata, save as plain text. (Aozora texts are XHTML, NOT “plain text” as they’re often touted to be.) Repeat. For topic modeling using a tool like MALLET, you’re going to want to have hundreds of documents at the end of it. For example, you might be downloading all Meiji novels from Aozora and dividing them into chunks or chapters. Even the complete works of Natsume Sōseki aren’t enough without cutting them down into chapters or even paragraphs to make enough documents to use a topic modeling tool effectively. Possibly, run all these through a part-of-speech tagger like KH Coder. This is going to take a significant amount of time.

Then again, preprocessing is an essential and extremely time-consuming part of almost any text analysis project. I went through a moderate amount of work just removing Project Gutenberg metadata and dividing into chapters a set of travel narratives that I downloaded in plain text, thankfully not in HTML or XML. It made for easy processing. With something that’s not already real plain text, with a lot of metadata, and with a lot of ruby, it’s going to take much more time and effort, which is more typical of a project like this. The digital humanities are a lot of manual labor, despite the glamorous image and the idea that computers can do a lot of manual labor for us. They are a little finicky with what they’ll accept. (Granted, I’ll be using a computer script to strip out the XHTML and ruby tags, but it’s going to take work for me to write it in the first place.)

In conclusion? Text analysis, despite exciting available tools, is still hard and time consuming. There is a lot of potential here, but I also see myself going through some trials to get to the fun part, the experimentation. Still, stay tuned, especially for some follow-up posts on these tools and KH Coder as I become more familiar with them. And, I promise to stop being difficult and giving them Ichiyō’s Meiji-style bungo.

Introducing Waseda bungaku #2 早稲田文学第二次

Waseda bungaku, the literary magazine of Waseda University (Tokyo Senmon Gakkō until 1902), was originally published in the 1880s by famed writer and theater critic (and professor) Tsubouchi Shōyō, and ceased publication in the 1890s. It was started up again by his successors, explicitly in his honor and in that of the original magazine, in 1906, and went until 1927. This, as opposed to the first run (dai ichi-ji) is known now as the second series or run, dai ni-ji. It’s since gone through a number of changes in ji and is on dai-jūji (#10) in its current form – it’s still a running literary magazine today.

I’m particularly interested in this second run of the magazine because of its content, as well as its clear intent to do honor to the original, influential mid-Meiji (1868-1912) periodical. As I’ve touched on in previous posts, it’s highly nostalgic, with articles not only on current novels but on earlier Meiji works, and memories of the writers regarding their literary and social groups from their youths in the 1880s and early 1890s. There were some special Meiji literature issues (特別号) that came out in expanded form and cost significantly more than the typical issue, but even the other issues are full of memories, not just current concerns.

photoThe publisher of the magazine, Tōkyōdō, is also of interest to me, and I’m currently starting to try to look into the relationship of this commercial publisher and the academic interest group behind Waseda bungaku. Surprisingly to me, there is quite a lot published (in a relative sense, and relative to my expectations) on both Waseda University, and also Tōkyōdō itself. (Including great titles like A Stroll Through 100 Years of Tōkyōdō History.) I’m fast checking these books out and they’re becoming a growing mountain on my office bookshelves, with a significant amount of space taken up by four volumes of the 9-volume set 100 Years of Waseda University History.

Why am I so interested in this publishing history? Well, I recently received the 1929 Meiji bungaku kenkyū, which is ostensibly (according to catalog records, anyway) a reprint edition of the special Meiji literature issues of Waseda bungaku. However, when I examined the two-volume set itself, it’s a set of rebound issues – original covers and advertisements and all, bound up in hardcovers. Even the preface refers to new binding (新装) specifically, rather than a new printing or a collection. It’s extremely explicit that it’s a literal collection of old magazine issues.

The fact that Tōkyōdō seems to have rebound its overstock in 1929, two years after the journal ceased, and sold it at relatively low prices (5 yen for the set) is interesting enough, but what is even better is the fact that the advertisements are not from 1925, when the first issues included were originally published, but from 1927. Even more interesting, they’re Meiji-focused, largely for the series Meiji bungaku meicho zenshū, a collection of “famous writers” of Meiji literature (which I’ve posted on previously). These are obviously reprinted issues of the magazine from 1927, two years after their original publication date, and have had current advertisements related to the content of the issues (remember, “special Meiji literature” issues) inserted into them instead of the original 1925 ads for things like books written by the journal editors on Western philosophers. (By “original” I’m referring actually to a reproduction I have of these same issues with 1925 ads, but am not actually sure if these are from “originals” as in first printings, or if these are also later printings that have been reproduced.)

So this indicates that not only are these overstock that Tōkyōdō wanted to try to sell off in a repackaged format (“as a resource for future Meiji scholars” rather than “old issues of a literary magazine from four years ago”), but they were later printings than the 1925 original first printings. This means that there was enough interest in and demand for the Meiji special issues, whether at the time or after the fact, for them to be reissued by a commercial publisher whose goal is to make money off of them. There must have been such demand that the publisher saw profit in it.

This brings me back to previous posts about interest in Meiji, Meiji nostalgia, and Meiji and Meiji literature themselves as “things” to be studied, as fields, newly invented post-Meiji and specifically in the late 1920s. (Even if this isn’t the first appearance of the phrase “Meiji literature,” I’d still argue that as a “thing,” it really came into being at this time in terms of being popular, published, studied, and talked about.) There is obviously a market and demand for things Meiji at this time, testified to by both the reissued magazines and their rebinding, packaging, and marketing to “scholars.” I’m still on the fence about what the interest in Meiji actually meant – was it really scholarly work as these collections advertise themselves, or was it something about grasping onto recently lived past and lost youth? Or perhaps both?

Meiji nostalgia: the 1910s-1920s

I’m always struck by the nostalgia for the Meiji period (1868-1912) that I find even before the end of Meiji, but especially in what ramps up in the 1910s-late 1920s, in particular with the reprinting of literary coterie Ken’yūsha’s Garakuta bunko (late 1880s) in 1927, the re-publication of Waseda bungaku‘s special Meiji articles and issues in the form of Meiji bungaku kenkyū in 1929, and the publication of Meiji bungaku meicho zenshsū (The Complete Collection of Famous Meiji Literary Writers) from 1926. It’s something about this late-20s flurry of Meiji activity, plus what precedes it in the literary journal Waseda bungaku, that fascinates the part of me that is interested in archives and social memory.*

Why social memory? Well, Waseda bungaku, the literary journal of Waseda University (started by Tsubouchi Shoyo in the 1880s-1890s, then on hiatus until 1906, restarting in that year – late Meiji), contains a huge number of articles written by surviving members of Meiji literary groups about their memories and their friends, long or recently dead, and their reminiscences of the early days of those groups and associated publications. Shimazaki Tōson writes of the founding and early period of literary magazine Bungakkai and its coterie in the early 1890s, Kōda Rohan writes of the death and life of Awashima Kangetsu, and Emi Suiin writes volumes about Ken’yūsha and its early and late history.

In fact, Suiin not only wrote these lengthy articles, he also penned the book Meiji bundanshi – jiko chūshin (A History of the Meiji Literary World – Focused on Myself) in 1927, and another, Ken’yūsha to Kōyō (Ken’yūsha and [Ozaki] Kōyō) in the same year. These are focused entirely on his memories of his life in the Meiji literary world, including big shot Ozaki Kōyō, Ken’yūsha’s founder and one of the most popular and influential writers of the mid-Meiji period (d. 1902). His books, coincidentally – or perhaps not – came out in the very same year as a reproduction of Ken’yūsha’s first literary magazine, Garakuta bunko, reprinted by an individual (Kaneyama Fumio) with the express purpose of providing more material to Meiji literary scholars interested in that coterie’s activities, for whom the archives were dwindling if they existed at all. Likewise, in 1927 an article appeared in Waseda bungaku on Ken’yūsha’s somewhat later Edo murasaki magazine, testifying to renewed (if perhaps not sustained) interest in that coterie’s publications and, importantly, that specific time period of the early Meiji 20s (late 1880s-early 1890s).

Just two years later, in 1929, a publication came out that commemorated the 27th anniversary of Ozaki Kōyō’s death with a special society pamphlet, for lack of a better word (kaishi 会誌). Why it’s the 27th anniversary is anyone’s guess (or, if I’m missing something culturally significant, please fill me in!).

I recently received a fascinating set of books for my library that collects the “Meiji issues” (Meiji bungaku gō) of Waseda bungaku from 1925-1927, and was published in 1929. It appears to be bound volumes of individual, original Waseda bungaku issues, although there is a discrepancy between those and the reproduction of the “originals” that also arrived – the ads are different, and the ones in the “1925″ issues all date from 1927 or later. Leaving this fascinating publishing story aside for the time being, let’s take a look at the preface. Just as with the Garakuta bunko reprints, the editor (Honma Hisao) of Waseda bungaku and these volumes claims that there is a dearth of material for those studying “Meiji literature” and in order to help future scholars, it is a mission of “a magazine with a tradition stretching back into the Meiji period” (i.e., Waseda bungaku) to collect its issues in a gappon 合本 and re-release them to the public.

preface As Michael Williams pointed out to me, this isn’t even primary sources on Meiji literature – it contains Taisho and Showa writing on Meiji. But I think there’s a particular draw, an almost-primary-source quality, because the articles are by and large written by other Meiji big shots (if not the deceased Kōyō himself) such as Rohan and Tōson and Suiin, and they’re about those Meiji memories and Meiji experiences. They’re social memories of Meiji, giving the reader a direct connection to events and literature of the past through the firsthand experiences of the writers.

So is it really about a lack of Meiji sources? Possibly, but unlikely. Meiji literature was being reprinted and recirculated both in single-volume form as well as in zenshū, or “complete” literary collections, of various kinds. I think it’s more a mixture of nostalgia and fear of the experiences and memories of the period disappearing, perhaps along with the fires that accompanied the 1923 Great Kanto Earthquake, and along with those who were dying, like Awashima Kangetsu had only a few years before. It was a time when the original Ken’yūsha members were old and dying off, when major Meiji figures were disappearing and no longer accessible – and no longer surrounded by others who could also remember the time of their youth.

I have one other tidbit to add to the Meiji nostalgia boom of the late 20s. The series I referenced above, Meiji bungaku meicho zenshū, was published in 12 volumes from 1926-1927 and there are publisher advertising leaflets for it stuffed into the books that make up Meiji bungaku kenkyū (the Meiji re-issues of Waseda bungaku that has been discussed). One is nearly poster-sized. The books that make them up, save for Kōyō’s Irozange and Rohan’s Fūryūbutsu, are largely forgotten now, and it even includes one translation by Morita Shiken. Yet it’s a “scholarly resource” including explications, criticism, photographs, and illustrations – not exactly nostalgic. But I’d argue that it’s the context in which I find those leaflets that makes them intimate parts of the fabric of Meiji social memory: they’re reprints of the very books that the writers of the nostalgic essays would have read in their youths, and supply the means to remember Meiji through direct experience in 1927, 15 years after the end of the period in 1912.

All of this Meiji-related publishing activity, I see as a flurry of nostalgia for and fear of the loss of Meiji memories, of Meiji experiences, and ultimately of the memories of the writers’ and publishers’ very youth itself. These actions bind up inextricably the institutions of archives (personal and official), publication (private and commercial), remembering (individually and socially), and commemorating – creating the very idea of “Meiji” and “Meiji literature,” an idea that can never be severed, at least in the late 1920s, from the memory and social fabric of those Meiji survivors still living.

leafletsmall leaflet

* Actually, I came to my dissertation research topic – literary anthologies of the recently deceased – through a course entitled “Archives and Institutions of Social Memory.”

my disagreement with authorship attribution

I’m torn: I’m very interested in stylometry, but I have issues with the fundamental questions that are asked in this field, in particular authorship attribution.

In my research, I’ve thought and written quite a bit about authorship. My dissertation looked at changing concepts of authorship – the singular, cohesive, Romantic genius author as established in collected editions in Japan at the turn of the 20th century – and also at actual practices of writing and authorship that preceded and accompanied these developments. My conclusion about authorship was that it is a kind of performance, embedded in and never preceding the text, and is not coextensive with the historical writer(s) behind the performance – pseudonymous, collective, anonymous, or otherwise.

These performances are necessarily contextualized by space, time, society, culture, literary trends, place of publication, audience. They are more or less without meaning if one doesn’t take context into account, even if not all relevant contexts at once. For a performance takes place within a historic, cultural, and literary moment, and does not exist independently of it. I see that place of performance as both the text and its place of publication, its material manifestation; and it is a performance that is inextricably linked to reader reception.

I also don’t see these performances as necessarily creating a unified authorial identity or unified author-function across space, time, and texts. This may sound extremely counterintuitive given that many performances of authorship share appellations and can be “attributed” to the “same” author, and I recognize that my argument is downright bizarre at times. I blame it on having spent too much time thinking about the implications of this topic. But in a way, our linking of these performances after the fact is artificial, and these different author-functions are, for me, so linked to the time and place of both publication and reading – whether contemporary or not – that they can be seen as separate as well. This is why I concluded that collected literary anthologies are constructing – inventing – an entirely artificial “author” out of the works associated, after the fact, with a historical, individual writer, whose identity and name may not have coincided with that of the authorial performance at all in the first place.

So, that said, let me get to my disagreement with authorship attribution. It’s fundamentally asking the wrong question of authors and authorship: who “really” wrote this text? My argument is that the hand of the historical writer “behind” the authorial performance is a moot point; what matters is the name, or lack of a name, attributed to the text when it is published, republished, read, and reread over time. It’s the performance that takes place at the site of the text, coinciding with and following the creation of the text, deeply associated with and embedded in the text, and located within reception rather than intention. It takes place at a different site than the hand of the historical writer holding a pen or the mind creating an idea. And so the search for the “real” identity of the author is beside the point; what is happening here is really “writership attribution” that is something separate from authorship.

A colleague recently asked me, too, what the greater goal of authorship attribution is – what is it beyond finding out the person behind the text? What is it besides deciding that the entity constructed with the name Shakespeare “really” wrote an unattributed or mysterious text? And I couldn’t answer this question, which brings me to my second fundamental problem with authorship attribution. I don’t see an overarching research question guiding methodology, besides the narrow goal of establishing writership of a text. This could be my own ignorance, and I’d be happy to be corrected on it.

I’m interested to hear your thoughts!

digital surrogates and utility

As someone who studies the history of the book, often as an object in itself, my research tends to require that I go look at books in person. However, I use the Kindai Digital Library quite regularly as a way to survey what exists (although I fully realize how incomplete Kindai is), and indeed, I would never have found my research topic without being able to preview books using this digital library.

The point is, I previewed the books using Kindai, and then got on a plane to Japan to actually study the books for my research. I had to locate a physical copy and literally get my hands on it, in order to understand how it was made, what impression it would make on readers, and its intended audience. (For example, how well-made is it? Does it have color illustrations or text? What’s the quality of the paper like? Does it feel or look cheap? How is the binding? None of these questions can be answered from the black-and-white copy in Kindai.)

The history of the Kindai Digital Library is interesting: it’s a digitization project undertaken by the National Diet Library and based in the same collection as the Maruzen Meiji Microfilm: books microfilmed and owned by the NDL. Neither covers the entire collection of Meiji books that the NDL owns, it’s not clear if Kindai and Maruzen are coextensive (to me anyway), and the NDL’s collection does not contain every book published in the Meiji period. So, yes, it has limitations – it’s not every book from the Meiji period, and it’s scanned microfilm in black-and-white, not grayscale.

But the Kindai Digital Library, unlike the Maruzen microfilm collection, is being added to continuously, and out-of-copyright books from the Taisho and Showa periods (1912-1989) are also being scanned and included in the collection. For the newer books, they themselves are being digitized, rather than having microfilm as an intermediate step. Check out the difference between these two books by Wakamatsu Shizuko, published in 1897 (color) and 1894 (black and white):

Sure, there is a big impressionistic difference in seeing a full-color cover illustration versus a black-and-white scan of what used to be a color cover. But you can see from these images that it’s very difficult to tell the quality and condition of the monochrome image, versus the higher-quality color image that captures things like discolorations on paper and the quality of the cloth binding (not pictured here).

This makes all the difference for someone doing my kind of research: if I had scanned copies of the anthologies I study that are as good as the color book above, it’s likely that I could still do decent research – if incomplete – without going to Japan to look at these books in person. With the higher-quality color image, the digital surrogate has become a usable surrogate for me, a reasonable facsimile if you will. It provides me with enough information to be able to draw conclusions about more than just the content of the book.

This matters for more than book historians, however. One reason that Kindai Digital Library is so great is that it provides digital surrogates of the full text of books, not just their covers. Every page that is available is scanned, either from microfilm or from the book itself, and provided for viewing online – and, if you have the patience, as a PDF download a few pages at a time. Yet compare these images, again from the 1897 and 1894 books introduced above. Click to view the full size so you can see the quality of the text in each. They are both at 25% zoom in Kindai’s page viewer.

 

Here, you can appreciate the difficulty of reading the monochrome text – and this is an exceptionally clear one. The books I have read (with difficulty) excerpts from on Kindai are typically much lower quality and many characters are difficult to make out. Zooming in doesn’t help, because the quality of the image itself is relatively low.

On the other hand, you have the newer additions with higher-quality surrogates such as this color book. Of course, it’s not necessary to have color pages to read a text that was originally printed in black and white, but the inclusion of values other than straight black or white increases readability by allowing for a higher quality image. It also allows for clearer text when zooming out, viewing at say, 33% (a percentage where the monochrome text would look terrible).

As you can see, the point is that the newer Kindai texts are more usable than the older ones, not just prettier. They express the idea that there is a point where a digital surrogate becomes a usable surrogate, where it becomes “good enough” to live up to its name. Of course, “usable” depends on the purpose, but I think we can agree that if “reading” is the purpose, these new scans are far closer to the goal than the old ones.

Kindai should be commended for this commitment to higher quality in new additions to the library; I only wish there were the resources to re-digitize everything in the library at this standard.

Why is it important to? It’s not just because it would be an even more convenient resource for myself and my colleagues, an even more usable one. It’s because of the very real danger of losing some of these books. There are few, if any, copies of many of them left outside of the NDL’s collection, and many of them can no longer be viewed at the NDL in any format other than microfilm. It’s not clear to me whether the originals are being protected from the public, or if NDL actually only owns the microfilm, with the original lost to time at some point. Regardless, for many books, the Kindai scan (or NDL microfilm, its source) is the only copy of the book available. If it’s not even fully readable – the most basic level of utility beyond knowing from search results that it exists – then we have failed in our task of preservation, and in our task of creating a digital surrogate in the first place. A surrogate can’t take the place of the original if it can’t mimic it in the most basic ways. Given the fragility of Meiji and Taisho (and early Showa) sources, it’s crucial that we make available the highest-quality digital surrogates we can, and as soon as possible, before we no longer can.

*The first few editions of The Complete Works of Higuchi Ichiyo, which feature prominently in my dissertation, are a case of this. I never found a physical copy of the very first edition, actually, even outside of NDL.