Tag Archives: chinese

World-building and retelling and digital literary studies

I’ve been thinking, ever since listening to Alicia Peaker‘s amazing WORD LAB presentation about studying environmental words in fiction, about the creative writing process and DH. Specifically: the kind of surface reading that constitutes a lot of digital literary studies, and the lack of attention to things that we writers would foreground as very important to our fiction and what’s behind it, and the story we are trying to tell.

I see a lot of work about plot, about identifying percentage of dialogue spoken by gender (or just character count/number of appearances by gender), about sentiment words. In other words, an inordinate amount of attention paid to the language that addresses humans and their feelings and actions within a story, most often within a corpus of novels.

However. As I write, and as I listen to other professional or amateur writers (of which I am in the latter group) talk about writing, what comes up very often is world building. And I just read an article about the question of retelling existing stories. Other than Donald Sturgeon’s and Scott Enderle’s recent work on text reuse (in the premodern Chinese (and see paywalled article too) and Star Wars fanfic contexts respectively), I don’t see much addressing the latter, which is hugely important. Most writing, maybe all writing, does not come from scratch, sui generis out of a writer’s mind. (That’s not to even get into issues of all the rest of the people involved in both published fiction and fanfic communities, if we’re going to talk about Scott’s work for example.) We’re missing the community that surrounds writing and publishing (and fanfic is published online, even if not in a traditional model), and reception, of course. And that’s probably not a criticism that originates with me.

When it comes to the structure of fiction, though, I think we’re alarmingly not paying attention to some of the most important elements for writers, and thus what constitutes what they write. Putting authorial intent aside — for example, trying to understand what is “behind” a novel — this is also on the surface of the novel in that the world is what makes up the environment in which the story is told. Anyone could tell you that interactions with the environment, and the shape of the environment as the infrastructure of what can happen within it, are just as fundamental to a work of fiction as the ostensible “plot” and “characters.” (And, to bring up another element of good writing, there’s the question of whether you can even separate them: these two books on character arcs and crafting emotional fiction come to mind as examples of how writers are told not to consider these things even remotely in isolation.)

And then there is the question of where stories come from in the first place. I’m not just talking about straight-up adaptation, although that’s a project I’d love to somehow make work between Meiji Japanese-language fiction and 19th-century English or French novels, that we may not realize are connected even now. (Many Meiji works are either what we’d now call “plagiarism” of foreign novels, or adaptations that are subtle enough that something like Ozaki Kōyō’s Konjiki yasha was not “discovered” to have been an adaptation until very recently.) How do we understand how writers generate their stories? Where are they taking the elements from that are important to them, that influence them, that they want to retell in some manner? Are there projects I’m not aware of (aside from the two I mentioned) that are going deep not into just straight-up obvious word or phrasing reuse, but … well, structure, device, or element adaptation and reuse?

These absolutely fundamental elements of fiction writing are not, I think, something that’s been ignored in traditional literary criticism (see, for example, the term “intertextuality”) but I don’t feel like they’ve made it into digital literary projects, at least not the most well-known and -discussed projects. But if I am wrong, and there are projects on world building, environmental elements, or intertextuality that I am missing, please let me know in the comments or via email (sendmailto@ this website) so I can check them out!

Japanese tokenization – tools and trials

I’ve been looking (okay, not looking, wishing) for a Japanese tokenizer for a while now, and today I decided to sit down and do some research into what’s out there. It didn’t take long – things have improved recently.

I found two tools quickly: kuromoji Japanese morphological analyzer and the U-Tokenizer CJK Tokenizer API.

First off – so what is tokenization? Basically, it’s separating sentences by words, or documents by sentences, or any text by some unit, to be able to chunk that text into parts and analyze them (or do other things with them). When you tokenize a document by word, like a web page, you enable searching: this is how Google finds individual words in documents. You can also find keywords from a document this way, by writing an algorithm to choose the most meaningful nouns, for example. It’s also the first step in more involved linguistic analysis like part-of-speech tagging (thing, marking individual words as nouns, verbs, and so on) and lemmatizing (paring words down to their stems, such as removing plural markers and un-conjugating verbs).

This gives you a taste of why tokenization is so fundamental and important for text analysis. It’s what lets you break up an otherwise unintelligible (to the computer) string of characters into units that the computer can attempt to analyze. It can index them, search them, categorize them, group them, visualize them, and so on. Without this, you’re stuck with “words” that are entire sentences or documents, that the computer thinks are individual units based on the fact that they’re one long string of characters.

Usually, the way you tokenize is to break up “words” based on spaces (or sentences based on punctuation rules, etc., although that doesn’t always work). (I put “words” in quotes because you can really make any kind of unit you want, the computer doesn’t understand what words are, and in the end it doesn’t matter. I’m using “words” as an example here.) However, for languages like Japanese and Chinese (and to a lesser extent Korean) that don’t use spaces to delimit all words (for example, in Korean particles are attached to nouns with no space in between, like saying “athome” instead of “at home”), you run into problems quickly. How to break up texts into words when there’s no easy way to distinguish between them?

The question of tokenizing Japanese may be a linguistic debate. I don’t know enough about linguistics to begin to participate in it, if it is. But I’ll quickly say that you can break up Japanese based on linguistic rules and dictionary rules – understanding which character compounds are nouns, which verb conjugations go with which verb stems (as opposed to being particles in between words), then breaking up common particles into their own units. This appears to be how these tools are doing it. For my own purposes, I’m not as interested in linguistic patterns as I am in noun and verb usage (the meaning rather than the kind) so linguistic nitpicking won’t be my area anyway.

Moving on to the tools. I put them through the wringer: Higuchi Ichiyō’s Ame no yoru, the first two lines, from Aozora bunko.

One, kuromoji, is the tokenizer behind Solr and Lucene. It does a fairly good job, although with Ichiyō’s uncommon word usage and conjugation, it faltered and couldn’t figure out that 高やか is one word; rather it divided it into 高 や か.  It gives the base form, reading, and pronunciation, but nothing else. However, in the version that ships with Solr/Lucene, it lemmatizes. Would that ever make me happy. (That’s, again, reducing a word to its base form, making it easy to count all instances of both “people” and “person” for example, if you’re just after meaning.) I would kill for this feature to be integrated with the below tool.

The other, U-Tokenizer, did significantly better, but its major drawback is that it’s done in the form of an HTTP request, meaning that you can’t put in entire documents (well, maybe you could? how much can you pass in an HTTP request?). If it were downloadable code with an API, I would be very happy (kuromoji is downloadable and has a command line interface). U-Tokenizer figured out that 高やか is one word, and also provides a list of “keywords,” which as far as I can tell is a bunch of salient nouns. I used it for a very short piece of text, so I can’t comment on how many keywords it would come up with for an entire document. The documentation on this is sparse, and it’s not open source, so it’s impossible to know what it’s doing. Still, it’s a fantastic tool, and also seems to work decently for Chinese and Korean.

Each of these tools has its strengths, and both are quite usable for modern and contemporary Japanese. (I really was cruel to feed them Ichiyō.) However, there is a major trial involved in using them with freely-available corpora like Aozora bunko. Guess what? Preprocessing ruby.

Aozora texts contain ruby marked up within the documents. I have my issues with stripping out ruby from documents that heavily use them (like Meiji writers, for example) because they add so much meaning to the text, but let’s say for argument’s sake that we’re not interested in the ruby. Now, it’s time to cut it all out. If I were a regular expressions wizard (or even had basic competency with them) I could probably strip this out easily, but it’s still time consuming. Download text, strip out ruby and other metadata, save as plain text. (Aozora texts are XHTML, NOT “plain text” as they’re often touted to be.) Repeat. For topic modeling using a tool like MALLET, you’re going to want to have hundreds of documents at the end of it. For example, you might be downloading all Meiji novels from Aozora and dividing them into chunks or chapters. Even the complete works of Natsume Sōseki aren’t enough without cutting them down into chapters or even paragraphs to make enough documents to use a topic modeling tool effectively. Possibly, run all these through a part-of-speech tagger like KH Coder. This is going to take a significant amount of time.

Then again, preprocessing is an essential and extremely time-consuming part of almost any text analysis project. I went through a moderate amount of work just removing Project Gutenberg metadata and dividing into chapters a set of travel narratives that I downloaded in plain text, thankfully not in HTML or XML. It made for easy processing. With something that’s not already real plain text, with a lot of metadata, and with a lot of ruby, it’s going to take much more time and effort, which is more typical of a project like this. The digital humanities are a lot of manual labor, despite the glamorous image and the idea that computers can do a lot of manual labor for us. They are a little finicky with what they’ll accept. (Granted, I’ll be using a computer script to strip out the XHTML and ruby tags, but it’s going to take work for me to write it in the first place.)

In conclusion? Text analysis, despite exciting available tools, is still hard and time consuming. There is a lot of potential here, but I also see myself going through some trials to get to the fun part, the experimentation. Still, stay tuned, especially for some follow-up posts on these tools and KH Coder as I become more familiar with them. And, I promise to stop being difficult and giving them Ichiyō’s Meiji-style bungo.