In Spring 2018, I taught the grad/undergrad seminar “East Asian Digital Humanities” via Penn’s East Asian Languages & Civilizations (EALC) department, listed as an elective for the undergrad DH minor at the SAS Price Lab. This course focused on digital humanities (DH) research and projects in an East Asian context. It was taught in English to make the course accessible across East Asian studies, which includes at least three languages and geographic regions, but required intermediate to advanced knowledge of at least one East Asian language. Because all of the students came from humanities departments (6 from History and 3 from EALC) they were well versed in East Asian studies but completely new to digital methods and strategies for humanities research.Continue reading Teaching East Asian DH: Overview
I’ve been thinking, ever since listening to Alicia Peaker‘s amazing WORD LAB presentation about studying environmental words in fiction, about the creative writing process and DH. Specifically: the kind of surface reading that constitutes a lot of digital literary studies, and the lack of attention to things that we writers would foreground as very important to our fiction and what’s behind it, and the story we are trying to tell.
I see a lot of work about plot, about identifying percentage of dialogue spoken by gender (or just character count/number of appearances by gender), about sentiment words. In other words, an inordinate amount of attention paid to the language that addresses humans and their feelings and actions within a story, most often within a corpus of novels.
However. As I write, and as I listen to other professional or amateur writers (of which I am in the latter group) talk about writing, what comes up very often is world building. And I just read an article about the question of retelling existing stories. Other than Donald Sturgeon’s and Scott Enderle’s recent work on text reuse (in the premodern Chinese (and see paywalled article too) and Star Wars fanfic contexts respectively), I don’t see much addressing the latter, which is hugely important. Most writing, maybe all writing, does not come from scratch, sui generis out of a writer’s mind. (That’s not to even get into issues of all the rest of the people involved in both published fiction and fanfic communities, if we’re going to talk about Scott’s work for example.) We’re missing the community that surrounds writing and publishing (and fanfic is published online, even if not in a traditional model), and reception, of course. And that’s probably not a criticism that originates with me.
When it comes to the structure of fiction, though, I think we’re alarmingly not paying attention to some of the most important elements for writers, and thus what constitutes what they write. Putting authorial intent aside — for example, trying to understand what is “behind” a novel — this is also on the surface of the novel in that the world is what makes up the environment in which the story is told. Anyone could tell you that interactions with the environment, and the shape of the environment as the infrastructure of what can happen within it, are just as fundamental to a work of fiction as the ostensible “plot” and “characters.” (And, to bring up another element of good writing, there’s the question of whether you can even separate them: these two books on character arcs and crafting emotional fiction come to mind as examples of how writers are told not to consider these things even remotely in isolation.)
And then there is the question of where stories come from in the first place. I’m not just talking about straight-up adaptation, although that’s a project I’d love to somehow make work between Meiji Japanese-language fiction and 19th-century English or French novels, that we may not realize are connected even now. (Many Meiji works are either what we’d now call “plagiarism” of foreign novels, or adaptations that are subtle enough that something like Ozaki Kōyō’s Konjiki yasha was not “discovered” to have been an adaptation until very recently.) How do we understand how writers generate their stories? Where are they taking the elements from that are important to them, that influence them, that they want to retell in some manner? Are there projects I’m not aware of (aside from the two I mentioned) that are going deep not into just straight-up obvious word or phrasing reuse, but … well, structure, device, or element adaptation and reuse?
These absolutely fundamental elements of fiction writing are not, I think, something that’s been ignored in traditional literary criticism (see, for example, the term “intertextuality”) but I don’t feel like they’ve made it into digital literary projects, at least not the most well-known and -discussed projects. But if I am wrong, and there are projects on world building, environmental elements, or intertextuality that I am missing, please let me know in the comments or via email (sendmailto@ this website) so I can check them out!
I’m teaching a workshop on Japanese text mining this week and am getting all kinds of interesting practical questions that I don’t know the answer to. Today, I was asked if it’s possible to batch convert .docx files to .txt in Windows.
I don’t know Windows, but I do know Mac OS, so I discovered that one can use textutil in the terminal to do this. Just run this line to convert .docx -> .txt:
textutil -convert txt /path/to/DOCX/files/*.docx
You can convert to a bunch of different formats, including txt, html, rtf, rtfd, doc, docx, wordml, odt, or webarchive. It puts the files in the same directory as the source files. That’s it: enjoy!
* Note: This worked fine with UTF-8 files using Japanese, so I assume it just works with UTF-8 in general. YMMV.
I heard a remark the other day that struck a cord – or, churned my butter, a bit.
The gist of it was, “we should make digital facsimiles of our library materials (especially rare materials) and put them online, so they spark library use when people visit to see them in person, after becoming aware of them thanks to the digitized versions.”
Now, at Penn, we have digitized a couple of Japanese collections: Japanese Juvenile Fiction Collection (aka Tatsukawa Bunko series), Japanese Naval Collection (in-process, focused on Renshū Kantai training fleet materials), and a miscellaneous collection of Japanese rare books in general.* These materials have been used both in person (thanks to publicizing them, pre- and post-digitization, on library news sites, blogs, and social media as well as word-of-mouth), and also digitally by researchers who cannot travel to Penn. In fact, a graduate student in Australia used our juvenile fiction collection for part of his dissertation; another student in Wisconsin plans to use facsimiles of our naval materials once they’re complete; and faculty at University of Montana have used our digital facsimile of Meiji-period journal Hōbunkai-sui (or Hōbunkai-shi).
These researchers, due to distance and budget, will likely never be able to visit Penn in person to use the collections. On top of that, some items – like the juvenile fiction and lengthy government documents related to the Imperial Navy – don’t lend themselves to using in a reading room. These aren’t artifacts to look over one page at a time, but research materials that will be read extensively (rather than “intensively,” a distinction we book history folks make). Thus, this is the only use they can make of our materials.
The digitization of Japanese collections at Penn has invited use and a kind of library visit by virtue of being available for researchers worldwide, not just those who are at Penn (who could easily view them in person and don’t “need” a digital facsimile), or who can visit the library to “smell” the books (as the person I paraphrased put it). I think it’s more important to be able to read, research, and use these documents than to smell or witness the material artifact. Of course, there are cases in which one would want to do that, but by and large, our researchers care more about the content and visual aspects of the materials – things that can be captured and conveyed in digital images – rather than touching or handling them.
Isn’t this use, just as visiting the library in person use? Shouldn’t we be tracking visits to our digital collections, downloads, and qualitative stories about their use in research, just as we do a gate count and track circulation? I think so. As we think about the present and future of libraries, and people make comments about their not being needed because libraries are on our smartphones (like libraries of fake news, right?), we must make the argument for providing content both physically and virtually. Who do people think is providing the content for their digital libraries? Physical libraries, of course! Those collections exist in the real world and come from somewhere, with significant investments of money, time, and labor involved – and moreover, it is the skilled and knowledgable labor of professionals that is required.
On top of all of this, I feel it is most important to own up to what we can and cannot “control” online: our collections, by virtue of being able to be released at all, are largely in the public domain. Let’s not put CC licenses on them except for CC-0 (which is explicitly marking materials as public domain), pretending we can control the images when we have no legal right to (but users largely don’t know that). Let’s allow for free remixing and use without citing the digital library/archive it came from, without getting upset about posts on Tumblr. When you release public domain materials on the web (or through other services online), you are giving up your exclusive right to control the circumstances under which people use it – and as a cultural heritage institution, it is your role to perform this service for the world.
But not only should we provide this service, we should take credit for it: take credit for use, visits, and for people getting to do whatever they want with our collections. That is really meaningful and impactful use.
* Many thanks to Michael Williams for his great blog posts about our collections!
As I begin working on my project involving Taiyō magazine, I thought I’d document what I’m doing so others can see the process of cleaning the data I’ve gotten, and then experimenting with it. This is the first part in that series: first steps with data, cleaning it, and getting it ready for analysis. If I have the Taiyō data in “plain text,” what’s there to clean? Oh, you have no idea.
What am I working on these days? Well, one thing is working with the Taiyō magazine corpus (1895-1925, selected articles) from NINJAL, released on CD about 10 years ago but currently being prepared for web release. In addition, I should note that Taiyō has been reproduced digitally as a paid resource through JKBooks (on the JapanKnowledge+ platform).
Taiyō was a general-interest magazine spanning Meiji through Taishō periods in Japan, with articles on all topics as well as fiction, and innovative for its time in 1895 with the use of lithography to reproduce pages of photographs. (And let me tell you, they were random at the time: battleships, various nations’ viceroys, stuff like that. I’m not making this up.) Unfortunately, the text-only nature of my project doesn’t reflect the cool printing technology and visual nature of the magazine, but I was wondering, what can I do with just the text of the articles and metadata kindly provided by NINJAL (including genre by NDL classification and style of writing).
Because I’m working on another project (under wraps and in very beginning stages at the moment) involving periodicals in the Japanese empire, I was already thinking about this question. I hit upon something very basic but an important topic: what language did Japanese publications use to talk about Japan at the time? With “Japan” in the early 20th century, we can think of both a nation and an empire, with blurred and constantly shifting boundaries. Over the span of Taiyō‘s publication, Japan annexed both Korea and Taiwan, increased hostilities with China, and battled (and defeated) Russia in the Russo-Japanese War (thus gaining some territories there). There was a lot going on to keep Japan’s borders in flux, and make Japanese question the limits and definition of their “nation.”
Especially because of the discourse in the early 20th century of naichi 内地 (inner lands or “home islands”, referring to the archipelago of Japan we know today) and gaichi 外地 (outer lands or “colonies”, referring to Korea/Taiwan), which are both subsumed under the name of Japan, I’m really interested in how those terms were being used, other terms that might have been used as well, and what qualities and relationships were associated with them. How did Japanese define these areas and how did it change over time? While I can’t get in the minds of people in the imperial period, I can take a look at one of its most popular magazines, intended for a broad audience, to see at least the public, print discourse of the nation and empire.
How to work with it, though? That’s where I’m still just beginning. It’s a daunting project in some ways. For example, I am not a linguist, let alone a Japanese linguist. I haven’t specialized in this period in the past, so keywords for territories will take some research on my part (for example, there were multiple names for Taiwan at the time in addition to the gaichi reference). Moreover, the corpus is 1.2 GB in UTF-8 text (which I converted from sentence-tokenized XML to word-tokenized, non-tagged text). It breaks Voyant Server and Topic Modeling Tool on my machine with 12 GB RAM when attempting to analyze the whole thing at once. Of course, I could split it up, but then that raises another methodological question: how and why to split it up? What divisions should I use: years, genres, authors, etc.? Right now I have it in text files by article, but could combine those articles in any number of ways.
I am also stymied by methodologies for analysis, but my plan at the moment is to start by doing some basic visualizations of the articles, in different groupings, as an exploration of what kind of things people talked about in Taiyō over time. Are they even talking about the nation? When they talk about naichi what kinds of things do they associate with those territories, as opposed to gaichi? Is the distinction changing, and is it even a reliable distinction?
As a Price Lab Fellow this year at Penn, I hope to explore these questions and start to nail down what I want to analyze in more detail over time in Taiyō — and hopefully gain some insight into the language of empire in Japan 1895-1925.
In addition I’ll be presenting about this at a workshop at the University of Chicago in November, so if you’re in the area please attend and help me figure all this out!
I just got off the phone with a researcher this morning who is interested in looking at sentiment analysis on a corpus of fiction, specifically by having some native speakers of Japanese (I think) tag adjectives as positive or negative, then look at the overall shape of the corpus with those tags in mind.
A while back, I wrote a paper about geoparsing and sentiment analysis for a class, describing a project I worked on. Talking to this researcher made me think back to this project – which I’m actually currently trying to rewrite in Python and then make work on some Japanese, rather than Victorian English, texts – and my own definition of sentiment analysis for humanistic inquiry.*
How is my definition of sentiment analysis different? How about I start with the methodology? What I did was look for salient adjectives, which I searched for by looking at most “salient” nouns (not necessarily the most frequent, but I need to refine my heuristics) and then the adjectives that appeared next to them. I also used Wordnet to look for words related to these adjectives and nouns to expand my search beyond just those specific words to ones with similar meaning that I might have missed (in particular, I looked at hypernyms (broader terms) and synonyms of nouns, and synonyms of adjectives).
My method of sentiment analysis ends up looking more like automatic summarization than a positive-negative sentiment analysis we more frequently encounter, even in humanistic work such as Matt Jockers’s recent research. I argue, of course, that my method is somewhat more meaningful. I consider all adjectives to be sentiment words, because they carry subjective judgment (even something that’s kind of green might be described by someone else as also kind of blue). And I’m more interested in the character of subjective judgment than whether it should be able to be considered ‘objectively’ as positive or negative (something I don’t think is really possible in humanistic inquiry, and even in business applications). In other words, if we have to pick out the most representative feelings of people about what they’re experiencing, what are they feeling about that experience?
After all, can you really say that weather is good or bad, that there being a lot of farm fields is good or bad? I looked at 19th-century British women’s travel narratives of “exotic” places, and I found that their sentiment was often just observations about trains and the landscape and the people. They didn’t talk about whether they were feeling positively or negatively about those things; rather, they gave us their subjective judgment of what those things were like.
My take on sentiment analysis, then, is clearly that we need to introduce human judgment to the end of the process, perhaps gathering these representative phrases and adjectives (I lean toward phrases or even whole sentences) and then deciding what we can about them. I don’t even think a human interlocutor could put down a verdict of positive or negative on these observations and judgments – sentiments – that the women had about their experiences and environments. If not even a human could do it, and humans write and train the algorithms, how can the computer do it?
Is there even a point? Does it matter if it’s possible or not? We should be looking for something else entirely.
(I really need to get cracking on this project. Stay tuned for the revised methodology and heuristics, because I hope to write more and share code here as I go along.)
* I’m also trying to write a more extensive and revised paper on this, meant for the new incarnation of LLC.
Several years ago, I attended Digital Humanities 2011 at Stanford and had the opportunity to meet with Franco Moretti. When Franco asked what I was interested in, I admitted that I badly wanted to see the Literary Lab I’d heard so much about, and seen so much interesting research come out of. He laughed and said he’d show it to me, but that I shouldn’t get too excited.
Why? Because Literary Lab is a windowless conference room in the middle of the English department at Stanford. Literary Lab is a room with a whiteboard.
I couldn’t have been more excited, to Franco’s amusement.
A room with a whiteboard. A room dedicated to talking about projects, to collaborating, to bringing a laptop and getting research done, and to sharing and brainstorming via drawing and notes up on a wall, not on a piece of paper or a shared document. It was an important moment for me.
When I was in graduate school, I’d tossed around a number of projects with colleagues, and gotten excited about a lot of them. But they always petered out, lost momentum, and disappeared. This is surely due to busy schedules and competing projects – not least the dissertation – but I think it’s also partly due to logistics.
Much as our work has gone online, and despite these being digital projects – just like Literary Lab’s research – a physical space is still hugely important. A space to talk, a space to brainstorm and draw and write, a space to work together: a space to keep things going.
I had been turning this over in my head ever since I met with Franco, but never had the opportunity to put my idea into action. Then I came to Penn, and met a like-minded colleague who got just as excited about the idea of dedicated space and collective work on projects as I was.
Our boss thought the idea of a room with a whiteboard was funny, just as Franco had thought my low standards were kind of silly. But you know what? You don’t need a budget to create ideas and momentum. You don’t need a budget to stimulate discussion and cross-disciplinary cooperation. You just need space and time, and willing participants who can make use of it. We made a proposal, got the go-ahead, and took advantage of a new room in our Kislak Center at Penn that was free for an hour and a half a week. It was enough: the Vitale II lab is a room with a whiteboard. It even has giant TVs to hook up a laptop.
Thus, WORD LAB was born: a text-analysis interest group that just needed space to meet, and people to populate it. We recruited hard, mailing every department and discipline list we could think of, and got a mind-boggling 15+ people at the first meeting, plus the organizers and some interested library staff, from across the university. The room was full.
That was the beginning of September 2014. WORD LAB is still going strong, with more formal presentations every other week, interspersed with journal club/coding tutorials/etc. in OPEN LAB on the other weeks. We get a regular attendance of at least 7-10 people a week, and the faces keep changing. It’s a group of Asianists, an Islamic law scholar, Annenberg School of Communication researchers, political scientists, psychologists, and librarians, some belonging to more than one group. We’ve had presentations from Penn staff, other regional university researchers, and upcoming Skype presentations from Chicago and Northeastern.
A room with a whiteboard has turned into a budding cross-disciplinary, cross-professional text analysis interest community at Penn.
I recently wrote a small script to perform a couple of functions for pre-processing Aozora Bunko texts (text files of public domain, modern Japanese literature and non-fiction) to be used with Western-oriented text analysis tools, such as Voyant, other TAPoR tools, and MALLET. Whereas Japanese text analysis software focuses largely on linguistics (tagging parts of speech, lemmatizing, etc.), Western tools open up possibilities for visualization, concordances, topic modeling, and other various modes of analysis.
Why do these Aozora texts need to be processed? Well, a couple of issues.
- They contain ruby, which are basically glosses of Chinese characters that give their pronunciation. These can be straightforward pronunciation help, or actually different words that give added meaning and context. While I have my issues with removing ruby, it’s impossible to do straightforward tool-based analysis without removing it, and many people who want to do this kind of analysis want it to be removed.
- The Aozora files are not exactly plain text: they’re HTML. The HTML tags and Aozora metadata (telling where the text came from, for example) need to be removed before analysis can be performed.
- There are no spaces between words in Japanese, but Western text analysis tools identify words by looking at where there are spaces. Without inserting spaces, it looks like each line is one big word. So I needed to insert spaces between the Japanese words.
How did I do it? My approach, because of my background and expertise, was to create a Python script that used a couple of helpful libraries, including BeautifulSoup for ruby removal based on HTML tags, and TinySegmenter for inserting spaces between words. My script requires you to have these packages installed, but it’s not a big deal to do so. You then run the script in a command line prompt. The way it works is to look for all .html files in a directory, load them and run the pre-processing, then output each processed file with the same filename, .txt ending, a plain text UTF-8 encoded file.
The first step in the script is to remove the ruby. Helpfully, the ruby is contained in several HTML tags. I had BeautifulSoup traverse the file and remove all elements contained within these tags; it removes both the tags and content.
Next, I used a very simple regular expression to remove everything in brackets – i.e. the HTML tags. This is kind of quick and dirty, and won’t work on every file in the universe, but in Aozora texts everything inside a bracket is an HTML tag, so it’s not a problem here.
Finally, I used TinySegmenter on the resulting HTML-free text to split the text into words. Luckily for me, it returns an array of words – basically, each word is a separate element in a list like [‘word1’, ‘word2’, … ‘wordn’] for n words. This makes my life easy for two reasons. First, I simply joined the array with a space between each word, creating one long string (the outputted text) with spaces between each element in the array (words). Second, it made it easy to just remove the part of the array that contains Aozora metadata before creating that string. Again, this is quick and dirty, but from examining the files I noted that the metadata always comes at the end of the file and begins with the word 底本 (‘source text’). Remove that word and everything after it, and then you have a metadata-free file.
Write this resulting text into a plain text file, and you have a non-ruby, non-HTML, metadata-free, whitespace-delimited Aozora text! Although you have to still download all the Aozora files individually and then do what you will with the resulting individual text files, it’s an easy way to pre-process this text and get it ready for tool-based (and also your-own-program-based) text analysis.
I plan to put the script on GitHub for your perusal and use (and of course modification) but for now, check it out on my Japanese Text Analysis research guide at Penn.
Another “digital” thing I’ve been doing that relates to the “humanities” (but is it even remotely DH? I don’t know), is the creation of research guides for digital resources in Japanese studies of all kinds, with a focus on Japanese-language free websites and databases, and open-access publications.
So far, I’ve been working hard on creating guides for electronic Japanese studies resources, and mobile apps easily accessible in the US for both Android and iOS that relate to Japanese research or language study. The digital resources guide covers everything from general digital archives and citation indexes to literature, art, history, pop culture, and kuzushiji resources (for reading handwritten pre- and early modern documents). They range from text and image databases to dictionaries and even YouTube videos and online courseware for learning classical Japanese and how to read manuscripts.
This has been a real challenge, as you can imagine. Creating lists of stuff is one thing (and is one thing I’ve done for Japanese text analysis resources), but actually curating them and creating the equivalent of annotated bibliographies is quite another. It’s been a huge amount of research and writing – both in discovery of sources, and also in investigating and evaluating them, then describing them in plain terms to my community. I spent hours on end surfing the App and Play Stores and downloading/trying countless awful free apps – so you don’t have to!
It’s especially hard to find digital resources in ways other than word of mouth. I find that I end up linking to other librarians’ LibGuides (i.e. research guides) often because they’ve done such a fantastic job curating their own lists already. I wonder sometimes if we’re all just duplicating each other’s efforts! The NCC has a database of research guides, yes, but would it be better if we all collaboratively edited just one? Would it get overwhelming? Would there be serious disagreements about how to organize, whether to include paid resources (and which ones), and where to file things?
The answer to all these questions is probably yes, which creates problems. Logistically, we can’t have every Japanese librarian in the English-speaking world editing the same guide anyway. So it’s hard to say what the solution is – keep working in our silos? Specialize and tell our students and faculty to Google “LibGuide Japanese” + topic? (Which is what I’ve done in the past with art and art history.) Search the master NCC database? Some combination is probably the right path.
Until then, I will keep working on accumulating as many kuzushiji resources as I can for Penn’s reading group, and updating my mobile app guide if I ever find a decent まとめ!