As I begin working on my project involving Taiyō magazine, I thought I’d document what I’m doing so others can see the process of cleaning the data I’ve gotten, and then experimenting with it. This is the first part in that series: first steps with data, cleaning it, and getting it ready for analysis. If I have the Taiyō data in “plain text,” what’s there to clean? Oh, you have no idea.
What am I working on these days? Well, one thing is working with the Taiyō magazine corpus (1895-1925, selected articles) from NINJAL, released on CD about 10 years ago but currently being prepared for web release. In addition, I should note that Taiyō has been reproduced digitally as a paid resource through JKBooks (on the JapanKnowledge+ platform).
Taiyō was a general-interest magazine spanning Meiji through Taishō periods in Japan, with articles on all topics as well as fiction, and innovative for its time in 1895 with the use of lithography to reproduce pages of photographs. (And let me tell you, they were random at the time: battleships, various nations’ viceroys, stuff like that. I’m not making this up.) Unfortunately, the text-only nature of my project doesn’t reflect the cool printing technology and visual nature of the magazine, but I was wondering, what can I do with just the text of the articles and metadata kindly provided by NINJAL (including genre by NDL classification and style of writing).
Because I’m working on another project (under wraps and in very beginning stages at the moment) involving periodicals in the Japanese empire, I was already thinking about this question. I hit upon something very basic but an important topic: what language did Japanese publications use to talk about Japan at the time? With “Japan” in the early 20th century, we can think of both a nation and an empire, with blurred and constantly shifting boundaries. Over the span of Taiyō‘s publication, Japan annexed both Korea and Taiwan, increased hostilities with China, and battled (and defeated) Russia in the Russo-Japanese War (thus gaining some territories there). There was a lot going on to keep Japan’s borders in flux, and make Japanese question the limits and definition of their “nation.”
Especially because of the discourse in the early 20th century of naichi 内地 (inner lands or “home islands”, referring to the archipelago of Japan we know today) and gaichi 外地 (outer lands or “colonies”, referring to Korea/Taiwan), which are both subsumed under the name of Japan, I’m really interested in how those terms were being used, other terms that might have been used as well, and what qualities and relationships were associated with them. How did Japanese define these areas and how did it change over time? While I can’t get in the minds of people in the imperial period, I can take a look at one of its most popular magazines, intended for a broad audience, to see at least the public, print discourse of the nation and empire.
How to work with it, though? That’s where I’m still just beginning. It’s a daunting project in some ways. For example, I am not a linguist, let alone a Japanese linguist. I haven’t specialized in this period in the past, so keywords for territories will take some research on my part (for example, there were multiple names for Taiwan at the time in addition to the gaichi reference). Moreover, the corpus is 1.2 GB in UTF-8 text (which I converted from sentence-tokenized XML to word-tokenized, non-tagged text). It breaks Voyant Server and Topic Modeling Tool on my machine with 12 GB RAM when attempting to analyze the whole thing at once. Of course, I could split it up, but then that raises another methodological question: how and why to split it up? What divisions should I use: years, genres, authors, etc.? Right now I have it in text files by article, but could combine those articles in any number of ways.
I am also stymied by methodologies for analysis, but my plan at the moment is to start by doing some basic visualizations of the articles, in different groupings, as an exploration of what kind of things people talked about in Taiyō over time. Are they even talking about the nation? When they talk about naichi what kinds of things do they associate with those territories, as opposed to gaichi? Is the distinction changing, and is it even a reliable distinction?
As a Price Lab Fellow this year at Penn, I hope to explore these questions and start to nail down what I want to analyze in more detail over time in Taiyō — and hopefully gain some insight into the language of empire in Japan 1895-1925.
In addition I’ll be presenting about this at a workshop at the University of Chicago in November, so if you’re in the area please attend and help me figure all this out!