I recently read J. Scott Miller’s Adaptations of Western Literature in Meiji Japan (New York: Palgrave, 2001) and am full of Thoughts on Meiji writers, literature, zeitgeist, continuity, and adaptation. Let me express some of them here.
I’m teaching a workshop on Japanese text mining this week and am getting all kinds of interesting practical questions that I don’t know the answer to. Today, I was asked if it’s possible to batch convert .docx files to .txt in Windows.
I don’t know Windows, but I do know Mac OS, so I discovered that one can use textutil in the terminal to do this. Just run this line to convert .docx -> .txt:
textutil -convert txt /path/to/DOCX/files/*.docx
You can convert to a bunch of different formats, including txt, html, rtf, rtfd, doc, docx, wordml, odt, or webarchive. It puts the files in the same directory as the source files. That’s it: enjoy!
* Note: This worked fine with UTF-8 files using Japanese, so I assume it just works with UTF-8 in general. YMMV.
I’ve been meaning to write about my writing process for quite a while now and am surprised, looking back through my blog archives, that I have not yet addressed it.
This post could alternately be titled “How NaNoWriMo Enabled Me to Write My Dissertation in Three and a Half Months” or “The Importance of NaNoWriMo for Academic Writing.” Or just “Do NaNoWriMo at Least Once, People.”
NaNoWriMo stands for “National Novel Writing Month” and has been going since the turn of the twenty-first century. I’ve done it myself since 2002, most years. No, I don’t have a published novel, and in fact I only finished two of them in that time. (And the first one didn’t even “win” — the only criterion for winning is having a file containing 50,000 words — because it came in about 40,000 words when it was done. Oh well. My best and first finished work, so I’m cool with it. In fact, I’m still working on revising that work and trying to cut a version of it into a 10,000-word short story.) But man, what I got out of it.
NaNoWriMo taught me how to write. I don’t mean how to write well, or grammar or mechanics or plot or anything like that. It taught me how to put words on the page. And, after all, that is the first step to writing something. You have to just start making words. Continue reading Writing Process: NaNoWriMo and Me
I heard a remark the other day that struck a cord – or, churned my butter, a bit.
The gist of it was, “we should make digital facsimiles of our library materials (especially rare materials) and put them online, so they spark library use when people visit to see them in person, after becoming aware of them thanks to the digitized versions.”
Now, at Penn, we have digitized a couple of Japanese collections: Japanese Juvenile Fiction Collection (aka Tatsukawa Bunko series), Japanese Naval Collection (in-process, focused on Renshū Kantai training fleet materials), and a miscellaneous collection of Japanese rare books in general.* These materials have been used both in person (thanks to publicizing them, pre- and post-digitization, on library news sites, blogs, and social media as well as word-of-mouth), and also digitally by researchers who cannot travel to Penn. In fact, a graduate student in Australia used our juvenile fiction collection for part of his dissertation; another student in Wisconsin plans to use facsimiles of our naval materials once they’re complete; and faculty at University of Montana have used our digital facsimile of Meiji-period journal Hōbunkai-sui (or Hōbunkai-shi).
These researchers, due to distance and budget, will likely never be able to visit Penn in person to use the collections. On top of that, some items – like the juvenile fiction and lengthy government documents related to the Imperial Navy – don’t lend themselves to using in a reading room. These aren’t artifacts to look over one page at a time, but research materials that will be read extensively (rather than “intensively,” a distinction we book history folks make). Thus, this is the only use they can make of our materials.
The digitization of Japanese collections at Penn has invited use and a kind of library visit by virtue of being available for researchers worldwide, not just those who are at Penn (who could easily view them in person and don’t “need” a digital facsimile), or who can visit the library to “smell” the books (as the person I paraphrased put it). I think it’s more important to be able to read, research, and use these documents than to smell or witness the material artifact. Of course, there are cases in which one would want to do that, but by and large, our researchers care more about the content and visual aspects of the materials – things that can be captured and conveyed in digital images – rather than touching or handling them.
Isn’t this use, just as visiting the library in person use? Shouldn’t we be tracking visits to our digital collections, downloads, and qualitative stories about their use in research, just as we do a gate count and track circulation? I think so. As we think about the present and future of libraries, and people make comments about their not being needed because libraries are on our smartphones (like libraries of fake news, right?), we must make the argument for providing content both physically and virtually. Who do people think is providing the content for their digital libraries? Physical libraries, of course! Those collections exist in the real world and come from somewhere, with significant investments of money, time, and labor involved – and moreover, it is the skilled and knowledgable labor of professionals that is required.
On top of all of this, I feel it is most important to own up to what we can and cannot “control” online: our collections, by virtue of being able to be released at all, are largely in the public domain. Let’s not put CC licenses on them except for CC-0 (which is explicitly marking materials as public domain), pretending we can control the images when we have no legal right to (but users largely don’t know that). Let’s allow for free remixing and use without citing the digital library/archive it came from, without getting upset about posts on Tumblr. When you release public domain materials on the web (or through other services online), you are giving up your exclusive right to control the circumstances under which people use it – and as a cultural heritage institution, it is your role to perform this service for the world.
But not only should we provide this service, we should take credit for it: take credit for use, visits, and for people getting to do whatever they want with our collections. That is really meaningful and impactful use.
* Many thanks to Michael Williams for his great blog posts about our collections!
As I begin working on my project involving Taiyō magazine, I thought I’d document what I’m doing so others can see the process of cleaning the data I’ve gotten, and then experimenting with it. This is the first part in that series: first steps with data, cleaning it, and getting it ready for analysis. If I have the Taiyō data in “plain text,” what’s there to clean? Oh, you have no idea.
What am I working on these days? Well, one thing is working with the Taiyō magazine corpus (1895-1925, selected articles) from NINJAL, released on CD about 10 years ago but currently being prepared for web release. In addition, I should note that Taiyō has been reproduced digitally as a paid resource through JKBooks (on the JapanKnowledge+ platform).
Taiyō was a general-interest magazine spanning Meiji through Taishō periods in Japan, with articles on all topics as well as fiction, and innovative for its time in 1895 with the use of lithography to reproduce pages of photographs. (And let me tell you, they were random at the time: battleships, various nations’ viceroys, stuff like that. I’m not making this up.) Unfortunately, the text-only nature of my project doesn’t reflect the cool printing technology and visual nature of the magazine, but I was wondering, what can I do with just the text of the articles and metadata kindly provided by NINJAL (including genre by NDL classification and style of writing).
Because I’m working on another project (under wraps and in very beginning stages at the moment) involving periodicals in the Japanese empire, I was already thinking about this question. I hit upon something very basic but an important topic: what language did Japanese publications use to talk about Japan at the time? With “Japan” in the early 20th century, we can think of both a nation and an empire, with blurred and constantly shifting boundaries. Over the span of Taiyō‘s publication, Japan annexed both Korea and Taiwan, increased hostilities with China, and battled (and defeated) Russia in the Russo-Japanese War (thus gaining some territories there). There was a lot going on to keep Japan’s borders in flux, and make Japanese question the limits and definition of their “nation.”
Especially because of the discourse in the early 20th century of naichi 内地 (inner lands or “home islands”, referring to the archipelago of Japan we know today) and gaichi 外地 (outer lands or “colonies”, referring to Korea/Taiwan), which are both subsumed under the name of Japan, I’m really interested in how those terms were being used, other terms that might have been used as well, and what qualities and relationships were associated with them. How did Japanese define these areas and how did it change over time? While I can’t get in the minds of people in the imperial period, I can take a look at one of its most popular magazines, intended for a broad audience, to see at least the public, print discourse of the nation and empire.
How to work with it, though? That’s where I’m still just beginning. It’s a daunting project in some ways. For example, I am not a linguist, let alone a Japanese linguist. I haven’t specialized in this period in the past, so keywords for territories will take some research on my part (for example, there were multiple names for Taiwan at the time in addition to the gaichi reference). Moreover, the corpus is 1.2 GB in UTF-8 text (which I converted from sentence-tokenized XML to word-tokenized, non-tagged text). It breaks Voyant Server and Topic Modeling Tool on my machine with 12 GB RAM when attempting to analyze the whole thing at once. Of course, I could split it up, but then that raises another methodological question: how and why to split it up? What divisions should I use: years, genres, authors, etc.? Right now I have it in text files by article, but could combine those articles in any number of ways.
I am also stymied by methodologies for analysis, but my plan at the moment is to start by doing some basic visualizations of the articles, in different groupings, as an exploration of what kind of things people talked about in Taiyō over time. Are they even talking about the nation? When they talk about naichi what kinds of things do they associate with those territories, as opposed to gaichi? Is the distinction changing, and is it even a reliable distinction?
As a Price Lab Fellow this year at Penn, I hope to explore these questions and start to nail down what I want to analyze in more detail over time in Taiyō — and hopefully gain some insight into the language of empire in Japan 1895-1925.
In addition I’ll be presenting about this at a workshop at the University of Chicago in November, so if you’re in the area please attend and help me figure all this out!
While my research diary has stalled out because I haven’t been researching (other than some administrative tasks like collecting and organizing article PDFs, and typing notes into Mendeley), I have made some progress on updating my website.
Specifically, I have switched over to using Jekyll, which is software that converts markdown/HTML and SASS/CSS to static web pages. Why do I want to do it? Because I want to have a consistent header and footer (navigation and that blurb at the bottom of every page) across the whole site, but don’t want to manually edit every single file every time I update one of those, or update the site structure/design. I also didn’t want to use PHP because then all my files will be .php and on top of it, it feels messier. I like static HTML a lot.
I’m just writing down my notes here for others who might want to use it too. I’ve only found tutorials that talk about how to publish your site to GitHub Pages. Obviously, I have my own hosting. I also already had a full static site coded in HTML and CSS, so I didn’t want to start all over again with markdown. (Markdown is just a different markup language from HTML; from what I can tell, you can’t get nearly the flexibility or semantic markup into your markup documents that you can with HTML, so I’m sticking with the latter.) I wondered: all these tutorials show you how to do it from scratch, but will it be difficult to convert an existing HTML/CSS site into a Jekyll-powered site?
The answer is: no. It’s really really easy. Just copy and paste from your old site into some broken-up files in the Jekyll directory, serve, and go.
I recommend following the beginning of this tutorial by Tania Rascia. This will help you get Jekyll installed and set up.
Then, if you want a website — not a blog — what you want to do is just start making “index.html”, “about.html”, folders with more .html files (or .md if you prefer), etc., in your Jekyll folder. These will all be generated as regular .html pages in the _site directory when you start the server, and will be updated as long as the server is running. It’ll all be structured how you set it up in the Jekyll folder. For my site, that means I have folders like “projects” and “guides” in addition to top-level pages (such as “index.html”).
Finally, start your server and generate all those static pages. Put your CSS file wherever the head element wants it to be on your web server. (I have to use its full URL, starting with http://, because I have multiple folders and if I just put “mollydesjardin.css” the non-top-level files will not know where to find it.) Then upload all the files from _site into your server and voilà, you have your static website.
I do not “get” Git enough yet to follow some more complicated instructions I found for automatically pushing my site to my hosting. What I’m doing, and is probably the simplest but just a little cumbersome solution, is to just manually SFTP those files to my web server as I modify them. Obviously, I do not have to upload and overwrite every file every time; I just select the ones I created or modified from the _site directory and upload those.
Hope this is helpful for someone starting out with Jekyll, converting an existing HTML/CSS site.
Lately, I feel like I’m stuck in short-term thinking. While I hear “be in the moment” is a good thing, I’m overly in the moment. I’m having a hard time thinking long-term and planning out projects, let alone sticking to any kind of plan. Not that I have one.
A review of my dissertation recently went online, and of course some reactions to my sharing that were “what have you published in journals?” and “are you turning it into a book?” I graduated three years ago, and the dissertation was finished six months prior to that and handed in. This summer, I’ll be looking at four years of being “done” without much to show for the intervening time.
Of course, it’s hard to show something when you have a full-time job that doesn’t include research as a professional component. But if I want to do it for myself — and I do — that means that I need to come up with a non-job way to motivate myself and stay on track.
That brings me to the title of this post. My mother recently had a “meeting with herself” at the end of the work week to check in on what she meant to do and what actually happened. It sounds remarkably productive to me as a way to keep yourself 1) kind of on track, and 2) in touch with your own habits and aspirations. It’s easy to lose touch with those things in the weekly grind.
I decided I will have a weekend meeting with myself every week, and as a part of that, write a narrative of what I did. I’ll write it before I review my list of aspirations for the previous week and then when I compare, not necessarily beat myself up over “not meeting goals” but rather use it as an opportunity to refine my aspirations based on how I actually work (or don’t). As a part of that — to hold myself accountable and also to start a dialogue with others — I’ll be writing a cleaned-up version of that research diary once a week here. Don’t expect detailed notes, but do expect a diary of my process and the kinds of activities I engage in when doing research and writing.
I hope this can be helpful to a beginning researcher and spark some conversation with more experienced ones. While this is a personal journey of a sort, it is public, and I welcome your comments.
I just got off the phone with a researcher this morning who is interested in looking at sentiment analysis on a corpus of fiction, specifically by having some native speakers of Japanese (I think) tag adjectives as positive or negative, then look at the overall shape of the corpus with those tags in mind.
A while back, I wrote a paper about geoparsing and sentiment analysis for a class, describing a project I worked on. Talking to this researcher made me think back to this project – which I’m actually currently trying to rewrite in Python and then make work on some Japanese, rather than Victorian English, texts – and my own definition of sentiment analysis for humanistic inquiry.*
How is my definition of sentiment analysis different? How about I start with the methodology? What I did was look for salient adjectives, which I searched for by looking at most “salient” nouns (not necessarily the most frequent, but I need to refine my heuristics) and then the adjectives that appeared next to them. I also used Wordnet to look for words related to these adjectives and nouns to expand my search beyond just those specific words to ones with similar meaning that I might have missed (in particular, I looked at hypernyms (broader terms) and synonyms of nouns, and synonyms of adjectives).
My method of sentiment analysis ends up looking more like automatic summarization than a positive-negative sentiment analysis we more frequently encounter, even in humanistic work such as Matt Jockers’s recent research. I argue, of course, that my method is somewhat more meaningful. I consider all adjectives to be sentiment words, because they carry subjective judgment (even something that’s kind of green might be described by someone else as also kind of blue). And I’m more interested in the character of subjective judgment than whether it should be able to be considered ‘objectively’ as positive or negative (something I don’t think is really possible in humanistic inquiry, and even in business applications). In other words, if we have to pick out the most representative feelings of people about what they’re experiencing, what are they feeling about that experience?
After all, can you really say that weather is good or bad, that there being a lot of farm fields is good or bad? I looked at 19th-century British women’s travel narratives of “exotic” places, and I found that their sentiment was often just observations about trains and the landscape and the people. They didn’t talk about whether they were feeling positively or negatively about those things; rather, they gave us their subjective judgment of what those things were like.
My take on sentiment analysis, then, is clearly that we need to introduce human judgment to the end of the process, perhaps gathering these representative phrases and adjectives (I lean toward phrases or even whole sentences) and then deciding what we can about them. I don’t even think a human interlocutor could put down a verdict of positive or negative on these observations and judgments – sentiments – that the women had about their experiences and environments. If not even a human could do it, and humans write and train the algorithms, how can the computer do it?
Is there even a point? Does it matter if it’s possible or not? We should be looking for something else entirely.
(I really need to get cracking on this project. Stay tuned for the revised methodology and heuristics, because I hope to write more and share code here as I go along.)
* I’m also trying to write a more extensive and revised paper on this, meant for the new incarnation of LLC.
After packing my copy of Japanese Cooking Contemporary and Traditional (an awesome vegan cookbook) a little hastily before my upcoming move, I scoured the Internet for the basic udon/soba noodle broth recipe. To my surprise, it is not on the Oracle. So here I’ll provide the standard Japanese recipe for soba/udon broth for posterity.
- 4 cups any kind of broth (for example konbu-dashi, katsuo-dashi, or chicken, or fake-chicken as I used last night)
- 2-5 tablespoons light (usu-kuchi) soy sauce as desired for saltiness
- 1 tablespoon or so ryōri-shu (cooking sake, the cheap kind)
- 1 teaspoon mirin or sugar if you don’t have mirin handy
- 1/2 teaspoon salt
Simmer all this together for about 5 minutes and pour over the noodles. Adjust all the salty and sugary elements based on your taste. This makes enough for 2 servings.
You’re welcome, Internet!