Category Archives: academia

Convert docs with OS X terminal

I’m teaching a workshop on Japanese text mining this week and am getting all kinds of interesting practical questions that I don’t know the answer to. Today, I was asked if it’s possible to batch convert .docx files to .txt in Windows.

I don’t know Windows, but I do know Mac OS, so I discovered that one can use textutil in the terminal to do this. Just run this line to convert .docx -> .txt:

textutil -convert txt /path/to/DOCX/files/*.docx

You can convert to a bunch of different formats, including txt, html, rtf, rtfd, doc, docx, wordml, odt, or webarchive. It puts the files in the same directory as the source files. That’s it: enjoy!

* Note: This worked fine with UTF-8 files using Japanese, so I assume it just works with UTF-8 in general. YMMV.

Writing Process: NaNoWriMo and Me

I’ve been meaning to write about my writing process for quite a while now and am surprised, looking back through my blog archives, that I have not yet addressed it.

This post could alternately be titled “How NaNoWriMo Enabled Me to Write My Dissertation in Three and a Half Months” or “The Importance of NaNoWriMo for Academic Writing.” Or just “Do NaNoWriMo at Least Once, People.”

NaNoWriMo stands for “National Novel Writing Month” and has been going since the turn of the twenty-first century. I’ve done it myself since 2002, most years. No, I don’t have a published novel, and in fact I only finished two of them in that time. (And the first one didn’t even “win” — the only criterion for winning is having a file containing 50,000 words — because it came in about 40,000 words when it was done. Oh well. My best and first finished work, so I’m cool with it. In fact, I’m still working on revising that work and trying to cut a version of it into a 10,000-word short story.) But man, what I got out of it.

NaNoWriMo taught me how to write. I don’t mean how to write well, or grammar or mechanics or plot or anything like that. It taught me how to put words on the page. And, after all, that is the first step to writing something. You have to just start making words. Continue reading Writing Process: NaNoWriMo and Me

Using Collections – Virtually

I heard a remark the other day that struck a cord – or, churned my butter, a bit.

The gist of it was, “we should make digital facsimiles of our library materials (especially rare materials) and put them online, so they spark library use when people visit to see them in person, after becoming aware of them thanks to the digitized versions.”

Now, at Penn, we have digitized a couple of Japanese collections: Japanese Juvenile Fiction Collection (aka Tatsukawa Bunko series), Japanese Naval Collection (in-process, focused on Renshū Kantai training fleet materials), and a miscellaneous collection of Japanese rare books in general.* These materials have been used both in person (thanks to publicizing them, pre- and post-digitization, on library news sites, blogs, and social media as well as word-of-mouth), and also digitally by researchers who cannot travel to Penn. In fact, a graduate student in Australia used our juvenile fiction collection for part of his dissertation; another student in Wisconsin plans to use facsimiles of our naval materials once they’re complete; and faculty at University of Montana have used our digital facsimile of Meiji-period journal Hōbunkai-sui (or Hōbunkai-shi).

These researchers, due to distance and budget, will likely never be able to visit Penn in person to use the collections. On top of that, some items – like the juvenile fiction and lengthy government documents related to the Imperial Navy – don’t lend themselves to using in a reading room. These aren’t artifacts to look over one page at a time, but research materials that will be read extensively (rather than “intensively,” a distinction we book history folks make). Thus, this is the only use they can make of our materials.

The digitization of Japanese collections at Penn has invited use and a kind of library visit by virtue of being available for researchers worldwide, not just those who are at Penn (who could easily view them in person and don’t “need” a digital facsimile), or who can visit the library to “smell” the books (as the person I paraphrased put it). I think it’s more important to be able to read, research, and use these documents than to smell or witness the material artifact. Of course, there are cases in which one would want to do that, but by and large, our researchers care more about the content and visual aspects of the materials – things that can be captured and conveyed in digital images – rather than touching or handling them.

Isn’t this use, just as visiting the library in person use? Shouldn’t we be tracking visits to our digital collections, downloads, and qualitative stories about their use in research, just as we do a gate count and track circulation? I think so. As we think about the present and future of libraries, and people make comments about their not being needed because libraries are on our smartphones (like libraries of fake news, right?), we must make the argument for providing content both physically and virtually. Who do people think is providing the content for their digital libraries? Physical libraries, of course! Those collections exist in the real world and come from somewhere, with significant investments of money, time, and labor involved – and moreover, it is the skilled and knowledgable labor of professionals that is required.

On top of all of this, I feel it is most important to own up to what we can and cannot “control” online: our collections, by virtue of being able to be released at all, are largely in the public domain. Let’s not put CC licenses on them except for CC-0 (which is explicitly marking materials as public domain), pretending we can control the images when we have no legal right to (but users largely don’t know that). Let’s allow for free remixing and use without citing the digital library/archive it came from, without getting upset about posts on Tumblr. When you release public domain materials on the web (or through other services online), you are giving up your exclusive right to control the circumstances under which people use it – and as a cultural heritage institution, it is your role to perform this service for the world.

But not only should we provide this service, we should take credit for it: take credit for use, visits, and for people getting to do whatever they want with our collections. That is really meaningful and impactful use.

* Many thanks to Michael Williams for his great blog posts about our collections!

Taiyō project: first steps with data

As I begin working on my project involving Taiyō magazine, I thought I’d document what I’m doing so others can see the process of cleaning the data I’ve gotten, and then experimenting with it. This is the first part in that series: first steps with data, cleaning it, and getting it ready for analysis. If I have the Taiyō data in “plain text,” what’s there to clean? Oh, you have no idea.

taiyo_data Continue reading Taiyō project: first steps with data

research diary go

binding

Lately, I feel like I’m stuck in short-term thinking. While I hear “be in the moment” is a good thing, I’m overly in the moment. I’m having a hard time thinking long-term and planning out projects, let alone sticking to any kind of plan. Not that I have one.

A review of my dissertation recently went online, and of course some reactions to my sharing that were “what have you published in journals?” and “are you turning it into a book?” I graduated three years ago, and the dissertation was finished six months prior to that and handed in. This summer, I’ll be looking at four years of being “done” without much to show for the intervening time.

Of course, it’s hard to show something when you have a full-time job that doesn’t include research as a professional component. But if I want to do it for myself — and I do — that means that I need to come up with a non-job way to motivate myself and stay on track.

That brings me to the title of this post. My mother recently had a “meeting with herself” at the end of the work week to check in on what she meant to do and what actually happened. It sounds remarkably productive to me as a way to keep yourself 1) kind of on track, and 2) in touch with your own habits and aspirations. It’s easy to lose touch with those things in the weekly grind.

I decided I will have a weekend meeting with myself every week, and as a part of that, write a narrative of what I did. I’ll write it before I review my list of aspirations for the previous week and then when I compare, not necessarily beat myself up over “not meeting goals” but rather use it as an opportunity to refine my aspirations based on how I actually work (or don’t). As a part of that — to hold myself accountable and also to start a dialogue with others — I’ll be writing a cleaned-up version of that research diary once a week here. Don’t expect detailed notes, but do expect a diary of my process and the kinds of activities I engage in when doing research and writing.

I hope this can be helpful to a beginning researcher and spark some conversation with more experienced ones. While this is a personal journey of a sort, it is public, and I welcome your comments.

thinking about ‘sentiment analysis’

I just got off the phone with a researcher this morning who is interested in looking at sentiment analysis on a corpus of fiction, specifically by having some native speakers of Japanese (I think) tag adjectives as positive or negative, then look at the overall shape of the corpus with those tags in mind.

A while back, I wrote a paper about geoparsing and sentiment analysis for a class, describing a project I worked on. Talking to this researcher made me think back to this project – which I’m actually currently trying to rewrite in Python and then make work on some Japanese, rather than Victorian English, texts – and my own definition of sentiment analysis for humanistic inquiry.*

How is my definition of sentiment analysis different? How about I start with the methodology? What I did was look for salient adjectives, which I searched for by looking at most “salient” nouns (not necessarily the most frequent, but I need to refine my heuristics) and then the adjectives that appeared next to them. I also used Wordnet to look for words related to these adjectives and nouns to expand my search beyond just those specific words to ones with similar meaning that I might have missed (in particular, I looked at hypernyms (broader terms) and synonyms of nouns, and synonyms of adjectives).

My method of sentiment analysis ends up looking more like automatic summarization than a positive-negative sentiment analysis we more frequently encounter, even in humanistic work such as Matt Jockers’s recent research. I argue, of course, that my method is somewhat more meaningful. I consider all adjectives to be sentiment words, because they carry subjective judgment (even something that’s kind of green might be described by someone else as also kind of blue). And I’m more interested in the character of subjective judgment than whether it should be able to be considered ‘objectively’ as positive or negative (something I don’t think is really possible in humanistic inquiry, and even in business applications). In other words, if we have to pick out the most representative feelings of people about what they’re experiencing, what are they feeling about that experience?

After all, can you really say that weather is good or bad, that there being a lot of farm fields is good or bad? I looked at 19th-century British women’s travel narratives of “exotic” places, and I found that their sentiment was often just observations about trains and the landscape and the people. They didn’t talk about whether they were feeling positively or negatively about those things; rather, they gave us their subjective judgment of what those things were like.

My take on sentiment analysis, then, is clearly that we need to introduce human judgment to the end of the process, perhaps gathering these representative phrases and adjectives (I lean toward phrases or even whole sentences) and then deciding what we can about them. I don’t even think a human interlocutor could put down a verdict of positive or negative on these observations and judgments – sentiments – that the women had about their experiences and environments. If not even a human could do it, and humans write and train the algorithms, how can the computer do it?

Is there even a point? Does it matter if it’s possible or not? We should be looking for something else entirely.

(I really need to get cracking on this project. Stay tuned for the revised methodology and heuristics, because I hope to write more and share code here as I go along.)

* I’m also trying to write a more extensive and revised paper on this, meant for the new incarnation of LLC.

WORD LAB: a room with a whiteboard

Several years ago, I attended Digital Humanities 2011 at Stanford and had the opportunity to meet with Franco Moretti. When Franco asked what I was interested in, I admitted that I badly wanted to see the Literary Lab I’d heard so much about, and seen so much interesting research come out of. He laughed and said he’d show it to me, but that I shouldn’t get too excited.

Why? Because Literary Lab is a windowless conference room in the middle of the English department at Stanford. Literary Lab is a room with a whiteboard.

I couldn’t have been more excited, to Franco’s amusement.

A room with a whiteboard. A room dedicated to talking about projects, to collaborating, to bringing a laptop and getting research done, and to sharing and brainstorming via drawing and notes up on a wall, not on a piece of paper or a shared document. It was an important moment for me.

When I was in graduate school, I’d tossed around a number of projects with colleagues, and gotten excited about a lot of them. But they always petered out, lost momentum, and disappeared. This is surely due to busy schedules and competing projects – not least the dissertation – but I think it’s also partly due to logistics.

Much as our work has gone online, and despite these being digital projects – just like Literary Lab’s research – a physical space is still hugely important. A space to talk, a space to brainstorm and draw and write, a space to work together: a space to keep things going.

I had been turning this over in my head ever since I met with Franco, but never had the opportunity to put my idea into action. Then I came to Penn, and met a like-minded colleague who got just as excited about the idea of dedicated space and collective work on projects as I was.

Our boss thought the idea of a room with a whiteboard was funny, just as Franco had thought my low standards were kind of silly. But you know what? You don’t need a budget to create ideas and momentum. You don’t need a budget to stimulate discussion and cross-disciplinary cooperation. You just need space and time, and willing participants who can make use of it. We made a proposal, got the go-ahead, and took advantage of a new room in our Kislak Center at Penn that was free for an hour and a half a week. It was enough: the Vitale II lab is a room with a whiteboard. It even has giant TVs to hook up a laptop.

Thus, WORD LAB was born: a text-analysis interest group that just needed space to meet, and people to populate it. We recruited hard, mailing every department and discipline list we could think of, and got a mind-boggling 15+ people at the first meeting, plus the organizers and some interested library staff, from across the university. The room was full.

That was the beginning of September 2014. WORD LAB is still going strong, with more formal presentations every other week, interspersed with journal club/coding tutorials/etc. in OPEN LAB on the other weeks. We get a regular attendance of at least 7-10 people a week, and the faces keep changing. It’s a group of Asianists, an Islamic law scholar, Annenberg School of Communication researchers, political scientists, psychologists, and librarians, some belonging to more than one group. We’ve had presentations from Penn staff, other regional university researchers, and upcoming Skype presentations from Chicago and Northeastern.

A room with a whiteboard has turned into a budding cross-disciplinary, cross-professional text analysis interest community at Penn.

Keep up on WORD LAB:
@upennwordlab on Twitter
WORD LAB on Facebook

academic death squad

Are you interested in joining a supportive academic community online? A place to share ideas, brainstorming, motivation and inspiration, and if you’re comfortable, your drafts and freewriting and blogging for critique? If so, Academic Death Squad may be for you.

This is a Google group that I believe can be accessed publicly (although I’ve had some issues with signing up with non-Gmail addresses) although you appear to have to be logged in to Google to view the group’s page. Just put in a request to join and I’ll approve you. Or, if that doesn’t work, email me at mdesjardin (at) gmail.com.

Link: [Academic Death Squad]

I’m trying to get as many disciplines and geographic/chronological areas involved as possible, so all are welcome. And I especially would love to have diversity in careers, mixing in tenure-track faculty, adjuncts, grad students, staff broadly interpreted, librarians, museum curators, and independent scholars – and any other career path you can think of. Many of us not in grad student or faculty land have very little institutional support for academic research, so let’s support each other virtually.

In fact, one member has already posted a publication-ready article draft for last-minute comments, so we even have a little activity already!

Best regards and best wishes for this group. Please email me or comment on this post if you have questions, concerns, or suggestions.

よろしくお願いいたします!

*footnote: The name came originally based on a group I ran called “Creative Death Squad” but the real origin is an amazing t-shirt I used to own in Pittsburgh that read “412 Vegan Death Squad” and had a picture of a skull with a carrot driven through it. I hope the name connotates badass-ness, serious commitment to our research, and some casual levity. Take it as you will.

arsenal of research: organizing citations, PDFs, notes, brainstorming, and drafts

Post title courtesy of the tyrannical Brian Vivier.

Although I post about the content of my research quite a bit (when I do post), I thought I’d take a step back and talk about the research process today. I’m going to write about a very specific aspect: the ways in which the computer helps me organize and engage in my research.

Obviously, there are things like databases and library catalogs, which are a topic for another day. Many people I talk to don’t know the first thing about WorldCat, so it needs to be addressed! But let’s pretend I already have my sources. Now what do I do?

When I read, I’m very traditional. I take notes with pen and paper when I have a book or a photocopied source. In fact, I used to print out PDFs too, and highlight and write in the margins. Well, that turned out to be a terrible idea. Your highlights and margin notes are not very accessible when you’re coming back to the document later to brainstorm, outline, or write.

My lesson learned – learned after many difficult situations – was to take notes like I’m never going to see the source again. My advisor recommended I do this with primary sources, but if you take long notes that involve mostly direct quotes from the sources, there’s no need to buy the book or really even check it out again. There’s no need to keep binders and binders of printed-out PDFs. So that’s the kind of note-taking I do with pen and paper, first.

The next step is to get them into the computer, because I want them to be 1) stored somewhere safe (I do daily external HD backups, plus sync, more later on that), and 2) searchable, and also 3) copy and paste-able. But where to keep them? How to organize?

I have gone through several pieces of software trying to figure this out, and I’ve settled on Mendeley. I first used Scrivener even for note-taking, which is a great program, but bad for citation management. I then tried Zotero, but that turned out to be bad for PDF management. What I really wanted was a good database that would save my citations, any PDFs I happened to have (I’m currently digitizing all of my sources from my dissertation so they don’t get lost or damaged, and so I can free up my filing cabinet for other things), and ideally let me take notes and even annotate or highlight the PDFs.

Well, despite Mendeley being owned by the devil (Elsevier), it’s free and it actually does everything I need with only a few minor nitpicks, and does it in a way that makes me supremely happy. (My nitpicks are no nested bulleted lists in the notes, and no shortcut keys for bold/italics in the notes.) If you have a PDF attached to your citation and it has OCR, Mendeley’s search function will search not only your citations, notes, and annotations, but also inside the PDFs. It can be overkill at times, but it’s pretty amazing.

So step two of my research organization process is the painstaking, mindless, thankless task of typing my pen-and-paper notes into Mendeley under the appropriate citation. It’s boring but worth it. As I mentioned above, it searches all my notes, and I can copy and paste them into Scrivener, which I will address next. As I type my notes, at the very least I copy and paste them into brainstorming documents as appropriate (usually full quotes), and if I’m up to it, I do some free-writing to brainstorm how the source informs my topic and what I could write about related to it. This usually brings up new ideas I didn’t know I had.

What happens after I get all the notes typed in, PDFs organized and annotated if I have them? I next move over to Scrivener. I’ve been using it for over five years, for both research and creative writing, and can’t sing its praises enough. It’s a word processor that creates a database for your project, where you can store your reference materials, brainstorming ideas, notes, and draft. And more, if you can think of other areas you need to record notes in. Unlike old Scrivener (when I first started using it), you can now add footnotes and comments that port straight to MS Word when you compile your document for it, making the transition to final draft in Word very easy. (Sadly, publishers seem to prefer things that are not Scrivener databases when reviewing.) The typical things I store are the draft itself (of course), a research diary of brainstorming that I update periodically, brainstorming specifically about sources and particular concepts or points, and also under the “Notes” section the comments and suggestions and draft corrections I receive from others. So I keep my full writing process, except for mind mapping/concept mapping (another post), all in one place. It’s amazing.

I’m extremely happy with these two pieces of software; my only complaint is that neither of them does all of what I want, and I have to use two different things complementarily. Well, the situation is still significantly better than several years ago, when I used Mendeley Alpha and it deleted my entire library of citations multiple times. Yikes. Now its syncing works perfectly and I haven’t had a library failure yet. (Fingers crossed).

Next posts will include mind mapping software, how I take notes, how to effectively find and import source citations, and how I deal with multiple languages in my citations.