I’m teaching a workshop on Japanese text mining this week and am getting all kinds of interesting practical questions that I don’t know the answer to. Today, I was asked if it’s possible to batch convert .docx files to .txt in Windows.
I don’t know Windows, but I do know Mac OS, so I discovered that one can use textutil in the terminal to do this. Just run this line to convert .docx -> .txt:
textutil -convert txt /path/to/DOCX/files/*.docx
You can convert to a bunch of different formats, including txt, html, rtf, rtfd, doc, docx, wordml, odt, or webarchive. It puts the files in the same directory as the source files. That’s it: enjoy!
* Note: This worked fine with UTF-8 files using Japanese, so I assume it just works with UTF-8 in general. YMMV.