A report written for SI561 with Prof. Dragomir Radev in late spring 2011, on an experimental project dealing with sentiment phrase identification (in a way that also involves automatic summarization of a kind) and geoparsing. Read on to find out why it failed but was interesting nonetheless!
Download the Java source and be ready to tear your hair out at my uncommented code!
This project attempts to divide text narratives by location, and to find representative or key phrases within them. The corpus consists of 19th-century British travel narratives. The methodology uses a number of heuristics, most based on syntax, to identify when the narrator has arrived or is departing a location, and to identify that location. A simple methodology uses frequent nouns and sentences which contain both a frequent noun and an adjective to choose representative phrases, here used as a kind of sentiment analysis. The results of the methodology and remaining issues are discussed.
The purpose of this project is to first divide text narratives into blocks of text associated with a location (geoparsing), and then to choose representative descriptive or sentiment phrases from among the original text. By doing so, my goal is to automatically detect travel and assign locations to their respective sections of texts, and to perform a kind of qualitative sentiment analysis. Specifically, I consider this to be phrases that indicate the author’s impression or opinion of the places in which she travels.
My focus was on travel narratives from the 19th century, written by British women traveling to locales that would have been considered “exotic” at the time and were possibly colonies of the British empire. I aimed to assess both how long they spent in various places during the travel, as opposed to time traveling, and what their impressions were of the places and the journey itself, without having to read through each long narrative. I found that adapting the concepts of geoparsing and sentiment analysis to 19th-century travel narratives would be a productive approach to these humanistic questions.
Initially, I took as my corpus all works by Isabella L. Bird, Mary Kingsley, and Ida Pfeiffer that were freely available in plain text from Project Gutenberg. As I developed the software that I used for this project, it became obvious that the tools I used could not handle parsing and tagging large texts and all would have to be manually divided by chapters in the pre-processing phase, which also included the removal of Project Gutenberg metadata, non-travel descriptive chapters, and notes. Given the time constraints, I limited my corpus to a representative sample of one work by Isabella Bird (A Lady’s Life in the Rocky Mountains), Mary Kingsley (Travels in West Africa), and Ida Pfeiffer (A Woman’s Journey Round the World). I chose these because I wanted to test my heuristics on a variety of writing styles, rather than risk skewed results by sticking to one style.
As I intended to create a tool that could be used on any plain text from the same genre and time period with a reasonable success rate, I avoided using a training set as this would have added a significant amount of time and possibly biased the heuristics toward a specific set of texts. I did use Isabella Bird’s Among the Tibetans (chapter 1) as a development text for identifying and developing heuristics from syntactic patterns and typical verb usage for arriving or leaving a destination.
My test set consisted of the first chapter of each of the above listed works. This was chosen due to time constraints, as evaluating the success of the tool involves significant human effort in dividing the texts and making subjective judgments about representative phrases.
I developed my application using Java, and used two Java APIs for text annotation (part of speech and lemma), and WordNet database access. My code is included with this report, but I did not include the two APIs. They are freely available for download.
Stanford CoreNLP is a Java API that includes a variety of NLP tools. For this project, I used the part-of-speech tagger (including tokenization and sentence splitting) and lemmatizer in the Stanford CoreNLP package. Although there is a named entity recognition component, I did not choose to use this feature. My reasoning was that due to the nature of the texts in the corpus, location spellings are, more often than not, devised phonetically by the author and are thus idiosyncratic, or are archaic spellings. The named entity recognition tool is also resource-intensive, and the benefits did not outweigh the drawback of greater memory usage.
For WordNet access, I used the MIT Java Wordnet Interface (JWI) to retrieve related words for both verbs and nouns that I used in my text analysis.
The first phase of this project was to attempt to process the travel narratives in order to break them into sub-texts based on location. I approached this problem by tagging the texts for parts of speech, then lemmatizing all verbs in the texts. For each text, I then analyzed them by sentence for patterns that indicate either leaving a place (thus beginning a “non location” block of text that indicates a journey) or entering a place (and beginning a block of text tagged with that location). For all texts, I saved these sub-texts as new files with the place or “journey” in the file name.
Many of the patterns I used to trigger arrival or departure were purely based on parts of speech, but I also created arrival and departure verb lists from an analysis of the development text. One step in the geoparsing process in my Java application is to broaden the arrival verb list by finding the entailments and hypernyms of the verbs that are initially read from a text file that I created. This final arrival verb list was used in triggering location text blocks. A full list of the patterns that trigger arrival and departure are included as Appendix A.
The patterns I used to indicate travel and identify the location did not include any named entity recognition tool. I evaluated all of the place name gazetteers that were freely available online, but because of the alternate place names, alternate or historical spellings, and often vague or overly specific mentions of places, they did not meet the need of accurately recognizing and tagging locations in these texts. Because some of the locations covered in the narratives now use almost completely different schemes for place names (such as Japan) as compared with 19th-century standards, or are rendered in the author’s phonetic impression of what the place name might have been, it would be difficult even with a 19th-century place name lexicon to accurately tag these as locations. Instead, I attempted to develop a method for location recognition based on syntactic patterns and movement verb hypernyms that could possibly work on any number of texts in 19th-or 20th-century English. I was able to do this because the purposes of my project do not require resolving these place names to canonical forms or specific map coordinates. My methodology described here would not be able to accomplish these tasks.
After dividing the texts based on assumed location, I moved on to evaluate which locations (including the non-location “journey” texts) were longest relative to the total word count of the original text. This provided information on which areas the author tended to focus on, or whether the individual locations visited were discussed less than the travel itself (the “journey”). For the purpose of this project, I counted all individual “journey” texts as a single block and compared that word count against the full original text. Because the test set was limited to only the first chapter from three texts, these measurements can be taken as a rough judge of how accurate the parser was, but cannot serve as definitive results.
The sentiment analysis aspect of this project posed significant difficulty. Rather than evaluating the texts as positive or negative, the goals of the project required identifying sentiment qualitatively. I chose to interpret “sentiment” as feeling about an experience, and thus the author’s description of significant entities in the text. My process was to first identify the most frequently mentioned nouns, as well as their related words, and to then scan through the text to find sentences that contained both one of these nouns and an adjective or adverb. This process is quite simple and does not try to isolate a phrase from a sentence, and thus leaves much room for improvement.
My methodology for finding the key nouns was to first measure the frequency of all nouns in the text (excluding pronouns), to determine the mean frequency of a noun (rounded down to the nearest integer), and choose nouns whose frequency was at least 3 above the mean as the set of nouns I would use. (This number is arbitrary, but I arrived at mean + 3 after experimentation on the development text as the number that would create a long enough list of nouns that also wouldn’t trigger a sentiment phrase on every sentence in the text.) Thus, the set of key nouns is specific to each text as it is analyzed.
With this preparation, as I parsed the text for location blocks I assessed each sentence for sentiment. If a sentence contained both a key noun and an adjective or adverb, I stored that sentence in a hashtable with its location. (If the location already exists, I simply appended this sentence to the exiting text.)
It is clear from the quantitative results that the rule for departure was invoked much too often, and rules for arrival not enough. The following are results for each test text, with a rule count below.
Travels in West Africa
|Section||Word Count||Percentage of Total|
|FREE TOWN ITS CAPITAL||64||1%|
|SAN PAUL DE LOANDA||111||1.5%|
|THE GOLD COAST||86||1%|
- Rule 1 was used 4 times.
- Rule 2 was used 0 times.
- Rule 3 was used 1 times.
- Rule 4 was used 12 times.
- Rule 5 was used 35 times. (arrival verb rule)
- Rule 6 was used 66 times. (departure verb rule)
A Woman's Journey Round the World
|THE FOURTH OF JULY||496||7%|
- Rule 1 was used 2 times.
- Rule 2 was used 0 times.
- Rule 3 was used 1 times.
- Rule 4 was used 8 times.
- Rule 5 was used 41 times. (arrival verb rule)
- Rule 6 was used 99 times. (departure verb rule)
A Lady's Life in the Rocky Mountains
|Section||Word Count||Percentage of Total|
- Rule 1 was used 2 times.
- Rule 2 was used 0 times.
- Rule 3 was used 0 times.
- Rule 4 was used 8 times.
- Rule 5 was used 21 times. (arrival verb rule)
- Rule 6 was used 37 times. (departure verb rule)
As seen from the tables, the results show low recall, with non-location (JOURNEY) blocks of text making up the majority. These results similarly show relatively low precision; in other words, from the locations that it did identify (other than JOURNEY), three were either no location at all (a space in the filename) or not actually locations. In the initial development before all rules were in place, many more proper nouns were identified as arrival locations that were not actually place names. For example, the rules identified the location as the name of a horse or person, or even “icy water” when its part of speech was mis-tagged as proper noun. Thus, some of this low precision could be helped by compiling a gazetteer specific to this genre and time period, but some of it is inevitable due to tagging error.
Looking toward further revision, the departure rule clearly needs to be revised so that it results in lower recall. This may enable other rules to be triggered for the same sentence and possibly provide us with a more detailed look at the places that are named in the texts. In addition, it is necessary to re-think rules 2 and 3, which are both based solely on syntax, and come up with new rules to replace them as they have rarely been triggered in any of these three cases.
The results of the sentiment analysis process are much harder to judge, as they involve subjective judgment which would likely vary even between two people reading the same text. I have attached all of the sentences identified as sentiment phrases in Appendix B, but will discuss some preliminary conclusions here.
In terms of quantitative results, the sentiment analysis had very high recall, which suggests that the cutoff for noun frequency should be made much higher. At the same time, that runs the risk of eliminating nouns that are more information-bearing. It is a fundamental question about what constitutes the “key” nouns of a text – are they the ones that a reader feels are important, even if only mentioned once, or the most frequent ones written by the author? My methodology puts precedence on the words used most often by the author, in order to discover as much information that is present in the text as possible.
The key nouns used to identify phrases are taken from a list of the most frequent, then lengthened with their related words in order to take these initial nouns as frequent topics with the related words as additional referents. Because of this, many of the sentiment phrases taken from the three test texts are more mundane than a human might have chosen. For example, because “letters” is a frequent noun in A Lady’s Life in the Rocky Mountains, there are many references to the author’s letters from her previous book, the letters she sends home (which make up the book in question), and letters she receives. Still, it may be beside the point to evaluate the phrases based on whether they deviate from what an imaginary human would have done. Rather, as an answer to the humanistic question of “what impressed these authors about the places they visited?” it provides an interesting challenge to my own assumptions about the texts. When beginning this project, I assumed that the representative sentiment phrases would be about landscapes, people, customs, or levels of comfort. There are of course many phrases that fit into these categories, but also many about train cars, times, animals (especially horses), crops, and temperatures.
Despite the value of the results for my reading of the texts, they can clearly be improved upon as well. There is a great deal of repetition and it would be useful to make lists of similar sentences or phrases and automatically choose several as most descriptive or emotional based on some combination of syntactic rules, WordNet synsets, and introducing probabilities. At the same time, some nouns may bear more information than others and so should be ranked accordingly, although this would clearly require human effort and a training set. In addition, I did not include the hyponyms of the noun hypernyms that I gathered from WordNet, which would expand the noun list considerably. Isolating “descriptive phrases” rather than including entire sentences would also be a logical next step, as reading through the mini-narrative that results from the sentiment phrases currently is much closer to automatic summarization than sentiment analysis.
This project is a first step toward detecting location for subsections of text documents, but there are many issues remaining to be resolved.
The first major issue is partially addressed in the set of syntactic rules used to identify arrival in a place. The authors included in my corpus often implied rather than stated that they traveled to a new place, and simply mentioned the name of the place without any travel verb near it. For example, the development text contained the phrase “Leh is ...” to indicate that she had arrived in the city of Leh. Because this was a frequent pattern in the development text I included this in my heuristics. However, this is a crude way of getting at the place name and needs to be refined further. It is difficult to say how close any rules can get to identifying travel when it is only implied by an author, and unfortunately that is often the way that we write our narratives. Still, it should be a goal to get as close to this as possible.
Place names are also an issue in that there are a number of false positives, even when the part of speech is correct. Named entity recognition would become more useful with the development of a 19th-century place name gazetteer, although this would require research and time. An intermediate step is to integrate GeoNames’ gazetteer with my existing application and use the Stanford CoreNLP named entity recognition tool along with it to try to identify more place names. This would allow me to develop more heuristics specifically related to detecting a place name, which I have not done.
A second issue is the travel sections between locations, referred to in this report as “JOURNEY.” These sections, even if they had lower recall (and thus didn’t take up the majority of the results), are opaque. Between two places, where do the travelers pass through? This information is often mentioned in passing in the middle of a more complex sentence, and difficult to isolate. A partial solution is to refine the departure and arrival rules to cut down on the number of sections labeled JOURNEY when they are not a travel sequence, and to use the sentiment phrases from the JOURNEY sections to supply more detail about the travel.
Refining the departure rule could lead to a large improvement in accuracy, especially because it is often triggered by a false positive. “Departure verbs” are often used in a non-literal sense, but trigger JOURNEY nonetheless. Elimination of false positives through part of speech, syntactic context, or lowering probability of triggering the rule could all contribute to lowering the number of false positives.
Another major issue is the inability to deal with time. On a simple level, here I use the lemmas of verbs in order to make consistent matches, but beyond that, it is also very difficult to understand the time of the current text based only on syntactic rules. If a character starts to tell a story, it will potentially be blocked as a new segment of text at a new location. This is especially a challenge because the narrators of these texts speak in past tense, so any story they tell in past tense will be indistinguishable from a linear narrative going from past to future. At the same time, this also brings up the theoretical question of how we should read texts. This issue is a priority if preserving chronology is important for the purposes of the reader/programmer, but less so if chronology isn’t important.
Finally, there are several resource issues. First, I would like to greatly expand on the development text set to develop better heuristics, and also increase the size of the test set for more reliable results. A second technical issue persists that makes it difficult to process even moderate amounts of text. The Stanford CoreNLP set of tools is so memory-intensive that it cannot process an entire text (around 4,000 – 7,000 words) at a time, and the texts need to be divided. (Here, I divided them logically according to chapter marker.) Without making the application more efficient, its use will be very limited since the pre-processing of texts takes much more time and effort if done by hand.
Although this project is in its beginning stages and the results were unsatisfactory, it lays ground for continuing to work on both geoparsing and also the question of how and whether sentiment analysis can be used for humanistic questions. It also suggests the integration of automatic summarization into this problem.
This problem has the potential to allow us to re-think our approach to texts, in particular which phrases are important and representative of the author’s impressions, and also the question of how to attribute text to a location. Should a block of text be marked with the location it discusses, or the location where it was written? (This is an especially relevant question because many 19th-century travel narratives take the form of letters with the location in which they were written marked clearly, even though the content of the letters is about something else.) For the humanities, this doesn’t just provide us with information, but also with new questions.
At the same time, the information provided by this kind of application has use elsewhere. A more refined process could come closer to the goal of automatically dividing text by place and providing a few representative sentences from each. I envision this taking form as a simple visualization over an historical map dating from the time of the narrative, implemented as a Web site for general access. Both historical maps and these texts are freely available online, but bringing them together as an interactive experience could help make them more accessible to a range of people, including students, who might not see them otherwise.
Note: arrival becomes false if the location name found in the rule is the same as the current location.
- Rule 1: Sentence has verb with lemma “be” in the first 5 words, preceded by at least one pronoun. The entire phrase before “be” is taken to be the place name.
- Rule 2: Sentences begins with one or two word proper noun phrase followed by a comma.
- Rule 3: Sentence contains “arrive” and is followed by a place name phrase (proper nouns, determiners, prepositions).
- Rule 4: Sentence contains “come” and is followed by a place name phrase (proper nouns, determiners, prepositions).
- Rule 5: Sentence contains an arrival verb. (See the verb list.) Arrival is true and location changes only if there is a place name phrase (combination of proper nouns, determiners, and prepositions) following the verb.
- Rule 6: Sentence contains a departure verb. (See the block list.) Location changes to “JOURNEY.”