Working in the Lab, Part 2

Sarah Thomas; Dec 8, 2016

I first became familiar with the Literary Lab when I took a class on literary text mining in R with Mark Algee-Hewitt last winter. From discussing the philosophies behind the digital humanities to constructing cluster dendrograms (plus lots of other cool graphs) of Poe's short stories, I loved the class and was excited to start working at the lab!

The first project I contributed to was the Identity project, which investigates the discourse on race in American literary texts from the late 18th century until the mid-20th century. Much of my time was devoted to reading novels and replacing references to black characters' names with fixed tags. I personally read and tagged The Sound and the Fury, Their Eyes Were Watching God, Our Nig, The Adventures of Huckleberry Finn, The Conjure Woman, Westward Ho!, Three Lives, and Uncle Remus: His Songs and Sayings. These character-tagged texts would later be used in a collocate analysis.

The tagging seemed simple enough, but I soon ran into problems with pronouns. Should a pronoun reference have the same tag as a named reference? And moreover, should we distinguish between reflexive and personal pronoun tags? Using a single, fixed tag could streamline future analyses, but using multiple tags could lead to potentially more revelatory results. After conferring with the principal researchers, we decided to alter tags based on the type of pronoun. As I was reading through the texts, I noticed that sentences containing reflexive pronouns sometimes revealed how characters conceived of their own identities, often as they related to race. The collocate analysis done by the principal researchers produced some surprising results---for example, a common word found in collocates of reflexive pronouns was "value".

From another project, I learned first-hand that projects often run into obstacles early on, and that's valuable---it can help clarify goals and shed light on new ways of doing. About a month before I started at the lab I'd been thinking a lot about nebulous genre words such as "postmodern". I thought about doing an empirical content analysis of texts generally considered postmodern to see what their defining characteristics are according to a computer. I ran into a challenge right away---how could I construct a corpus that was free of bias? The other members of the lab and I came up with a set of 50 novels we considered postmodern, and I then verified the accuracy of the label by finding peer-reviewed articles describing the novels as postmodern. Yet all the same we were still coming up with a set of novels ourselves to fit within a corpus of arbitrary size. We did, however, decide to make a control corpus which will consist of a random set of novels from the lab's 20th century corpus with the same date distributions as those in the postmodern corpus. The randomness inherent in that corpus could control for bias better.

I planned to work in R and use topic models and keywords in context. At the start, I ran a trial topic model on a subset of the corpus and found it a little hard to make sense of (if only it would clearly assign names to topics!). And I soon realized that my goals were constrained by the tools I was using. Topic models and keywords in context are useful for understanding thematic postmodernism---that is, the specific words and topics that make up postmodern texts. Yet postmodern novels often experiment with form, from employing extensive commentaries such as in Pale Fire to extensive footnotes such as in Infinite Jest. A text file of Infinite Jest in which the footnotes blend together with the actual text could give us distorted results about the most prominent words in the novel, yet it's hard to dispute that the footnotes are essential to the experience of reading the novel. For making sense of stylistically postmodern characteristics---things often rendered best in print form---my best bet was reading the texts themselves or having knowledge of the form beforehand. After all, there's no one way of being experimental in form, and such experimentation is not necessarily easy for a computer to recognize.

Going forward, I'm thinking of trying out more methods in R such as clustering or most distinctive words. I'm also considering revamping my corpus to only include novels with the highest number of "postmodern" affirmations in peer-reviewed articles; I suspect this could alleviate bias in the corpus. And I'll read up more on postmodern literature and theory so that I can better interpret the results in R as they compare to widely-held views. Digital tools can help maneuver wide-ranging literary questions, but it's really a solid knowledge of literature that makes all the data visualizations and topic models the most meaningful. I'll continue to learn!

Sarah Thomas is a sophomore at Stanford majoring in English. In her spare time she hosts a radio show called Life Aquatic on KZSU.