The humanities are used to feeling embattled, and, consequently, to making excuses for their existence. We should be allowed to study literature, one such line of argument goes, because reading and writing make students better employees or better citizens or more empathic human beings — because literature has some merit that is not merely aesthetic. Take, for instance, a widely publicized article in Science affirming that “Reading Literary Fiction Improves Theory of Mind”: the psychologist authors, while acknowledging the “difficulty in precisely quantifying literariness,” casually reify that “literariness” in their effort to demonstrate its virtues, concluding that reading “prize-winning” texts seems more beneficial, psychologically healthier, than reading “popular” fiction. The fact that literary scholars would be likely to reject the precise terms of their distinction — which places Dashiell Hammett and Robert Heinlein in the same category as Danielle Steel and Christian novelist William Paul Young, and which ignores variables like translation, publication date, and author demographics — was trumped, when the article was picked up in the popular media, by the suggestion that we should be grateful that these scientists have taken an interest in our object of study.
In a sense, I am. Among literary study’s dizzying series of 21st-century “turns,” the empirical one strikes me as the most intellectually stimulating and the likeliest to last, providing a means for literary scholars to generate and test falsifiable hypotheses without abandoning the interpretive rigor that characterizes our discipline. And digital humanities, of course, is the star of this new empiricism; the question is — why? What is it about large-scale quantitative analysis that strikes us as so invigorating and so promising? This question may sound like a softball, but I think a prolonged discussion would reveal that many of us in DH actually disagree on the answer. I have one, of course, that I think is right — but before I get to it, I want to explore one especially common wrong answer.
A few weeks ago, an article started showing up in my Twitter feed and my email inbox. The title was admirably straightforward — “The emotional arcs of stories are dominated by six basic shapes” — and the authors were all mathematicians or computer scientists, mostly based in the Computational Story Lab at the University of Vermont. Having generated “emotional arcs” by assigning sentiment scores within a 10000-word sliding window (using a tool delightfully named the Hedometer), the team then classified the arcs into six major “modes” using Singular Value Distribution. Because the team was able to duplicate these “modes” as clusters via both hierarchical clustering and unsupervised machine learning, and because randomly shuffled “word salad” versions of the texts did not generate similar arcs, the authors feel justified in their conclusion that “these specific arcs are uniquely compelling as stories written by and for homo narrativus.”
Well — ok. This is extraordinarily similar to Matt Jockers’s Syuzhet project (right down to the identification of 6-7 basic plot types), which the authors never mention by name in the body of the article; but perhaps the conventions of citation are sufficiently different for computer scientists that this does not read to them as unethical. It’s a huge and essentializing claim, which makes many humanities scholars uneasy; but what is DH for if not to push us toward more ambitious arguments? There may be issues with generalizing about all stories from what is at best a corpus limited by time and nationality (Gutenberg, for which I have a deep and abiding love, is nonetheless very Eurocentric); but it’s still better, surely, to put forward a suggestion that future research can refine than to make the kind of contentless claim — “both affirms and subverts,” “both endorses and undermines” — we used to see in the bad old days of anti-empiricist criticism, the kind of claim that aims for nuance but flounders in vagueness. Right?
Right, I think, in principle. But then one looks at the authors’ lists of the top 5 texts associated with the various arcs, and notices something curious. Here’s A Christmas Carol, sure, and A Hero of Our Time; here’s The Adventures of Tom Sawyer and something called Tarzan the Terrible; here’s — Fundamental Principles of the Metaphysic of Morals. Kant? A one-off mistake, perhaps? But wait, it’s Notes on Nursing by Florence Nightingale, and Lucretius’s On the Nature of Things; The Economic Consequences of the Peace and Cookery and Dining in Imperial Rome, which is quite literally a book of recipes — and all these, mind you, drawn from the top five most representative examples of each story arc (although one finds plenty more nonfictional works when one examines the hierarchical tree in the appendix). Kant, for instance, is apparently the fifth most perfect instance of the “Man in a Hole” narrative template (a designation the authors borrowed from Kurt Vonnegut — possibly by way of Jockers again, who cites the same Vonnegut talk); this has a certain hilarious aptness, but is ultimately hard to fathom.
A significant subset of the “stories” analyzed by the authors, then, appear not even to have been narratives, let alone fictional ones. But there’s a more subtle problem with some of the fictional works as well. One might see the name of Balzac or Poe and assume that we are here at least dealing with appropriate narrative texts. But one of those Balzac “narratives” is The Human Comedy: Introductions and Appendix, while Poe’s collected Works make an appearance; there are also anthologies of works that aren’t even by a single author, like Fifty Famous Stories Retold or Humour, Wit, and Satire of the Seventeenth Century. (The authors identify A Primary Reader as “among the most categorical tragedies [they] found,” which would surely alarm the well-intentioned elementary school teacher who composed this story collection for her first-grade charges.) If you try to track an “emotional arc” across one of these anthologies, you may well get a pattern that seems recognizable, but it will be invalidated by the presence of discrete narratives within the text: “The Murders in the Rue Morgue” and “The Balloon-Hoax,” for instance, have distinct and unrelated arcs, but would show up by the authors’ methods as mere moments in a larger “narrative” that does not, in fact, exist — not in authorial intention, and not in readerly experience.
Here you may object, reasonably, that I’m piling on, dismantling an analysis that isn’t necessarily worth the trouble; after all, digital humanists know that we’re supposed to clean up our corpora and retain metadata on texts’ genres, whereas the authors of this paper seem not to have recognized that these steps were important. And it’s true, they surely didn’t realize; but this is exactly my point. For researchers with any sort of background in literary studies, the results obtained here would have provided valuable feedback about the accuracy of the “Hedometer” tool: if your sentiment analysis algorithm is finding emotional arcs in Kant and cookbooks, it is effectively broken. Precisely because the authors didn’t begin by conceptualizing fiction and non-fiction, or narrative and non-narrative, as discrete categories within “the literary,” the discovery that the two showed the same characteristics under “Hedometer” did not register as problematic or even surprising — as, objectively, it should be.
The same goes for countless other morsels of expert knowledge that would, in a successful DH project, provide a sanity check on the plot arc results. Does the sentiment analysis tool adequately deal with irony — and, if not, can we assume that this washes out in a large data set, or do we need to exercise special care with particular authors or genres? Might focalization choices have predictable effects on textual sentiment that would add more nuance to the Hedometer’s judgments? How do these story templates, if they prove accurate, interact with more prescriptive theories of plot structure (for instance, the Aristotelian model that the authors mention in their introduction)? These are not follow-up questions, building on the data generated by this research project; they are integral to determining that data’s validity and meaning. If the authors so thoroughly miss an opportunity to say something worthwhile about the humanities, it is because they ignore the concepts and tacit knowledge internal to literary study; rather than operationalizing literary concepts — character, canonicity, personification, poetic meter — with the help of big(gish) data, these “emotional arcs” are built on shaky conceptual foundations, making even the basic information they purport to provide practically unusable.
I’ve been focusing my criticism on the applied mathematicians who wrote this paper, which may seem to suggest that I believe we humanists would never do this; we’d never let our love for an analytical tool blind us to the weak results it produced, or elide humanistic knowledge in the interest of arresting visualizations. It might be more accurate, though, to say that when we do this, it can’t necessarily be explained by mere ignorance. Rather, DH projects that treat the humanities as a mere data set — rather than a robust mode of inquiry with protocols of its own — are making a calculated decision to use science as a legitimizing tool rather than a truly investigative one. The implication, I think, is that your average non-digital humanist is simply a subpar scientist, and that one can therefore learn more by applying quantitative tools to raw texts than by generating hypotheses and writing programs on the basis of humanistic knowledge. Contrary to many non-DH practitioners who seem to think that this devaluation is DH’s ultimate agenda, I’d argue that nothing could be more fatal to the future of the digital humanities. What a truly empirical humanities would suggest, after all, is that literature (for instance) is worth investigating for its own sake, beyond the potential for monetizing a particular plot structure or trading on cultural capital; as a complex form of human behavior, literature deserves the same conceptual rigor and expertise that we would have no qualms about bringing to, say, the study of traffic planning or coral colonies. (Would a mathematician consider analyzing coral reef distribution without consulting a marine biologist, or indeed even reading any active marine biologists?) To allow papers like this one to represent DH — as, flattered and hopeful, we often do — is to mistake undisciplinary for interdisciplinary research, or indeed to forget that we have a discipline at all. It’s short-term visibility, bought with long-term extinction.
 The authors critique Jockers’s project, linking to his work in a footnote without explicitly mentioning him, when they claim that “other work” on sentiment analysis within texts has confused “the emotional arc and the plot of a story.” Given that the authors label their own emotional arcs with plot-type designations like “Rags-to-riches,” “Cinderella,” “Tragedy,” and “Icarus,” one might suspect that they themselves are not entirely scrupulous about this distinction. Ben Schmidt, in his blog post responding to the paper, also notes the similarity to Jockers’s work.
 Again, Schmidt noticed this too, in a blog post I became aware of after writing this one. He points out, correctly, that the authors’ means of separating fiction from non-fiction — length and download count — are “*terrible* inputs into a fiction/nonfiction classifier.”
 As David McClure pointed out to me in conversation, there is another explanation for this phenomenon: it could be that these non-narrative texts do in fact have emotional arcs that the tool is picking up accurately, and that the authors have discovered something about the affective structure of nonfiction. This is certainly possible, but I think one could only make the argument if the authors had derived their six “emotional arcs” from a corpus that they knew to contain only narratives; in that case, finding evidence of narrative structures in a non-narrative text would suggest that the text did in fact have some narrative component. Because the non-narratives were baked into the corpus from the beginning, though, their arcs presumably influenced the outcome of the SVD analysis, making it hard to argue that these arcs reflect anything distinctive about stories per se.