Finding needles in 34 million haystacks

Erik Fredner; Nov 9, 2019

We are working on a new collaboration with the Smithsonian Institution about the histories of fame and celebrity in the United States. To ground ourselves in public discourse surrounding these topics, we began by analyzing ProQuest's Historical Newspapers corpus. Working with datasets this large---tens of millions of records, billions of tokens---posed a number of technical and intellectual challenges, which are precisely the sort of subjects for which *Techne *was created. This post is the first in a series on the challenges of scale posed by this project.

Our main goal in working with the newspapers is to understand more about how they represent---and, in so doing, create---concepts of fame and celebrity. In the contemporary moment, we might think of something like the style section of the New York Times, magazines, or tabloids as contributing to the print production of celebrity. Of course, newspapers seem quite peripheral to contemporary celebrity: Future historians will likely be trawling through Instagram and YouTube archives to understand its modern formation. But this project is interested in the history of celebrity in the U.S. as formulated in periods when newspapers played a more central role. Perhaps too often, that history is written around major figures like P.T. Barnum, whose particular fame can be incautiously treated as structurally similar with fame generally. While Barnum is no doubt important, we know that the history of celebrity is stranger and more multifaceted than that---not least because most famous people didn't run circuses.

First, the scale of the data: The Historical Newspapers corpus advertises more than 55 million pages of content. Across those "pages"---which vary greatly in size---we have at least 16.5 billion words. In order to get a handle on huge numbers like these, we often use "a Proust" as an informal unit of measure in the Literary Lab. Proust's In Search of Lost Time---which takes up 6 volumes and about 10 inches of shelf in the Modern Library translation on my shelf---totals over 1 million words. ProQuest's Historical Newspapers collection contains at least 14,000 Prousts.

One does not simply click on a folder containing 16.5 billion words. Even overpowered computers would struggle to list that many files. Worse still, Stanford's supercomputer, Sherlock, caps not only the amount of disk space one can use (which is perfectly reasonable), but also the total number of files a given user can store (also reasonable, but, for our purposes, annoying). As happens often enough in text analysis, the absolute amount of data in gigabytes is less problematic than the massive number of small files to be analyzed.

In this case, the unusually high number of unusually small files led to an unusual corpus structure. ProQuest's newspapers come compressed as a large number of tarballs (which are compressed files similar to .zip archives) each of which contains about 25,000 tiny XML files. Each of these XML files in turn contains a single article or other text from a single issue of a single newspaper. Besides articles, other texts might include advertisements, poems, etc. The American subset of the ProQuest corpus we have been working with contains 34 million such files spread across hundreds of tarballs.

Usually, our corpora exist as directories of text files with some form of associated metadata stored in a tabular format like CSV. For instance, the corpus from Pamphlet 8 (Algee-Hewitt and McGurl) is enormous by literary critical standards---containing several hundred novels from across the twentieth century---but it is computationally simple to work with compared with ProQuest. Sometimes differences in degree amount to differences in kind, and that is certainly the case when we compare that by-now-familiar scale of several hundred novels to ProQuest's thousands of Prousts.

Because we could not decompress these files on disk without it either a) crashing Sherlock, b) timing out, or c) exceeding our file quota, we had to figure out how to work with them while they were still compressed. I hadn't done this before, and had to develop a new workflow for it, which I will be publishing on GitHub for the second post in this series. It turns out that working with compressed files is trivially easy, and one of the major conclusions of this work for me so far may be a shift in how I imagine storing corpora on disk. Once they get up to even a medium size, it may make sense to work with text files in compressed form (especially if your hard drive, like mine, is overcrowded with old text files and bad results you never got around to deleting). You merely need an extra column in your metadata table indicating the archive where the file is located to access it. But I'm getting ahead of myself.

We considered multiple options to solve this problem of scale. Again, we did not lack for storage space on Sherlock, but rather we were bumping up against the system's limit for the total number of files in our workspace. One method we considered and decided against would have extracted all of the individual files in an archive into a single large text file. All of the articles, advertisements, obituaries, etc., which are currently stored as individual XML files with ProQuest's rich metadata, would be added to a single text document, and could be parsed from there. That would have solved our problem of not being allowed to have millions of tiny files extracted on disk. These aggregate files would have been more similar in size to the text files we ordinarily work with, maxing out at a few dozen megabytes. But as an approach dumping them all into big text files seemed, to use a term of art, dumb.

We soon learned that Python has a built-in module for working with compressed files that does exactly what we needed. Using this module, you can iteratively read compressed files into memory, manipulate them, and move on without decompressing the archive. To me, this seemed like a bit of wizardry: it's as if you could scan a piece of paper in a filing cabinet without ever opening the drawer.

Working with files at this scale had another unexpected advantage in that it required us to be a bit more efficient with our programming than usual. (After all, English Ph.D.s are rarely evaluated on their code's runtime...) Because text files are so small, suboptimal programming is usually not a serious problem. After all, the absolute amount of time required to run inefficient code is usually less than the amount of time it might take for a frankly mediocre programmer like myself to optimize it. Far better programmers than I have taken up a line from Donald Knuth as a mantra in this area: "...premature optimization is the root of all evil (or at least most of it) in programming." That was certainly not the case here. The difference between functional-but-bad and actually decent code might be a week of processing time when working with this many files.

Combining lightly optimized code with Ryan Heuser's custom Python wrapper for the Message Passing Interface (MPI) on Sherlock, Slingshot, we were able to extract complete metadata from all 34 million files in just a few hours. MPI, as David McClure has already discussed on Techne, distributes computational tasks across a variety of different cores on different machines. After the processing is complete, we stream all of the individual results back together. In this case, we split the metadata extraction process across 200 of Sherlock's cores, returning 200 individual datasets that can be combined into a massive metadata table that describes the whole corpus. Using that gigantic table, which pandas handles with what can only be described as grace under pressure, we can then subset the articles by any number of factors given by the metadata: Give me all articles published south of the Mason-Dixon line between 1850 and 1860. Let me see all of the advertisements published by the New York Times during the Civil Rights Era. Give me every obituary containing the phrase "robber baron."

Our first step in analysis had to involve subsetting the corpus, since it is too large to be easily (or usefully) manipulated in its entirety. The most obvious thing to do relative to our research question would have been too look for famous people in our texts. But as Ryan Cordell has pointed out, messy optical character recognition (OCR) causes researchers to unduly miss (and, more rarely, falsely identify) documents containing instances of their keywords. His titular example is typical: Only a literary scholar accustomed to reading bad ORC might recognize "Q i-jtb the Raven" as a corruption of the Poe line "quoth the raven." ProQuest suffers from the same sort of OCR problems. OCR accuracy varies depending on the quality of the initial images, the text processing algorithms used, the age of the paper and the type, etc. Some articles have very high rates of accuracy, with more than 95% of all its tokens appearing in a validation dictionary, while others are much lower. Stock reports, perhaps unsurprisingly given that they contain tables, graphs, and other unusual formatting, fare worst of all.

As this project is interested in documenting the history of American celebrity, we wanted to begin with people's names, specifically names that frequently recurred across the corpus. Of course, names are among the likeliest words to have serious OCR errors: many rare names do not exist in validation dictionaries, and we would need to successfully capture two well-formed names in a row to get an unambiguous hit, or potentially even more if we are looking at someone conventionally called by three or more names like Harriet Beecher Stowe. Worse still, we would be over-counting people with names likely to be in the validation dictionaries ("John Smith"), and under-counting people with names less likely to be there ("Olaudah Equiano").

Although the bodies of the articles have the usual OCR imperfections Cordell addresses, we discovered that the titles are all but perfect. They appear to have been entered by hand. We don't know the names of the people who did this tremendous amount of work, but we want to express our gratitude to each and every one of them. Given the excellent reliability of the titles, we used natural language processing techniques to identify named entities that appeared there. Of course, newspaper titles have a grammar and style of their own---"Headless body found in topless bar"---and the accuracy of the named entity recognition may have suffered some as a result of the unusual dependency parsing.

Fortunately, our results appear to err in the direction of over-inclusivity. It seems likely that the number of names missed in this process is lower than the number it returned; the named entity recognition process favors tagging text as a possible person even in what appear to us to be unlikely instances. For example, the two most common "people" in the titles were "N.Y." and "N.J." which must refer in almost all cases to New York and New Jersey.

As a first approximation, we assumed that if a person appears in article titles with high relative frequency, they are more likely to be a celebrity. No doubt this overrates some celebrities and underrates others. But it is a start with the data we have. We totaled up the number of instances of unique names, and then manually filtered the results for actually existing historical persons. We did this by hand as a group. It's not always the case that you can read through a table containing tens of thousands of rows, but dividing that important if repetitious work up is one of the many advantages of the Lab's group research model. Reading through the results led to a series of provocative historical questions about the nature of celebrity. Many of the identifications we tasked ourselves with were simple: If the words "Malcolm X" appear in an article headline written after Malcolm Little's birth year, that unambiguously refers to a specific person, or to an entity named for that person. After all, the existence of "Malcolm X School" is hardly evidence against Malcolm X's fame.

Other cases were more ambiguous. "Sam Jones" appeared a large number of times in titles, but there are several historical persons named Sam Jones who might have appeared in newspaper titles at overlapping times. The next step in such cases is to read passages from the articles containing "Sam Jones" to identify which Sam they refer to. We also ran into cases that veered into questions of ontology. For instance, what to make of the possible celebrity of metaphorical or fictional persons? "Uncle Sam" and "John Bull"---fictional personifications of the U.S. and England, respectively---both rank highly. "Jim Crow" is a minstrelsy character, as well as a depraved legal regime. If we take his novel seriously, "Don Quixote" demands to be considered a real person.Though these names are all frequent, are they celebrities? Our manual coding scheme accounted for this ambiguity by adopting a provisional distinction between a historically specific person and a real person. "Muhammad Ali" is both specific and real. "Uncle Sam" is specific but not real. "Joe Shmoe" is neither specific nor real. We would not go to the mat for this distinction at a philosophy conference, but it's good enough to deal with these edge cases, of which there were very few. My personal favorite of these ambiguous cases was John P. Grier, the too-human name of a famous racehorse.

Using this this data we have generated, we will be able to extract sub-corpora of articles that mention specific real people who appear frequently in titles, as well as those frequently cited in histories of American celebrity. We have P.T. Barnum's articles, but also more frequently referenced yet less well known individuals like Kelly Miller, Lillian Russell, and Marian Anderson. The next step in our research is to learn about the forgotten histories, processes, and conceptualizations of celebrity in these documents, and how they differ from one celebrity to the next.