Last week I looked at some of the clusters of words that fluctuate together across narrative time in the Lab’s corpus of ~27k American novels. A lot of these are pretty semantically “legible,” in the sense that it’s not hard
A hierarchical cluster of words across narrative time
I wanted to pick back up quickly with that list of the 500 most “non-uniform” words at the end of the last post about word distributions across narrative time in the American novel corpus. Before, I just put these into
Distributions of words across narrative time in 27,266 novels
Over the course of the last few months here at the Literary Lab, I’ve been working on a little project that looks at the distributions of individual words inside of novels, when averaged out across lots and lots of texts.
Counting words in HathiTrust with Python and MPI
In recent months we’ve been working on a couple of projects here in the Lab that are making use of the Extracted Features data from HathiTrust. To help kick off the lab’s new Techne series, I wanted to take a look at some of the programming patterns we’ve been using that make it easier to work these kinds of large data sets – namely the “Message Passing Interface” (MPI), a set of semantics for spreading out programs in large computing grids.