Counting words in HathiTrust with Python and MPI

In recent months we’ve been working on a couple of projects here in the Lab that are making use of the Extracted Features data from HathiTrust. To help kick off the lab’s new Techne series, I wanted to take a look at some of the programming patterns we’ve been using that make it easier to work these kinds of large data sets – namely the “Message Passing Interface” (MPI), a set of semantics for spreading out programs in large computing grids.

Virtual Readers

Virtual Readers

Often, the most exciting moment of a Lab project occurs when our research takes an unexpected direction: we thought we were doing ‘a’, but it turns out that all along we’ve been doing ‘b’ (or, more often, should have been doing ‘b’). The realization that we’ve discovered something unexpected, the ability to be guided by the research and its results: these are what differentiates a Lab project from the traditional pursuits of the humanities.