How many novels have been published in English? (An Attempt)
Erik Fredner; Mar 14, 2017
Not for the first time, I find myself wanting to know how big the field of the novel is. Granted, finding the precise number of novels published in English is impossible. And even if we had an exact figure, the number of published novels doesn't directly address the question of the genre's cultural extent since it wouldn't account for self-publishing, personal writing shared among friends, fan fiction, etc. Nevertheless, having an approximate answer to this question seems useful for two reasons: First, I genuinely didn't know at what order of magnitude the field of the novel operates. Is the number of novels in the tens or hundreds of millions? Or is it shockingly modest---maybe just a few hundred thousand? Second, this question is worth asking because the order of magnitude matters. We know that we only study a tiny portion of the novel field, and that what we do study is deliberately nonrepresentative. Knowing the scope of our reading in comparison to the field as a whole gives us a better sense of how circumscribed our claims about "the novel" are. Asking about the "representativeness" of our samples connotes a quantitative humility.
So, an attempt: According to Bowker's Books in Print, there were 2,714,409 new books printed in English in 2015. Of these, just 221,597 (8.2%) were classified as fiction. This alone surprised me---I had always assumed that fiction controlled a significantly larger portion of publishing considering how much of the global conversation about books is driven by it. But, based on a Nielsen report, the ratio of fiction releases to sales is not one-to-one; even though only about 8% of publishing is fiction, the category accounts for 23% of all book sales. (Also worth noting here is another surprise from that report, at least from the perspective of someone cloistered in an English department: just 47% of Americans buy books of any kind in any format, and a huge number of them were adult coloring books last year.)
Bowkers's 2015 ratio (8.2% of publishing categorized as "fiction") does not seem to have been too far outside of the norm for the last six years:
As this chart shows, the data Bowkers collects on book sales has varied dramatically over the last sixteen years, starting with that sharp uptick in 2009-10 before declining again. It's hard to know whether that spike accurately reflects a year of unprecedented book publication, or if instead it measures a change in Bowkers's counting methodology. After all, 2009-10 seems like it would have been a bad time economically to quintuple your book printing. But it also came near the beginning of on-demand printing---those physical reprints of out-of-copyright texts by no-name publishers, sometimes literally just printing off scanned page images from Archive.org and gluing them together. This could have greatly inflated the number of "new" books being printed, but it's hard to tell what percentage of the texts from those years and after fall into that category.
Thankfully for our purposes here, the absolute variance in Bowkers's data does not particularly matter since what we need is not a count of books but rather a ratio of fiction to total print production. During this period, fiction was never more than 16.3% (2004) nor less than 2.5% (2010) of a given year's printing. On average, about 11% of books published in a given year were fiction. Without the outlier years, that dips slightly to 10.6%.
That gives us a ratio of fiction to nonfiction production within the contemporary book market. Roughly 1 in 10 books printed will be categorized "fiction," a set that contains a range of materials, including novel-length literary fiction, novellas, short story collections, young adult novels, romance, science fiction, fantasy, translations, etc.
To the best of my knowledge we lack a reliable means of estimating historical fiction/nonfiction print ratios. So, my first major assumption will be to map the average contemporary ratio of fiction to total print production from Bowkers onto a measure of total print production. Clearly this will produce a *very *rough result. But if we can assume that the contemporary moment reflects an average or lower-than-average ratio of fiction to nonfiction print production, then using the current ratio will point us toward a larger goal of this exercise: estimating the number of novels in English without overshooting the mark.
Given the contemporary ratio of fiction to nonfiction in print, we now need to know how many printed books we ought to be considering. If you filter Google Books for English language works today, the search engine returns an estimated 189 million books, a 146% increase from Google's 2010 estimate of total extant volumes globally. Of course, this too underestimates the field since it presumably only references the collections Google has access to. Applying the ratio of fiction production we derived from Bowker's to this measure of total published output would leave us about 18.9 million books in English that Bowkers would categorize as fiction.
"Books" turns out to be a key word in that last sentence. Bowkers's "Books in Print" is precisely what it sounds like: a database about books published, not works written. Because it privileges the book-as-product, the dataset does not allow us to easily differentiate between new novels and new versions of an extant novel. I use "versions" rather than "editions" or "printings" advisedly since Bowkers tracks paper-and-ink books, on-demand printing, digital copies of trade books, Kindle Direct Publishing (Amazon's self-publishing wing), etc. Localization also plays a major role in counting: The Color Purple and The Colour Purple count as two books, though I'm quite sure we would think of them as one novel. Worse, there is no way to readily decide from the metadata whether two different editions of a book titled The Portrait of a Lady both contain the same Henry James novel. We could assume based on a set of fuzzy title-author matches, but that becomes immediately ambiguous: Could we set parameters to reliably find that an item titled *The Portrait of a Lady: A Novel *is the same as Portrait of a Lady? Or, as a more genuine question of literary history rather than one about metadata, should we count *The Portrait of a Lady: New York Edition *as the same novel as The Portrait of a Lady?
To suss out the contents of the fiction category we need a sample of the total texts. I copied the first 500 records (the max allowed by Bowkers) from the 2015 fiction works, sorted by the date the record was last updated. Of the sorting options, this seemed to offer the greatest degree of randomness, though a truly random sample from the 222,686 records would of course have been much better.
Reading through those records, I only recognized a few by title: Blood Meridian (McCarthy), Plainsong (Haruf), The Diaries of Adam and Eve (Twain), The Savage Detectives, and* 2666*. The fact that these last two are by Bolaño shows one clear limit of relying on date edited as a randomizing field.
To give you a sense of the range, here are a few other titles from that group:
- The Book that Proves Time Travel Happens
- Rio de Janeiro! #5
- Chicken and Pickle: Get a Baby
- Everything is Teeth
A huge number of books on the list were movie and television tie-ins from franchises like Star Wars, The Princess Diaries, The Minions, The Avengers, Walking Dead, Shrek, Madagascar, and Doctor Who, among others. But there were also a few other titles that had been classed as fiction, but seem to be about fiction rather than fiction themselves:
- Japanese Science Fiction: Views of a Changing Society
- The Transhuman Antihero: Split-Natured Protagonists in Speculative Fiction from Mary Shelley to Richard Morgan
- The Angel and the Cad: Love, Loss and Scandal in Regency England
Of those 500 items, 211 (42%) were duplicate entries referencing the same work (i.e. *Colour Purple */ Color Purple). Duplicate entries seem to primarily be the result of localizations and book type (hardcover vs. softcover vs. ebook). If we subtract those duplicates and the titles like the ones above about fiction rather than fictional works themselves, that leaves us with 285 possibilities. Of those, if we cut from there with a top-level BISAC code of Fiction, we're left with 128 possible novels. This seems reasonable if we're interested in the novel as distinct from "juvenile fiction" (275 of the items in the sample).
The genre breakdown of that group in this sample is as follows:
A few items to note about this chart. BISAC Fiction includes a wide range of categories not represented in this sample. This includes non-novel fiction like short stories, anthologies, classics (as in Greek tragedy, not Penguin Classics), etc. So if we assume that the ratio of BISAC Fiction to Bowkers Fiction holds over the set, there would still be some percentage of non-novel Fictions, though they do seem to be rare. The subcategory also includes many forms of genre fiction, which, taken together, outweigh so-called General fiction, frequently the label used for literary fiction. Notably, the BISAC Code Fiction / Literary did not appear once in the sample, though some literary fictions did (e.g. *Blood Meridian *was categorized as Western, 2666 as General, etc.)
Based on this sample, roughly 25% of everything Bowkers categorized as Fiction could possibly be a novel. That takes the initial figure of 18.9 million possible fictional works down to 4.8 million. If cutting that proportion to get from "books" to "works" seems outlandish, consider this: According to Bowkers's, Zora Neale Hurston, James Joyce, and Henry James published a combined 123 books of fiction in 2015, 55 years after the youngest of them died. Over the time-period that Bowkers's full database covers, Henry James has been listed as the author of 14,829 books of fiction. I repeat: Based on the way Bowkers counts, Henry James "authored" 14,829 books since the 1990s. Compared to the 23 novels included in the Library of America's complete printing of James's novels, the ratio of unique book-length works by James to books attributed to James is about 0.16%.
Among novelists who have had large numbers of unique books printed, James seemed to me like he ought to rank quite highly considering his prolific output. I was curious who, if anyone, had more books to their name, so I looked up some of the highly ranked names from the Literary Lab's Popularity/Prestige project:
Predictably, Shakespeare appears at the top of the list. But quite a few of these placements surprised me: Twain has been printed far more than other 19th century novelists like Austen, Eliot, and Melville, who also wrote a fairly large number of books. Faulkner and Hemingway appear surprisingly far down the list, being published more like J.K. Rowling than Fitzgerald, Woolf, and Cather.
To get at the problem of the "unique novel" we need to cut aggressively against the 4.8 million printed novels proposed earlier to account for authors with many print runs, but here we run to the end of our rope. If we had the data, we could get novel counts for a list of highly printed novelists against their total number of novels and subtract overprinting from the total.
But running up against this blockade (or, rather, the inflection point between diminishing returns and increasingly dubious assumptions) allowed me to pause and reflect on what we have learned at this point. We're within an order of magnitude: the total number of novels in English is closer to 5 million than 500,000 or 50 million. We also know that the floor is in the hundreds of thousands since the Library of Congress holds more than 207,000 fiction items and the British Library returns over 390,000 books containing "fiction" anywhere in the object description. For "novel," those numbers are 139,000 and 66,000 respectively---surprisingly small considering the size of the corpora we have become accustomed to working with in the Lab.
I also reflected on whether and how this figure relates to my initial question, considering the data that is actually available. I started off this post by asking, "How many novels have been published in English?" Based on the data I wound up using in the attempt, I rewrote the initial question to see what this process actually "answered." This is as close as I got: "Based on a ratio of fiction to nonfiction production---derived in part from a sampling methodology that selected for works that are likely novels that is then then applied to a rough measure of total publication and subselected against non-novel fictions---how many novels have been published in English, within an order of magnitude?" Self-flagellation by qualification.
Imprecise, presentist, and biased toward the published and the archived as it may be, what does having an order of magnitude tell us about the genre of the novel?
To answer that question, it helps to think about that number from the perspective of the reader who opened this essay. If you were something quite a bit more than voracious in your reading, and managed to get through a new novel every day for 50 years without letup, you would have read more than 18,000 "loose, baggy monsters," which is 8% of our lowest estimate and 0.3% of the highest. Literary critics, by contrast with this imagined reader, might know 200 novels quite well, giving them purchase on somewhere between 0.1% and 0.004% of the field. Even as we specialize by nation and century, the comprehensiveness of critics' reading only increases by portions of a percent.
The question that emerges (and one that cannot be addressed in this space) is whether so little is, in fact, enough. The books that get read in literature departments may exist a space marginal to the marketplace of books that Bowkers tallies, but not in one entirely disconnected from it. When you're trying to understand a phenomenon like the novel from a sample that's both that small and deliberately nonrepresentative, does knowing its broadest dimension oblige us to ask about the other 99.9%?
 This methodology expands on the data set used in Figure 1 of Pamphlet 8, "Between Canon and Corpus" and continues an interrogation from the section on archival bias and representativeness broached in Pamphlet 11, "Canon/Archive." Lastly, it extends a few ideas from a post by Matthew Wilkens: https://mattwilkens.com/2009/10/14/how-many-novels-are-published-each-year/
Queries were performed at http://www.booksinprint.com/ in December of 2016 and should be reproducible with small variance from changes in the database since.
Query structure: Date range 2015-01-01 to 2015-12-31; Language: English. The same date and language filters were applied to each search (changing the years as needed), and adding or removing the Fiction book type.
 Page 32 of this report.
 Unfortunately, the Nielsen report doesn't disaggregate the category of "Americans" to help us better understand this alarming statistic. For instance, does "Americans" include young children who likely don't buy anything?
 See, as an example of the on-demand publishing form, the "Paige M. Gutenberg" machine in the Harvard bookstore: http://www.harvard.com/clubs_services/books_on_demand/
 Notably, the fiction/print ratio in the first five years of this dataset is significantly higher than in the last five years. Hard to say if this reflects a shift in the publishing industry or Bowkers's counting.
 Query performed in November, 2016. As of now on the Google frontend, you cannot filter the Google Books corpus for works tagged by language without a character-based query, so this number is an estimate of books in English containing the word "the." As the most common word in English, "the" should put us within the margin of error, though there exist English books without it (e.g. Gilbert Adair's 1995 translation of Georges Perec's A Void, a novel written entirely without the letter "e" both in Perec's French and Adair's English.)
 This is quite crude, though changing to an average ratio over the short period of time covered by Bowkers does not seem better.
 For those not familiar, in his later years James edited many of his novels for release in a set known as The New York Edition. Many novels underwent fairly substantial changes during the editing process (frequently for the worse in the view of some James scholars).
 It should be pointed out here that this chart represents all books printed and documented by Bowkers's in all languages for a given author, not just English.
 Randall Munroe of xckd has also addressed this question from the perspective of reading a subset of authors rather than a genre like the novel, finding that if you read 16 hours a day, you could keep up with "400 living Isaac Asimovs:" https://what-if.xkcd.com/76/
 Or: Wouldn't Sisyphus have wanted to know how high the hill was?