Empirical Data on Over-Length Books

by Frank Fischer and Ubbo Veentjer

(Little note upfront: The technical part of this posting involves Docker, RDF→XSLT→JSON, Elasticsearch and Kibana. More on this at the bottom of this page or without further ado on our GitHub repo. Our solution is the result of nothing more than a 4-hour hackathon, so don’t expect anything polished.)

As a preliminary for a study of over-length books (novels, primarily) we wanted to gather more empirical data. There is still this “List of longest novels” in the English Wikipedia, but this list is problematic, because it’s obviously canon-driven and uses completely different measures to define the extent of books. We also gathered some evidence on our own, a list with novels of more than a thousand pages, on another blog (the explanatory text there is in German, watch out). Our list is sorted chronologically, since this is one aspect we have to take into account when planning to pen something like a “History of the Over-Length Novel”.

Using the number of pages as measurement is, of course, part of the problem. It would be better to work with number of letters or words per work, but this is not (yet?) part of bibliographic metadata. So, number of pages. How do we access them, large-scale?

German National Library Enters the Stage

Let’s start with this panoramic view of the German Library in Leipzig, predecessor and now part of the German National Library (DNB) – please also note the book towers on the left side:

Panoramic view of the German national Library, source: Wikimedia Commons.
Source: Wikimedia Commons (CC BY-SA 3.0)

Really nice, yes, but it’s another kind of picture of the actual German National Library that we’re going to show you, one that is much more to the point. When we came across the data-service page of the German National Library, we were almost enthusiastic. They offer different kinds of sets in different formats (all licenced under CC0 1.0!), from which we chose the “DNBTitel.rdf.gz” one comprising the records for all books/items archived at the DNB library (the dump was generated on March 10, 2016, and is 1,5 GB in size, which makes for an uncompressed 21,3 GB). There’s no SPARQL endpoint (yet?) by which users could query the catalogue data directly (but you can register for access via OAI and SRU). So, as said before, we decided to download the RDF file and started to build our own query thing.

Now, the other picture of the German National Library we wanted to show you is this:

HDT matrix front view: subjects meet objects.

This is the matrix view of the RDF file we downloaded, a 3D scatter plot of triples generated with HDT-it!. Each predicate there has a different colour. In this front view, subjects meet objects, but we can see a pink predicate gleaming through, we’re looking at ISBD element P1053 (“has extent”), the thing we’ll be looking at in the following:

HDT matrix front view: pink predicate gleaming through.

Some Results

Btw, we started this little project as part of a spontaneous 4-hour hackathon last July, at the R&D department of the Göttingen State and University Library. Now, just about a year later, we decided to wrap it up a bit and show you how we made it work.

First and foremost, you have to be aware of the history of the German National Library to know what you can expect from any query result. The gist of it is: They were a bit late compared to other national libraries in Europe and began collecting “all German and German-language publications from 1913, foreign publications about Germany, translations of German works, and the works of German-speaking emigrants published abroad between 1933 and 1945” (quoting the official “About us” page).

Before we bore you with how we did it, let’s go for some results. As a proof of concept, let’s see which authors are the ones with the most books in the catalogue (dcterms:creator), let’s generate a top 25:

Bar chart: 25 authors with most books in the German National Library.

In the bar chart above, the authors are identified by their GND records. Very well then, let’s resolve them:

DNB Identifier Number of Records Author
118540238 5792 Goethe, Johann Wolfgang von
118617443 3881 Steiner, Rudolf
118577166 3402 Mann, Thomas
11855042X 3336 Hesse, Hermann
11856515X 3227 Konsalik, Heinz G.
118637479 2995 Zweig, Stefan
118542257 2840 Grimm, Jacob
118578537 2732 Marx, Karl
118542265 2721 Grimm, Wilhelm
118530380 2504 Engels, Friedrich
12002179X 2478 Schaal, Eric
118607626 2405 Schiller, Friedrich
118618725 2133 Storm, Theodor
118514768 2132 Brecht, Bertolt
118559230 2100 Kafka, Franz
118559206 2086 Kästner, Erich
118818651 2080 May, Karl
118613723 2070 Shakespeare, William
11856109X 1828 Keller, Gottfried
118512676 1818 Böll, Heinrich
118601024 1794 Rilke, Rainer Maria
118587943 1781 Nietzsche, Friedrich
118534262 1755 Fontane, Theodor
118533436 1677 Fischer, Marie Louise
118520628 1674 Christie, Agatha

Looks plausible, in a way. And interesting enough for an interpretation (which, for the time being, we won’t deliver). One thing becomes clearer now, though, we should really talk about “items”, rather than “books”. Photographer Eric Schaal, for example, didn’t get into this top 25 by writing more than two thousand books. To be honest, we made a top 25 just to get at least some women into this board of men, ranking 24th and 25th. And stating the obvious, Goethe also didn’t write almost six thousand books since 1913, what really puts weight on the authors is the substantial number of re-editions, of course.

Number of Books/Items in the Catalogue

We’ve got 11.373.862 items altogether (some didn’t make it into the Elasticsearch index since we didn’t really address error handling or validation; the regexps in our XSLT weren’t perfect either, things we can improve next time we’re not high-speed hackathoning). Anyway, in 5.874.504 cases we successfully parsed the isbd:P1053 element (= ”has extent”) into a usable number of pages, summing up to a total number of 969.846.170 pages. The max number of pages in this set is 2.711.111, which is obviously the result of a metadata apocalypse, somebody must have slipped on the 1 key (this is the book that’s said to have more than two million pages) (ed. 05/27/16: n° of pages has been corrected today, they notified us via Twitter, nice!).

You can glean from our XSLT file that we’re only using extent information if we found a number succeeded by “ S.” (= ”pages”) in the isbd:P1053 string. So we’re not using at least 5.499.358 items out of the 11.373.862. Either they had no information on the book extent/number of pages or we didn’t parse it because we just used a very basic pattern. But with this simple method we still managed to cover 51,65% of all the books/items stored in the German National Library. We can sure improve our data extraction, but for the time being we’re good with what we have. After all, we’re still speaking about almost six million book records.

Now onto some more meaningful stuff. Let’s take five major publishers and compare them just by looking at the extent of their books. We held a little powwow and, full of intentional bias, chose Aufbau, Eichborn, Hanser, Rowohlt, Suhrkamp. Let’s have a look at the number of items per publisher in the catalogue:

Bar chart: 5 publishers, number of items.

This is an interesting perspective, also if this comparison makes not much sense. For all we know, there could be thousands of re-editions involved. Well, okay.

Average Number of Pages per Book per Publisher

Bar chart: 5 publishers, average number of pages.

So obviously, the average Suhrkamp book beats the average Rowohlt book beats the average Eichborn book when considering the number of pages. This comparison is intriguing, but it shouldn’t be taken too literally, we’re dealing with a certain amount of incorrect metadata as we’ll see in the next set of lists.

Longest Books per Publisher

This brings us a bit closer to our goal. But these lists also show the problems of erroneous data and the need for additional metadata. If we want to look into over-length novels, we’ll obviously need another indicator, one of which is not provided by the current DNB dataset. And once again, the lists clarify that it’s more appropriate to speak of “item” than of “book”, but now let’s start with the rankings:

Aufbau

DNB Identifier Number of Pages Author: Title (Year)
988488205 1359 S. Vikram Chandra: Der Pate von Bombay (2009)
1001932447 1291 S. Lew Tolstoi: Krieg und Frieden (2010)
576696420 1286 S. Alexej Tolstoi: Der Leidensweg (2. Aufl., 1955)
576696439 1286 S. Alexej Tolstoi: Der Leidensweg (3. Aufl., 1959)
1011565994 1243 S. Hans Fallada: Wolf unter Wölfen (2011)
98848823X 1227 S. Lew Tolstoi: Anna Karenina (2008)
945188846 1211 S. Friedrich Gorenstein: Der Platz (1995)
988488272 1200 S. Fjodor Dostojewski: Die Brüder Karamasow (2008)
949346470 1183 S. Lew Tolstoi: Anna Karenina (1996)
451896025 1174 S. G. W. F. Hegel: Ästhetik (1955)

Eichborn

DNB Identifier Number of Pages Author: Title (Year)
946561486 1814 S. (wrong number of pages)
967526825 1222 S. Leo Tolstoi: Krieg und Frieden (2003)
950298603 1081 S. Rolf Vollmann: Die wunderbaren Falschmünzer (1997)
988571005 991 S. Daniel Schwartz: Schnee in Samarkand (2008)
979687187 954 S. Paul Verhaeghen: Omega minor (2006)
974540919 855 S. David M. Crowe: Oskar Schindler (2005)
979691044 852 S. Laurence Sterne: Leben und Ansichten von Tristram Shandy, Gentleman (2006)
840181604 841 S. Fred Denger: Der grosse Boss (1984)
860929477 841 S. Fred Denger: Der grosse Boss (6. Aufl., 1985)
870140876 841 S. Fred Denger: Der grosse Boss (5. Aufl., 1985)

Hanser

DNB Identifier Number of Pages Author: Title (Year)
1022146394 4587 S. (wrong number of pages)
99886398X 1810 S. Walter Doberenz; Thomas Gewinnus: Visual C# 2010 (2010)
98401098X 1806 S. Walter Doberenz; Thomas Gewinnus: Borland Delphi 7 (2007)
998863955 1802 S. Walter Doberenz; Thomas Gewinnus: Visual Basic 2010 (2010)
760043035 1672 S. Joseph von Eichendorff: Werke (4. Aufl., 1971)
451062442 1606 S. Joseph von Eichendorff: Werke (2. Aufl., 1959)
451062817 1590 S. Joseph von Eichendorff: Werke (1. Aufl., 1955)
453424740 1511 S. Eduard Mörike: Sämtliche Werke (3. Aufl., 1964)
780204875 1511 S. Eduard Mörike: Sämtliche Werke (5. Aufl., 1976)
970961294 1469 S. Uwe Bünning; Jörg Krause: Windows XP Professional (3. Aufl., 2004)

Rowohlt

DNB Identifier Number of Pages Author: Title (Year)
944325807 2253 S. Klaus Harpprecht: Thomas Mann (1995)
945394659 2253 S. Klaus Harpprecht: Thomas Mann (16.–30. Tsd., 1995)
967713358 2026 S. Karl Corino: Robert Musil (2003)
101781905X 1723 S. Péter Nádas: Parallelgeschichten (2012)
1028105657 1723 S. Péter Nádas: Parallelgeschichten (Taschenbuch, 2013)
1008548022 1719 S. Rolf Hochhuth: Essayistische Prosa und Gedichte (2011)
575594950 1671 S. Robert Musil: Der Mann ohne Eigenschaften (1952)
961281588 1642 S. Rolf Hochhuth: Alle Erzählungen, Gedichte und Romane (2001)
457661054 1632 S. Robert Musil: Der Mann ohne Eigenschaften (1970)
575594969 1632 S. Robert Musil: Der Mann ohne Eigenschaften (23.–29. Tsd., 1960)

Suhrkamp

DNB Identifier Number of Pages Author: Title (Year)
946102384 3980 S. (wrong number of pages)
945262094 2909 S. (wrong number of pages)
991420225 2569 S. Amos Oz: Die Romane (2009)
988814668 2085 S. E. M. Cioran: Werke (2008)
986493635 1909 S. Marguerite Duras: Die Romane (2008)
986531766 1840 S. Thomas Bernhard: Die Romane (2008)
991398939 1838 S. Hermann Hesse: Die Erzählungen und Märchen (2009)
988840758 1782 S. Max Frisch: Romane, Erzählungen, Tagebücher (2008)
998413925 1782 S. Bertolt Brecht: Prosa (2013)
1008349852 1735 S. Alejo Carpentier: Die Romane (2011)

We won’t comment these lists now, although they make for some interesting discussions. You can obviously do much more with what we built here and we’ll certainly get back to this later. So let’s close this blog post with a short note on how we built this bridge from the freely available DNB catalogue data to the results shown above.

German National Library Goes Elasticsearch

The somewhat weird original idea we had when initiating the hackathon was this: We wanted to know how much the German National Library weighs on books, and we wanted to find out by the number of pages of all the books it stores which we then would have multiplied by the average weight of a book page. Well, you can do the maths yourself now, you can find the total number of pages we counted above and then extrapolate, don’t forget to divide this by two, it will be a good enough approximation.

What saves us time now is that we already described our mechanism on GitHub where we also provide all the info you need to rebuild our machine. We basically reorganised the whole thing as a Docker project, which will create a container running Elasticsearch/Kibana. The repo also features shell scripts for downloading the current version of the German National Library title catalogue. Some selected data fields from every book in that catalogue are then transformed into JSON and pushed to the Elasticsearch instance. After that you will be able to query the DNB catalogue data with Elasticsearch to create nice outputs with Kibana. As should be clear from the text above, the data fields we’re focusing on are mainly the number of pages per book and some book metadata (author, title, year, publisher, etc.) for identification.

That’s it for now. And lest we forget, special thanks to Max Brodhun and Carsten Thiel for writing the XSLT and helping with the shell scripting, “as quick as boiled asparagus”, so to speak, it was only because of them that we could go on with what we actually wanted to do. The next Uludağ is on us! ;)

Share on Twitter, Facebook, Google+
Prev Next

Comments