DBpedia and World Literature

by Robert Jäschke and Frank Fischer

After our blog post “Wikidata Meets World Literature” some might have mumbled into their tea cups, “ok yeah, but what about DBpedia?” Accordingly, let us add some technological diversity to our initial experiment. DBpedia is a great project and follows its very own approach, it is also a few years older than Wikidata. More on the differences between the two projects can be found on Quora.

So, DBpedia and World Literature. Let’s kick off by translating/simplifying the original SPARQL query to match the DBpedia ontology:

SELECT ?s ?desc ?authorlabel
  ?s rdf:type dbo:Book .
  ?s dbo:author ?author
    ?s rdfs:label ?desc FILTER (lang(?desc) = "en").
    ?author rdfs:label ?authorlabel FILTER (lang(?authorlabel) = "en").

This query returns a list of resources that have been assigned the class Book in the DBpedia ontology, along with their authors. (Yes, just click, this will auto-execute the query and show you the results.)

In our original post we used the number of Wikipedia language versions per book to rank them, with “One Thousand and One Nights” taking away all the glory (for the time being, that is). We can try something similar in DBpedia by counting the number of labels per book. Each Wikipedia language edition from which a resource was extracted by DBpedia has its label stored using the rdfs:label property. This is our query:

SELECT ?s ?desc ?authorlabel (COUNT(DISTINCT ?label) as ?labelcount)
  ?s rdf:type dbo:Book .
  ?s rdfs:label ?label .
  ?s dbo:author ?author
    ?s rdfs:label ?desc FILTER (lang(?desc) = "en").
    ?author rdfs:label ?authorlabel FILTER (lang(?authorlabel) = "en").
} GROUP BY ?s ?desc ?authorlabel ORDER BY DESC(?labelcount)

Unfortunately, the output is a bit disappointing:

s desc authorlabel labelcount
:The_Adventures_of_Tom_Sawyer “The Adventures of Tom Sawyer”@en “Mark Twain”@en 12
:Strange_Case_of_Dr_Jekyll_and_Mr_Hyde “Strange Case of Dr Jekyll and Mr Hyde”@en “Robert Louis Stevenson”@en 12
:Fifty_Shades_of_Grey “Fifty Shades of Grey”@en “E. L. James”@en 12
:Gray’s_Anatomy “Gray’s Anatomy”@en “Henry Gray”@en 12
:Sense_and_Sensibility “Sense and Sensibility”@en “Jane Austen”@en 12
:A_Brief_History_of_Time “A Brief History of Time”@en “Stephen Hawking”@en 12
:Mein_Kampf “Mein Kampf”@en “Adolf Hitler”@en 12
:Crime_and_Punishment “Crime and Punishment”@en “Fyodor Dostoyevsky”@en 12
:Catching_Fire “Catching Fire”@en “Suzanne Collins”@en 12
:David_Copperfield “David Copperfield”@en “Charles Dickens”@en 12
:Mansfield_Park “Mansfield Park”@en “Jane Austen”@en 12
:Les_Liaisons_dangereuses “Les Liaisons dangereuses”@en “Pierre Choderlos de Laclos”@en 12
:Murder_on_the_Orient_Express “Murder on the Orient Express”@en “Agatha Christie”@en 12
:The_Sign_of_the_Four “The Sign of the Four”@en “Arthur Conan Doyle”@en 12
:The_Republic_(Plato) “The Republic (Plato)”@en “Plato”@en 12
:The_Communist_Manifesto “The Communist Manifesto”@en “Friedrich Engels”@en 12
:From_the_Earth_to_the_Moon “From the Earth to the Moon”@en “Jules Verne”@en 12
:Buddenbrooks “Buddenbrooks”@en “Thomas Mann”@en 12
:The_Brothers_Karamazov “The Brothers Karamazov”@en “Fyodor Dostoyevsky”@en 12
:The_Trial “The Trial”@en “Franz Kafka”@en 12

Seems that books have been extracted from only 12 language editions (ar, de, en, es, fr, it, ja, nl, pl, pt, ru, zh). Or, only these 12 languages feature a page template for books. Or … well, let’s stop speculating and have a look: DBpedia features a page with statistics about the extracted data and we can see in the “Cross-Language Instance Overlap” table that there are, for instance, 268 books appearing in 16 language editions. However, not all datasets from all language versions are available at the public SPARQL endpoint. This list shows that currently a “labels” dataset has been loaded for exactly the 12 language versions we mentioned above. For the same reason, other properties like dbo:abstract or owl:sameAs that could be linked to the number of language editions show the same behaviour. (For an overview of potential properties take a look at the DBpedia entry on “The Adventures of Tom Sawyer”.)

When looking for alternatives we found the PageRank dataset by Andreas Thalhammer. Fortunately, it is deployed on the official DBpedia SPARQL endpoint and so, instead of counting the number of language editions, we can easily use the PageRank of each page within the English Wikipedia as a measure of importance:

PREFIX vrank:<http://purl.org/voc/vrank#>

SELECT ?s (SAMPLE(?desc) AS ?label) (GROUP_CONCAT(?authorlabel, ', ') AS ?author) (MAX(?v) AS ?rank)
FROM <http://dbpedia.org>
FROM <http://people.aifb.kit.edu/ath/#DBpedia_PageRank>
  ?s rdf:type dbo:Book .
  ?s vrank:hasRank/vrank:rankValue ?v .
  ?s dbo:author ?author
    ?s rdfs:label ?desc FILTER (lang(?desc) = "en").
    ?author rdfs:label ?authorlabel FILTER (lang(?authorlabel) = "en").

We modified the query so it groups authors of multi-author books into one row. Despite some Wikipedia-related distortions, the result is much more meaningful:

s label author rank
:The_World_Factbook “The World Factbook”@en “Central Intelligence Agency” 146.277
:Systema_Naturae “Systema Naturae”@en “Carl Linnaeus” 68.6522
:Natural_History_(Pliny) “Natural History (Pliny)”@en “Pliny the Elder” 64.9683
:On_the_Origin_of_Species “On the Origin of Species”@en “Charles Darwin” 56.7624
:The_Rolling_Stone_Album_Guide “The Rolling Stone Album Guide”@en “Anthony DeCurtis, Dave Marsh” 47.9486
:Don_Quixote “Don Quixote”@en “Miguel de Cervantes” 45.3332
:All_Music_Guide_to_Jazz “All Music Guide to Jazz”@en “Vladimir Bogdanov (editor), Stephen Thomas Erlewine” 44.641
:Alice’s_Adventures_in_Wonderland “Alice’s Adventures in Wonderland”@en “Lewis Carroll” 42.4669
:All_Music_Guide_to_the_Blues “All Music Guide to the Blues”@en “Vladimir Bogdanov, Stephen Thomas Erlewine” 40.9857
:Records_of_the_Grand_Historian “Records of the Grand Historian”@en “Sima Qian” 40.262
:Nineteen_Eighty-Four “Nineteen Eighty-Four”@en “George Orwell” 39.9243
:The_Republic_(Plato) “The Republic (Plato)”@en “Plato” 38.9124
:The_Wealth_of_Nations “The Wealth of Nations”@en “Adam Smith” 37.4529
:Euclid’s_Elements “Euclid’s Elements”@en “Euclid” 36.0581
:Paradise_Lost “Paradise Lost”@en “John Milton” 32.8596
:Moby-Dick “Moby-Dick”@en “Herman Melville” 32.632
:Encyclopædia_Iranica “Encyclopædia Iranica”@en “Ehsan Yarshater” 30.9694
:Dracula “Dracula”@en “Bram Stoker” 29.6592
:Histories_(Herodotus) “Histories (Herodotus)”@en “Herodotus” 29.4831
:The_Hobbit “The Hobbit”@en “J. R. R. Tolkien” 29.4576

These are the top 20 books, ranked by their PageRank within the English Wikipedia. One thing still blocking the view a bit are the popular encyclopedic works, of course. The many, many in-links earned by the “World Factbook” or the “Rolling Stone Album Guide” will not speak for their unrivalled literary quality, supposedly. Sorting out the literary works from this set of (all kinds of) books would be the obvious next step.

Share on Twitter, Facebook, Google+
Prev Next