In the first post of this 2-part series we introduced six features of Wikipedia pages on writers that might help us figure out if those pages are, in fact, about (literary) writers. Today we want to discuss which of these features are actually suited to reliably identify writers in Wikipedia. Our goal is to build a set of contributors to world literature that we can further investigate (and keep updated, as Wikipedia evolves). Instead of providing a conclusive answer here, this post will concentrate on observations that could eventually help to implement an approach for the automated identification of literary writers in Wikipedia.

First Sentence

The first sentence on the Wikipedia Page of John Irving

This feature is available throughout all language versions.
The automated analysis of sentences requires some sophisticated natural language processing which should also be available for different languages. The quality of such approaches highly depends on the chosen language.
The results are quite ambiguous, since even within one language version there is no standardised way to describe writers.
DBpedia has implemented a framework for fact extraction which could potentially extract this information.

The table of contents on the Wikipedia page of John Irving

At first sight, a “Bibliography” section seems to be the prevalent way to list original works of an author in the English Wikipedia. But not all writers have one. Also, the name of the section is not consistent (e.g., in the article on John Milton it is called “Works”). “Bibliography” can also refer to a list of secondary literature (see the article on Gabriel García Márquez).
Albeit its ambiguity, finding the “Bibliography” headline in the wiki source code is easy. Finding equivalent headings in other language versions would require some effort, though.
It is unclear if and how this feature is used in other language versions.
The section structure of articles is currently not provided by DBpedia.

Occupation Property

The infobox on the Wikipedia page of John Irving

This property often contains several (ambiguous) values without further distinctions (“writer”, “novelist”, “poet”, etc.).
Values of this property are highly language-dependent, it is rather difficult to make use of it in a cross-language environment.
Not all language editions permit infoboxes for writers (among others, the German edition).
If available, the infobox properties are extracted by DBpedia. They can be found in the infobox properties dataset and in the mapping-based properties dataset. The latest dataset contains this information in the infobox properties dataset:

<http://dbpedia.org/resource/John_Irving> <http://dbpedia.org/property/occupation> "Novelist"@en .
<http://dbpedia.org/resource/John_Irving> <http://dbpedia.org/property/occupation> "Screenwriter"@en .

List of Works

The "John Irving" template on the Wikipedia page of John Irving

In our example page on John Irving, this feature was implemented using a specific John Irving template. Other popular/canonised writers also have that kind of template, yet in general, this feature is not very widespread.
There are categories for such templates, like Category:Novelist navigational boxes, making it easy to check which writers have such a template.
Apparently, there are only two other language editions with such a template, Romanian and Farsi.

Writer Template

The source code of the "writer" template of the Wikipedia page of John Irving

Not all language editions have or use such template. E.g., the German Wikipedia does not permit the writer template and, thus, its page on Johann Wolfgang von Goethe lacks an infobox.
Each page can be assigned to exactly one template, which means that some persons do not have a writer template, although we’d clearly like to have them in our set. This includes Franz Kafka (neutral person template) and Umberto Eco (tied to a philosopher template).
The association of Wikipedia pages to templates has been extracted by DBpedia in the instance types dataset. The value is provided in the type property. For John Irving, it contains the following information:

<http://dbpedia.org/resource/John_Irving> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Writer> .

Conclusion

Summarising the above findings in a table, we get the following result:

feature	language editions	usage within language editions	DBpedia dataset	effort	ambiguity
first sentence	all	all	-	high	high
table of contents	all	some	-	low	medium
occupation property	some	many	infobox properties	medium	high
list of works	few	few	-	low	low
categories	all	many	article categories and skos categories	high	high
writer template	some	many	instance types	low	low

Given the large variety of properties the different features have, it is quite difficult to devise an approach to identify writers on Wikipedia that works across different language versions. This 2-part blog post was a preliminary introduction to get a sense of the problem. We proposed and applied a solution that basically works across different language versions in a still unpublished paper (on which we will keep you posted) and will introduce a pragmatic and manageable set that can be used for a variety of purposes in one of our next blog posts.

weltliteratur.net

A Black Market for the Digital Humanities

Contributors to World Literature – Identifying Writers in Wikipedia, Part II

First Sentence

Table of Contents

Occupation Property

List of Works

Categories

Writer Template

Conclusion

Robert Jäschke

Frank Fischer