How to Identify Key Passages in Literary Works? Using Algorithms and Machine Learning!

by Frederik Arnold and Robert Jäschke

Context

In the DFG-funded project What matters? Key Passages in Literary Works (which is part of the special priority programm Computational Literary Studies) we set out to identify and characterize key passages in literary works.

We understand key passages as passages that are particularly important to expert readers when interpreting texts. In a mixed-methods approach, we investigate empirically which textual characteristics of literary genres can be revealed through patterns of citation and quotation.

Corpus

Our main corpus consists of two literary works Die Judenbuche by Annette von Droste-Hülshoff and Michael Kohlhaas by Heinrich von Kleist with 44 and 49 scholarly articles, respectively. Fortunately, we could build on the previous work of the ArguLIT project and their annotation of all direct quotations.

Automatic Identification of Quotations

Scholarly texts contain different types of quotations. For example, verbatim quotes of single words to longer quotations spanning multiple sentences, and indirect quotations in the form of summarizations or re-narrations. In the first phase of the project, we focused on the automatic identification and linking of direct quotations starting with quotations of a length of five or more words. In Lotte and Annette: A Framework for Finding and Exploring Key Passages in Literary Works1, we outline the current landscape for text reuse detection and the development of our tool Quid. Although there are a number of existing tools, we found that all had limitations for our specific use case. We evaluated Quid and compared it to the existing tools.

Approach Die Judenbuche Michael Kohlhaas
Precision Recall F1 Precision Recall F1
BLAST 0.59 0.61 0.60 0.37 0.59 0.45
Copyfind 0.85 0.75 0.79 0.76 0.79 0.78
SimilarityTexter 0.91 0.64 0.76 0.83 0.74 0.79
Textmatcher 0.69 0.37 0.48 0.68 0.42 0.52
Quid 0.82 0.90 0.86 0.70 0.90 0.78

Table 1. Comparison of different approaches for text reuse detection with an evaluation on our corpus.

Considerably more difficult to identify are quotations which are shorter than 5 words. In A Novel Approach for Identification and Linking of Short Quotations in Scholarly Texts and Literary Works2, we develop and compare two approaches to tackle this challenge, ProQuo and ProQuoLM.

For ProQuo, we use the (page) references for long quotations as examples to tell apart (page) references for short quotations from other text in parenthesis. This includes references like those to the Bible or other literary works. We then relate short quotes to their source in the literary work by figuring out the relationships between the quotes and references. We also use the positions of long quotes as guides to link short quotations to the correct passage of the literary work.

For our second approach, ProQuoLM, we fine-tune a German BERT for classification. First, we identify potential short quotes, and then use the fine-tuned model to filter them.

Approach Die Judenbuche Michael Kohlhaas
Precision Recall F1 Precision Recall F1
Baseline 0.65 0.78 0.71 0.59 0.75 0.66
ProQuo 0.87 0.72 0.79 0.87 0.66 0.75
ProQuoML 0.88 0.75 0.81 0.87 0.69 0.77

Table 2. Evaluation results of our two approaches compared to a baseline which always links a quotation from the scholarly work to the first matching instance in the literary work.

QuidEx – Visualization and Exploration

We created QuidEx, a website for visualization and exploration of the results, which is shown in this screenshot:

Key passages, website

On the left, there’s a heatmap that displays the distribution of quoted passages in the entire literary work. The darker the area, the more frequently it has been quoted, suggesting its significance. Right beside the heatmap is the literary work itself. The grayscale indicates how many scholarly works quote any part of a crucial passage. This means the level of gray remains constant for the entire key passage. The font size is adjusted based on how often a minimal segment is quoted. At the bottom, alongside the literary text, there’s a list of all scholarly works.

The source code of the website is available as a white-label version which facilitates adoption by others, for example, for the comparison of intertextual relations of literary texts.

Key passages, banner Key passages, banner

Summary and Outlook

In the first phase of the project, we developed tools to identify, link, visualize, and explore direct quotations of all lengths. In August 2023, the project went into its second phase, titled Is Expert Knowledge Key? Scholarly Interpretations as Resource for the Analysis of Literary Texts in Computational Literary Studies. One important task we are currently working on, is the identification and linking of indirect quotations, that is, summarizations and re-narrations.

For daily key passages, follow us on Bluesky or try Quid online with our web interface.

  1. Lotte and Annette have since been renamed to Quid and QuidEx, respectively. 

  2. Accepted at JCLS 2023 and soon to be published. 

Prev