Authorship Verification on the Testimonium Flavianum: Data and Challenges
February 27, 2025
Authorship Verification is a special case of a branch of research which is known as Authorship Attribution. The goal of Authorship Attribution is to quantif the wrinting style of an author in such a way that by later comparison of unknown writing styles of texts one can assert a given text to an author. In the usual case of Authorship Attribution there is a set of authors given, with their respective writing styles (these writing styles are potentially extracted prior by analysing a set of texts of a given author) and one must assign an unknown text to an author of the set. Therefore you compare the writing styles of the authors with the writing style of the text and assign the text to the author which has the most similar writing style to the text in question.
The special case of Authorship Verification now occurs when we dont have a set of potential authors as given but just a single author, with a list of texts written by this author. An unknown text should now be classified as being written by this author or not. The reliability of Authorship Attribution is based on the fact that the writing style of a given unknown document is more similar to author A than to the other authors. With the reduction of the set of authors this conclusion gets less and less reliable and in the extreme case of Authorship Verification completely different approaches are needed.
In our problem on the analysis of the Jewish Antiquities and the Testimonium Flavianum we face exactly this problem. We have a set of texts by the author Flavius Josephus given (the Jewish Antiquities without the Testimonium Falvianum) and want to quantify if Flavius has also written the text in question (the Testimonium Flavianum). There is no comparison corpus of different texts of different authors with which we can compare. Indeed the artificial generation of such a comparison corpus would probably corrupt the reliability of our findings as it is not trivial to create an authentic comparison corpus for an antique set of texts that covers the same topics, has the same genre and originates from the same period in time.
In order to now tackle the Authorship Verification problem we need a way to quantify the writing style of an author. In the field of Natural Language Processing there exist a lot of different recommendations for writing style features but some that are especially promising in the field of Authorship Verification are so called Character n-grams as Keselj et al. and Koppel et al. outline (see reference section for details).
The set of character n-grams of a text is the set of all n character sequences that appear in this text. For example the Testimonium Flavianum starts with At that time there. This subsentence would be transformed into the character 4-grams (meaning we choose n=4)
['At t', 't tha', ' tha', 'that', 'hat ', 'at t', 't ti', ' tim', 'time', 'ime ', 'me t', 'e th', ' the', 'ther', 'here']
The correct choice of the n-Parameter is essential for the analysis result. Generally values between 3 and 6 seem to hold good results. As detailed in Determining If Two Documents Are Written by the Same Author by Koppel et al. for english texts n=4 is especially effective The big advantage of Character n-grams is their simplicity. Without much effort they can be extracted in every language and have shown to be as effective if not better than alternative features (e.g. function words or Part-of-Speech n-grams)
In the following blog post we want to first extract the Character n-grams with different n and see if we can gain any insights from these simple analysis. In later steps we want to employ more sophisticated Authorship Verification techniques found in competitions.
Todos
- doublecheck writing and gramar
References
- Vlado Keselj et al.: n-gram-based author profiles
- Moshe Koppel und Yaron Winter: Determining If Two Documents Are Written by the Same Author
- Mirco Kocher: Text clustering with styles
- Efstathios Stamatatos: Authorship Verification: A Review of Recent Advances