Authorship Verification on the Testimonium Flavianum: Data exploration with character n-grams
February 27, 2025
-
procedure of character n-gram extraction
- space free
- tf-idf values
-
18 th book
- preprocessing to same length
-
ensemble of character n-grams
-
n = [3..9]
- not really any change over the variation of n (see gif in plots/baseline) => PCA1 = o => PCA2 = i => therefore how would it be if we do space including?
105, 109 und 114 haben als eine der wenigen Passagen einzelne o’s wie in “Equal to this determination of yours, O Petronius” (114) elemente in Cluster 1 wie Abschnitt 4 haben einzelne i’s wie in “To be sure, I have spoken about them in the second book of the Jewish War,”
-
It does indeed change but the results are not that well explainable anymore
- the first 2 PCA components only explain around 5% of the variance of the dataset => space including rather not
-
what changes do we see if we do different preprocessing?
- remove unnecessary punctuation and keep upper/lowercase => results as with space_including
- remove unneccesary punctuation and only lowercase => results as space_free + reasonable PCA
- Upper/Lowercase
- What changes do we see if we do word n-grams
- What changes if employing most common ngram profile
-
- english text vs greek text
- different clustering solutions?
Main thought
- if we can find a certain feature set where the TF looks quite outlierish => this might be an expalantaion that this feature set is quite distinct in the TF