Authorship Verification on the Testimonium Flavianum: Data exploration with character n-grams
February 27, 2025
-
procedure of character n-gram extraction
- space free
- tf-idf values
-
18 th book
- preprocessing to same length
-
ensemble of character n-grams
-
n = [3..9]
- not really any change over the variation of n (see gif in plots/baseline) => PCA1 = o => PCA2 = i => therefore how would it be if we do space including?
105, 109 und 114 haben als eine der wenigen Passagen einzelne o’s wie in “Equal to this determination of yours, O Petronius” (114) elemente in Cluster 1 wie Abschnitt 4 haben einzelne i’s wie in “To be sure, I have spoken about them in the second book of the Jewish War,”
-
It does indeed change but the results are not that well explainable anymore
- the first 2 PCA components only explain around 5% of the variance of the dataset => space including rather not => seems to have something to do with the fact that less “collitions” happen, because of casing and point, comma
-
what changes do we see if we do different preprocessing?
- remove unnecessary punctuation and keep upper/lowercase => results as with space_including
- remove unneccesary punctuation and only lowercase => results as space_free + reasonable PCA
-
What changes do we see if we do word n-grams
- the explained variance decreases even more and very hard to get conclusion out
-
is PCA actually a good dimension reduction technique here?
-
probably not because the tf*idf-Matrix is still quite sparse and the PCA destroys this sparseness
- the matrices are very sparse nearly 97% of all values are 0 on allmost al datasets => how do results change if we use a SVD => the clustering does not really change, the diagram just seemed to be flipped for all datasets
-
-
- english text vs greek text
- different clustering solutions?
Main thought
- if we can find a certain feature set where the TF looks quite outlierish => this might be an expalantaion that this feature set is quite distinct in the TF
Later investigations
- What changes if employing most common ngram profile
- whole jewish antiquities