Authorship Verification on the Testimonium Flavianum: Data exploration with character n-grams

February 27, 2025

procedure of character n-gram extraction
- space free
- tf-idf values
18 th book
- preprocessing to same length
ensemble of character n-grams
- n = [3..9]
  1. not really any change over the variation of n (see gif in plots/baseline) => PCA1 = o => PCA2 = i => therefore how would it be if we do space including?
  105, 109 und 114 haben als eine der wenigen Passagen einzelne o’s wie in “Equal to this determination of yours, O Petronius” (114) elemente in Cluster 1 wie Abschnitt 4 haben einzelne i’s wie in “To be sure, I have spoken about them in the second book of the Jewish War,”
  1. It does indeed change but the results are not that well explainable anymore
    - the first 2 PCA components only explain around 5% of the variance of the dataset => space including rather not => seems to have something to do with the fact that less “collitions” happen, because of casing and point, comma
  2. what changes do we see if we do different preprocessing?
    - remove unnecessary punctuation and keep upper/lowercase => results as with space_including
    - remove unneccesary punctuation and only lowercase => results as space_free + reasonable PCA
  3. What changes do we see if we do word n-grams
    - the explained variance decreases even more and very hard to get conclusion out
  4. is PCA actually a good dimension reduction technique here?
    - probably not because the tf*idf-Matrix is still quite sparse and the PCA destroys this sparseness
      - the matrices are very sparse nearly 97% of all values are 0 on allmost al datasets => how do results change if we use a SVD => the clustering does not really change, the diagram just seemed to be flipped for all datasets
english text vs greek text
different clustering solutions?

Main thought

if we can find a certain feature set where the TF looks quite outlierish => this might be an expalantaion that this feature set is quite distinct in the TF

Later investigations

What changes if employing most common ngram profile
whole jewish antiquities

Home

Main thought

Later investigations