Authorship Verification on the Testimonium Flavianum: Data exploration with character n-grams

February 27, 2025

  • procedure of character n-gram extraction

    • space free
    • tf-idf values
  • 18 th book

    • preprocessing to same length
  • ensemble of character n-grams

    • n = [3..9]

      1. not really any change over the variation of n (see gif in plots/baseline) => PCA1 = o => PCA2 = i => therefore how would it be if we do space including?

      105, 109 und 114 haben als eine der wenigen Passagen einzelne o’s wie in “Equal to this determination of yours, O Petronius” (114) elemente in Cluster 1 wie Abschnitt 4 haben einzelne i’s wie in “To be sure, I have spoken about them in the second book of the Jewish War,”

      1. It does indeed change but the results are not that well explainable anymore

        • the first 2 PCA components only explain around 5% of the variance of the dataset => space including rather not => seems to have something to do with the fact that less “collitions” happen, because of casing and point, comma
      2. what changes do we see if we do different preprocessing?

        • remove unnecessary punctuation and keep upper/lowercase => results as with space_including
        • remove unneccesary punctuation and only lowercase => results as space_free + reasonable PCA
      3. What changes do we see if we do word n-grams

        • the explained variance decreases even more and very hard to get conclusion out
      4. is PCA actually a good dimension reduction technique here?

        • probably not because the tf*idf-Matrix is still quite sparse and the PCA destroys this sparseness

          • the matrices are very sparse nearly 97% of all values are 0 on allmost al datasets => how do results change if we use a SVD => the clustering does not really change, the diagram just seemed to be flipped for all datasets
  • english text vs greek text
  • different clustering solutions?

Main thought

  • if we can find a certain feature set where the TF looks quite outlierish => this might be an expalantaion that this feature set is quite distinct in the TF

Later investigations

  • What changes if employing most common ngram profile
  • whole jewish antiquities