Authorship Verification on the Testimonium Flavianum: Data exploration with character n-grams

February 27, 2025

  • procedure of character n-gram extraction

    • space free
    • tf-idf values
  • 18 th book

    • preprocessing to same length
  • ensemble of character n-grams

    • n = [3..9]

      1. not really any change over the variation of n (see gif in plots/baseline) => PCA1 = o => PCA2 = i => therefore how would it be if we do space including?

      105, 109 und 114 haben als eine der wenigen Passagen einzelne o’s wie in “Equal to this determination of yours, O Petronius” (114) elemente in Cluster 1 wie Abschnitt 4 haben einzelne i’s wie in “To be sure, I have spoken about them in the second book of the Jewish War,”

      1. It does indeed change but the results are not that well explainable anymore

        • the first 2 PCA components only explain around 5% of the variance of the dataset => space including rather not
      2. what changes do we see if we do different preprocessing?

        • remove unnecessary punctuation and keep upper/lowercase => results as with space_including
        • remove unneccesary punctuation and only lowercase => results as space_free + reasonable PCA
        • Upper/Lowercase
      3. What changes do we see if we do word n-grams
      4. What changes if employing most common ngram profile
  • english text vs greek text
  • different clustering solutions?

Main thought

  • if we can find a certain feature set where the TF looks quite outlierish => this might be an expalantaion that this feature set is quite distinct in the TF