BERT for Authorship Verification

Fine-tuned BERT model for determining if two texts were written by the same author.

Model Details

Base Model: sentence-transformers/all-MiniLM-L6-v2
Training Data: 50K text pairs from swan07/authorship-verification dataset
Task: Authorship verification (binary classification)
Performance: 73.9% accuracy, 0.821 AUC

Usage

from sentence_transformers import SentenceTransformer
import numpy as np

# Load model
model = SentenceTransformer('swan07/bert-authorship-verification')

# Encode texts
text1 = "Your first text here"
text2 = "Your second text here"

emb1 = model.encode(text1)
emb2 = model.encode(text2)

# Calculate cosine similarity
similarity = np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))

# Predict
prediction = "Same Author" if similarity >= 0.5 else "Different Authors"
print(f"Prediction: {prediction}")
print(f"Similarity: {similarity:.3f}")

Training

Trained on 50K pairs from the swan07/authorship-verification dataset using:

Learning rate: 2e-5
Batch size: 16
Epochs: 4
Loss: CosineSimilarityLoss

Dataset

swan07/authorship-verification - 325K text pairs from 12 sources including PAN competitions (2011-2020).

Citation

@article{manolache2021transferring,
  title={Transferring BERT-like Transformers' Knowledge for Authorship Verification},
  author={Manolache, Andrei and Brad, Florin and Burceanu, Elena and Barbalau, Antonio and Ionescu, Radu Tudor and Popescu, Marius},
  journal={arXiv preprint arXiv:2112.05125},
  year={2021}
}

Dataset used to train swan07/bert-authorship-verification

Paper for swan07/bert-authorship-verification

Rethinking the Authorship Verification Experimental Setups

Paper • 2112.05125 • Published Dec 9, 2021

Evaluation results

Accuracy on swan07/authorship-verification
self-reported

0.739
AUC on swan07/authorship-verification
self-reported

0.821

swan07
/

bert-authorship-verification