BERT for Authorship Verification

Fine-tuned BERT model for determining if two texts were written by the same author.

Model Details

  • Base Model: sentence-transformers/all-MiniLM-L6-v2
  • Training Data: 50K text pairs from swan07/authorship-verification dataset
  • Task: Authorship verification (binary classification)
  • Performance: 73.9% accuracy, 0.821 AUC

Usage

from sentence_transformers import SentenceTransformer
import numpy as np

# Load model
model = SentenceTransformer('swan07/bert-authorship-verification')

# Encode texts
text1 = "Your first text here"
text2 = "Your second text here"

emb1 = model.encode(text1)
emb2 = model.encode(text2)

# Calculate cosine similarity
similarity = np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))

# Predict
prediction = "Same Author" if similarity >= 0.5 else "Different Authors"
print(f"Prediction: {prediction}")
print(f"Similarity: {similarity:.3f}")

Training

Trained on 50K pairs from the swan07/authorship-verification dataset using:

  • Learning rate: 2e-5
  • Batch size: 16
  • Epochs: 4
  • Loss: CosineSimilarityLoss

Dataset

swan07/authorship-verification - 325K text pairs from 12 sources including PAN competitions (2011-2020).

Citation

@article{manolache2021transferring,
  title={Transferring BERT-like Transformers' Knowledge for Authorship Verification},
  author={Manolache, Andrei and Brad, Florin and Burceanu, Elena and Barbalau, Antonio and Ionescu, Radu Tudor and Popescu, Marius},
  journal={arXiv preprint arXiv:2112.05125},
  year={2021}
}

Links

Downloads last month
34
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train swan07/bert-authorship-verification

Paper for swan07/bert-authorship-verification

Evaluation results