File size: 2,565 Bytes
a8d748c
985f327
 
a8d748c
985f327
a8d748c
 
985f327
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a8d748c
 
985f327
a8d748c
985f327
a8d748c
 
 
985f327
 
 
 
a8d748c
 
 
 
 
985f327
a8d748c
985f327
 
a8d748c
985f327
 
 
a8d748c
985f327
 
a8d748c
985f327
 
a8d748c
985f327
 
 
 
 
a8d748c
985f327
a8d748c
985f327
 
 
 
 
a8d748c
985f327
a8d748c
985f327
a8d748c
 
 
 
985f327
 
 
 
 
a8d748c
 
 
985f327
a8d748c
985f327
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
language: en
license: mit
tags:
- authorship-verification
- sentence-transformers
- sentence-similarity
datasets:
- swan07/authorship-verification
metrics:
- accuracy
- auc
model-index:
- name: swan07/bert-authorship-verification
  results:
  - task:
      type: authorship-verification
      name: Authorship Verification
    dataset:
      name: swan07/authorship-verification
      type: authorship-verification
    metrics:
    - type: accuracy
      value: 0.739
      name: Accuracy
    - type: auc
      value: 0.821
      name: AUC
---

# BERT for Authorship Verification

Fine-tuned BERT model for determining if two texts were written by the same author.

## Model Details

- **Base Model**: sentence-transformers/all-MiniLM-L6-v2
- **Training Data**: 50K text pairs from swan07/authorship-verification dataset
- **Task**: Authorship verification (binary classification)
- **Performance**: 73.9% accuracy, 0.821 AUC

## Usage

```python
from sentence_transformers import SentenceTransformer
import numpy as np

# Load model
model = SentenceTransformer('swan07/bert-authorship-verification')

# Encode texts
text1 = "Your first text here"
text2 = "Your second text here"

emb1 = model.encode(text1)
emb2 = model.encode(text2)

# Calculate cosine similarity
similarity = np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))

# Predict
prediction = "Same Author" if similarity >= 0.5 else "Different Authors"
print(f"Prediction: {prediction}")
print(f"Similarity: {similarity:.3f}")
```

## Training

Trained on 50K pairs from the swan07/authorship-verification dataset using:
- Learning rate: 2e-5
- Batch size: 16
- Epochs: 4
- Loss: CosineSimilarityLoss

## Dataset

[swan07/authorship-verification](https://huggingface.co/datasets/swan07/authorship-verification) - 325K text pairs from 12 sources including PAN competitions (2011-2020).

## Citation

```bibtex
@article{manolache2021transferring,
  title={Transferring BERT-like Transformers' Knowledge for Authorship Verification},
  author={Manolache, Andrei and Brad, Florin and Burceanu, Elena and Barbalau, Antonio and Ionescu, Radu Tudor and Popescu, Marius},
  journal={arXiv preprint arXiv:2112.05125},
  year={2021}
}
```

## Links

- **Live Demo**: [same-writer-detector.streamlit.app](https://same-writer-detector.streamlit.app/)
- **Code**: [github.com/swan-07/authorship-verification](https://github.com/swan-07/authorship-verification)
- **Dataset**: [huggingface.co/datasets/swan07/authorship-verification](https://huggingface.co/datasets/swan07/authorship-verification)