| | --- |
| | language: en |
| | license: mit |
| | tags: |
| | - authorship-verification |
| | - sentence-transformers |
| | - sentence-similarity |
| | datasets: |
| | - swan07/authorship-verification |
| | metrics: |
| | - accuracy |
| | - auc |
| | model-index: |
| | - name: swan07/bert-authorship-verification |
| | results: |
| | - task: |
| | type: authorship-verification |
| | name: Authorship Verification |
| | dataset: |
| | name: swan07/authorship-verification |
| | type: authorship-verification |
| | metrics: |
| | - type: accuracy |
| | value: 0.739 |
| | name: Accuracy |
| | - type: auc |
| | value: 0.821 |
| | name: AUC |
| | --- |
| | |
| | # BERT for Authorship Verification |
| |
|
| | Fine-tuned BERT model for determining if two texts were written by the same author. |
| |
|
| | ## Model Details |
| |
|
| | - **Base Model**: sentence-transformers/all-MiniLM-L6-v2 |
| | - **Training Data**: 50K text pairs from swan07/authorship-verification dataset |
| | - **Task**: Authorship verification (binary classification) |
| | - **Performance**: 73.9% accuracy, 0.821 AUC |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | from sentence_transformers import SentenceTransformer |
| | import numpy as np |
| | |
| | # Load model |
| | model = SentenceTransformer('swan07/bert-authorship-verification') |
| | |
| | # Encode texts |
| | text1 = "Your first text here" |
| | text2 = "Your second text here" |
| | |
| | emb1 = model.encode(text1) |
| | emb2 = model.encode(text2) |
| | |
| | # Calculate cosine similarity |
| | similarity = np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2)) |
| | |
| | # Predict |
| | prediction = "Same Author" if similarity >= 0.5 else "Different Authors" |
| | print(f"Prediction: {prediction}") |
| | print(f"Similarity: {similarity:.3f}") |
| | ``` |
| |
|
| | ## Training |
| |
|
| | Trained on 50K pairs from the swan07/authorship-verification dataset using: |
| | - Learning rate: 2e-5 |
| | - Batch size: 16 |
| | - Epochs: 4 |
| | - Loss: CosineSimilarityLoss |
| |
|
| | ## Dataset |
| |
|
| | [swan07/authorship-verification](https://huggingface.co/datasets/swan07/authorship-verification) - 325K text pairs from 12 sources including PAN competitions (2011-2020). |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @article{manolache2021transferring, |
| | title={Transferring BERT-like Transformers' Knowledge for Authorship Verification}, |
| | author={Manolache, Andrei and Brad, Florin and Burceanu, Elena and Barbalau, Antonio and Ionescu, Radu Tudor and Popescu, Marius}, |
| | journal={arXiv preprint arXiv:2112.05125}, |
| | year={2021} |
| | } |
| | ``` |
| |
|
| | ## Links |
| |
|
| | - **Live Demo**: [same-writer-detector.streamlit.app](https://same-writer-detector.streamlit.app/) |
| | - **Code**: [github.com/swan-07/authorship-verification](https://github.com/swan-07/authorship-verification) |
| | - **Dataset**: [huggingface.co/datasets/swan07/authorship-verification](https://huggingface.co/datasets/swan07/authorship-verification) |
| |
|