| # DistilBERT Fine-Tuned Model for Authorship Attribution on Blog Corpus | |
| This repository hosts a fine-tuned DistilBERT model designed for the **authorship attribution** task on the Blog Authorship Corpus dataset. The model is optimized for identifying the author of a given blog post from a subset of top contributors. | |
| ## Model Details | |
| - **Model Architecture:** DistilBERT Base (distilbert-base-uncased) | |
| - **Task:** Authorship Attribution | |
| - **Dataset:** Blog Authorship Corpus (Top 10 authors selected) | |
| - **Quantization:** Float16 (Post-training) | |
| - **Fine-tuning Framework:** Hugging Face Transformers | |
| ## Usage | |
| ### Installation | |
| ```sh | |
| pip install transformers torch | |
| ``` | |
| ### Loading the Model | |
| ```python | |
| from transformers import DistilBertForSequenceClassification, DistilBertTokenizerFast | |
| import torch | |
| # Load fine-tuned model | |
| model_path = "fine-tuned-model" | |
| model = DistilBertForSequenceClassification.from_pretrained(model_path) | |
| tokenizer = DistilBertTokenizerFast.from_pretrained(model_path) | |
| # Set model to evaluation and convert to half precision | |
| device = "cuda" if torch.cuda.is_available() else "cpu" | |
| model.to(device) | |
| model.eval() | |
| model.half() | |
| # Example input | |
| blog_post = "Today I went to the beach and had an amazing time with friends. The sunset was breathtaking!" | |
| # Tokenize input | |
| inputs = tokenizer(blog_post, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device) | |
| inputs = {k: v.half() if v.dtype == torch.float else v for k, v in inputs.items()} | |
| # Make prediction | |
| with torch.no_grad(): | |
| outputs = model(**inputs) | |
| predicted_class = torch.argmax(outputs.logits, dim=1).item() | |
| # Label mapping (example) | |
| label_mapping = { | |
| 0: "Author_A", | |
| 1: "Author_B", | |
| 2: "Author_C", | |
| 3: "Author_D", | |
| 4: "Author_E", | |
| 5: "Author_F", | |
| 6: "Author_G", | |
| 7: "Author_H", | |
| 8: "Author_I", | |
| 9: "Author_J" | |
| } | |
| predicted_author = label_mapping[predicted_class] | |
| print(f"Predicted Author: {predicted_author}") | |
| ``` | |
| ## Performance Metrics | |
| - **Accuracy:** ~78% (on validation set of top 10 authors) | |
| - **Precision/Recall/F1:** Vary per class, average F1 around 0.75 | |
| ## Fine-Tuning Details | |
| ### Dataset | |
| The model is trained on a subset of the **Blog Authorship Corpus** containing blogs from the top 10 most prolific authors. Each sample is a blog post with its corresponding author label. | |
| ### Training | |
| - **Epochs:** 3 | |
| - **Batch size:** 8 | |
| - **Evaluation strategy:** Per epoch | |
| - **Learning rate:** 2e-5 | |
| ### Quantization | |
| Post-training dynamic quantization using PyTorch was applied to reduce model size and accelerate inference: | |
| ```python | |
| quantized_model = torch.quantization.quantize_dynamic( | |
| model, {torch.nn.Linear}, dtype=torch.qint8 | |
| ) | |
| ``` | |
| ## Repository Structure | |
| ``` | |
| . | |
| βββ model/ # Contains the fine-tuned and quantized model files | |
| βββ tokenizer_config/ # Tokenizer configuration and vocabulary | |
| βββ model.safensors/ # Safetensors version of model weights | |
| βββ README.md # Documentation | |
| ``` | |
| ## Limitations | |
| - The model is limited to the top 10 authors used in fine-tuning. | |
| - May not generalize well to unseen authors or blogs outside the dataset distribution. | |
| - Quantization may slightly affect prediction precision. | |
| ## Contributing | |
| Contributions are welcome! If you find bugs or have suggestions for improvements, feel free to open an issue or submit a pull request. | |