Upload 7 files

Browse files

Files changed (7) hide show

README.md +116 -0
config.json +48 -0
model.safetensors +3 -0
special_tokens_map.json +37 -0
tokenizer.json +0 -0
tokenizer_config.json +63 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,116 @@

+# DistilBERT Fine-Tuned Model for Authorship Attribution on Blog Corpus
+This repository hosts a fine-tuned DistilBERT model designed for the **authorship attribution** task on the Blog Authorship Corpus dataset. The model is optimized for identifying the author of a given blog post from a subset of top contributors.
+## Model Details
+- **Model Architecture:** DistilBERT Base (distilbert-base-uncased)
+- **Task:** Authorship Attribution
+- **Dataset:** Blog Authorship Corpus (Top 10 authors selected)
+- **Quantization:** Float16 (Post-training)
+- **Fine-tuning Framework:** Hugging Face Transformers
+## Usage
+### Installation
+```sh
+pip install transformers torch
+```
+### Loading the Model
+```python
+from transformers import DistilBertForSequenceClassification, DistilBertTokenizerFast
+import torch
+# Load fine-tuned model
+model_path = "fine-tuned-model"
+model = DistilBertForSequenceClassification.from_pretrained(model_path)
+tokenizer = DistilBertTokenizerFast.from_pretrained(model_path)
+# Set model to evaluation and convert to half precision
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model.to(device)
+model.eval()
+model.half()
+# Example input
+blog_post = "Today I went to the beach and had an amazing time with friends. The sunset was breathtaking!"
+# Tokenize input
+inputs = tokenizer(blog_post, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
+inputs = {k: v.half() if v.dtype == torch.float else v for k, v in inputs.items()}
+# Make prediction
+with torch.no_grad():
+    outputs = model(**inputs)
+predicted_class = torch.argmax(outputs.logits, dim=1).item()
+# Label mapping (example)
+label_mapping = {
+    0: "Author_A",
+    1: "Author_B",
+    2: "Author_C",
+    3: "Author_D",
+    4: "Author_E",
+    5: "Author_F",
+    6: "Author_G",
+    7: "Author_H",
+    8: "Author_I",
+    9: "Author_J"
+}
+predicted_author = label_mapping[predicted_class]
+print(f"Predicted Author: {predicted_author}")
+```
+## Performance Metrics
+- **Accuracy:** ~78% (on validation set of top 10 authors)
+- **Precision/Recall/F1:** Vary per class, average F1 around 0.75
+## Fine-Tuning Details
+### Dataset
+The model is trained on a subset of the **Blog Authorship Corpus** containing blogs from the top 10 most prolific authors. Each sample is a blog post with its corresponding author label.
+### Training
+- **Epochs:** 3
+- **Batch size:** 8
+- **Evaluation strategy:** Per epoch
+- **Learning rate:** 2e-5
+### Quantization
+Post-training dynamic quantization using PyTorch was applied to reduce model size and accelerate inference:
+```python
+quantized_model = torch.quantization.quantize_dynamic(
+    model, {torch.nn.Linear}, dtype=torch.qint8
+)
+```
+## Repository Structure
+```
+.
+├── model/               # Contains the fine-tuned and quantized model files
+├── tokenizer_config/    # Tokenizer configuration and vocabulary
+├── model.safensors/     # Safetensors version of model weights
+├── README.md            # Documentation
+```
+## Limitations
+- The model is limited to the top 10 authors used in fine-tuning.
+- May not generalize well to unseen authors or blogs outside the dataset distribution.
+- Quantization may slightly affect prediction precision.
+## Contributing
+Contributions are welcome! If you find bugs or have suggestions for improvements, feel free to open an issue or submit a pull request.

config.json ADDED Viewed

	@@ -0,0 +1,48 @@

+{
+  "activation": "gelu",
+  "architectures": [
+    "DistilBertForSequenceClassification"
+  ],
+  "attention_dropout": 0.1,
+  "dim": 768,
+  "dropout": 0.1,
+  "hidden_dim": 3072,
+  "id2label": {
+    "0": 913315,
+    "1": 1853281,
+    "2": 1904603,
+    "3": 1955799,
+    "4": 2752410,
+    "5": 2781780,
+    "6": 3019516,
+    "7": 3122872,
+    "8": 3346463,
+    "9": 3428854
+  },
+  "initializer_range": 0.02,
+  "label2id": {
+    "1853281": 1,
+    "1904603": 2,
+    "1955799": 3,
+    "2752410": 4,
+    "2781780": 5,
+    "3019516": 6,
+    "3122872": 7,
+    "3346463": 8,
+    "3428854": 9,
+    "913315": 0
+  },
+  "max_position_embeddings": 512,
+  "model_type": "distilbert",
+  "n_heads": 12,
+  "n_layers": 6,
+  "pad_token_id": 0,
+  "problem_type": "single_label_classification",
+  "qa_dropout": 0.1,
+  "seq_classif_dropout": 0.2,
+  "sinusoidal_pos_embds": false,
+  "tie_weights_": true,
+  "torch_dtype": "float16",
+  "transformers_version": "4.51.3",
+  "vocab_size": 30522
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:72818d55860500790b1f419b6eedc5dfa39629877d7d736805f9799b688b4b3e
+size 133934740

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,63 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "max_length": 256,
+  "model_max_length": 512,
+  "pad_to_multiple_of": null,
+  "pad_token": "[PAD]",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "sep_token": "[SEP]",
+  "stride": 0,
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "DistilBertTokenizer",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff