Upload 6 files

Browse files

Files changed (6) hide show

README.md +108 -0
config.json +25 -0
model.safetensors +3 -0
special_tokens_map.json +7 -0
tokenizer_config.json +57 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,108 @@

+# DistilBERT Quantized Model for IMDB Sentiment Analysis
+This repository contains a quantized DistilBERT model fine-tuned for binary sentiment classification on IMDB movie reviews. Optimized for production deployment, the model achieves high accuracy while maintaining efficiency.
+## Model Details
+- **Model Architecture:** DistilBERT Base Uncased
+- **Task:** Binary Sentiment Analysis (Positive/Negative)
+- **Dataset:** IMDB Movie Reviews (50K samples)
+- **Quantization:** Dynamic Quantization (INT8)
+- **Framework:** Hugging Face Transformers + PyTorch
+## Usage
+### Installation
+```sh
+pip install transformers torch scikit-learn pandas
+```
+### Loading the Model
+```python
+from transformers import DistilBertForSequenceClassification, DistilBertTokenizer
+import torch
+# Load quantized model
+model_path = "./quantized_sentiment_model.pth"
+model = DistilBertForSequenceClassification.from_pretrained("./sentiment_model")
+model.load_state_dict(torch.load(model_path))
+model.eval()
+# Load tokenizer
+tokenizer = DistilBertTokenizer.from_pretrained("./sentiment_model")
+def predict_sentiment(text):
+    inputs = tokenizer(text, return_tensors="pt",
+                      padding=True, truncation=True,
+                      max_length=128)
+    with torch.no_grad():
+        outputs = model(**inputs)
+    prediction = torch.argmax(outputs.logits).item()
+    return "Positive" if prediction == 1 else "Negative"
+# Example usage
+review = "This movie blew me away with its stunning visuals and gripping storyline."
+print(predict_sentiment(review))  # Output: Positive
+```
+## 📊 Performance Metrics
+| Metric                   | Value   |
+|--------------------------|---------|
+| Accuracy                 | 89.1%   |
+| F1 Score                 | 89.0%   |
+| Inference Latency (CPU) | 12ms    |
+| Model Size               | 67MB    |
+## 🏋️ Training Details
+### Dataset
+- 50,000 IMDB movie reviews
+- Balanced binary classes (50% positive, 50% negative)
+### Hyperparameters
+- Epochs: 5
+- Batch Size: 24 (Effective 48 with accumulation)
+- Learning Rate: 8e-6
+- Warmup Ratio: 10%
+- Weight Decay: 0.005
+- Optimizer: AdamW with Cosine LR Schedule
+### Quantization
+Applied dynamic post-training quantization:
+```python
+quantized_model = torch.quantization.quantize_dynamic(
+    model, {torch.nn.Linear}, dtype=torch.qint8
+)
+```
+## 📁 Repository Structure
+```
+.
+├── sentiment_model/                # Full-precision model files
+│   ├── config.json
+│   ├── pytorch_model.bin
+│   └── tokenizer files...
+├── quantized_sentiment_model.pth  # Quantized weights
+├── imdb_train.csv                 # Sample training data
+├── train.py                       # Training script
+└── inference.py                   # Usage examples
+```
+## ⚠️ Limitations
+- Accuracy may drop on reviews with:
+  - Sarcasm or nuanced language
+  - Domain-specific terminology (non-movie content)
+- Maximum sequence length: 128 tokens
+- English language only

config.json ADDED Viewed

	@@ -0,0 +1,25 @@

+{
+  "_name_or_path": "distilbert-base-uncased",
+  "activation": "gelu",
+  "architectures": [
+    "DistilBertForSequenceClassification"
+  ],
+  "attention_dropout": 0.1,
+  "dim": 768,
+  "dropout": 0.1,
+  "hidden_dim": 3072,
+  "initializer_range": 0.02,
+  "max_position_embeddings": 512,
+  "model_type": "distilbert",
+  "n_heads": 12,
+  "n_layers": 6,
+  "pad_token_id": 0,
+  "problem_type": "single_label_classification",
+  "qa_dropout": 0.1,
+  "seq_classif_dropout": 0.2,
+  "sinusoidal_pos_embds": false,
+  "tie_weights_": true,
+  "torch_dtype": "float16",
+  "transformers_version": "4.36.2",
+  "vocab_size": 30522
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e648b51b5b350bedf9493c6b176dd2b6372adc0b25f67040d229386cc068d8ab
+size 133922428

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,57 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": true,
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "never_split": null,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "DistilBertTokenizer",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff