Aryan7500 commited on
Commit
5e5973e
·
verified ·
1 Parent(s): 49a6c61

Upload 6 files

Browse files
Files changed (6) hide show
  1. README.md +108 -0
  2. config.json +25 -0
  3. model.safetensors +3 -0
  4. special_tokens_map.json +7 -0
  5. tokenizer_config.json +57 -0
  6. vocab.txt +0 -0
README.md ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ # DistilBERT Quantized Model for IMDB Sentiment Analysis
3
+
4
+ This repository contains a quantized DistilBERT model fine-tuned for binary sentiment classification on IMDB movie reviews. Optimized for production deployment, the model achieves high accuracy while maintaining efficiency.
5
+
6
+ ## Model Details
7
+
8
+ - **Model Architecture:** DistilBERT Base Uncased
9
+ - **Task:** Binary Sentiment Analysis (Positive/Negative)
10
+ - **Dataset:** IMDB Movie Reviews (50K samples)
11
+ - **Quantization:** Dynamic Quantization (INT8)
12
+ - **Framework:** Hugging Face Transformers + PyTorch
13
+
14
+ ## Usage
15
+
16
+ ### Installation
17
+
18
+ ```sh
19
+ pip install transformers torch scikit-learn pandas
20
+ ```
21
+
22
+ ### Loading the Model
23
+
24
+ ```python
25
+ from transformers import DistilBertForSequenceClassification, DistilBertTokenizer
26
+ import torch
27
+
28
+ # Load quantized model
29
+ model_path = "./quantized_sentiment_model.pth"
30
+ model = DistilBertForSequenceClassification.from_pretrained("./sentiment_model")
31
+ model.load_state_dict(torch.load(model_path))
32
+ model.eval()
33
+
34
+ # Load tokenizer
35
+ tokenizer = DistilBertTokenizer.from_pretrained("./sentiment_model")
36
+
37
+ def predict_sentiment(text):
38
+ inputs = tokenizer(text, return_tensors="pt",
39
+ padding=True, truncation=True,
40
+ max_length=128)
41
+
42
+ with torch.no_grad():
43
+ outputs = model(**inputs)
44
+
45
+ prediction = torch.argmax(outputs.logits).item()
46
+ return "Positive" if prediction == 1 else "Negative"
47
+
48
+ # Example usage
49
+ review = "This movie blew me away with its stunning visuals and gripping storyline."
50
+ print(predict_sentiment(review)) # Output: Positive
51
+ ```
52
+
53
+ ## 📊 Performance Metrics
54
+
55
+ | Metric | Value |
56
+ |--------------------------|---------|
57
+ | Accuracy | 89.1% |
58
+ | F1 Score | 89.0% |
59
+ | Inference Latency (CPU) | 12ms |
60
+ | Model Size | 67MB |
61
+
62
+ ## 🏋️ Training Details
63
+
64
+ ### Dataset
65
+
66
+ - 50,000 IMDB movie reviews
67
+ - Balanced binary classes (50% positive, 50% negative)
68
+
69
+ ### Hyperparameters
70
+
71
+ - Epochs: 5
72
+ - Batch Size: 24 (Effective 48 with accumulation)
73
+ - Learning Rate: 8e-6
74
+ - Warmup Ratio: 10%
75
+ - Weight Decay: 0.005
76
+ - Optimizer: AdamW with Cosine LR Schedule
77
+
78
+ ### Quantization
79
+
80
+ Applied dynamic post-training quantization:
81
+
82
+ ```python
83
+ quantized_model = torch.quantization.quantize_dynamic(
84
+ model, {torch.nn.Linear}, dtype=torch.qint8
85
+ )
86
+ ```
87
+
88
+ ## 📁 Repository Structure
89
+
90
+ ```
91
+ .
92
+ ├── sentiment_model/ # Full-precision model files
93
+ │ ├── config.json
94
+ │ ├── pytorch_model.bin
95
+ │ └── tokenizer files...
96
+ ├── quantized_sentiment_model.pth # Quantized weights
97
+ ├── imdb_train.csv # Sample training data
98
+ ├── train.py # Training script
99
+ └── inference.py # Usage examples
100
+ ```
101
+
102
+ ## ⚠️ Limitations
103
+
104
+ - Accuracy may drop on reviews with:
105
+ - Sarcasm or nuanced language
106
+ - Domain-specific terminology (non-movie content)
107
+ - Maximum sequence length: 128 tokens
108
+ - English language only
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "distilbert-base-uncased",
3
+ "activation": "gelu",
4
+ "architectures": [
5
+ "DistilBertForSequenceClassification"
6
+ ],
7
+ "attention_dropout": 0.1,
8
+ "dim": 768,
9
+ "dropout": 0.1,
10
+ "hidden_dim": 3072,
11
+ "initializer_range": 0.02,
12
+ "max_position_embeddings": 512,
13
+ "model_type": "distilbert",
14
+ "n_heads": 12,
15
+ "n_layers": 6,
16
+ "pad_token_id": 0,
17
+ "problem_type": "single_label_classification",
18
+ "qa_dropout": 0.1,
19
+ "seq_classif_dropout": 0.2,
20
+ "sinusoidal_pos_embds": false,
21
+ "tie_weights_": true,
22
+ "torch_dtype": "float16",
23
+ "transformers_version": "4.36.2",
24
+ "vocab_size": 30522
25
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e648b51b5b350bedf9493c6b176dd2b6372adc0b25f67040d229386cc068d8ab
3
+ size 133922428
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "mask_token": "[MASK]",
49
+ "model_max_length": 512,
50
+ "never_split": null,
51
+ "pad_token": "[PAD]",
52
+ "sep_token": "[SEP]",
53
+ "strip_accents": null,
54
+ "tokenize_chinese_chars": true,
55
+ "tokenizer_class": "DistilBertTokenizer",
56
+ "unk_token": "[UNK]"
57
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff