Upload folder using huggingface_hub

Browse files

Files changed (9) hide show

README.md +179 -57
config.json +37 -0
label_map.json +7 -0
model.safetensors +3 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +56 -0
training_args.bin +3 -0
vocab.txt +0 -0

README.md CHANGED Viewed

@@ -1,98 +1,220 @@
 ---
 language: en
-license: mit
 tags:
 - text-classification
 - distilbert
-- query-classification
 - pytorch
 datasets:
-- synthetic
 metrics:
 - accuracy
 - f1
 ---
-# Query Classification Model
 ## Model Description
-This is a fine-tuned DistilBERT-base-uncased model for classifying user queries into 4 categories: basic_actions, script_writing, information, conversation.
-## Intended Uses & Limitations
-### Intended Uses
-- Classifying user queries for routing to appropriate handlers
-- Chatbot query categorization
-- Automated response systems
-### Limitations
-- Trained on synthetic data, may not generalize to all real-world prompts
-- Limited to 4 predefined categories
-- English language only
-## Training Details
-### Training Data
-- 640 synthetic queries (scaled from 32,500 target)
-- Augmented with synonyms, paraphrasing, room variations
-- Deduplicated and filtered (3-50 words)
-- Format: JSONL with "context" and "output" fields
-- Split: 28 train, 6 validation, 40 test
-### Training Procedure
-- Base model: distilbert-base-uncased (66M parameters)
-- Task: Sequence Classification (4 classes)
-- Fine-tuning: 3 epochs
-- Learning rate: 2e-5
-- Batch size: 1 (gradient accumulation 4)
-- Optimizer: AdamW
-### Training Logs
-- Epoch 1: Eval Loss 1.37, Accuracy 0.17, F1 0.05
-- Epoch 2: Eval Loss 1.35, Accuracy 0.67, F1 0.54
-- Epoch 3: Eval Loss 1.32, Accuracy 0.50, F1 0.33
 ## Performance
-| Metric | Value |
-|--------|-------|
-| Accuracy | 67% |
-| F1 Score | 54% |
-## How to Use
 ```python
 from transformers import AutoTokenizer, AutoModelForSequenceClassification
 import torch
-tokenizer = AutoTokenizer.from_pretrained("SaiCharan7829/query_classification-distilBERT-66M")
-model = AutoModelForSequenceClassification.from_pretrained("SaiCharan7829/query_classification-distilBERT-66M")
-categories = ["basic_actions", "script_writing", "information", "conversation"]
-prompt = "Turn on the lights in the living room"
-inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True, max_length=512)
 with torch.no_grad():
     outputs = model(**inputs)
-predicted_class = torch.argmax(outputs.logits).item()
-print(f"Predicted Category: {categories[predicted_class]}")
 ```
-## Model Files
-- `model.safetensors`: Model weights
-- `config.json`: Model configuration
-- `tokenizer.json`: Tokenizer files
-- `vocab.txt`: Vocabulary
-- `special_tokens_map.json`: Special tokens
-- `training_args.bin`: Training arguments
-## Dataset
-The synthetic dataset is included as `train_data.jsonl`, `val_data.jsonl`, `test_data.jsonl`.
 ## License
-MIT License

 ---
 language: en
+license: apache-2.0
 tags:
 - text-classification
+- intent-classification
+- task-routing
 - distilbert
 - pytorch
 datasets:
+- custom
 metrics:
 - accuracy
 - f1
+model-index:
+- name: query_classification-distilBERT-66M
+  results:
+  - task:
+      type: text-classification
+      name: Intent Classification
+    metrics:
+    - type: accuracy
+      value: 98.03
+      name: Test Accuracy
+    - type: f1
+      value: 98.03
+      name: F1 Score (Weighted)
 ---
+# DistilBERT Task Router - Query Classification Model (V5)
+A high-performance intent classification model based on DistilBERT, fine-tuned to classify user queries into 5 categories with **98.03% accuracy** on a challenging test set of 7,320 samples.
 ## Model Description
+- **Base Model:** distilbert-base-uncased (66M parameters)
+- **Task:** Multi-class text classification (5 categories)
+- **Language:** English
+- **Training Data:** 58,560 samples (custom generated)
+- **Test Accuracy:** **98.03%** ✓
+- **Inference Speed:** ~3ms average latency
+## Categories
+This model classifies text into 5 intent categories:
+1. **basic_actions** - One-time, immediate commands
+   - Examples: "Turn on the lights", "Set temperature to 22 degrees", "Play music"
+2. **automator** - Recurring, scheduled, or conditional automations
+   - Examples: "Turn on lights every day at 6pm", "AC on if temperature > 28", "Every morning at 8am, start coffee"
+3. **information** - Educational, factual, or informational queries
+   - Examples: "What is quantum computing?", "How does photosynthesis work?", "What's the weather?"
+4. **conversation** - Social interactions and casual chat
+   - Examples: "Hello", "How are you?", "Good morning", "Nice to meet you"
+5. **irrelevant** - Abusive, meaningless, or off-topic content
+   - Examples: "asdfghjkl", "You're stupid", "Random gibberish"
 ## Performance
+### Test Set Results (7,320 samples)
+| Category       | Precision | Recall  | F1-Score | Support |
+|----------------|-----------|---------|----------|---------|
+| basic_actions  | 95.92%    | 100.00% | 97.92%   | 1,833   |
+| automator      | 100.00%   | 94.50%  | 97.17%   | 1,418   |
+| information    | 100.00%   | 95.39%  | 97.64%   | 1,432   |
+| conversation   | 100.00%   | 100.00% | 100.00%  | 1,456   |
+| irrelevant     | 94.71%    | 100.00% | 97.28%   | 1,181   |
+| **Overall**    | **98.12%**| **98.03%** | **98.03%** | **7,320** |
+### Key Metrics
+- **Accuracy:** 98.03%
+- **F1 Score (Weighted):** 98.03%
+- **F1 Score (Macro):** 98.00%
+- **Error Rate:** 1.97% (144 errors / 7,320 samples)
+### Latency
+- **Average:** 2.91ms
+- **Median:** 2.80ms
+- **P95:** 3.36ms
+- **P99:** 3.88ms
+## Usage
+### Quick Start
 ```python
 from transformers import AutoTokenizer, AutoModelForSequenceClassification
 import torch
+# Load model and tokenizer
+model_name = "SaiCharan7829/query_classification-distilBERT-66M"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+# Prepare input
+text = "Turn on the lights every evening at 6pm"
+inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
+# Get prediction
 with torch.no_grad():
     outputs = model(**inputs)
+    logits = outputs.logits
+    predicted_class = torch.argmax(logits, dim=1).item()
+# Categories mapping
+categories = ["basic_actions", "automator", "information", "conversation", "irrelevant"]
+print(f"Predicted category: {categories[predicted_class]}")
+# Output: Predicted category: automator
 ```
+### With Confidence Scores
+```python
+import torch.nn.functional as F
+# Get probabilities
+probs = F.softmax(logits, dim=1)[0]
+confidence = probs[predicted_class].item()
+print(f"Category: {categories[predicted_class]}")
+print(f"Confidence: {confidence:.2%}")
+# Show all probabilities
+for i, category in enumerate(categories):
+    print(f"{category}: {probs[i].item():.2%}")
+```
+## Training Details
+### Training Hyperparameters
+- **Epochs:** 30
+- **Batch Size:** 64 (effective, with gradient accumulation)
+- **Learning Rate:** 2e-5
+- **Warmup Steps:** 500
+- **Weight Decay:** 0.01
+- **Label Smoothing:** 0.1
+- **Learning Rate Schedule:** Cosine with warmup
+- **Optimizer:** AdamW
+- **Class Weights:** Applied (automator: 1.31x, basic_actions: 1.48x, irrelevant: 0.98x)
+### Dataset
+- **Training Samples:** 58,560
+- **Validation Samples:** 7,320
+- **Test Samples:** 7,320
+- **Data Split:** 80% / 10% / 10%
+**Distribution:**
+- basic_actions: 24.4% (15,000 samples with 40% short commands)
+- automator: 19.8%
+- information: 19.7%
+- conversation: 19.8%
+- irrelevant: 16.4%
+### Training Infrastructure
+- **Framework:** Transformers 4.x, PyTorch 2.x
+- **Device:** Apple Silicon (MPS)
+- **Precision:** FP32
+## Limitations & Biases
+- The model is trained on English text only
+- Performance may degrade on domain-specific jargon not seen during training
+- Short ambiguous commands (1-2 words) may have lower confidence
+- The "irrelevant" category includes abusive content, which may reflect biases in training data
+## Intended Use
+This model is designed for:
+- Smart home assistants and IoT platforms
+- Chatbot intent classification
+- Task routing and workflow automation
+- Virtual assistant command parsing
+**Not recommended for:**
+- Sensitive content moderation (use dedicated safety models)
+- Medical or legal decision-making
+- Financial advice classification
+## Version History
+### v5 (Current) - November 2024
+- **Accuracy:** 98.03% (test set)
+- Major improvements to basic_actions recall (100%)
+- Optimized class weights based on error analysis
+- Enhanced dataset with better short command coverage
+### v4
+- **Accuracy:** 94.86% (test set)
+- Initial release with 72k training samples
+- Identified issues with short command classification
+## Citation
+```bibtex
+@misc{query_classification_distilbert_2024,
+  author = {SaiCharan7829},
+  title = {DistilBERT Task Router - Query Classification Model},
+  year = {2024},
+  publisher = {HuggingFace},
+  howpublished = {\url{https://huggingface.co/SaiCharan7829/query_classification-distilBERT-66M}}
+}
+```
 ## License
+Apache 2.0
+## Model Card Authors
+SaiCharan7829

config.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "activation": "gelu",
+  "architectures": [
+    "DistilBertForSequenceClassification"
+  ],
+  "attention_dropout": 0.3,
+  "dim": 768,
+  "dropout": 0.3,
+  "dtype": "float32",
+  "hidden_dim": 3072,
+  "id2label": {
+    "0": "LABEL_0",
+    "1": "LABEL_1",
+    "2": "LABEL_2",
+    "3": "LABEL_3",
+    "4": "LABEL_4"
+  },
+  "initializer_range": 0.02,
+  "label2id": {
+    "LABEL_0": 0,
+    "LABEL_1": 1,
+    "LABEL_2": 2,
+    "LABEL_3": 3,
+    "LABEL_4": 4
+  },
+  "max_position_embeddings": 512,
+  "model_type": "distilbert",
+  "n_heads": 12,
+  "n_layers": 6,
+  "pad_token_id": 0,
+  "qa_dropout": 0.1,
+  "seq_classif_dropout": 0.3,
+  "sinusoidal_pos_embds": false,
+  "tie_weights_": true,
+  "transformers_version": "4.57.1",
+  "vocab_size": 30522
+}

label_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "basic_actions": 0,
+  "automator": 1,
+  "information": 2,
+  "conversation": 3,
+  "irrelevant": 4
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6e646c4db75cca299970ef495042c3ddfa322737f8fa3466b240384bedaabd3f
+size 267841796

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,56 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "DistilBertTokenizer",
+  "unk_token": "[UNK]"
+}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:346d95449450462a9af6a37abd6886f382d3c25debf1065d0ae773f601d4a53c
+size 5841

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff