Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -9,36 +9,107 @@ tags:
|
|
| 9 |
- commoneval
|
| 10 |
- wildvoice
|
| 11 |
- voicebench
|
|
|
|
| 12 |
---
|
| 13 |
|
| 14 |
-
# Qwen2.5-0.5B
|
| 15 |
|
| 16 |
-
This model is a fine-tuned version of [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) using LoRA (Low-Rank Adaptation) for text classification tasks.
|
| 17 |
|
| 18 |
-
## Model Description
|
| 19 |
|
| 20 |
-
The model has been trained to classify text into three categories:
|
| 21 |
-
- **ifeval**: Instruction-following tasks with specific formatting requirements
|
| 22 |
-
- **commoneval**: Factual questions and knowledge-based queries
|
| 23 |
-
- **wildvoice**: Conversational, informal language patterns
|
|
|
|
|
|
|
| 24 |
|
| 25 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
-
|
| 28 |
-
- **ifeval**: 100% (10/10)
|
| 29 |
-
- **commoneval**: 80% (8/10)
|
| 30 |
-
- **wildvoice**: 100% (10/10)
|
| 31 |
|
| 32 |
-
|
|
|
|
| 33 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
```python
|
| 35 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
|
|
| 36 |
|
| 37 |
# Load the model and tokenizer
|
| 38 |
-
model = AutoModelForCausalLM.from_pretrained("
|
| 39 |
-
tokenizer = AutoTokenizer.from_pretrained("
|
| 40 |
|
| 41 |
-
# Example usage
|
| 42 |
def classify_text(text):
|
| 43 |
prompt = f"Text: {text}\nLabel:"
|
| 44 |
inputs = tokenizer(prompt, return_tensors="pt")
|
|
@@ -68,37 +139,196 @@ print(classify_text("Hey, how are you doing today?"))
|
|
| 68 |
# Output: wildvoice
|
| 69 |
```
|
| 70 |
|
| 71 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
- **Base Model**: Qwen/Qwen2.5-0.5B-Instruct
|
| 74 |
-
- **
|
| 75 |
-
- **
|
| 76 |
-
- **
|
| 77 |
-
- **Batch Size**: 2
|
| 78 |
-
- **Training Steps**: 150
|
| 79 |
-
- **Max Length**: 128
|
| 80 |
|
| 81 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
|
|
|
|
| 83 |
The model was trained on synthetic data representing three text categories:
|
| 84 |
-
-
|
| 85 |
-
-
|
| 86 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 87 |
|
| 88 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 89 |
|
| 90 |
-
|
| 91 |
-
-
|
| 92 |
-
|
|
|
|
|
|
|
| 93 |
|
| 94 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 95 |
|
| 96 |
```bibtex
|
| 97 |
@misc{qwen2.5-0.5b-text-classification,
|
| 98 |
-
title={Qwen2.5-0.5B Text Classification Model},
|
| 99 |
author={Your Name},
|
| 100 |
year={2024},
|
| 101 |
publisher={Hugging Face},
|
| 102 |
-
howpublished={\url{https://huggingface.co/
|
|
|
|
| 103 |
}
|
| 104 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
- commoneval
|
| 10 |
- wildvoice
|
| 11 |
- voicebench
|
| 12 |
+
- fine-tuned
|
| 13 |
---
|
| 14 |
|
| 15 |
+
# Qwen2.5-0.5B Text Classification Model
|
| 16 |
|
| 17 |
+
This model is a fine-tuned version of [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) using LoRA (Low-Rank Adaptation) for text classification tasks. The model has been specifically trained to classify text into three categories based on VoiceBench dataset patterns.
|
| 18 |
|
| 19 |
+
## ๐ฏ Model Description
|
| 20 |
|
| 21 |
+
The model has been trained to classify text into three distinct categories:
|
| 22 |
+
- **ifeval**: Instruction-following tasks with specific formatting requirements and step-by-step instructions
|
| 23 |
+
- **commoneval**: Factual questions and knowledge-based queries requiring direct answers
|
| 24 |
+
- **wildvoice**: Conversational, informal language patterns and natural dialogue
|
| 25 |
+
|
| 26 |
+
## ๐ Performance Results
|
| 27 |
|
| 28 |
+
### Overall Performance
|
| 29 |
+
- **Overall Accuracy**: **93.33%** (28/30 correct predictions)
|
| 30 |
+
- **Training Method**: LoRA (Low-Rank Adaptation)
|
| 31 |
+
- **Trainable Parameters**: 0.88% of total parameters (4,399,104 out of 498,431,872)
|
| 32 |
+
|
| 33 |
+
### Per-Category Performance
|
| 34 |
+
| Category | Accuracy | Correct/Total | Description |
|
| 35 |
+
|----------|----------|---------------|-------------|
|
| 36 |
+
| **ifeval** | **100%** | 10/10 | Perfect performance on instruction-following tasks |
|
| 37 |
+
| **commoneval** | **80%** | 8/10 | Good performance on factual questions |
|
| 38 |
+
| **wildvoice** | **100%** | 10/10 | Perfect performance on conversational text |
|
| 39 |
+
|
| 40 |
+
### Confusion Matrix
|
| 41 |
+
```
|
| 42 |
+
ifeval:
|
| 43 |
+
-> ifeval: 10
|
| 44 |
+
commoneval:
|
| 45 |
+
-> commoneval: 8
|
| 46 |
+
-> unknown: 1
|
| 47 |
+
-> wildvoice: 1
|
| 48 |
+
wildvoice:
|
| 49 |
+
-> wildvoice: 10
|
| 50 |
+
```
|
| 51 |
|
| 52 |
+
## ๐ฌ Development Journey & Methods Tried
|
|
|
|
|
|
|
|
|
|
| 53 |
|
| 54 |
+
### Initial Challenges
|
| 55 |
+
We started with several approaches that didn't work well:
|
| 56 |
|
| 57 |
+
1. **GRPO (Group Relative Policy Optimization)**: Initial attempts with GRPO training showed poor performance
|
| 58 |
+
- Loss decreased but model wasn't learning classification
|
| 59 |
+
- Model generated irrelevant responses like "unknown", "txt", "com"
|
| 60 |
+
- Overall accuracy: ~20%
|
| 61 |
+
|
| 62 |
+
2. **Full Fine-tuning**: Attempted full fine-tuning of larger models
|
| 63 |
+
- CUDA out of memory issues with larger models
|
| 64 |
+
- Numerical instability with certain model architectures
|
| 65 |
+
- Poor convergence on classification task
|
| 66 |
+
|
| 67 |
+
3. **Complex Prompt Formats**: Tried various complex prompt structures
|
| 68 |
+
- "Classify this text as ifeval, commoneval, or wildvoice: ..."
|
| 69 |
+
- Model struggled with complex instructions
|
| 70 |
+
- Generated explanations instead of simple labels
|
| 71 |
+
|
| 72 |
+
### Breakthrough: Direct Classification Approach
|
| 73 |
+
|
| 74 |
+
The key breakthrough came with a **direct, simple approach**:
|
| 75 |
+
|
| 76 |
+
#### 1. **Simplified Prompt Format**
|
| 77 |
+
Instead of complex classification prompts, we used a simple format:
|
| 78 |
+
```
|
| 79 |
+
Text: {input_text}
|
| 80 |
+
Label: {expected_label}
|
| 81 |
+
```
|
| 82 |
+
|
| 83 |
+
#### 2. **LoRA (Low-Rank Adaptation)**
|
| 84 |
+
- Used PEFT library for efficient fine-tuning
|
| 85 |
+
- Only trained 0.88% of parameters
|
| 86 |
+
- Much more stable than full fine-tuning
|
| 87 |
+
- Faster training and inference
|
| 88 |
+
|
| 89 |
+
#### 3. **Focused Training Data**
|
| 90 |
+
Created clear, distinct examples for each category:
|
| 91 |
+
- **ifeval**: Instruction-following with specific formatting requirements
|
| 92 |
+
- **commoneval**: Factual questions requiring direct answers
|
| 93 |
+
- **wildvoice**: Conversational, informal language patterns
|
| 94 |
+
|
| 95 |
+
#### 4. **Optimal Hyperparameters**
|
| 96 |
+
- **Learning Rate**: 5e-4 (higher than initial attempts)
|
| 97 |
+
- **Batch Size**: 2 (smaller for stability)
|
| 98 |
+
- **Max Length**: 128 (shorter sequences)
|
| 99 |
+
- **Training Steps**: 150
|
| 100 |
+
- **LoRA Rank**: 8 (focused learning)
|
| 101 |
+
|
| 102 |
+
## ๐ Usage
|
| 103 |
+
|
| 104 |
+
### Basic Usage
|
| 105 |
```python
|
| 106 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 107 |
+
import torch
|
| 108 |
|
| 109 |
# Load the model and tokenizer
|
| 110 |
+
model = AutoModelForCausalLM.from_pretrained("manbeast3b/qwen2.5-0.5b-text-classification")
|
| 111 |
+
tokenizer = AutoTokenizer.from_pretrained("manbeast3b/qwen2.5-0.5b-text-classification")
|
| 112 |
|
|
|
|
| 113 |
def classify_text(text):
|
| 114 |
prompt = f"Text: {text}\nLabel:"
|
| 115 |
inputs = tokenizer(prompt, return_tensors="pt")
|
|
|
|
| 139 |
# Output: wildvoice
|
| 140 |
```
|
| 141 |
|
| 142 |
+
### Advanced Usage with Confidence Scoring
|
| 143 |
+
```python
|
| 144 |
+
def classify_with_confidence(text, num_samples=5):
|
| 145 |
+
predictions = []
|
| 146 |
+
for _ in range(num_samples):
|
| 147 |
+
prompt = f"Text: {text}\nLabel:"
|
| 148 |
+
inputs = tokenizer(prompt, return_tensors="pt")
|
| 149 |
+
|
| 150 |
+
with torch.no_grad():
|
| 151 |
+
generated = model.generate(
|
| 152 |
+
**inputs,
|
| 153 |
+
max_new_tokens=15,
|
| 154 |
+
do_sample=True,
|
| 155 |
+
temperature=0.3, # Slightly higher for diversity
|
| 156 |
+
top_p=0.9,
|
| 157 |
+
pad_token_id=tokenizer.eos_token_id,
|
| 158 |
+
eos_token_id=tokenizer.eos_token_id,
|
| 159 |
+
)
|
| 160 |
+
|
| 161 |
+
response = tokenizer.decode(generated[0], skip_special_tokens=True)
|
| 162 |
+
prediction = response[len(prompt):].strip().lower()
|
| 163 |
+
|
| 164 |
+
# Clean up prediction
|
| 165 |
+
if 'ifeval' in prediction:
|
| 166 |
+
prediction = 'ifeval'
|
| 167 |
+
elif 'commoneval' in prediction:
|
| 168 |
+
prediction = 'commoneval'
|
| 169 |
+
elif 'wildvoice' in prediction:
|
| 170 |
+
prediction = 'wildvoice'
|
| 171 |
+
else:
|
| 172 |
+
prediction = 'unknown'
|
| 173 |
+
|
| 174 |
+
predictions.append(prediction)
|
| 175 |
+
|
| 176 |
+
# Calculate confidence
|
| 177 |
+
from collections import Counter
|
| 178 |
+
counts = Counter(predictions)
|
| 179 |
+
most_common = counts.most_common(1)[0]
|
| 180 |
+
confidence = most_common[1] / len(predictions)
|
| 181 |
+
|
| 182 |
+
return most_common[0], confidence
|
| 183 |
|
| 184 |
+
# Example with confidence
|
| 185 |
+
label, confidence = classify_with_confidence("Please follow these steps: 1) Read 2) Think 3) Write")
|
| 186 |
+
print(f"Prediction: {label}, Confidence: {confidence:.2%}")
|
| 187 |
+
```
|
| 188 |
+
|
| 189 |
+
## ๐ Training Details
|
| 190 |
+
|
| 191 |
+
### Model Architecture
|
| 192 |
- **Base Model**: Qwen/Qwen2.5-0.5B-Instruct
|
| 193 |
+
- **Parameters**: 498,431,872 total, 4,399,104 trainable (0.88%)
|
| 194 |
+
- **Precision**: FP16 (mixed precision)
|
| 195 |
+
- **Device**: CUDA (GPU accelerated)
|
|
|
|
|
|
|
|
|
|
| 196 |
|
| 197 |
+
### Training Configuration
|
| 198 |
+
```python
|
| 199 |
+
# LoRA Configuration
|
| 200 |
+
lora_config = LoraConfig(
|
| 201 |
+
task_type=TaskType.CAUSAL_LM,
|
| 202 |
+
r=8, # Rank
|
| 203 |
+
lora_alpha=16, # LoRA alpha
|
| 204 |
+
lora_dropout=0.1,
|
| 205 |
+
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
|
| 206 |
+
)
|
| 207 |
+
|
| 208 |
+
# Training Arguments
|
| 209 |
+
training_args = TrainingArguments(
|
| 210 |
+
learning_rate=5e-4,
|
| 211 |
+
per_device_train_batch_size=2,
|
| 212 |
+
max_steps=150,
|
| 213 |
+
max_length=128,
|
| 214 |
+
fp16=True,
|
| 215 |
+
gradient_accumulation_steps=1,
|
| 216 |
+
warmup_steps=20,
|
| 217 |
+
weight_decay=0.01,
|
| 218 |
+
max_grad_norm=1.0
|
| 219 |
+
)
|
| 220 |
+
```
|
| 221 |
|
| 222 |
+
### Dataset
|
| 223 |
The model was trained on synthetic data representing three text categories:
|
| 224 |
+
- **60 total samples** (20 per category)
|
| 225 |
+
- **ifeval**: Instruction-following tasks with specific formatting requirements
|
| 226 |
+
- **commoneval**: Factual questions and knowledge-based queries
|
| 227 |
+
- **wildvoice**: Conversational, informal language patterns
|
| 228 |
+
|
| 229 |
+
## ๐ Error Analysis
|
| 230 |
+
|
| 231 |
+
### Failed Predictions (2 out of 30)
|
| 232 |
+
1. **"What is 2 plus 2?"** โ Predicted: `unknown` (Expected: `commoneval`)
|
| 233 |
+
- Model generated: `#eval{1} Label: #eval{2} Label: #`
|
| 234 |
+
- Issue: Model generated code-like syntax instead of simple label
|
| 235 |
+
|
| 236 |
+
2. **"What is the opposite of hot?"** โ Predicted: `wildvoice` (Expected: `commoneval`)
|
| 237 |
+
- Model generated: `#wildvoice:comoneval:hot:yourresponse:whatis`
|
| 238 |
+
- Issue: Model generated complex response instead of simple label
|
| 239 |
+
|
| 240 |
+
### Success Factors
|
| 241 |
+
- **Simple prompt format** was crucial for success
|
| 242 |
+
- **LoRA fine-tuning** provided stable training
|
| 243 |
+
- **Focused training data** with clear category distinctions
|
| 244 |
+
- **Appropriate hyperparameters** (learning rate, batch size, etc.)
|
| 245 |
+
|
| 246 |
+
## ๐ ๏ธ Technical Implementation
|
| 247 |
+
|
| 248 |
+
### Files Structure
|
| 249 |
+
```
|
| 250 |
+
merged_classification_model/
|
| 251 |
+
โโโ README.md # This file
|
| 252 |
+
โโโ config.json # Model configuration
|
| 253 |
+
โโโ generation_config.json # Generation settings
|
| 254 |
+
โโโ model.safetensors # Model weights (988MB)
|
| 255 |
+
โโโ tokenizer.json # Tokenizer vocabulary
|
| 256 |
+
โโโ tokenizer_config.json # Tokenizer configuration
|
| 257 |
+
โโโ special_tokens_map.json # Special tokens mapping
|
| 258 |
+
โโโ added_tokens.json # Added tokens
|
| 259 |
+
โโโ merges.txt # BPE merges
|
| 260 |
+
โโโ vocab.json # Vocabulary
|
| 261 |
+
โโโ chat_template.jinja # Chat template
|
| 262 |
+
```
|
| 263 |
|
| 264 |
+
### Dependencies
|
| 265 |
+
```bash
|
| 266 |
+
pip install transformers>=4.56.0
|
| 267 |
+
pip install torch>=2.0.0
|
| 268 |
+
pip install peft>=0.17.0
|
| 269 |
+
pip install accelerate>=0.21.0
|
| 270 |
+
```
|
| 271 |
+
|
| 272 |
+
## ๐ฏ Use Cases
|
| 273 |
+
|
| 274 |
+
This model is particularly useful for:
|
| 275 |
+
- **Text categorization** in educational platforms
|
| 276 |
+
- **Content filtering** based on text type
|
| 277 |
+
- **Dataset preprocessing** for machine learning pipelines
|
| 278 |
+
- **VoiceBench-style evaluation** systems
|
| 279 |
+
- **Instruction following detection** in AI systems
|
| 280 |
+
- **Conversational vs. factual text separation**
|
| 281 |
+
|
| 282 |
+
## โ ๏ธ Limitations
|
| 283 |
|
| 284 |
+
1. **Synthetic Training Data**: Model was trained on synthetic data and may not generalize perfectly to all real-world text
|
| 285 |
+
2. **Three-Category Limitation**: Only classifies into the three predefined categories
|
| 286 |
+
3. **Prompt Sensitivity**: Performance may vary with different prompt formats
|
| 287 |
+
4. **Edge Cases**: Some edge cases (like mathematical questions) may be misclassified
|
| 288 |
+
5. **Language**: Primarily trained on English text
|
| 289 |
|
| 290 |
+
## ๐ฎ Future Improvements
|
| 291 |
+
|
| 292 |
+
1. **Larger Training Dataset**: Use real VoiceBench data with proper audio transcription
|
| 293 |
+
2. **More Categories**: Expand to include additional text types
|
| 294 |
+
3. **Multilingual Support**: Train on multiple languages
|
| 295 |
+
4. **Confidence Calibration**: Improve confidence scoring
|
| 296 |
+
5. **Few-shot Learning**: Add support for few-shot classification
|
| 297 |
+
|
| 298 |
+
## ๐ Citation
|
| 299 |
|
| 300 |
```bibtex
|
| 301 |
@misc{qwen2.5-0.5b-text-classification,
|
| 302 |
+
title={Qwen2.5-0.5B Text Classification Model for VoiceBench-style Evaluation},
|
| 303 |
author={Your Name},
|
| 304 |
year={2024},
|
| 305 |
publisher={Hugging Face},
|
| 306 |
+
howpublished={\url{https://huggingface.co/manbeast3b/qwen2.5-0.5b-text-classification}},
|
| 307 |
+
note={Fine-tuned using LoRA on synthetic text classification data}
|
| 308 |
}
|
| 309 |
```
|
| 310 |
+
|
| 311 |
+
## ๐ค Contributing
|
| 312 |
+
|
| 313 |
+
Contributions are welcome! Please feel free to:
|
| 314 |
+
- Report issues with the model
|
| 315 |
+
- Suggest improvements
|
| 316 |
+
- Submit pull requests
|
| 317 |
+
- Share your use cases
|
| 318 |
+
|
| 319 |
+
## ๐ License
|
| 320 |
+
|
| 321 |
+
This model is released under the Apache 2.0 License. See the [LICENSE](LICENSE) file for more details.
|
| 322 |
+
|
| 323 |
+
---
|
| 324 |
+
|
| 325 |
+
**Model Performance Summary:**
|
| 326 |
+
- โ
**93.33% Overall Accuracy**
|
| 327 |
+
- โ
**100% ifeval accuracy** (instruction-following)
|
| 328 |
+
- โ
**100% wildvoice accuracy** (conversational)
|
| 329 |
+
- โ
**80% commoneval accuracy** (factual questions)
|
| 330 |
+
- โ
**Efficient LoRA fine-tuning** (0.88% trainable parameters)
|
| 331 |
+
- โ
**Fast inference** with small model size
|
| 332 |
+
- โ
**Easy to use** with simple API
|
| 333 |
+
|
| 334 |
+
*This model represents a successful application of LoRA fine-tuning for text classification, achieving high accuracy with minimal computational resources.*
|