File size: 11,394 Bytes

---
license: apache-2.0
base_model: Qwen/Qwen2.5-0.5B-Instruct
tags:
- text-classification
- lora
- peft
- ifeval
- commoneval
- wildvoice
- voicebench
- fine-tuned
---

# Qwen2.5-0.5B Text Classification Model

This model is a fine-tuned version of [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) using LoRA (Low-Rank Adaptation) for text classification tasks. The model has been specifically trained to classify text into three categories based on VoiceBench dataset patterns.

## 🎯 Model Description

The model has been trained to classify text into three distinct categories:
- **ifeval**: Instruction-following tasks with specific formatting requirements and step-by-step instructions
- **commoneval**: Factual questions and knowledge-based queries requiring direct answers
- **wildvoice**: Conversational, informal language patterns and natural dialogue

## 📊 Performance Results

### Overall Performance
- **Overall Accuracy**: **93.33%** (28/30 correct predictions)
- **Training Method**: LoRA (Low-Rank Adaptation)
- **Trainable Parameters**: 0.88% of total parameters (4,399,104 out of 498,431,872)

### Per-Category Performance
| Category | Accuracy | Correct/Total | Description |
|----------|----------|---------------|-------------|
| **ifeval** | **100%** | 10/10 | Perfect performance on instruction-following tasks |
| **commoneval** | **80%** | 8/10 | Good performance on factual questions |
| **wildvoice** | **100%** | 10/10 | Perfect performance on conversational text |

### Confusion Matrix
```
ifeval:
  -> ifeval: 10
commoneval:
  -> commoneval: 8
  -> unknown: 1
  -> wildvoice: 1
wildvoice:
  -> wildvoice: 10
```

## 🔬 Development Journey & Methods Tried

### Initial Challenges
We started with several approaches that didn't work well:

1. **GRPO (Group Relative Policy Optimization)**: Initial attempts with GRPO training showed poor performance
   - Loss decreased but model wasn't learning classification
   - Model generated irrelevant responses like "unknown", "txt", "com"
   - Overall accuracy: ~20%

2. **Full Fine-tuning**: Attempted full fine-tuning of larger models
   - CUDA out of memory issues with larger models
   - Numerical instability with certain model architectures
   - Poor convergence on classification task

3. **Complex Prompt Formats**: Tried various complex prompt structures
   - "Classify this text as ifeval, commoneval, or wildvoice: ..."
   - Model struggled with complex instructions
   - Generated explanations instead of simple labels

### Breakthrough: Direct Classification Approach

The key breakthrough came with a **direct, simple approach**:

#### 1. **Simplified Prompt Format**
Instead of complex classification prompts, we used a simple format:
```
Text: {input_text}
Label: {expected_label}
```

#### 2. **LoRA (Low-Rank Adaptation)**
- Used PEFT library for efficient fine-tuning
- Only trained 0.88% of parameters
- Much more stable than full fine-tuning
- Faster training and inference

#### 3. **Focused Training Data**
Created clear, distinct examples for each category:
- **ifeval**: Instruction-following with specific formatting requirements
- **commoneval**: Factual questions requiring direct answers
- **wildvoice**: Conversational, informal language patterns

#### 4. **Optimal Hyperparameters**
- **Learning Rate**: 5e-4 (higher than initial attempts)
- **Batch Size**: 2 (smaller for stability)
- **Max Length**: 128 (shorter sequences)
- **Training Steps**: 150
- **LoRA Rank**: 8 (focused learning)

## 🚀 Usage

### Basic Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained("manbeast3b/qwen2.5-0.5b-text-classification")
tokenizer = AutoTokenizer.from_pretrained("manbeast3b/qwen2.5-0.5b-text-classification")

def classify_text(text):
    prompt = f"Text: {text}\nLabel:"
    inputs = tokenizer(prompt, return_tensors="pt")
    
    with torch.no_grad():
        generated = model.generate(
            **inputs,
            max_new_tokens=15,
            do_sample=True,
            temperature=0.1,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )
    
    response = tokenizer.decode(generated[0], skip_special_tokens=True)
    return response[len(prompt):].strip()

# Test examples
print(classify_text("Follow these instructions exactly: Write 3 sentences about cats."))
# Output: ifeval

print(classify_text("What is the capital of France?"))
# Output: commoneval

print(classify_text("Hey, how are you doing today?"))
# Output: wildvoice
```

### Advanced Usage with Confidence Scoring
```python
def classify_with_confidence(text, num_samples=5):
    predictions = []
    for _ in range(num_samples):
        prompt = f"Text: {text}\nLabel:"
        inputs = tokenizer(prompt, return_tensors="pt")
        
        with torch.no_grad():
            generated = model.generate(
                **inputs,
                max_new_tokens=15,
                do_sample=True,
                temperature=0.3,  # Slightly higher for diversity
                top_p=0.9,
                pad_token_id=tokenizer.eos_token_id,
                eos_token_id=tokenizer.eos_token_id,
            )
        
        response = tokenizer.decode(generated[0], skip_special_tokens=True)
        prediction = response[len(prompt):].strip().lower()
        
        # Clean up prediction
        if 'ifeval' in prediction:
            prediction = 'ifeval'
        elif 'commoneval' in prediction:
            prediction = 'commoneval'
        elif 'wildvoice' in prediction:
            prediction = 'wildvoice'
        else:
            prediction = 'unknown'
        
        predictions.append(prediction)
    
    # Calculate confidence
    from collections import Counter
    counts = Counter(predictions)
    most_common = counts.most_common(1)[0]
    confidence = most_common[1] / len(predictions)
    
    return most_common[0], confidence

# Example with confidence
label, confidence = classify_with_confidence("Please follow these steps: 1) Read 2) Think 3) Write")
print(f"Prediction: {label}, Confidence: {confidence:.2%}")
```

## 📈 Training Details

### Model Architecture
- **Base Model**: Qwen/Qwen2.5-0.5B-Instruct
- **Parameters**: 498,431,872 total, 4,399,104 trainable (0.88%)
- **Precision**: FP16 (mixed precision)
- **Device**: CUDA (GPU accelerated)

### Training Configuration
```python
# LoRA Configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,  # Rank
    lora_alpha=16,  # LoRA alpha
    lora_dropout=0.1,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
)

# Training Arguments
training_args = TrainingArguments(
    learning_rate=5e-4,
    per_device_train_batch_size=2,
    max_steps=150,
    max_length=128,
    fp16=True,
    gradient_accumulation_steps=1,
    warmup_steps=20,
    weight_decay=0.01,
    max_grad_norm=1.0
)
```

### Dataset
The model was trained on synthetic data representing three text categories:
- **60 total samples** (20 per category)
- **ifeval**: Instruction-following tasks with specific formatting requirements
- **commoneval**: Factual questions and knowledge-based queries
- **wildvoice**: Conversational, informal language patterns

## 🔍 Error Analysis

### Failed Predictions (2 out of 30)
1. **"What is 2 plus 2?"** → Predicted: `unknown` (Expected: `commoneval`)
   - Model generated: `#eval{1} Label: #eval{2} Label: #`
   - Issue: Model generated code-like syntax instead of simple label

2. **"What is the opposite of hot?"** → Predicted: `wildvoice` (Expected: `commoneval`)
   - Model generated: `#wildvoice:comoneval:hot:yourresponse:whatis`
   - Issue: Model generated complex response instead of simple label

### Success Factors
- **Simple prompt format** was crucial for success
- **LoRA fine-tuning** provided stable training
- **Focused training data** with clear category distinctions
- **Appropriate hyperparameters** (learning rate, batch size, etc.)

## 🛠️ Technical Implementation

### Files Structure
```
merged_classification_model/
├── README.md                    # This file
├── config.json                  # Model configuration
├── generation_config.json       # Generation settings
├── model.safetensors           # Model weights (988MB)
├── tokenizer.json              # Tokenizer vocabulary
├── tokenizer_config.json       # Tokenizer configuration
├── special_tokens_map.json     # Special tokens mapping
├── added_tokens.json           # Added tokens
├── merges.txt                  # BPE merges
├── vocab.json                  # Vocabulary
└── chat_template.jinja         # Chat template
```

### Dependencies
```bash
pip install transformers>=4.56.0
pip install torch>=2.0.0
pip install peft>=0.17.0
pip install accelerate>=0.21.0
```

## 🎯 Use Cases

This model is particularly useful for:
- **Text categorization** in educational platforms
- **Content filtering** based on text type
- **Dataset preprocessing** for machine learning pipelines
- **VoiceBench-style evaluation** systems
- **Instruction following detection** in AI systems
- **Conversational vs. factual text separation**

## ⚠️ Limitations

1. **Synthetic Training Data**: Model was trained on synthetic data and may not generalize perfectly to all real-world text
2. **Three-Category Limitation**: Only classifies into the three predefined categories
3. **Prompt Sensitivity**: Performance may vary with different prompt formats
4. **Edge Cases**: Some edge cases (like mathematical questions) may be misclassified
5. **Language**: Primarily trained on English text

## 🔮 Future Improvements

1. **Larger Training Dataset**: Use real VoiceBench data with proper audio transcription
2. **More Categories**: Expand to include additional text types
3. **Multilingual Support**: Train on multiple languages
4. **Confidence Calibration**: Improve confidence scoring
5. **Few-shot Learning**: Add support for few-shot classification

## 📚 Citation

```bibtex
@misc{qwen2.5-0.5b-text-classification,
  title={Qwen2.5-0.5B Text Classification Model for VoiceBench-style Evaluation},
  author={Your Name},
  year={2024},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/manbeast3b/qwen2.5-0.5b-text-classification}},
  note={Fine-tuned using LoRA on synthetic text classification data}
}
```

## 🤝 Contributing

Contributions are welcome! Please feel free to:
- Report issues with the model
- Suggest improvements
- Submit pull requests
- Share your use cases

## 📄 License

This model is released under the Apache 2.0 License. See the [LICENSE](LICENSE) file for more details.

---

**Model Performance Summary:**
- ✅ **93.33% Overall Accuracy**
- ✅ **100% ifeval accuracy** (instruction-following)
- ✅ **100% wildvoice accuracy** (conversational)
- ✅ **80% commoneval accuracy** (factual questions)
- ✅ **Efficient LoRA fine-tuning** (0.88% trainable parameters)
- ✅ **Fast inference** with small model size
- ✅ **Easy to use** with simple API

*This model represents a successful application of LoRA fine-tuning for text classification, achieving high accuracy with minimal computational resources.*