File size: 7,104 Bytes

# Turnlet BERT Multilingual - End-of-Utterance Detection

A lightweight, multilingual DistilBERT model fine-tuned for End-of-Utterance (EOU) detection in conversational AI systems. This model supports **English, Hindi, and Spanish** with high accuracy and fast inference.

## Model Description

- **Architecture**: DistilBERT (6 layers, 768 hidden dimensions)
- **Parameters**: ~67M parameters (DistilBERT base)
- **Languages**: English, Hindi, Spanish
- **Task**: Binary sequence classification (EOU vs Non-EOU)
- **Training**: Knowledge distillation from teacher model
- **Model Size**: 
  - PyTorch (safetensors): 517 MB
  - ONNX (optimized FP32): 517 MB
  - ONNX (quantized INT8): 132 MB (74% size reduction)

## Performance Metrics

### Validation Set Performance (Step 60500)

| Language | Accuracy | Samples |
|----------|----------|---------|
| **English** | 97.01% | 16,258 |
| **Hindi** | 96.89% | 12,103 |
| **Spanish** | 94.52% | 7,963 |
| **Overall** | 96.43% | 36,324 |

**Validation Metrics:**
- F1 Score: 0.9635
- Precision: 0.9491
- Recall: 0.9783

### TURNS-2K Benchmark

- **Accuracy**: 91.10%
- **F1 Score**: 0.9150
- **Precision**: 0.9796
- **Recall**: 0.8584

## Model Variants

This repository includes three model formats:

1. **PyTorch (safetensors)**: `model.safetensors` - Full precision PyTorch model
2. **ONNX Optimized (FP32)**: `bert_model_optimized.onnx` - Optimized for inference, full precision
3. **ONNX Quantized (INT8)**: `bert_model_optimized_dynamic_int8.onnx` - **Recommended** for production

### Why Use the Quantized INT8 Model?

- ✅ **74% smaller** (132 MB vs 517 MB)
- ✅ **Faster inference** on CPU
- ✅ **Minimal accuracy loss** (<0.5%)
- ✅ **Lower memory footprint**
- ✅ **Better for deployment**

## Quick Start

### Interactive Demo (Easiest Way)

```bash
# Clone the model repository
git clone https://huggingface.co/your-username/turnlet-bert-multilingual-eou
cd turnlet-bert-multilingual-eou

# Install dependencies
pip install -r requirements.txt

# Run interactive mode (default - uses fast ONNX INT8)
python inference_example.py

# Or explicitly use interactive mode
python inference_example.py --interactive

# Use PyTorch instead of ONNX
python inference_example.py --interactive --pytorch

# Adjust threshold
python inference_example.py --interactive --threshold 0.9
```

The interactive mode allows you to:
- 🎮 Type text and get instant EOU predictions
- 🌐 Test in English, Hindi, or Spanish
- 📊 See confidence scores and inference times
- 📈 View visual confidence bars
- 💡 Type 'examples' to see sample inputs
- 🚪 Type 'quit' or 'exit' to stop

### One-off Prediction

```bash
# Single prediction with ONNX (fast)
python inference_example.py --text "Thanks for your help!"

# Test suite with multiple examples
python inference_example.py --test-suite
```

### Using PyTorch (in Python)

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained("your-username/turnlet-bert-multilingual-eou")
tokenizer = AutoTokenizer.from_pretrained("your-username/turnlet-bert-multilingual-eou")

# Predict
text = "Thanks for your help!"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
is_eou = probs[0][1] > 0.5  # Using optimal threshold

print(f"EOU Probability: {probs[0][1]:.3f}")
print(f"Is EOU: {is_eou}")
```

### Using ONNX (Quantized INT8) - Recommended for Production

```python
import onnxruntime as ort
import numpy as np
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("your-username/turnlet-bert-multilingual-eou")

# Create ONNX session
session = ort.InferenceSession("bert_model_optimized_dynamic_int8.onnx")

# Tokenize
text = "Thanks for your help!"
inputs = tokenizer(text, padding="max_length", max_length=128, truncation=True, return_tensors="np")

# Prepare ONNX inputs
ort_inputs = {
    'input_ids': inputs['input_ids'].astype(np.int64),
    'attention_mask': inputs['attention_mask'].astype(np.int64)
}

# Run inference
outputs = session.run(None, ort_inputs)
logits = outputs[0][0]

# Calculate probability
probs = np.exp(logits) / np.sum(np.exp(logits))
is_eou = probs[1] > 0.5 # Using optimal threshold

print(f"EOU Probability: {probs[1]:.3f}")
print(f"Is EOU: {is_eou}")
```

## Use Cases

This model is designed for:

- 🗣️ **Voice Assistants**: Detect when user has finished speaking
- 💬 **Chatbots**: Identify complete user intents
- 📞 **Call Centers**: Segment customer utterances in real-time
- 🌐 **Multilingual Applications**: Support English, Hindi, and Spanish speakers
- ⚡ **Real-time Systems**: Fast inference with quantized model

## Training Details

### Training Data

The model was trained using knowledge distillation on a multilingual dataset:

- **English**: 76,258 samples
- **Hindi**: 75,103 samples  
- **Spanish**: 75,963 samples
- **Total**: ~211K samples

### Training Configuration

- **Base Model**: DistilBERT multilingual
- **Method**: Knowledge distillation from Qwen-based teacher model
- **Epochs**: 8
- **Final Step**: 60,500
- **Optimization**: AdamW optimizer
- **Max Sequence Length**: 128 tokens

### Distillation Process

The model was created using sparse Mixture-of-Experts (MoE) based knowledge distillation:
1. Teacher model (Qwen-based) provides soft labels
2. Student model (DistilBERT) learns to mimic teacher predictions
3. Multi-stage training with progressive difficulty
4. Language-specific accuracy monitoring

## Evaluation

The model was evaluated on:

1. **Validation Set**: Balanced multilingual dataset
2. **TURNS-2K**: Standard benchmark for turn-taking detection
3. **Per-Language Metrics**: Individual language performance tracking

### Inference Speed

Approximate inference times (CPU, single sample):
- ONNX Optimized: ~70-120ms
- ONNX Quantized INT8: ~40-50ms

*Note: Actual speeds vary by hardware*

## Limitations

- Model performance is slightly lower on Spanish compared to English and Hindi
- Optimal threshold (0.86) may need adjustment for specific use cases
- Maximum sequence length is 128 tokens (longer texts will be truncated)
- Best performance on conversational, task-oriented dialogue
- May require fine-tuning for domain-specific applications

## Citation

If you use this model in your research or applications, please cite:

```bibtex
@model{turnlet-bert-multilingual-eou,
  title={Turnlet BERT Multilingual: End-of-Utterance Detection},
  author={Your Name},
  year={2024},
  publisher={Hugging Face},
  note={Knowledge-distilled DistilBERT for multilingual EOU detection}
}
```

## License

Please specify your license here (e.g., Apache 2.0, MIT, etc.)

## Model Card Contact

For questions or feedback, please open an issue in the repository.

---

**Model Version**: Step 60500  
**Last Updated**: November 2024  
**Framework**: PyTorch, ONNX Runtime  
**Languages**: English (en), Hindi (hi), Spanish (es)