Estonel's picture
Update README.md
e5cc772 verified
# Turnlet BERT Multilingual - End-of-Utterance Detection
A lightweight, multilingual DistilBERT model fine-tuned for End-of-Utterance (EOU) detection in conversational AI systems. This model supports **English, Hindi, and Spanish** with high accuracy and fast inference.
## Model Description
- **Architecture**: DistilBERT (6 layers, 768 hidden dimensions)
- **Parameters**: ~67M parameters (DistilBERT base)
- **Languages**: English, Hindi, Spanish
- **Task**: Binary sequence classification (EOU vs Non-EOU)
- **Training**: Knowledge distillation from teacher model
- **Model Size**:
- PyTorch (safetensors): 517 MB
- ONNX (optimized FP32): 517 MB
- ONNX (quantized INT8): 132 MB (74% size reduction)
## Performance Metrics
### Validation Set Performance (Step 60500)
| Language | Accuracy | Samples |
|----------|----------|---------|
| **English** | 97.01% | 16,258 |
| **Hindi** | 96.89% | 12,103 |
| **Spanish** | 94.52% | 7,963 |
| **Overall** | 96.43% | 36,324 |
**Validation Metrics:**
- F1 Score: 0.9635
- Precision: 0.9491
- Recall: 0.9783
### TURNS-2K Benchmark
- **Accuracy**: 91.10%
- **F1 Score**: 0.9150
- **Precision**: 0.9796
- **Recall**: 0.8584
## Model Variants
This repository includes three model formats:
1. **PyTorch (safetensors)**: `model.safetensors` - Full precision PyTorch model
2. **ONNX Optimized (FP32)**: `bert_model_optimized.onnx` - Optimized for inference, full precision
3. **ONNX Quantized (INT8)**: `bert_model_optimized_dynamic_int8.onnx` - **Recommended** for production
### Why Use the Quantized INT8 Model?
- โœ… **74% smaller** (132 MB vs 517 MB)
- โœ… **Faster inference** on CPU
- โœ… **Minimal accuracy loss** (<0.5%)
- โœ… **Lower memory footprint**
- โœ… **Better for deployment**
## Quick Start
### Interactive Demo (Easiest Way)
```bash
# Clone the model repository
git clone https://huggingface.co/your-username/turnlet-bert-multilingual-eou
cd turnlet-bert-multilingual-eou
# Install dependencies
pip install -r requirements.txt
# Run interactive mode (default - uses fast ONNX INT8)
python inference_example.py
# Or explicitly use interactive mode
python inference_example.py --interactive
# Use PyTorch instead of ONNX
python inference_example.py --interactive --pytorch
# Adjust threshold
python inference_example.py --interactive --threshold 0.9
```
The interactive mode allows you to:
- ๐ŸŽฎ Type text and get instant EOU predictions
- ๐ŸŒ Test in English, Hindi, or Spanish
- ๐Ÿ“Š See confidence scores and inference times
- ๐Ÿ“ˆ View visual confidence bars
- ๐Ÿ’ก Type 'examples' to see sample inputs
- ๐Ÿšช Type 'quit' or 'exit' to stop
### One-off Prediction
```bash
# Single prediction with ONNX (fast)
python inference_example.py --text "Thanks for your help!"
# Test suite with multiple examples
python inference_example.py --test-suite
```
### Using PyTorch (in Python)
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained("your-username/turnlet-bert-multilingual-eou")
tokenizer = AutoTokenizer.from_pretrained("your-username/turnlet-bert-multilingual-eou")
# Predict
text = "Thanks for your help!"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
is_eou = probs[0][1] > 0.5 # Using optimal threshold
print(f"EOU Probability: {probs[0][1]:.3f}")
print(f"Is EOU: {is_eou}")
```
### Using ONNX (Quantized INT8) - Recommended for Production
```python
import onnxruntime as ort
import numpy as np
from transformers import AutoTokenizer
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("your-username/turnlet-bert-multilingual-eou")
# Create ONNX session
session = ort.InferenceSession("bert_model_optimized_dynamic_int8.onnx")
# Tokenize
text = "Thanks for your help!"
inputs = tokenizer(text, padding="max_length", max_length=128, truncation=True, return_tensors="np")
# Prepare ONNX inputs
ort_inputs = {
'input_ids': inputs['input_ids'].astype(np.int64),
'attention_mask': inputs['attention_mask'].astype(np.int64)
}
# Run inference
outputs = session.run(None, ort_inputs)
logits = outputs[0][0]
# Calculate probability
probs = np.exp(logits) / np.sum(np.exp(logits))
is_eou = probs[1] > 0.5 # Using optimal threshold
print(f"EOU Probability: {probs[1]:.3f}")
print(f"Is EOU: {is_eou}")
```
## Use Cases
This model is designed for:
- ๐Ÿ—ฃ๏ธ **Voice Assistants**: Detect when user has finished speaking
- ๐Ÿ’ฌ **Chatbots**: Identify complete user intents
- ๐Ÿ“ž **Call Centers**: Segment customer utterances in real-time
- ๐ŸŒ **Multilingual Applications**: Support English, Hindi, and Spanish speakers
- โšก **Real-time Systems**: Fast inference with quantized model
## Training Details
### Training Data
The model was trained using knowledge distillation on a multilingual dataset:
- **English**: 76,258 samples
- **Hindi**: 75,103 samples
- **Spanish**: 75,963 samples
- **Total**: ~211K samples
### Training Configuration
- **Base Model**: DistilBERT multilingual
- **Method**: Knowledge distillation from Qwen-based teacher model
- **Epochs**: 8
- **Final Step**: 60,500
- **Optimization**: AdamW optimizer
- **Max Sequence Length**: 128 tokens
### Distillation Process
The model was created using sparse Mixture-of-Experts (MoE) based knowledge distillation:
1. Teacher model (Qwen-based) provides soft labels
2. Student model (DistilBERT) learns to mimic teacher predictions
3. Multi-stage training with progressive difficulty
4. Language-specific accuracy monitoring
## Evaluation
The model was evaluated on:
1. **Validation Set**: Balanced multilingual dataset
2. **TURNS-2K**: Standard benchmark for turn-taking detection
3. **Per-Language Metrics**: Individual language performance tracking
### Inference Speed
Approximate inference times (CPU, single sample):
- ONNX Optimized: ~70-120ms
- ONNX Quantized INT8: ~40-50ms
*Note: Actual speeds vary by hardware*
## Limitations
- Model performance is slightly lower on Spanish compared to English and Hindi
- Optimal threshold (0.86) may need adjustment for specific use cases
- Maximum sequence length is 128 tokens (longer texts will be truncated)
- Best performance on conversational, task-oriented dialogue
- May require fine-tuning for domain-specific applications
## Citation
If you use this model in your research or applications, please cite:
```bibtex
@model{turnlet-bert-multilingual-eou,
title={Turnlet BERT Multilingual: End-of-Utterance Detection},
author={Your Name},
year={2024},
publisher={Hugging Face},
note={Knowledge-distilled DistilBERT for multilingual EOU detection}
}
```
## License
Please specify your license here (e.g., Apache 2.0, MIT, etc.)
## Model Card Contact
For questions or feedback, please open an issue in the repository.
---
**Model Version**: Step 60500
**Last Updated**: November 2024
**Framework**: PyTorch, ONNX Runtime
**Languages**: English (en), Hindi (hi), Spanish (es)