turnlet-bert-multilingual-eou / README.md

Update README.md

e5cc772 verified 2 months ago

7.1 kB

	# Turnlet BERT Multilingual - End-of-Utterance Detection

	A lightweight, multilingual DistilBERT model fine-tuned for End-of-Utterance (EOU) detection in conversational AI systems. This model supports English, Hindi, and Spanish with high accuracy and fast inference.

	## Model Description

	- Architecture: DistilBERT (6 layers, 768 hidden dimensions)
	- Parameters: ~67M parameters (DistilBERT base)
	- Languages: English, Hindi, Spanish
	- Task: Binary sequence classification (EOU vs Non-EOU)
	- Training: Knowledge distillation from teacher model
	- Model Size:
	- PyTorch (safetensors): 517 MB
	- ONNX (optimized FP32): 517 MB
	- ONNX (quantized INT8): 132 MB (74% size reduction)

	## Performance Metrics

	### Validation Set Performance (Step 60500)

	\| Language \| Accuracy \| Samples \|
	\|----------\|----------\|---------\|
	\| English \| 97.01% \| 16,258 \|
	\| Hindi \| 96.89% \| 12,103 \|
	\| Spanish \| 94.52% \| 7,963 \|
	\| Overall \| 96.43% \| 36,324 \|

	Validation Metrics:
	- F1 Score: 0.9635
	- Precision: 0.9491
	- Recall: 0.9783

	### TURNS-2K Benchmark

	- Accuracy: 91.10%
	- F1 Score: 0.9150
	- Precision: 0.9796
	- Recall: 0.8584

	## Model Variants

	This repository includes three model formats:

	1. PyTorch (safetensors): `model.safetensors` - Full precision PyTorch model
	2. ONNX Optimized (FP32): `bert_model_optimized.onnx` - Optimized for inference, full precision
	3. ONNX Quantized (INT8): `bert_model_optimized_dynamic_int8.onnx` - Recommended for production

	### Why Use the Quantized INT8 Model?

	- ✅ 74% smaller (132 MB vs 517 MB)
	- ✅ Faster inference on CPU
	- ✅ Minimal accuracy loss (<0.5%)
	- ✅ Lower memory footprint
	- ✅ Better for deployment

	## Quick Start

	### Interactive Demo (Easiest Way)

	```bash
	# Clone the model repository
	git clone https://huggingface.co/your-username/turnlet-bert-multilingual-eou
	cd turnlet-bert-multilingual-eou

	# Install dependencies
	pip install -r requirements.txt

	# Run interactive mode (default - uses fast ONNX INT8)
	python inference_example.py

	# Or explicitly use interactive mode
	python inference_example.py --interactive

	# Use PyTorch instead of ONNX
	python inference_example.py --interactive --pytorch

	# Adjust threshold
	python inference_example.py --interactive --threshold 0.9
	```

	The interactive mode allows you to:
	- 🎮 Type text and get instant EOU predictions
	- 🌐 Test in English, Hindi, or Spanish
	- 📊 See confidence scores and inference times
	- 📈 View visual confidence bars
	- 💡 Type 'examples' to see sample inputs
	- 🚪 Type 'quit' or 'exit' to stop

	### One-off Prediction

	```bash
	# Single prediction with ONNX (fast)
	python inference_example.py --text "Thanks for your help!"

	# Test suite with multiple examples
	python inference_example.py --test-suite
	```

	### Using PyTorch (in Python)

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	# Load model and tokenizer
	model = AutoModelForSequenceClassification.from_pretrained("your-username/turnlet-bert-multilingual-eou")
	tokenizer = AutoTokenizer.from_pretrained("your-username/turnlet-bert-multilingual-eou")

	# Predict
	text = "Thanks for your help!"
	inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
	outputs = model(**inputs)
	probs = torch.softmax(outputs.logits, dim=-1)
	is_eou = probs[0][1] > 0.5 # Using optimal threshold

	print(f"EOU Probability: {probs[0][1]:.3f}")
	print(f"Is EOU: {is_eou}")
	```

	### Using ONNX (Quantized INT8) - Recommended for Production

	```python
	import onnxruntime as ort
	import numpy as np
	from transformers import AutoTokenizer

	# Load tokenizer
	tokenizer = AutoTokenizer.from_pretrained("your-username/turnlet-bert-multilingual-eou")

	# Create ONNX session
	session = ort.InferenceSession("bert_model_optimized_dynamic_int8.onnx")

	# Tokenize
	text = "Thanks for your help!"
	inputs = tokenizer(text, padding="max_length", max_length=128, truncation=True, return_tensors="np")

	# Prepare ONNX inputs
	ort_inputs = {
	'input_ids': inputs['input_ids'].astype(np.int64),
	'attention_mask': inputs['attention_mask'].astype(np.int64)
	}

	# Run inference
	outputs = session.run(None, ort_inputs)
	logits = outputs[0][0]

	# Calculate probability
	probs = np.exp(logits) / np.sum(np.exp(logits))
	is_eou = probs[1] > 0.5 # Using optimal threshold

	print(f"EOU Probability: {probs[1]:.3f}")
	print(f"Is EOU: {is_eou}")
	```

	## Use Cases

	This model is designed for:

	- 🗣️ Voice Assistants: Detect when user has finished speaking
	- 💬 Chatbots: Identify complete user intents
	- 📞 Call Centers: Segment customer utterances in real-time
	- 🌐 Multilingual Applications: Support English, Hindi, and Spanish speakers
	- ⚡ Real-time Systems: Fast inference with quantized model

	## Training Details

	### Training Data

	The model was trained using knowledge distillation on a multilingual dataset:

	- English: 76,258 samples
	- Hindi: 75,103 samples
	- Spanish: 75,963 samples
	- Total: ~211K samples

	### Training Configuration

	- Base Model: DistilBERT multilingual
	- Method: Knowledge distillation from Qwen-based teacher model
	- Epochs: 8
	- Final Step: 60,500
	- Optimization: AdamW optimizer
	- Max Sequence Length: 128 tokens

	### Distillation Process

	The model was created using sparse Mixture-of-Experts (MoE) based knowledge distillation:
	1. Teacher model (Qwen-based) provides soft labels
	2. Student model (DistilBERT) learns to mimic teacher predictions
	3. Multi-stage training with progressive difficulty
	4. Language-specific accuracy monitoring

	## Evaluation

	The model was evaluated on:

	1. Validation Set: Balanced multilingual dataset
	2. TURNS-2K: Standard benchmark for turn-taking detection
	3. Per-Language Metrics: Individual language performance tracking

	### Inference Speed

	Approximate inference times (CPU, single sample):
	- ONNX Optimized: ~70-120ms
	- ONNX Quantized INT8: ~40-50ms

	Note: Actual speeds vary by hardware

	## Limitations

	- Model performance is slightly lower on Spanish compared to English and Hindi
	- Optimal threshold (0.86) may need adjustment for specific use cases
	- Maximum sequence length is 128 tokens (longer texts will be truncated)
	- Best performance on conversational, task-oriented dialogue
	- May require fine-tuning for domain-specific applications

	## Citation

	If you use this model in your research or applications, please cite:

	```bibtex
	@model{turnlet-bert-multilingual-eou,
	title={Turnlet BERT Multilingual: End-of-Utterance Detection},
	author={Your Name},
	year={2024},
	publisher={Hugging Face},
	note={Knowledge-distilled DistilBERT for multilingual EOU detection}
	}
	```

	## License

	Please specify your license here (e.g., Apache 2.0, MIT, etc.)

	## Model Card Contact

	For questions or feedback, please open an issue in the repository.

	---

	Model Version: Step 60500
	Last Updated: November 2024
	Framework: PyTorch, ONNX Runtime
	Languages: English (en), Hindi (hi), Spanish (es)