|
|
--- |
|
|
tags: |
|
|
- whisper |
|
|
- speech |
|
|
- audio |
|
|
- litert |
|
|
- tflite |
|
|
- edge |
|
|
- on-device |
|
|
license: mit |
|
|
base_model: openai/whisper-tiny |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
--- |
|
|
|
|
|
# whisper-tiny - LiteRT |
|
|
|
|
|
This is a [LiteRT](https://ai.google.dev/edge/litert) (formerly TensorFlow Lite) conversion of [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) for efficient on-device inference. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
| Property | Value | |
|
|
|----------|-------| |
|
|
| **Original Model** | [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | |
|
|
| **Format** | LiteRT (.tflite) | |
|
|
| **File Size** | 31.4 MB | |
|
|
| **Task** | Speech Recognition (Encoder Only) | |
|
|
| **Max Sequence Length** | 3000 | |
|
|
| **Output Dimension** | 384 | |
|
|
| **Pooling Mode** | N/A (Encoder output) | |
|
|
|
|
|
## Performance |
|
|
|
|
|
Benchmarked on AMD CPU (WSL2): |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| **Inference Latency** | 144.7 ms | |
|
|
| **Throughput** | 6.9/sec | |
|
|
| **Cosine Similarity vs Original** | 1.0000 ✅ | |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
```python |
|
|
import numpy as np |
|
|
from ai_edge_litert.interpreter import Interpreter |
|
|
from transformers import WhisperProcessor |
|
|
import librosa |
|
|
|
|
|
# Load model |
|
|
interpreter = Interpreter(model_path="openai_whisper-tiny_encoder.tflite") |
|
|
interpreter.allocate_tensors() |
|
|
input_details = interpreter.get_input_details() |
|
|
output_details = interpreter.get_output_details() |
|
|
|
|
|
# Load processor |
|
|
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny") |
|
|
|
|
|
def encode_audio(audio_path: str) -> np.ndarray: |
|
|
"""Extract encoder features from audio file.""" |
|
|
audio, sr = librosa.load(audio_path, sr=16000) |
|
|
input_features = processor(audio, sampling_rate=16000, return_tensors="np").input_features |
|
|
|
|
|
interpreter.set_tensor(input_details[0]["index"], input_features.astype(np.float32)) |
|
|
interpreter.invoke() |
|
|
|
|
|
return interpreter.get_tensor(output_details[0]["index"]) |
|
|
|
|
|
# Example |
|
|
# features = encode_audio("audio.wav") |
|
|
``` |
|
|
|
|
|
**Note**: This is the encoder-only model. For full ASR, you need the decoder as well. |
|
|
|
|
|
## Files |
|
|
|
|
|
- `openai_whisper-tiny_encoder.tflite` - The LiteRT model file |
|
|
|
|
|
## Conversion Details |
|
|
|
|
|
- **Conversion Tool**: [ai-edge-torch](https://github.com/google-ai-edge/ai-edge-torch) |
|
|
- **Conversion Date**: 2026-01-12 |
|
|
- **Source Framework**: PyTorch → LiteRT |
|
|
- **Validation**: Cosine similarity 1.0000 vs original |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
- **Mobile Applications**: On-device semantic search, RAG systems |
|
|
- **Edge Devices**: IoT, embedded systems, Raspberry Pi |
|
|
- **Offline Processing**: Privacy-preserving inference |
|
|
- **Low-latency Applications**: Real-time processing |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Fixed sequence length (3000 tokens) |
|
|
- CPU inference (GPU delegate requires setup) |
|
|
- Tokenizer loaded separately from original model |
|
|
- Float32 precision |
|
|
|
|
|
## License |
|
|
|
|
|
This model inherits the license from the original: |
|
|
- **License**: MIT ([source](https://huggingface.co/openai/whisper-tiny)) |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{radford2022whisper, |
|
|
title={Robust Speech Recognition via Large-Scale Weak Supervision}, |
|
|
author={Alec Radford and Jong Wook Kim and others}, |
|
|
year={2022}, |
|
|
eprint={2212.04356}, |
|
|
archivePrefix={arXiv}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- Original model by [openai](https://huggingface.co/openai) |
|
|
- Conversion using [ai-edge-torch](https://github.com/google-ai-edge/ai-edge-torch) |
|
|
|
|
|
--- |
|
|
|
|
|
*Converted by [Bombek1](https://huggingface.co/Bombek1)* |
|
|
|