File size: 4,555 Bytes
6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 2eb3f4b 6582246 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
# TinyWave Interleaved Expressive 2B
**TinyWave Interleaved Expressive 2B** is a compact, expressive speech-to-speech and speech-text language model distilled from the 7B SPIRIT-LM teacher. It supports **interleaved audio and text inputs** and is trained on 50k hours of public data using a multi-level layer-aligned distillation framework.
Despite being 3Γ smaller than its teacher, the model retains **93β97%** of its accuracy on expressive benchmarks like StoryCloze and SALMon, and outperforms size-matched baselines. This model is ideal for **real-time multimodal agents**, **spoken dialogue systems**, and **low-resource deployment**.
> π For more information, see the [TinyWave paper (arXiv:2506.23670)](https://arxiv.org/abs/2506.23670) and [project website](https://mohammadmahdinoori.github.io/tinywave-landing/).
## π§ Usage
This model accepts interleaved speech and text inputs. It expects inputs to be encoded using SPIRIT-LMβs **expressive speech tokenizer**.
### 1. Clone SPIRIT-LM and Install Requirements
```bash
git clone https://github.com/facebookresearch/spiritlm
cd spiritlm
pip install -e '.[eval]'
````
---
### 2. Load Tokenizer
```python
from spiritlm.speech_tokenizer import spiritlm_expressive
speech_tokenizer = spiritlm_expressive()
```
---
### 3. Inference Code
```python
from transformers import LlamaForCausalLM, AutoTokenizer
import torchaudio
import torch
# Load model and tokenizer
MODEL_PATH = "tinywave/interleaved-expressive-2b-v3"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = LlamaForCausalLM.from_pretrained(MODEL_PATH, device_map="auto", torch_dtype=torch.bfloat16)
# Audio + Speech tokenizer
speech_tokenizer = spiritlm_expressive()
def get_inference(input_audio_path):
audio, _ = torchaudio.load(input_audio_path)
input_values = audio.view(1, 1, -1).to(speech_tokenizer.hubert_model.device).float()
string_tokens = speech_tokenizer.encode_string(input_values)
input_ids = tokenizer(string_tokens, return_tensors="pt").input_ids.to(model.device)
output = model.generate(input_ids, max_new_tokens=256, top_p=0.9, temperature=0.9, do_sample=True)
return tokenizer.decode(output[0])
# Text-based prompt
def get_inference_text(prompt):
input_ids = tokenizer(prompt + " [Speech]", return_tensors="pt").input_ids.to(model.device)
output = model.generate(input_ids, max_new_tokens=256, top_p=0.9, temperature=0.9, do_sample=True)
return tokenizer.decode(output[0])
```
---
### 4. Decoding to WAV (optional)
```python
import numpy as np
from scipy.io.wavfile import write
def save_array_to_wav_int16(audio_array: np.ndarray, sampling_rate=16000, filename="output.wav"):
scaled = np.int16(audio_array / np.max(np.abs(audio_array)) * 32767)
write(filename, sampling_rate, scaled)
decoded_audio = speech_tokenizer.decode(output_text.replace(" ", "").replace("<s>", "").replace("</s>", ""), speaker_id=2)
save_array_to_wav_int16(decoded_audio, filename="generated.wav")
```
---
## π£οΈ Inference Examples
### π§ Speech Continuation
Input: `speech.wav` (spoken sentence)
Output: Expressive speech continuation in the same style and tone.
---
### π¬ Mixed Input: Text β Speech
Prompt:
```
"Once upon a time in a small village, a mysterious sound echoed through the forest. [Speech]"
```
Output: Expressive spoken continuation in WAV format.
---
## π§ Model Details
| Feature | Description |
| ------------------- | ------------------------------------------------- |
| Architecture | 2B parameter distilled transformer |
| Tokenizer | SPIRIT-LM Expressive (HuBERT + pitch/style) |
| Tasks | Speech continuation, mixed speech-text generation |
| Teacher Model | SPIRIT-LM-Expressive 7B |
| Distillation Method | Layer-aligned: hidden states, attention, logits |
| Input Types | Discrete HuBERT tokens and text |
---
## π Citation
```bibtex
@article{nouriborji2025tinywave,
title={Efficient Interleaved Speech Modeling through Knowledge Distillation},
author={Nouriborji, Mohammadmahdi and Rohanian, Morteza},
journal={arXiv preprint arXiv:2506.23670},
year={2025}
}
```
---
## π Resources
* π [Project Page](https://mohammadmahdinoori.github.io/tinywave-landing/)
* π¬ [Demo Samples](https://mohammadmahdinoori.github.io/tinywave-landing/#samples)
* π§ [Training & Codebase](https://github.com/mohammadmahdinoori/TinyWave) |