File size: 4,555 Bytes

6582246
2eb3f4b
6582246
2eb3f4b
6582246
2eb3f4b
6582246
2eb3f4b
6582246
2eb3f4b
6582246
2eb3f4b
6582246
2eb3f4b
 
6582246
 
 
 
 
2eb3f4b
6582246
2eb3f4b
6582246
2eb3f4b
6582246
 
 
 
2eb3f4b
6582246
2eb3f4b
6582246
2eb3f4b
6582246
 
 
 
2eb3f4b
6582246
 
 
 
2eb3f4b
6582246
 
2eb3f4b
6582246
 
 
 
 
 
 
2eb3f4b
6582246
 
 
 
 
 
2eb3f4b
6582246
2eb3f4b
6582246
2eb3f4b
6582246
 
 
2eb3f4b
6582246
 
 
2eb3f4b
6582246
 
 
2eb3f4b
6582246
2eb3f4b
6582246
2eb3f4b
6582246
2eb3f4b
6582246
 
2eb3f4b
6582246
2eb3f4b
6582246
2eb3f4b
6582246
2eb3f4b
6582246
 
 
2eb3f4b
6582246
2eb3f4b
6582246
2eb3f4b
6582246
2eb3f4b
6582246
 
 
 
 
 
 
 
2eb3f4b
6582246
2eb3f4b
6582246
2eb3f4b
6582246
 
 
 
 
 
 
 
2eb3f4b
6582246
2eb3f4b
6582246
2eb3f4b
6582246

# TinyWave Interleaved Expressive 2B

**TinyWave Interleaved Expressive 2B** is a compact, expressive speech-to-speech and speech-text language model distilled from the 7B SPIRIT-LM teacher. It supports **interleaved audio and text inputs** and is trained on 50k hours of public data using a multi-level layer-aligned distillation framework.

Despite being 3× smaller than its teacher, the model retains **93–97%** of its accuracy on expressive benchmarks like StoryCloze and SALMon, and outperforms size-matched baselines. This model is ideal for **real-time multimodal agents**, **spoken dialogue systems**, and **low-resource deployment**.

> 📖 For more information, see the [TinyWave paper (arXiv:2506.23670)](https://arxiv.org/abs/2506.23670) and [project website](https://mohammadmahdinoori.github.io/tinywave-landing/).

## 🔧 Usage

This model accepts interleaved speech and text inputs. It expects inputs to be encoded using SPIRIT-LM’s **expressive speech tokenizer**.

### 1. Clone SPIRIT-LM and Install Requirements


```bash
git clone https://github.com/facebookresearch/spiritlm
cd spiritlm
pip install -e '.[eval]'
````

---

### 2. Load Tokenizer

```python
from spiritlm.speech_tokenizer import spiritlm_expressive
speech_tokenizer = spiritlm_expressive()
```

---

### 3. Inference Code

```python
from transformers import LlamaForCausalLM, AutoTokenizer
import torchaudio
import torch

# Load model and tokenizer
MODEL_PATH = "tinywave/interleaved-expressive-2b-v3"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = LlamaForCausalLM.from_pretrained(MODEL_PATH, device_map="auto", torch_dtype=torch.bfloat16)

# Audio + Speech tokenizer
speech_tokenizer = spiritlm_expressive()

def get_inference(input_audio_path):
    audio, _ = torchaudio.load(input_audio_path)
    input_values = audio.view(1, 1, -1).to(speech_tokenizer.hubert_model.device).float()
    string_tokens = speech_tokenizer.encode_string(input_values)
    input_ids = tokenizer(string_tokens, return_tensors="pt").input_ids.to(model.device)
    output = model.generate(input_ids, max_new_tokens=256, top_p=0.9, temperature=0.9, do_sample=True)
    return tokenizer.decode(output[0])

# Text-based prompt
def get_inference_text(prompt):
    input_ids = tokenizer(prompt + " [Speech]", return_tensors="pt").input_ids.to(model.device)
    output = model.generate(input_ids, max_new_tokens=256, top_p=0.9, temperature=0.9, do_sample=True)
    return tokenizer.decode(output[0])
```

---

### 4. Decoding to WAV (optional)

```python
import numpy as np
from scipy.io.wavfile import write

def save_array_to_wav_int16(audio_array: np.ndarray, sampling_rate=16000, filename="output.wav"):
    scaled = np.int16(audio_array / np.max(np.abs(audio_array)) * 32767)
    write(filename, sampling_rate, scaled)

decoded_audio = speech_tokenizer.decode(output_text.replace(" ", "").replace("<s>", "").replace("</s>", ""), speaker_id=2)
save_array_to_wav_int16(decoded_audio, filename="generated.wav")
```

---

## 🗣️ Inference Examples

### 🎧 Speech Continuation

Input: `speech.wav` (spoken sentence)
Output: Expressive speech continuation in the same style and tone.

---

### 💬 Mixed Input: Text → Speech

Prompt:

```
"Once upon a time in a small village, a mysterious sound echoed through the forest. [Speech]"
```

Output: Expressive spoken continuation in WAV format.

---

## 🧠 Model Details

| Feature             | Description                                       |
| ------------------- | ------------------------------------------------- |
| Architecture        | 2B parameter distilled transformer                |
| Tokenizer           | SPIRIT-LM Expressive (HuBERT + pitch/style)       |
| Tasks               | Speech continuation, mixed speech-text generation |
| Teacher Model       | SPIRIT-LM-Expressive 7B                           |
| Distillation Method | Layer-aligned: hidden states, attention, logits   |
| Input Types         | Discrete HuBERT tokens and text                   |

---

## 📎 Citation

```bibtex
@article{nouriborji2025tinywave,
  title={Efficient Interleaved Speech Modeling through Knowledge Distillation},
  author={Nouriborji, Mohammadmahdi and Rohanian, Morteza},
  journal={arXiv preprint arXiv:2506.23670},
  year={2025}
}
```

---

## 📂 Resources

* 🔗 [Project Page](https://mohammadmahdinoori.github.io/tinywave-landing/)
* 💬 [Demo Samples](https://mohammadmahdinoori.github.io/tinywave-landing/#samples)
* 🧠 [Training & Codebase](https://github.com/mohammadmahdinoori/TinyWave)