File size: 4,552 Bytes
f8f2e58
2af7634
f8f2e58
2af7634
f8f2e58
2af7634
f8f2e58
2af7634
f8f2e58
2af7634
f8f2e58
2af7634
f8f2e58
2af7634
 
f8f2e58
 
 
 
 
2af7634
f8f2e58
2af7634
f8f2e58
2af7634
f8f2e58
 
 
 
2af7634
f8f2e58
2af7634
f8f2e58
2af7634
f8f2e58
 
 
 
2af7634
f8f2e58
 
 
 
2af7634
f8f2e58
 
2af7634
f8f2e58
 
 
 
 
 
 
2af7634
f8f2e58
 
 
 
 
 
2af7634
f8f2e58
2af7634
f8f2e58
2af7634
f8f2e58
 
 
2af7634
f8f2e58
 
 
2af7634
f8f2e58
 
 
2af7634
f8f2e58
2af7634
f8f2e58
2af7634
f8f2e58
2af7634
f8f2e58
 
2af7634
f8f2e58
2af7634
f8f2e58
2af7634
f8f2e58
2af7634
f8f2e58
 
 
2af7634
f8f2e58
2af7634
f8f2e58
2af7634
f8f2e58
2af7634
f8f2e58
 
 
 
 
 
 
 
2af7634
f8f2e58
2af7634
f8f2e58
2af7634
f8f2e58
 
 
 
 
 
 
 
2af7634
f8f2e58
2af7634
f8f2e58
2af7634
f8f2e58
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
# TinyWave Interleaved Expressive 2B

**TinyWave Interleaved Expressive 2B** is a compact, expressive speech-to-speech and speech-text language model distilled from the 7B SPIRIT-LM teacher. It supports **interleaved audio and text inputs** and is trained on 50k hours of public data using a multi-level layer-aligned distillation framework.

Despite being 3Γ— smaller than its teacher, the model retains **93–97%** of its accuracy on expressive benchmarks like StoryCloze and SALMon, and outperforms size-matched baselines. This model is ideal for **real-time multimodal agents**, **spoken dialogue systems**, and **low-resource deployment**.

> πŸ“– For more information, see the [TinyWave paper (arXiv:2506.23670)](https://arxiv.org/abs/2506.23670) and [project website](https://mohammadmahdinoori.github.io/tinywave-landing/).

## πŸ”§ Usage

This model accepts interleaved speech and text inputs. It expects inputs to be encoded using SPIRIT-LM’s **expressive speech tokenizer**.

### 1. Clone SPIRIT-LM and Install Requirements


```bash
git clone https://github.com/facebookresearch/spiritlm
cd spiritlm
pip install -e '.[eval]'
````

---

### 2. Load Tokenizer

```python
from spiritlm.speech_tokenizer import spiritlm_expressive
speech_tokenizer = spiritlm_expressive()
```

---

### 3. Inference Code

```python
from transformers import LlamaForCausalLM, AutoTokenizer
import torchaudio
import torch

# Load model and tokenizer
MODEL_PATH = "tinywave/interleaved-expressive-2b"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = LlamaForCausalLM.from_pretrained(MODEL_PATH, device_map="auto", torch_dtype=torch.bfloat16)

# Audio + Speech tokenizer
speech_tokenizer = spiritlm_expressive()

def get_inference(input_audio_path):
    audio, _ = torchaudio.load(input_audio_path)
    input_values = audio.view(1, 1, -1).to(speech_tokenizer.hubert_model.device).float()
    string_tokens = speech_tokenizer.encode_string(input_values)
    input_ids = tokenizer(string_tokens, return_tensors="pt").input_ids.to(model.device)
    output = model.generate(input_ids, max_new_tokens=256, top_p=0.9, temperature=0.9, do_sample=True)
    return tokenizer.decode(output[0])

# Text-based prompt
def get_inference_text(prompt):
    input_ids = tokenizer(prompt + " [Speech]", return_tensors="pt").input_ids.to(model.device)
    output = model.generate(input_ids, max_new_tokens=256, top_p=0.9, temperature=0.9, do_sample=True)
    return tokenizer.decode(output[0])
```

---

### 4. Decoding to WAV (optional)

```python
import numpy as np
from scipy.io.wavfile import write

def save_array_to_wav_int16(audio_array: np.ndarray, sampling_rate=16000, filename="output.wav"):
    scaled = np.int16(audio_array / np.max(np.abs(audio_array)) * 32767)
    write(filename, sampling_rate, scaled)

decoded_audio = speech_tokenizer.decode(output_text.replace(" ", "").replace("<s>", "").replace("</s>", ""), speaker_id=2)
save_array_to_wav_int16(decoded_audio, filename="generated.wav")
```

---

## πŸ—£οΈ Inference Examples

### 🎧 Speech Continuation

Input: `speech.wav` (spoken sentence)
Output: Expressive speech continuation in the same style and tone.

---

### πŸ’¬ Mixed Input: Text β†’ Speech

Prompt:

```
"Once upon a time in a small village, a mysterious sound echoed through the forest. [Speech]"
```

Output: Expressive spoken continuation in WAV format.

---

## 🧠 Model Details

| Feature             | Description                                       |
| ------------------- | ------------------------------------------------- |
| Architecture        | 2B parameter distilled transformer                |
| Tokenizer           | SPIRIT-LM Expressive (HuBERT + pitch/style)       |
| Tasks               | Speech continuation, mixed speech-text generation |
| Teacher Model       | SPIRIT-LM-Expressive 7B                           |
| Distillation Method | Layer-aligned: hidden states, attention, logits   |
| Input Types         | Discrete HuBERT tokens and text                   |

---

## πŸ“Ž Citation

```bibtex
@article{nouriborji2025tinywave,
  title={Efficient Interleaved Speech Modeling through Knowledge Distillation},
  author={Nouriborji, Mohammadmahdi and Rohanian, Morteza},
  journal={arXiv preprint arXiv:2506.23670},
  year={2025}
}
```

---

## πŸ“‚ Resources

* πŸ”— [Project Page](https://mohammadmahdinoori.github.io/tinywave-landing/)
* πŸ’¬ [Demo Samples](https://mohammadmahdinoori.github.io/tinywave-landing/#samples)
* 🧠 [Training & Codebase](https://github.com/mohammadmahdinoori/TinyWave)