Add project page to model card
#9
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,153 +1,142 @@
|
|
| 1 |
---
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
library_name: transformers
|
|
|
|
|
|
|
| 5 |
---
|
| 6 |
-
# AMโThinkingโv1: Advancing the Frontier of Reasoning at 32B Scale
|
| 7 |
-
* 2025-05-10ย ยทย a-mโteam
|
| 8 |
|
| 9 |
<p align="center">
|
| 10 |
-
|
|
|
|
|
|
|
| 11 |
</p>
|
| 12 |
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
### 3) Writing
|
| 64 |
-
<div style="text-align: center;">
|
| 65 |
-
<img src="assets/writing.png" alt="sushi" width="90%">
|
| 66 |
-
</div>
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
## โกย Quick start
|
| 71 |
-
|
| 72 |
-
```python
|
| 73 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 74 |
-
|
| 75 |
-
model_name = "a-m-team/AM-Thinking-v1"
|
| 76 |
-
|
| 77 |
-
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 78 |
-
model = AutoModelForCausalLM.from_pretrained(
|
| 79 |
-
model_name,
|
| 80 |
-
torch_dtype="auto",
|
| 81 |
-
device_map="auto"
|
| 82 |
-
)
|
| 83 |
-
|
| 84 |
-
prompt = "How can I find inner peace?"
|
| 85 |
-
messages = [
|
| 86 |
-
{"role": "user", "content": prompt}
|
| 87 |
-
]
|
| 88 |
-
text = tokenizer.apply_chat_template(
|
| 89 |
-
messages,
|
| 90 |
-
tokenize=False,
|
| 91 |
-
add_generation_prompt=True
|
| 92 |
-
)
|
| 93 |
-
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
|
| 94 |
-
|
| 95 |
-
generated_ids = model.generate(
|
| 96 |
-
**model_inputs,
|
| 97 |
-
max_new_tokens=49152
|
| 98 |
-
)
|
| 99 |
-
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
|
| 100 |
-
|
| 101 |
-
response = tokenizer.decode(output_ids, skip_special_tokens=True)
|
| 102 |
-
think_content = response.split("<think>")[1].split("</think>")[0]
|
| 103 |
-
answer_content = response.split("<answer>")[1].split("</answer>")[0]
|
| 104 |
-
|
| 105 |
-
print (f"user prompt: {prompt}")
|
| 106 |
-
print (f"model thinking: {think_content}")
|
| 107 |
-
print (f"model answer: {answer_content}")
|
| 108 |
```
|
| 109 |
-
> Note: We have included the system prompt in the tokenizer configuration, as it was used during both the SFT and RL stages. To ensure consistent output quality, we recommend including the same system prompt during actual usage; otherwise, the model's responses may be significantly affected.
|
| 110 |
-
|
| 111 |
-
### Quantized versions for compact devices
|
| 112 |
-
A series of quantized versions for [AM-Thinking-v1](https://huggingface.co/a-m-team/AM-Thinking-v1-gguf) model.
|
| 113 |
-
For use with [llama.cpp](https://github.com/ggml-org/llama.cpp) and [Ollama](https://github.com/ollama/ollama)
|
| 114 |
-
is available at [AM-Thinking-v1-gguf](https://huggingface.co/a-m-team/AM-Thinking-v1-gguf).
|
| 115 |
-
|
| 116 |
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
**Stepโฏ1ย โ Coldโstart SFT.**
|
| 124 |
-
We begin with the open-sourced **Qwenโฏ2.5โ32BโBase** and run a broad supervised fineโtune on a blended training dataset of math, code and openโdomain chat. This endows the model with a "thinkโthenโanswer" behavioural pattern and equips it with an initial capacity for reasoning.
|
| 125 |
-
|
| 126 |
-
**Stepโฏ2ย โ Passโrateโaware data curation.**
|
| 127 |
-
Before any RL, the SFT model is evaluated on every mathโ and codeโoriented training query. For each item we log a pass rate; only those with **0โฏ<โฏpassโrateโฏ<โฏ1** are kept. In effect we discard problems the model already masters and those it utterly fails, concentrating learning on genuinely informative cases.
|
| 128 |
-
|
| 129 |
-
**Stepโฏ3ย โ Reinforcement learningย .**
|
| 130 |
-
We adopt a twoโstage GRPO scheme: Stageโฏ1 trains only on math and code queries. Once it converges, stage 2 starts by removing every query the model answered 100% correctly in Stageโฏ1 and adjusting key hyperโparameters such as maximum generation length and learning rate.
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
## โ ๏ธ Limitations
|
| 134 |
-
|
| 135 |
-
While AMโThinkingโv1 excels at pure language reasoning and openโdomain chat, it has not yet been trained for structured functionโcalling or toolโuse workflows, which restricts its usefulness in agentโstyle applications that must act on external systems.
|
| 136 |
-
Improving the model's ability to follow complex instructions is also an important direction for our future work.
|
| 137 |
-
In addition, our safety alignment is still at an early stage, so more rigorous redโteaming are required to reduce potential harms.
|
| 138 |
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 142 |
|
| 143 |
```
|
| 144 |
-
@
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
url={https://arxiv.org/abs/2505.08311},
|
| 152 |
}
|
| 153 |
```
|
|
|
|
| 1 |
---
|
| 2 |
+
datasets:
|
| 3 |
+
- maitrix-org/Voila-Benchmark
|
| 4 |
+
- maitrix-org/Voila-million-voice
|
| 5 |
+
language:
|
| 6 |
+
- en
|
| 7 |
+
- zh
|
| 8 |
+
- fr
|
| 9 |
+
- de
|
| 10 |
+
- ja
|
| 11 |
+
- ko
|
| 12 |
library_name: transformers
|
| 13 |
+
license: mit
|
| 14 |
+
pipeline_tag: audio-text-to-text
|
| 15 |
---
|
|
|
|
|
|
|
| 16 |
|
| 17 |
<p align="center">
|
| 18 |
+
<img src="https://voila.maitrix.org/static/images/logo.png" width="400"/><br/>
|
| 19 |
+
<b>Voila: <span style="color:#ca00f9">Voi</span>ce-<span style="color:#ca00f9">La</span>nguage Foundation Models</b><br/><br/>
|
| 20 |
+
๐ <a href="https://voila.maitrix.org"><b>Project Page</b></a>    ๏ฝ    ๐ฅ๏ธ <a href="https://github.com/maitrix-org/Voila">GitHub</a>    |   ๐ค <a href="https://huggingface.co/collections/maitrix-org/voila-67e0d96962c19f221fc73fa5">Hugging Face</a>   |    ๐ <a href="http://arxiv.org/abs/2505.02707">Paper</a>    |    ๐ <a href="https://huggingface.co/spaces/maitrix-org/Voila-demo">Online Demo</a>   |    ๐ <a href="https://maitrix.org">Maitrix.org</a>
|
| 21 |
</p>
|
| 22 |
|
| 23 |
+
Voila is a new family of large voice-language foundation models aiming to lift human-AI interaction experiences to the next level. Breaking away from the constraints of traditional voice AI systemsโhigh latency, loss of vocal nuances, and mechanical responsesโVoila employs an innovative end-to-end model design and a novel hierarchical Transformer architecture. This approach enables real-time, autonomous, and rich voice interactions, with latency as low as 195 ms, surpassing average human response times. Combining advanced voice and language modeling, Voila offers customizable, persona-driven engagements and excels in a range of audio tasks from ASR and TTS to speech translation across six languages. With the online [web demo](https://huggingface.co/spaces/maitrix-org/Voila-demo), Voila invites you to explore a transformative, natural dialogue experience between human and AI.
|
| 24 |
+
|
| 25 |
+
# โจ Highlights
|
| 26 |
+
- โญ High-fidelity, low-latency, real-time streaming audio processing
|
| 27 |
+
- โญ Effective integration of voice and language modeling capabilities
|
| 28 |
+
- โญ Millions of pre-built and custom voices, fast voice switching during conversation
|
| 29 |
+
- โญ Unified model for various audio tasks
|
| 30 |
+
|
| 31 |
+
# ๐ฅ Video Demo
|
| 32 |
+
[](https://www.youtube.com/watch?v=J27M9-g5KL0)
|
| 33 |
+
|
| 34 |
+
# ๐ฅ Latest News!!
|
| 35 |
+
|
| 36 |
+
* April 28, 2025: ๐ We've released the inference code and model weights of Voila.
|
| 37 |
+
|
| 38 |
+
# โ๏ธ Foundation Models
|
| 39 |
+
|
| 40 |
+
| Model | Description | Download Link |
|
| 41 |
+
|--------|-----------|-----------------|
|
| 42 |
+
|Voila-base|Voila base model|https://huggingface.co/maitrix-org/Voila-base|
|
| 43 |
+
|Voila-Chat|End-to-end audio chat model|https://huggingface.co/maitrix-org/Voila-chat|
|
| 44 |
+
|Voila-Autonomous (preview)|Full-duplex audio chat model|https://huggingface.co/maitrix-org/Voila-autonomous-preview|
|
| 45 |
+
|Voila-Audio-alpha|Empowering LLM with raw audio input|https://huggingface.co/maitrix-org/Voila-audio-alpha|
|
| 46 |
+
|Voila-Tokenizer|Audio tokenizer|https://huggingface.co/maitrix-org/Voila-Tokenizer|
|
| 47 |
+
|
| 48 |
+
## Usage
|
| 49 |
+
### CLI demo
|
| 50 |
+
```shell
|
| 51 |
+
for model_name in "maitrix-org/Voila-audio-alpha" "maitrix-org/Voila-base" "maitrix-org/Voila-chat"; do
|
| 52 |
+
# Text chat
|
| 53 |
+
python infer.py \
|
| 54 |
+
--model-name ${model_name} \
|
| 55 |
+
--instruction "" \
|
| 56 |
+
--input-text "Hello" \
|
| 57 |
+
--task-type chat_tito
|
| 58 |
+
# Voice chat
|
| 59 |
+
python infer.py \
|
| 60 |
+
--model-name ${model_name} \
|
| 61 |
+
--instruction "" \
|
| 62 |
+
--input-audio "examples/test1.mp3" \
|
| 63 |
+
--task-type chat_aiao
|
| 64 |
+
done
|
| 65 |
+
|
| 66 |
+
# Autonomous mode
|
| 67 |
+
python infer.py \
|
| 68 |
+
--model-name "maitrix-org/Voila-autonomous-preview" \
|
| 69 |
+
--instruction "" \
|
| 70 |
+
--input-audio "examples/test_autonomous1.mp3" \
|
| 71 |
+
--task-type chat_aiao_auto
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
|
| 74 |
+
### Gradio demo
|
| 75 |
+
```shell
|
| 76 |
+
python gradio_demo.py
|
| 77 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
|
| 79 |
+
For more information, please refer to the [code repository](https://github.com/maitrix-org/Voila).
|
| 80 |
+
|
| 81 |
+
# ๐ Datasets
|
| 82 |
+
We publish the following two datasets: Voila Benchmark and Voila Voice Library. Voila-Benchmark is a novel speech evaluation benchmark, while Voila Voice Library provides millions of pre-built and customizable voices.
|
| 83 |
+
|
| 84 |
+
| Dataset | Description | Download Link |
|
| 85 |
+
|--------|-----------|-----------------|
|
| 86 |
+
|Voila Benchmark| Evaluation of Voila Benchmark | https://huggingface.co/datasets/maitrix-org/Voila-Benchmark |
|
| 87 |
+
|Voila Voice Library| Millons of pre-build voices | https://huggingface.co/datasets/maitrix-org/Voila-million-voice
|
| 88 |
+
|
| 89 |
+
# ๐ Benchmark
|
| 90 |
+
## 1. Voila Benchmark
|
| 91 |
+
We introduce a novel speech evaluation benchmark called the VoilaBenchmark. The Voila Benchmark is constructed by sampling from five widely used language model evaluation datasets: MMLU, MATH, OpenAI HumanEval, NQ-Open, and GSM8k. We compare our results with SpeechGPT and Moshi.
|
| 92 |
+
| Model | Voila Benchmark |
|
| 93 |
+
|-------|----------------|
|
| 94 |
+
|SpeechGPT| 13.29|
|
| 95 |
+
|Moshi | 11.45 |
|
| 96 |
+
|**Voila** | **30.56** |
|
| 97 |
+
|
| 98 |
+
_(higher is better)_
|
| 99 |
+
|
| 100 |
+
For detailed scores of Voila Benchmark on each specific domain, please refer to our paper (Section 5.1 "Evaluation of Voila Benchmark").
|
| 101 |
+
## 2. Evaluation of ASR
|
| 102 |
+
As Voila supports multiple tasks, including Automatic Speech Recognition (ASR), Text-to-Speech(TTS), and spoken question answering, we also evaluate the performance of ASR and TTS.
|
| 103 |
+
For ASR, we assess performance on the LibriSpeech test-clean dataset, using Word Error Rate (WER) as our metric. Voila attains a word error rate (WER) of 4.8%, outperforming the 5.7% reported by Moshi. In scenarios where both models utilize LibriSpeech training data, Voila achieves an impressive WER of 2.7%.
|
| 104 |
+
| Model | LibriSpeech test-clean (WER) |
|
| 105 |
+
|-------|-----------------------|
|
| 106 |
+
|Whisper large v2|2.7|
|
| 107 |
+
|Whisper large v3|2.2|
|
| 108 |
+
|FastConformer|3.6|
|
| 109 |
+
|VoxtLM |2.7|
|
| 110 |
+
|Moshi |5.7|
|
| 111 |
+
|**Voila (w/o LibriSpeech train split)** |**4.8**|
|
| 112 |
+
|**Voila (with LibriSpeech train split)**|**2.7**|
|
| 113 |
+
|
| 114 |
+
_(lower is better)_
|
| 115 |
+
|
| 116 |
+
## 3. Evaluation of TTS
|
| 117 |
+
For TTS, we follow the evaluation metrics proposed in Vall-E, which involves transcribing the generated audio using HuBERT-Large.
|
| 118 |
+
Voila once again leads with a WER of 3.2% (and 2.8% when using LibriSpeech training data).
|
| 119 |
+
|
| 120 |
+
| Model | LibriSpeech test-clean (WER) |
|
| 121 |
+
|-------|-----------------------|
|
| 122 |
+
|YourTTS |7.7|
|
| 123 |
+
|Vall-E|5.9|
|
| 124 |
+
|Moshi|4.7|
|
| 125 |
+
|**Voila (w/o LibriSpeech train split)** |**3.2**|
|
| 126 |
+
|**Voila (with LibriSpeech train split)** |**2.8**|
|
| 127 |
+
|
| 128 |
+
_(lower is better)_
|
| 129 |
+
|
| 130 |
+
# ๐ Citation
|
| 131 |
+
If you find our work helpful, please cite us.
|
| 132 |
|
| 133 |
```
|
| 134 |
+
@article{voila2025,
|
| 135 |
+
author = {Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, Zhiting Hu},
|
| 136 |
+
title = {Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Roleplay},
|
| 137 |
+
eprint={2505.02707},
|
| 138 |
+
archivePrefix={arXiv},
|
| 139 |
+
primaryClass={cs.CL},
|
| 140 |
+
year = {2025}
|
|
|
|
| 141 |
}
|
| 142 |
```
|