maitrix-org
/

Voila-base

@@ -1,199 +1,146 @@
 ---
 library_name: transformers
-tags: []
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
 library_name: transformers
+license: mit
+datasets:
+- maitrix-org/Voila-Benchmark
+- maitrix-org/Voila-million-voice
+language:
+- en
+- zh
+- fr
+- de
+- ja
+- ko
 ---
+<p align="center">
+    <img src="https://voila.maitrix.org/static/images/logo.png" width="400"/><br/>
+    <b>Voila: <span style="color:#ca00f9">Voi</span>ce-<span style="color:#ca00f9">La</span>nguage Foundation Models</b><br/><br/>
+    💜 <a href="https://voila.maitrix.org/"><b>Voila</b></a> &nbsp&nbsp ｜ &nbsp&nbsp 🖥️ <a href="https://github.com/maitrix-org/Voila">GitHub</a> &nbsp&nbsp  | &nbsp&nbsp🤗 <a href="https://huggingface.co/collections/maitrix-org/voila-67e0d96962c19f221fc73fa5">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="">Paper (Coming soon)</a> &nbsp&nbsp | &nbsp&nbsp 🌐 <a href="https://voila-demo.maitrix.org">Demo</a>
+</p>
+Voila is a groundbreaking family of large audio-language foundation models that revolutionizes human-AI interactions. Breaking away from the constraints of traditional voice AI systems—high latency, loss of vocal nuances, and mechanical responses, Voila employs an innovative end-to-end model design and a novel hierarchical Transformer architecture. This approach enables real-time, autonomous, and rich voice interactions, with latency as low as 195 ms, surpassing average human response times. Combining advanced voice and language modeling, Voila offers customizable, persona-driven engagements and excels in a range of audio tasks from ASR and TTS to speech translation across six languages. With the online [web demo](https://voila-demo.maitrix.org/), Voila invites you to explore a transformative, natural dialogue experience between human and AI.
+# ✨ Highlights
+- ⭐ High-fidelity, low-latency, real-time streaming audio processing
+- ⭐ Effective integration of voice and language modeling capabilities
+- ⭐ Millions of pre-built and custom voices, fast voice switching during conversation
+- ⭐ Unified model for various audio tasks
+# 🎥 Video Demo
+<div align="center">
+    <video width="60%" controls>
+        <source src="https://voila.maitrix.org/static/videos/voila-demo.mp4" type="video/mp4">
+        Your browser does not support the video tag.
+    </video>
+</div>
+# 🔥 Latest News!!
+* Mar 25, 2025: 👋 We've released the inference code and model weights of Voila.
+# ⚙️ Foundation Models
+| Model | Description | Download Link |
+|--------|-----------|-----------------|
+|Voila-base|Voila base model|https://huggingface.co/maitrix-org/Voila-base|
+|Voila-Chat|End-to-end audio chat model|https://huggingface.co/maitrix-org/Voila-chat|
+|Voila-Autonomous (preview)|Full-duplex audio chat model|https://huggingface.co/maitrix-org/Voila-autonomous-preview|
+|Voila-Audio-alpha|Empowering LLM with raw audio input|https://huggingface.co/maitrix-org/Voila-audio-alpha|
+|Voila-Tokenizer|Audio tokenizer|https://huggingface.co/maitrix-org/Voila-Tokenizer|
+## Usage
+### CLI demo
+```shell
+for model_name in "maitrix-org/Voila-audio-alpha" "maitrix-org/Voila-base" "maitrix-org/Voila-chat"; do
+    # Text chat
+    python infer.py \
+        --model-name ${model_name} \
+	    --instruction "" \
+	    --input-text "Hello" \
+	    --task-type chat_tito
+    # Voice chat
+    python infer.py \
+        --model-name ${model_name} \
+	    --instruction "" \
+	    --input-audio "examples/test1.mp3" \
+	    --task-type chat_aiao
+done
+# Autonomous mode
+python infer.py \
+    --model-name "maitrix-org/Voila-autonomous-preview" \
+	--instruction "" \
+	--input-audio "examples/test_autonomous1.mp3" \
+	--task-type chat_aiao_auto
+```
+### Gradio demo
+```shell
+python gradio_demo.py
+```
+For more information, please refer to the [code repository](https://github.com/maitrix-org/Voila).
+# 📁 Datasets
+We publish the following two datasets: Voila Benchmark and Voila Voice Library. Voila-Benchmark is a novel speech evaluation benchmark, while Voila Voice Library provides millions of pre-built and customizable voices.
+| Dataset | Description | Download Link |
+|--------|-----------|-----------------|
+|Voila Benchmark| Evaluation of Voila Benchmark | https://huggingface.co/datasets/maitrix-org/Voila-Benchmark |
+|Voila Voice Library| Millons of pre-build voices | https://huggingface.co/datasets/maitrix-org/Voila-million-voice
+# 📊 Benchmark
+## 1. Voila Benchmark
+We introduce a novel speech evaluation benchmark called the VoilaBenchmark. The Voila Benchmark is constructed by sampling from five widely used language model evaluation datasets: MMLU, MATH, OpenAI HumanEval, NQ-Open, and GSM8k. We compare our results with SpeechGPT and Moshi.
+| Model | Voila Benchmark |
+|-------|----------------|
+|SpeechGPT| 13.29|
+|Moshi | 11.45 |
+|**Voila** | **30.56** |
+_(higher is better)_
+For detailed scores of Voila Benchmark on each specific domain, please refer to our paper (Section 5.1 "Evaluation of Voila Benchmark").
+## 2. Evaluation of ASR
+As Voila supports multiple tasks, including Automatic Speech Recognition (ASR), Text-to-Speech(TTS), and spoken question answering, we also evaluate the performance of ASR and TTS.
+For ASR, we assess performance on the LibriSpeech test-clean dataset, using Word Error Rate (WER) as our metric. Voila attains a word error rate (WER) of 4.8%, outperforming the 5.7% reported by Moshi. In scenarios where both models utilize LibriSpeech training data, Voila achieves an impressive WER of 2.7%.
+| Model | LibriSpeech test-clean (WER) |
+|-------|-----------------------|
+|Whisper large v2|2.7|
+|Whisper large v3|2.2|
+|FastConformer|3.6|
+|VoxtLM |2.7|
+|Moshi |5.7|
+|**Voila (w/o LibriSpeech train split)** |**4.8**|
+|**Voila (with LibriSpeech train split)**|**2.7**|
+_(lower is better)_
+## 3. Evaluation of TTS
+For TTS, we follow the evaluation metrics proposed in Vall-E, which involves transcribing the generated audio using HuBERT-Large.
+Voila once again leads with a WER of 3.2% (and 2.8% when using LibriSpeech training data).
+| Model | LibriSpeech test-clean (WER) |
+|-------|-----------------------|
+|YourTTS |7.7|
+|Vall-E|5.9|
+|Moshi|4.7|
+|**Voila (w/o LibriSpeech train split)** |**3.2**|
+|**Voila (with LibriSpeech train split)** |**2.8**|
+_(lower is better)_
+# 📝 Citation
+If you find our work helpful, please cite us.
+```
+@article{voila2025,
+  author    = {Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, Zhiting Hu},
+  title     = {Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Roleplay},
+  eprint={},
+  archivePrefix={arXiv},
+  primaryClass={cs.CL},
+  year      = {2025}
+}
+```