VITA-MLLM
/

VITA-Audio-Balance

@@ -1,17 +1,23 @@
 ---
-license: apache-2.0
 language:
 - zh
 - en
-base_model:
-- Qwen/Qwen2.5-7B-Instruct
 ---
 ## ACCEPTABLE USE POLICY
@@ -30,7 +36,7 @@ Tencent endeavors to promote safe and fair use of its tools and features, includ
 10. To generate or disseminate personal identifiable information with the purpose of harming others;
 11. To generate or disseminate information (including images, code, posts, articles), and place the information in any public context (including –through the use of bot generated tweets), without expressly and conspicuously identifying that the information and/or content is machine generated;
 12. To impersonate another individual without consent, authorization, or legal right;
-13. To make high-stakes automated decisions in domains that affect an individual’s safety, rights or wellbeing (e.g., law enforcement, migration, medicine/health, management of critical infrastructure, safety components of products, essential services, credit, employment, housing, education, social scoring, or insurance);
 14. In a manner that violates or disrespects the social ethics and moral standards of other countries or regions;
 15. To perform, facilitate, threaten, incite, plan, promote or encourage violent extremism or terrorism;
 16. For any use intended to discriminate against or harm individuals or groups based on protected characteristics or categories, online or offline social behavior or known or predicted personal or personality characteristics;

 ---
+base_model:
+- Qwen/Qwen2.5-7B-Instruct
 language:
 - zh
 - en
+license: apache-2.0
+library_name: transformers
+pipeline_tag: text-to-audio
 ---
+## Paper title and link
+The model was presented in the paper [VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model](https://huggingface.co/papers/2505.03739).
+## Paper abstract
+The abstract of the paper is the following:
+With the growing requirement for natural human-computer interaction, speech-based systems receive increasing attention as speech is one of the most common forms of daily communication. However, the existing speech models still experience high latency when generating the first audio token during streaming, which poses a significant bottleneck for deployment. To address this issue, we propose VITA-Audio, an end-to-end large speech model with fast audio-text token generation. Specifically, we introduce a lightweight Multiple Cross-modal Token Prediction (MCTP) module that efficiently generates multiple audio tokens within a single model forward pass, which not only accelerates the inference but also significantly reduces the latency for generating the first audio in streaming scenarios. In addition, a four-stage progressive training strategy is explored to achieve model acceleration with minimal loss of speech quality. To our knowledge, VITA-Audio is the first multi-modal large language model capable of generating audio output during the first forward pass, enabling real-time conversational capabilities with minimal latency. VITA-Audio is fully reproducible and is trained on open-source data only. Experimental results demonstrate that our model achieves an inference speedup of 3~5x at the 7B parameter scale, but also significantly outperforms open-source models of similar model size on multiple benchmarks for automatic speech recognition (ASR), text-to-speech (TTS), and spoken question answering (SQA) tasks.
 ## ACCEPTABLE USE POLICY
 10. To generate or disseminate personal identifiable information with the purpose of harming others;
 11. To generate or disseminate information (including images, code, posts, articles), and place the information in any public context (including –through the use of bot generated tweets), without expressly and conspicuously identifying that the information and/or content is machine generated;
 12. To impersonate another individual without consent, authorization, or legal right;
+13. To make high-stakes automated decisions in domains that affect an individual’s safety, rights or wellbeing (e.g., law enforcement, migration, medicine/health, management of critical infrastructure, safety components of products, essential services, credit, employment, housing, education, social scoring, or insurance);\
 14. In a manner that violates or disrespects the social ethics and moral standards of other countries or regions;
 15. To perform, facilitate, threaten, incite, plan, promote or encourage violent extremism or terrorism;
 16. For any use intended to discriminate against or harm individuals or groups based on protected characteristics or categories, online or offline social behavior or known or predicted personal or personality characteristics;