|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: |
|
|
- kyutai/moshika-pytorch-bf16 |
|
|
pipeline_tag: audio-to-audio |
|
|
--- |
|
|
|
|
|
Muchi is a finetuned speech-text foundation model and full-duplex spoken dialogue framework, based on the original Moshi model. |
|
|
|
|
|
How to use: |
|
|
|
|
|
You can use the original moshi ui to try out this model, just start the server pointed to this model |
|
|
|
|
|
https://github.com/kyutai-labs/moshi |
|
|
|
|
|
python -m moshi.server [--gradio-tunnel] [--hf-repo DavidBrowne17/Muchi] |
|
|
|
|
|
Model Details |
|
|
|
|
|
Pytorch version: Quantized in bf16 precision. |
|
|
|
|
|
Model type: Multimodal speech-text foundation model |
|
|
|
|
|
Language(s) (NLP): English |
|
|
|
|
|
License: apache 2.0 |
|
|
|
|
|
Model Description |
|
|
|
|
|
Muchi is a refined version of the Moshi model, designed for smoother, more adaptable dialogue generation. Building upon Moshi’s speech-to-speech generation foundation, Muchi enhances conversational coherence and reduces latency. Like Moshi, it uses a residual quantizer from a neural audio codec to generate speech tokens and models its own and user speech into parallel streams. This framework supports dynamic conversational flow without rigid speaker turns. |
|
|
|
|
|
Muchi also implements the "Inner Monologue" method, predicting time-aligned text tokens before generating speech tokens. This approach enhances linguistic quality, supports streaming speech recognition, and improves text-to-speech output. Muchi achieves a practical latency of approximately 200ms, ensuring near real-time interaction. |
|
|
|
|
|
Key Enhancements in Muchi: |
|
|
|
|
|
Reduced latency and smoother conversational flow. |
|
|
|
|
|
Enhanced adaptability in dialogue dynamics. |
|
|
|
|
|
Improved speech synthesis quality. |
|
|
|
|
|
Uses |
|
|
|
|
|
Direct Use |
|
|
|
|
|
Muchi can be deployed as a conversational agent for: |
|
|
|
|
|
Casual conversation. |
|
|
|
|
|
Basic factual responses and advice. |
|
|
|
|
|
Roleplay scenarios. |
|
|
|
|
|
Low-latency interactive tasks. |
|
|
|
|
|
Downstream Use |
|
|
|
|
|
Components like the audio codec can be repurposed for training speech models or enhancing text-to-speech systems. |
|
|
|
|
|
The finetuned architecture allows for domain-specific adaptations with additional training. |
|
|
|
|
|
Out-of-Scope Use |
|
|
|
|
|
Muchi is not intended for: |
|
|
|
|
|
Impersonating individuals. |
|
|
|
|
|
Malicious applications. |
|
|
|
|
|
Professional advice or critical decision-making. |
|
|
|
|
|
Bias, Risks, and Limitations |
|
|
|
|
|
Muchi inherits safeguards from Moshi but may still exhibit biases due to the nature of its training data. While toxicity has been minimized, there are risks of over-representation from certain data domains. The model is trained to produce a consistent voice and is not designed for impersonation. Further testing is necessary to evaluate long-term sociotechnical impacts. |
|
|
|
|
|
|