DavidBrowne17
/

Muchi

Audio-to-Audio

Safetensors

Model card Files Files and versions

xet

Community

DavidBrowne17 commited on Mar 16, 2025

Commit

c6f1bdc

verified ·

1 Parent(s): 559a416

Update README.md

Browse files

Files changed (1) hide show

README.md +11 -10

README.md CHANGED Viewed

@@ -5,15 +5,15 @@ base_model:
 pipeline_tag: audio-to-audio
 ---
-Mochi is a finetuned speech-text foundation model and full-duplex spoken dialogue framework, based on the original Moshi model.
-How to use:
-You can use the original moshi ui to try out this model, just start the server pointed to this model
 https://github.com/kyutai-labs/moshi
-python -m moshi.server [--gradio-tunnel] [--hf-repo DavidBrowne17/Mochi]
 Model Details
@@ -27,11 +27,11 @@ License: apache 2.0
 Model Description
-Mochi is a refined version of the Moshi model, designed for smoother, more adaptable dialogue generation. Building upon Moshi’s speech-to-speech generation foundation, Mochi enhances conversational coherence and reduces latency. Like Moshi, it uses a residual quantizer from a neural audio codec to generate speech tokens and models its own and user speech into parallel streams. This framework supports dynamic conversational flow without rigid speaker turns.
-Mochi also implements the "Inner Monologue" method, predicting time-aligned text tokens before generating speech tokens. This approach enhances linguistic quality, supports streaming speech recognition, and improves text-to-speech output. Mochi achieves a practical latency of approximately 200ms, ensuring near real-time interaction.
-Key Enhancements in Mochi:
 Reduced latency and smoother conversational flow.
@@ -43,7 +43,7 @@ Uses
 Direct Use
-Mochi can be deployed as a conversational agent for:
 Casual conversation.
@@ -61,7 +61,7 @@ The finetuned architecture allows for domain-specific adaptations with additiona
 Out-of-Scope Use
-Mochi is not intended for:
 Impersonating individuals.
@@ -71,4 +71,5 @@ Professional advice or critical decision-making.
 Bias, Risks, and Limitations
-Mochi inherits safeguards from Moshi but may still exhibit biases due to the nature of its training data. While toxicity has been minimized, there are risks of over-representation from certain data domains. The model is trained to produce a consistent voice and is not designed for impersonation. Further testing is necessary to evaluate long-term sociotechnical impacts

 pipeline_tag: audio-to-audio
 ---
+Muchi is a finetuned speech-text foundation model and full-duplex spoken dialogue framework, based on the original Moshi model.
+How to use:
+You can use the original moshi ui to try out this model, just start the server pointed to this model
 https://github.com/kyutai-labs/moshi
+python -m moshi.server [--gradio-tunnel] [--hf-repo DavidBrowne17/Muchi]
 Model Details
 Model Description
+Muchi is a refined version of the Moshi model, designed for smoother, more adaptable dialogue generation. Building upon Moshi’s speech-to-speech generation foundation, Muchi enhances conversational coherence and reduces latency. Like Moshi, it uses a residual quantizer from a neural audio codec to generate speech tokens and models its own and user speech into parallel streams. This framework supports dynamic conversational flow without rigid speaker turns.
+Muchi also implements the "Inner Monologue" method, predicting time-aligned text tokens before generating speech tokens. This approach enhances linguistic quality, supports streaming speech recognition, and improves text-to-speech output. Muchi achieves a practical latency of approximately 200ms, ensuring near real-time interaction.
+Key Enhancements in Muchi:
 Reduced latency and smoother conversational flow.
 Direct Use
+Muchi can be deployed as a conversational agent for:
 Casual conversation.
 Out-of-Scope Use
+Muchi is not intended for:
 Impersonating individuals.
 Bias, Risks, and Limitations
+Muchi inherits safeguards from Moshi but may still exhibit biases due to the nature of its training data. While toxicity has been minimized, there are risks of over-representation from certain data domains. The model is trained to produce a consistent voice and is not designed for impersonation. Further testing is necessary to evaluate long-term sociotechnical impacts.