DavidBrowne17
/

Muchi

Model card Files Files and versions

Muchi / README.md

DavidBrowne17's picture

Update README.md

c6f1bdc verified 10 months ago

|

history blame contribute delete

2.56 kB

	---
	license: apache-2.0
	base_model:
	- kyutai/moshika-pytorch-bf16
	pipeline_tag: audio-to-audio
	---

	Muchi is a finetuned speech-text foundation model and full-duplex spoken dialogue framework, based on the original Moshi model.

	How to use:

	You can use the original moshi ui to try out this model, just start the server pointed to this model

	https://github.com/kyutai-labs/moshi

	python -m moshi.server [--gradio-tunnel] [--hf-repo DavidBrowne17/Muchi]

	Model Details

	Pytorch version: Quantized in bf16 precision.

	Model type: Multimodal speech-text foundation model

	Language(s) (NLP): English

	License: apache 2.0

	Model Description

	Muchi is a refined version of the Moshi model, designed for smoother, more adaptable dialogue generation. Building upon Moshi’s speech-to-speech generation foundation, Muchi enhances conversational coherence and reduces latency. Like Moshi, it uses a residual quantizer from a neural audio codec to generate speech tokens and models its own and user speech into parallel streams. This framework supports dynamic conversational flow without rigid speaker turns.

	Muchi also implements the "Inner Monologue" method, predicting time-aligned text tokens before generating speech tokens. This approach enhances linguistic quality, supports streaming speech recognition, and improves text-to-speech output. Muchi achieves a practical latency of approximately 200ms, ensuring near real-time interaction.

	Key Enhancements in Muchi:

	Reduced latency and smoother conversational flow.

	Enhanced adaptability in dialogue dynamics.

	Improved speech synthesis quality.

	Uses

	Direct Use

	Muchi can be deployed as a conversational agent for:

	Casual conversation.

	Basic factual responses and advice.

	Roleplay scenarios.

	Low-latency interactive tasks.

	Downstream Use

	Components like the audio codec can be repurposed for training speech models or enhancing text-to-speech systems.

	The finetuned architecture allows for domain-specific adaptations with additional training.

	Out-of-Scope Use

	Muchi is not intended for:

	Impersonating individuals.

	Malicious applications.

	Professional advice or critical decision-making.

	Bias, Risks, and Limitations

	Muchi inherits safeguards from Moshi but may still exhibit biases due to the nature of its training data. While toxicity has been minimized, there are risks of over-representation from certain data domains. The model is trained to produce a consistent voice and is not designed for impersonation. Further testing is necessary to evaluate long-term sociotechnical impacts.