Moshi
Safetensors
English
personaplex

Thread for understanding customisation for end to end use cases

#2
by harsh2ai - opened

Thanks to nvidia team for releasing the model, lately a lot of spike is happening in the Speech AI domain , I had a couple of questions in my mind , the model is exactly what the industry needs which supports a full duplex architechture with customisation as per role taking but

  • how can I connect it to fire with with some kind of MCP server (if let's say I wanted some user to query about xyz details related to them, and have built an mcp server for textual answering , how can i plug and play it within the same architechture, is it possible or not,)

  • how can i use it for my own custom language (Hindi, Arabic, Tamil etc)

  • Lately nvidia has been releasing multiple speech models is there any plan to bring and patch encoder and decoder of our own choice something like what huggingface allows with AutoCasualLM like module so more engineers and devs have free time in rapid prototyping before going on the hard way of finetuning role

NVIDIA org

This architecture is quite monolithic so its kind of hard to modularize with patch encoder decoder. Also its English-only at the point.

Excellent point about the MCP server. We don't have tool-calling support like that now, and will try to add support for your scenario in future models.
A workaround that might be possible: You have to prompt the model to say something like "I am looking it up" when it need some external information. Detect that being emitted in the text channel, feed the conversation transcript (use a lightweight ASR model like parakeet in parallel) to a query summarizer prompt, and make the MCP query. Then when the results come back, start a new context of this model with the previous conversation summary and results from the MCP in the text prompt. You could play some sort of "on hold" tone/music in the meantime. This is a very experimental suggestion.

Sign up or log in to comment