Thread for understanding customisation for end to end use cases

#2
by harsh2ai - opened

Thanks to nvidia team for releasing the model, lately a lot of spike is happening in the Speech AI domain , I had a couple of questions in my mind , the model is exactly what the industry needs which supports a full duplex architechture with customisation as per role taking but

  • how can I connect it to fire with with some kind of MCP server (if let's say I wanted some user to query about xyz details related to them, and have built an mcp server for textual answering , how can i plug and play it within the same architechture, is it possible or not,)

  • how can i use it for my own custom language (Hindi, Arabic, Tamil etc)

  • Lately nvidia has been releasing multiple speech models is there any plan to bring and patch encoder and decoder of our own choice something like what huggingface allows with AutoCasualLM like module so more engineers and devs have free time in rapid prototyping before going on the hard way of finetuning role

NVIDIA org

This architecture is quite monolithic so its kind of hard to modularize with patch encoder decoder. Also its English-only at the point.

Excellent point about the MCP server. We don't have tool-calling support like that now, and will try to add support for your scenario in future models.
A workaround that might be possible: You have to prompt the model to say something like "I am looking it up" when it need some external information. Detect that being emitted in the text channel, feed the conversation transcript (use a lightweight ASR model like parakeet in parallel) to a query summarizer prompt, and make the MCP query. Then when the results come back, start a new context of this model with the previous conversation summary and results from the MCP in the text prompt. You could play some sort of "on hold" tone/music in the meantime. This is a very experimental suggestion.

This comment has been hidden (marked as Off-Topic)
royrajarshi changed discussion status to closed

We built a working implementation of the tool-calling architecture @royrajarshi described, and wanted to share what we learned — including what works and what doesn’t.

What Doesn’t Work: text_prompt for Content Delivery

The suggested approach of restarting the model with MCP results in text_prompt runs into a fundamental limitation: text_prompt is fine-tuned for persona shaping only, not content injection.

PersonaPlex’s text_prompt training data (~2,250 hours of synthetic dialogues) only included persona-shaping prompts ("You are a wise teacher", "You are an astronaut named Alex"). It never included patterns like "Tell the user these facts: [data]" or "Relay this information: [results]".

We tested extensively:

  • "IMPORTANT: Tell the user X" → model greets normally, ignores content
  • "You just looked something up and found: [results]" → model greets normally
  • "Your job right now is to share this: X" → model greets normally
  • First-person framing ("You just found: X") → model greets normally

The model adopts persona (name, role, style) from text_prompt perfectly, but will not relay specific facts. The greeting behavior ("Hello, this is [Name]!") is baked into instruction fine-tuning weights and overrides any content in the prompt.

What Works: Drip-Feed Token Injection + External TTS

We built a Talker-Reasoner architecture (based on DeepMind’s "Agents Thinking Fast and Slow") where PersonaPlex is System 1 and a Letta AI agent with persistent memory + web search is System 2.

Key discovery — drip-feed sendText() at Moshi’s frame rate:

Moshi’s Inner Monologue (per the Moshi paper) predicts one text token per audio frame at 12.5Hz (80ms intervals). Burst injection of 300+ characters at once overwhelms temporal alignment and causes repetition degeneration ("plus plus plus...").

Fix: send 20 characters every 80ms, matching the model’s per-frame consumption rate:

const chunks = response.match(/.{1,20}/g) ?? [response];
for (const chunk of chunks) {
  sendText(chunk);
  await new Promise(r => setTimeout(r, 80));
}

This transfers knowledge into the model’s Inner Monologue without degeneration. However, PersonaPlex also tries to speak the injected text (garbled), so we:

  1. Gate PersonaPlex audio at the server — drop audio frames during delivery + 8s cooldown
  2. Play clean TTS audio instead — sentence-chunked external TTS (~500ms to first audio)

The drip-feed gives PersonaPlex the knowledge for follow-up questions, while the external TTS delivers the clean answer.

For the MCP use case @harsh2ai asked about: the pattern is trigger detection → MCP tool call → drip-feed results into PersonaPlex + play TTS for the user. PersonaPlex has no idea MCP exists; it just receives knowledge through its Inner Monologue channel.

Detailed technical writeup: gist
Code: vaos-voice-bridge

The idea of the model garbling to itself as it loads context is so sci-fi. It's the same thing that humans do when they're sub-vocal reading notes to themselves.

Hi Rajarshi and team,

I've spent the last two weeks building extensively with PersonaPlex and wanted to share detailed feedback that I hope is useful for future model development.

First: the full-duplex experience is genuinely extraordinary. The responsiveness, natural backchanneling, interruption handling, and conversational dynamics are unlike anything else available. My team and I were consistently impressed by how natural the conversations felt. This technology is a step change.

However, we've hit fundamental limitations that prevent us from shipping a production level product on PersonaPlex:

  1. Text prompt cannot control content. We confirmed (and found other developers reporting the same) that text_prompt is trained for persona shaping only. Our use case requires the model to reference specific facts from uploaded documents, deliver scripted openings, and follow structured conversation flows. The model consistently ignores content instructions while adopting persona attributes (name, role) correctly.

  2. The greeting behaviour is baked in. Regardless of prompt content, the model defaults to its trained greeting pattern. Instructions like "your first message must be exactly: [specific text]" are ignored.

  3. Strategic reasoning depth. For our use case, we wanted the model to push back on assumptions, connect insights across conversation topics, and deliver specific observations about a info we fed it. The 7B model handles general conversation well, but can't sustain this level of depth.

What we tried:

  • Shortening prompts to under 150 tokens (marginal improvement)
  • Appending structured data from pdf extraction (model references it inconsistently)
  • The drip-feed architecture via sendText() (this works for knowledge injection, but it doesn't proactively surface the knowledge)
  • Reconnection-based prompt updates mid-conversation (server doesn't handle rapid disconnect/reconnect gracefully)

We're now exploring a cascading pipeline (STT → Claude → TTS) for our product, which gives us the content control we need at the cost of full-duplex.

For future models, what would unlock PersonaPlex for our use cases:

  • Content injection training: fine-tune on patterns like "relay this information to the user" and "reference these facts in conversation"
  • Longer context window: 163 seconds is tight even for simple use cases
  • Tool-calling support: the ability to pause, call an external API, and incorporate results
  • Controllable opening: let developers script the first utterance

We're applying for Nemotron 3 VoiceChat early access - looks like it could potentially address several of these limitations.

Happy to share our full technical findings, test transcripts, or discuss further. Genuine advocates for this technology and want to see it succeed.

Sign up or log in to comment