Update README.md
Browse files
README.md
CHANGED
|
@@ -19,6 +19,10 @@ pipeline_tag: audio-to-audio
|
|
| 19 |
|
| 20 |
Hindi-Moshi is the first full-duplex spoken dialogue model for Hindi, built by adapting [Kyutai's Moshi](https://github.com/kyutai-labs/moshi) architecture. It enables real-time, natural Hindi conversation with support for interruptions, overlaps, backchannels, and natural turn-taking β trained on 26,000 hours of real spontaneous Hindi conversations from 14,695 speakers.
|
| 21 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
## Model Details
|
| 23 |
|
| 24 |
| | |
|
|
@@ -32,15 +36,7 @@ Hindi-Moshi is the first full-duplex spoken dialogue model for Hindi, built by a
|
|
| 32 |
| **Audio codec** | Mimi (frozen, 12.5 Hz, 1.1 kbps) |
|
| 33 |
| **License** | CC-BY-4.0 |
|
| 34 |
|
| 35 |
-
##
|
| 36 |
-
|
| 37 |
-
Hindi-Moshi builds on the Moshi architecture comprising three components:
|
| 38 |
-
|
| 39 |
-
**Mimi** is a neural audio codec that encodes 24 kHz speech into discrete tokens at 12.5 Hz using 8 codebook layers. Layer 1 captures semantic content while layers 2β8 capture acoustic detail. Mimi generalises to Hindi without retraining (STOI: 0.878, PESQ: 2.55) and is frozen throughout training.
|
| 40 |
-
|
| 41 |
-
**The RQ-Transformer** is a hierarchical architecture. The Temporal Transformer (7B parameters) models 17 parallel streams per timestep (1 text + 8 Moshi audio + 8 user audio). The Depth Transformer then autoregressively generates 16 audio tokens conditioned on the Temporal Transformer's hidden state.
|
| 42 |
-
|
| 43 |
-
### What was changed from base Moshi
|
| 44 |
|
| 45 |
The original English SentencePiece tokenizer was replaced with a Hindi SentencePiece model (32,000 vocabulary) trained on a large Hindi text corpus. This required reinitialisation of three vocabulary-dependent parameter groups:
|
| 46 |
|
|
@@ -48,7 +44,9 @@ The original English SentencePiece tokenizer was replaced with a Hindi SentenceP
|
|
| 48 |
- `depformer.emb.0` β text token embedding in the Depth Transformer
|
| 49 |
- `text_linear` β text output projection layer
|
| 50 |
|
| 51 |
-
All audio processing components (Mimi codec) and remaining transformer weights retain their pre-trained values.
|
|
|
|
|
|
|
| 52 |
|
| 53 |
## Training
|
| 54 |
|
|
@@ -68,7 +66,7 @@ The stereo recording format with separate speaker channels enables direct learni
|
|
| 68 |
|
| 69 |
### Two-stage training recipe
|
| 70 |
|
| 71 |
-
**Stage 1 β Pre-training** on the full 26,000-hour corpus. Learning rate of 3Γ10β»β΅ (matching original Moshi pre-training). AdamW with Ξ²β=0.9, Ξ²β=0.95, weight decay 0.1. Effective batch size of 64 (
|
| 72 |
|
| 73 |
**Stage 2 β Fine-tuning** on ~990 hours of curated high-quality conversational data. Split learning rates: 2Γ10β»βΆ for the Temporal Transformer, 4Γ10β»βΆ for the Depth Transformer. Optimal checkpoint selected at step 4,812 based on minimum total validation loss (3.370).
|
| 74 |
|
|
@@ -91,13 +89,33 @@ Measured using Sarvam-1 (2B) on Whisper-v3 transcriptions of generated speech.
|
|
| 91 |
|
| 92 |
### Human Evaluation
|
| 93 |
|
| 94 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 95 |
|
| 96 |
| Metric | Human Score | Model Score | Human Preferred | Model Preferred | Tie |
|
| 97 |
|---|---|---|---|---|---|
|
| 98 |
| Naturalness | 4.55 | 4.10 | 30.0% | 3.1% | 66.9% |
|
| 99 |
| Clarity | 4.05 | 3.04 | β | β | β |
|
| 100 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
### Turn-Taking Analysis
|
| 102 |
|
| 103 |
Temperature Ο=0.9 produces turn-taking dynamics closest to ground-truth.
|
|
@@ -109,6 +127,14 @@ Temperature Ο=0.9 produces turn-taking dynamics closest to ground-truth.
|
|
| 109 |
| Hindi-Moshi | 0.9 | 29.14 | 9.24 | 8.54 | 4.30 |
|
| 110 |
| Hindi-Moshi | 1.0 | 38.90 | 11.67 | 8.10 | 9.68 |
|
| 111 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
## Files
|
| 113 |
|
| 114 |
```
|
|
@@ -116,6 +142,7 @@ Temperature Ο=0.9 produces turn-taking dynamics closest to ground-truth.
|
|
| 116 |
βββ tokenizer-e351c8d8-checkpoint125.safetensors # Mimi audio codec (frozen, from Moshi)
|
| 117 |
βββ tokenizer_hindi.model # Hindi SentencePiece tokenizer
|
| 118 |
βββ tokenizer_hindi.vocab # Vocabulary reference
|
|
|
|
| 119 |
βββ README.md
|
| 120 |
```
|
| 121 |
|
|
@@ -168,7 +195,7 @@ The model is intended for research in full-duplex spoken dialogue systems for Hi
|
|
| 168 |
|
| 169 |
```bibtex
|
| 170 |
@article{hindimoshi2026,
|
| 171 |
-
title = {A Full-Duplex Conversational
|
| 172 |
author = {Bhaskar Singh and Shobhit Bhanga and Pranav},
|
| 173 |
year = {2026},
|
| 174 |
institution = {JoshTalks}
|
|
|
|
| 19 |
|
| 20 |
Hindi-Moshi is the first full-duplex spoken dialogue model for Hindi, built by adapting [Kyutai's Moshi](https://github.com/kyutai-labs/moshi) architecture. It enables real-time, natural Hindi conversation with support for interruptions, overlaps, backchannels, and natural turn-taking β trained on 26,000 hours of real spontaneous Hindi conversations from 14,695 speakers.
|
| 21 |
|
| 22 |
+
<p align="center">
|
| 23 |
+
<img src="hindi_moshi_architecture.svg" alt="Hindi-Moshi Architecture" width="480"/>
|
| 24 |
+
</p>
|
| 25 |
+
|
| 26 |
## Model Details
|
| 27 |
|
| 28 |
| | |
|
|
|
|
| 36 |
| **Audio codec** | Mimi (frozen, 12.5 Hz, 1.1 kbps) |
|
| 37 |
| **License** | CC-BY-4.0 |
|
| 38 |
|
| 39 |
+
## What was changed from base Moshi
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
The original English SentencePiece tokenizer was replaced with a Hindi SentencePiece model (32,000 vocabulary) trained on a large Hindi text corpus. This required reinitialisation of three vocabulary-dependent parameter groups:
|
| 42 |
|
|
|
|
| 44 |
- `depformer.emb.0` β text token embedding in the Depth Transformer
|
| 45 |
- `text_linear` β text output projection layer
|
| 46 |
|
| 47 |
+
All audio processing components (Mimi codec) and remaining transformer weights retain their pre-trained values. Mimi generalises to Hindi without retraining (STOI: 0.878, PESQ: 2.55).
|
| 48 |
+
|
| 49 |
+
For full architecture details, see the [Moshi paper](https://arxiv.org/abs/2410.00037).
|
| 50 |
|
| 51 |
## Training
|
| 52 |
|
|
|
|
| 66 |
|
| 67 |
### Two-stage training recipe
|
| 68 |
|
| 69 |
+
**Stage 1 β Pre-training** on the full 26,000-hour corpus. Learning rate of 3Γ10β»β΅ (matching original Moshi pre-training). AdamW with Ξ²β=0.9, Ξ²β=0.95, weight decay 0.1. Effective batch size of 64 (~2.9 hours of audio per update). Trained for 1 epoch (~10,000 steps) in approximately 13 hours on 8Γ NVIDIA H100 80GB GPUs.
|
| 70 |
|
| 71 |
**Stage 2 β Fine-tuning** on ~990 hours of curated high-quality conversational data. Split learning rates: 2Γ10β»βΆ for the Temporal Transformer, 4Γ10β»βΆ for the Depth Transformer. Optimal checkpoint selected at step 4,812 based on minimum total validation loss (3.370).
|
| 72 |
|
|
|
|
| 89 |
|
| 90 |
### Human Evaluation
|
| 91 |
|
| 92 |
+
63 evaluators completed 2,125 rating tasks comparing human speech with model responses. Each instance contained two audio samples (Voice A: Human, Voice B: Model) rated on 5-point Likert scales for naturalness and clarity.
|
| 93 |
+
|
| 94 |
+
| Dataset | Ratings | Female | Male | 18β25 | 25β30 | 30β35 |
|
| 95 |
+
|---|---|---|---|---|---|---|
|
| 96 |
+
| Speech Dialogue Eval. | 2,125 | 34 | 29 | 28 | 19 | 8 |
|
| 97 |
+
|
| 98 |
+
**Perceptual quality:**
|
| 99 |
|
| 100 |
| Metric | Human Score | Model Score | Human Preferred | Model Preferred | Tie |
|
| 101 |
|---|---|---|---|---|---|
|
| 102 |
| Naturalness | 4.55 | 4.10 | 30.0% | 3.1% | 66.9% |
|
| 103 |
| Clarity | 4.05 | 3.04 | β | β | β |
|
| 104 |
|
| 105 |
+
Generated speech achieves high perceptual quality, with naturalness scores approaching human speech and most pairwise comparisons resulting in ties.
|
| 106 |
+
|
| 107 |
+
**Conversational rubric evaluation:**
|
| 108 |
+
|
| 109 |
+
Evaluators also assessed conversational quality using three binary rubric questions measuring whether generated responses behave like natural conversational speech.
|
| 110 |
+
|
| 111 |
+
| Rubric | Pass Rate |
|
| 112 |
+
|---|---|
|
| 113 |
+
| Human-like interaction | β85% |
|
| 114 |
+
| Appropriateness (response follows prompt) | β53% |
|
| 115 |
+
| Completion (response forms a complete reply) | β42% |
|
| 116 |
+
|
| 117 |
+
While the model frequently produces speech that sounds human-like, maintaining contextual relevance and producing fully complete conversational responses remains an ongoing challenge.
|
| 118 |
+
|
| 119 |
### Turn-Taking Analysis
|
| 120 |
|
| 121 |
Temperature Ο=0.9 produces turn-taking dynamics closest to ground-truth.
|
|
|
|
| 127 |
| Hindi-Moshi | 0.9 | 29.14 | 9.24 | 8.54 | 4.30 |
|
| 128 |
| Hindi-Moshi | 1.0 | 38.90 | 11.67 | 8.10 | 9.68 |
|
| 129 |
|
| 130 |
+
## Conversation Style
|
| 131 |
+
|
| 132 |
+
Hindi-Moshi is trained on **topic-driven conversations** - real dialogues where two speakers discuss a subject naturally, with backchannels, interruptions, and organic turn-taking.
|
| 133 |
+
|
| 134 |
+
After an initial introduction, the model will typically **propose a topic and steer the conversation toward it**, preferring structured discussion over open-ended chitchat. Users can also **introduce their own topic** - the model will pick it up and engage in a focused discussion around it. This is an intentional design choice - the training data consists of real conversations where speakers engage in focused, in-depth discussions on assigned topics.
|
| 135 |
+
|
| 136 |
+
This makes the model particularly well-suited for **domain-specific conversational applications**. Our key finding is that the model's ability to stay on-topic emerges naturally from the structure of the training data alone - without any explicit prompting, reward shaping, or guardrails. This suggests that with sufficient hours of domain-specific conversational data, this approach can produce models that learn the conversational norms of virtually any domain - customer support, healthcare consultations, language tutoring, sales, therapy, and more - opening a direct path from curated conversations to deployable, real-world voice agents. Exploring this is an active direction of our future work.
|
| 137 |
+
|
| 138 |
## Files
|
| 139 |
|
| 140 |
```
|
|
|
|
| 142 |
βββ tokenizer-e351c8d8-checkpoint125.safetensors # Mimi audio codec (frozen, from Moshi)
|
| 143 |
βββ tokenizer_hindi.model # Hindi SentencePiece tokenizer
|
| 144 |
βββ tokenizer_hindi.vocab # Vocabulary reference
|
| 145 |
+
βββ hindi_moshi_architecture.svg # Architecture diagram
|
| 146 |
βββ README.md
|
| 147 |
```
|
| 148 |
|
|
|
|
| 195 |
|
| 196 |
```bibtex
|
| 197 |
@article{hindimoshi2026,
|
| 198 |
+
title = {A Full-Duplex Conversational Modelling Framework in Hindi using Real-World Conversations},
|
| 199 |
author = {Bhaskar Singh and Shobhit Bhanga and Pranav},
|
| 200 |
year = {2026},
|
| 201 |
institution = {JoshTalks}
|