bhaskarbuilds commited on
Commit
ba280f6
Β·
verified Β·
1 Parent(s): 520ac49

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -13
README.md CHANGED
@@ -19,6 +19,10 @@ pipeline_tag: audio-to-audio
19
 
20
  Hindi-Moshi is the first full-duplex spoken dialogue model for Hindi, built by adapting [Kyutai's Moshi](https://github.com/kyutai-labs/moshi) architecture. It enables real-time, natural Hindi conversation with support for interruptions, overlaps, backchannels, and natural turn-taking β€” trained on 26,000 hours of real spontaneous Hindi conversations from 14,695 speakers.
21
 
 
 
 
 
22
  ## Model Details
23
 
24
  | | |
@@ -32,15 +36,7 @@ Hindi-Moshi is the first full-duplex spoken dialogue model for Hindi, built by a
32
  | **Audio codec** | Mimi (frozen, 12.5 Hz, 1.1 kbps) |
33
  | **License** | CC-BY-4.0 |
34
 
35
- ## Architecture
36
-
37
- Hindi-Moshi builds on the Moshi architecture comprising three components:
38
-
39
- **Mimi** is a neural audio codec that encodes 24 kHz speech into discrete tokens at 12.5 Hz using 8 codebook layers. Layer 1 captures semantic content while layers 2–8 capture acoustic detail. Mimi generalises to Hindi without retraining (STOI: 0.878, PESQ: 2.55) and is frozen throughout training.
40
-
41
- **The RQ-Transformer** is a hierarchical architecture. The Temporal Transformer (7B parameters) models 17 parallel streams per timestep (1 text + 8 Moshi audio + 8 user audio). The Depth Transformer then autoregressively generates 16 audio tokens conditioned on the Temporal Transformer's hidden state.
42
-
43
- ### What was changed from base Moshi
44
 
45
  The original English SentencePiece tokenizer was replaced with a Hindi SentencePiece model (32,000 vocabulary) trained on a large Hindi text corpus. This required reinitialisation of three vocabulary-dependent parameter groups:
46
 
@@ -48,7 +44,9 @@ The original English SentencePiece tokenizer was replaced with a Hindi SentenceP
48
  - `depformer.emb.0` β€” text token embedding in the Depth Transformer
49
  - `text_linear` β€” text output projection layer
50
 
51
- All audio processing components (Mimi codec) and remaining transformer weights retain their pre-trained values.
 
 
52
 
53
  ## Training
54
 
@@ -68,7 +66,7 @@ The stereo recording format with separate speaker channels enables direct learni
68
 
69
  ### Two-stage training recipe
70
 
71
- **Stage 1 β€” Pre-training** on the full 26,000-hour corpus. Learning rate of 3Γ—10⁻⁡ (matching original Moshi pre-training). AdamW with β₁=0.9, Ξ²β‚‚=0.95, weight decay 0.1. Effective batch size of 64 (\~2.9 hours of audio per update). Trained for 1 epoch (\~10,000 steps) in approximately 13 hours on 8Γ— NVIDIA H100 80GB GPUs.
72
 
73
  **Stage 2 β€” Fine-tuning** on ~990 hours of curated high-quality conversational data. Split learning rates: 2Γ—10⁻⁢ for the Temporal Transformer, 4Γ—10⁻⁢ for the Depth Transformer. Optimal checkpoint selected at step 4,812 based on minimum total validation loss (3.370).
74
 
@@ -91,13 +89,33 @@ Measured using Sarvam-1 (2B) on Whisper-v3 transcriptions of generated speech.
91
 
92
  ### Human Evaluation
93
 
94
- 130 native Hindi speakers evaluated audio samples on 5-point scales.
 
 
 
 
 
 
95
 
96
  | Metric | Human Score | Model Score | Human Preferred | Model Preferred | Tie |
97
  |---|---|---|---|---|---|
98
  | Naturalness | 4.55 | 4.10 | 30.0% | 3.1% | 66.9% |
99
  | Clarity | 4.05 | 3.04 | β€” | β€” | β€” |
100
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
101
  ### Turn-Taking Analysis
102
 
103
  Temperature Ο„=0.9 produces turn-taking dynamics closest to ground-truth.
@@ -109,6 +127,14 @@ Temperature Ο„=0.9 produces turn-taking dynamics closest to ground-truth.
109
  | Hindi-Moshi | 0.9 | 29.14 | 9.24 | 8.54 | 4.30 |
110
  | Hindi-Moshi | 1.0 | 38.90 | 11.67 | 8.10 | 9.68 |
111
 
 
 
 
 
 
 
 
 
112
  ## Files
113
 
114
  ```
@@ -116,6 +142,7 @@ Temperature Ο„=0.9 produces turn-taking dynamics closest to ground-truth.
116
  β”œβ”€β”€ tokenizer-e351c8d8-checkpoint125.safetensors # Mimi audio codec (frozen, from Moshi)
117
  β”œβ”€β”€ tokenizer_hindi.model # Hindi SentencePiece tokenizer
118
  β”œβ”€β”€ tokenizer_hindi.vocab # Vocabulary reference
 
119
  └── README.md
120
  ```
121
 
@@ -168,7 +195,7 @@ The model is intended for research in full-duplex spoken dialogue systems for Hi
168
 
169
  ```bibtex
170
  @article{hindimoshi2026,
171
- title = {A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations},
172
  author = {Bhaskar Singh and Shobhit Bhanga and Pranav},
173
  year = {2026},
174
  institution = {JoshTalks}
 
19
 
20
  Hindi-Moshi is the first full-duplex spoken dialogue model for Hindi, built by adapting [Kyutai's Moshi](https://github.com/kyutai-labs/moshi) architecture. It enables real-time, natural Hindi conversation with support for interruptions, overlaps, backchannels, and natural turn-taking β€” trained on 26,000 hours of real spontaneous Hindi conversations from 14,695 speakers.
21
 
22
+ <p align="center">
23
+ <img src="hindi_moshi_architecture.svg" alt="Hindi-Moshi Architecture" width="480"/>
24
+ </p>
25
+
26
  ## Model Details
27
 
28
  | | |
 
36
  | **Audio codec** | Mimi (frozen, 12.5 Hz, 1.1 kbps) |
37
  | **License** | CC-BY-4.0 |
38
 
39
+ ## What was changed from base Moshi
 
 
 
 
 
 
 
 
40
 
41
  The original English SentencePiece tokenizer was replaced with a Hindi SentencePiece model (32,000 vocabulary) trained on a large Hindi text corpus. This required reinitialisation of three vocabulary-dependent parameter groups:
42
 
 
44
  - `depformer.emb.0` β€” text token embedding in the Depth Transformer
45
  - `text_linear` β€” text output projection layer
46
 
47
+ All audio processing components (Mimi codec) and remaining transformer weights retain their pre-trained values. Mimi generalises to Hindi without retraining (STOI: 0.878, PESQ: 2.55).
48
+
49
+ For full architecture details, see the [Moshi paper](https://arxiv.org/abs/2410.00037).
50
 
51
  ## Training
52
 
 
66
 
67
  ### Two-stage training recipe
68
 
69
+ **Stage 1 β€” Pre-training** on the full 26,000-hour corpus. Learning rate of 3Γ—10⁻⁡ (matching original Moshi pre-training). AdamW with β₁=0.9, Ξ²β‚‚=0.95, weight decay 0.1. Effective batch size of 64 (~2.9 hours of audio per update). Trained for 1 epoch (~10,000 steps) in approximately 13 hours on 8Γ— NVIDIA H100 80GB GPUs.
70
 
71
  **Stage 2 β€” Fine-tuning** on ~990 hours of curated high-quality conversational data. Split learning rates: 2Γ—10⁻⁢ for the Temporal Transformer, 4Γ—10⁻⁢ for the Depth Transformer. Optimal checkpoint selected at step 4,812 based on minimum total validation loss (3.370).
72
 
 
89
 
90
  ### Human Evaluation
91
 
92
+ 63 evaluators completed 2,125 rating tasks comparing human speech with model responses. Each instance contained two audio samples (Voice A: Human, Voice B: Model) rated on 5-point Likert scales for naturalness and clarity.
93
+
94
+ | Dataset | Ratings | Female | Male | 18–25 | 25–30 | 30–35 |
95
+ |---|---|---|---|---|---|---|
96
+ | Speech Dialogue Eval. | 2,125 | 34 | 29 | 28 | 19 | 8 |
97
+
98
+ **Perceptual quality:**
99
 
100
  | Metric | Human Score | Model Score | Human Preferred | Model Preferred | Tie |
101
  |---|---|---|---|---|---|
102
  | Naturalness | 4.55 | 4.10 | 30.0% | 3.1% | 66.9% |
103
  | Clarity | 4.05 | 3.04 | β€” | β€” | β€” |
104
 
105
+ Generated speech achieves high perceptual quality, with naturalness scores approaching human speech and most pairwise comparisons resulting in ties.
106
+
107
+ **Conversational rubric evaluation:**
108
+
109
+ Evaluators also assessed conversational quality using three binary rubric questions measuring whether generated responses behave like natural conversational speech.
110
+
111
+ | Rubric | Pass Rate |
112
+ |---|---|
113
+ | Human-like interaction | β‰ˆ85% |
114
+ | Appropriateness (response follows prompt) | β‰ˆ53% |
115
+ | Completion (response forms a complete reply) | β‰ˆ42% |
116
+
117
+ While the model frequently produces speech that sounds human-like, maintaining contextual relevance and producing fully complete conversational responses remains an ongoing challenge.
118
+
119
  ### Turn-Taking Analysis
120
 
121
  Temperature Ο„=0.9 produces turn-taking dynamics closest to ground-truth.
 
127
  | Hindi-Moshi | 0.9 | 29.14 | 9.24 | 8.54 | 4.30 |
128
  | Hindi-Moshi | 1.0 | 38.90 | 11.67 | 8.10 | 9.68 |
129
 
130
+ ## Conversation Style
131
+
132
+ Hindi-Moshi is trained on **topic-driven conversations** - real dialogues where two speakers discuss a subject naturally, with backchannels, interruptions, and organic turn-taking.
133
+
134
+ After an initial introduction, the model will typically **propose a topic and steer the conversation toward it**, preferring structured discussion over open-ended chitchat. Users can also **introduce their own topic** - the model will pick it up and engage in a focused discussion around it. This is an intentional design choice - the training data consists of real conversations where speakers engage in focused, in-depth discussions on assigned topics.
135
+
136
+ This makes the model particularly well-suited for **domain-specific conversational applications**. Our key finding is that the model's ability to stay on-topic emerges naturally from the structure of the training data alone - without any explicit prompting, reward shaping, or guardrails. This suggests that with sufficient hours of domain-specific conversational data, this approach can produce models that learn the conversational norms of virtually any domain - customer support, healthcare consultations, language tutoring, sales, therapy, and more - opening a direct path from curated conversations to deployable, real-world voice agents. Exploring this is an active direction of our future work.
137
+
138
  ## Files
139
 
140
  ```
 
142
  β”œβ”€β”€ tokenizer-e351c8d8-checkpoint125.safetensors # Mimi audio codec (frozen, from Moshi)
143
  β”œβ”€β”€ tokenizer_hindi.model # Hindi SentencePiece tokenizer
144
  β”œβ”€β”€ tokenizer_hindi.vocab # Vocabulary reference
145
+ β”œβ”€β”€ hindi_moshi_architecture.svg # Architecture diagram
146
  └── README.md
147
  ```
148
 
 
195
 
196
  ```bibtex
197
  @article{hindimoshi2026,
198
+ title = {A Full-Duplex Conversational Modelling Framework in Hindi using Real-World Conversations},
199
  author = {Bhaskar Singh and Shobhit Bhanga and Pranav},
200
  year = {2026},
201
  institution = {JoshTalks}