JoshTalksAI
/

Human-1

@@ -19,6 +19,10 @@ pipeline_tag: audio-to-audio
 Hindi-Moshi is the first full-duplex spoken dialogue model for Hindi, built by adapting [Kyutai's Moshi](https://github.com/kyutai-labs/moshi) architecture. It enables real-time, natural Hindi conversation with support for interruptions, overlaps, backchannels, and natural turn-taking — trained on 26,000 hours of real spontaneous Hindi conversations from 14,695 speakers.
 ## Model Details
 | | |
@@ -32,15 +36,7 @@ Hindi-Moshi is the first full-duplex spoken dialogue model for Hindi, built by a
 | **Audio codec** | Mimi (frozen, 12.5 Hz, 1.1 kbps) |
 | **License** | CC-BY-4.0 |
-## Architecture
-Hindi-Moshi builds on the Moshi architecture comprising three components:
-**Mimi** is a neural audio codec that encodes 24 kHz speech into discrete tokens at 12.5 Hz using 8 codebook layers. Layer 1 captures semantic content while layers 2–8 capture acoustic detail. Mimi generalises to Hindi without retraining (STOI: 0.878, PESQ: 2.55) and is frozen throughout training.
-**The RQ-Transformer** is a hierarchical architecture. The Temporal Transformer (7B parameters) models 17 parallel streams per timestep (1 text + 8 Moshi audio + 8 user audio). The Depth Transformer then autoregressively generates 16 audio tokens conditioned on the Temporal Transformer's hidden state.
-### What was changed from base Moshi
 The original English SentencePiece tokenizer was replaced with a Hindi SentencePiece model (32,000 vocabulary) trained on a large Hindi text corpus. This required reinitialisation of three vocabulary-dependent parameter groups:
@@ -48,7 +44,9 @@ The original English SentencePiece tokenizer was replaced with a Hindi SentenceP
 - `depformer.emb.0` — text token embedding in the Depth Transformer
 - `text_linear` — text output projection layer
-All audio processing components (Mimi codec) and remaining transformer weights retain their pre-trained values.
 ## Training
@@ -68,7 +66,7 @@ The stereo recording format with separate speaker channels enables direct learni
 ### Two-stage training recipe
-**Stage 1 — Pre-training** on the full 26,000-hour corpus. Learning rate of 3×10⁻⁵ (matching original Moshi pre-training). AdamW with β₁=0.9, β₂=0.95, weight decay 0.1. Effective batch size of 64 (\~2.9 hours of audio per update). Trained for 1 epoch (\~10,000 steps) in approximately 13 hours on 8× NVIDIA H100 80GB GPUs.
 **Stage 2 — Fine-tuning** on ~990 hours of curated high-quality conversational data. Split learning rates: 2×10⁻⁶ for the Temporal Transformer, 4×10⁻⁶ for the Depth Transformer. Optimal checkpoint selected at step 4,812 based on minimum total validation loss (3.370).
@@ -91,13 +89,33 @@ Measured using Sarvam-1 (2B) on Whisper-v3 transcriptions of generated speech.
 ### Human Evaluation
-130 native Hindi speakers evaluated audio samples on 5-point scales.
 | Metric | Human Score | Model Score | Human Preferred | Model Preferred | Tie |
 |---|---|---|---|---|---|
 | Naturalness | 4.55 | 4.10 | 30.0% | 3.1% | 66.9% |
 | Clarity | 4.05 | 3.04 | — | — | — |
 ### Turn-Taking Analysis
 Temperature τ=0.9 produces turn-taking dynamics closest to ground-truth.
@@ -109,6 +127,14 @@ Temperature τ=0.9 produces turn-taking dynamics closest to ground-truth.
 | Hindi-Moshi | 0.9 | 29.14 | 9.24 | 8.54 | 4.30 |
 | Hindi-Moshi | 1.0 | 38.90 | 11.67 | 8.10 | 9.68 |
 ## Files
 ```
@@ -116,6 +142,7 @@ Temperature τ=0.9 produces turn-taking dynamics closest to ground-truth.
 ├── tokenizer-e351c8d8-checkpoint125.safetensors   # Mimi audio codec (frozen, from Moshi)
 ├── tokenizer_hindi.model                          # Hindi SentencePiece tokenizer
 ├── tokenizer_hindi.vocab                          # Vocabulary reference
 └── README.md
 ```
@@ -168,7 +195,7 @@ The model is intended for research in full-duplex spoken dialogue systems for Hi
 ```bibtex
 @article{hindimoshi2026,
-  title   = {A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations},
   author  = {Bhaskar Singh and Shobhit Bhanga and Pranav},
   year    = {2026},
   institution = {JoshTalks}

 Hindi-Moshi is the first full-duplex spoken dialogue model for Hindi, built by adapting [Kyutai's Moshi](https://github.com/kyutai-labs/moshi) architecture. It enables real-time, natural Hindi conversation with support for interruptions, overlaps, backchannels, and natural turn-taking — trained on 26,000 hours of real spontaneous Hindi conversations from 14,695 speakers.
+<p align="center">
+  <img src="hindi_moshi_architecture.svg" alt="Hindi-Moshi Architecture" width="480"/>
+</p>
 ## Model Details
 | | |
 | **Audio codec** | Mimi (frozen, 12.5 Hz, 1.1 kbps) |
 | **License** | CC-BY-4.0 |
+## What was changed from base Moshi
 The original English SentencePiece tokenizer was replaced with a Hindi SentencePiece model (32,000 vocabulary) trained on a large Hindi text corpus. This required reinitialisation of three vocabulary-dependent parameter groups:
 - `depformer.emb.0` — text token embedding in the Depth Transformer
 - `text_linear` — text output projection layer
+All audio processing components (Mimi codec) and remaining transformer weights retain their pre-trained values. Mimi generalises to Hindi without retraining (STOI: 0.878, PESQ: 2.55).
+For full architecture details, see the [Moshi paper](https://arxiv.org/abs/2410.00037).
 ## Training
 ### Two-stage training recipe
+**Stage 1 — Pre-training** on the full 26,000-hour corpus. Learning rate of 3×10⁻⁵ (matching original Moshi pre-training). AdamW with β₁=0.9, β₂=0.95, weight decay 0.1. Effective batch size of 64 (~2.9 hours of audio per update). Trained for 1 epoch (~10,000 steps) in approximately 13 hours on 8× NVIDIA H100 80GB GPUs.
 **Stage 2 — Fine-tuning** on ~990 hours of curated high-quality conversational data. Split learning rates: 2×10⁻⁶ for the Temporal Transformer, 4×10⁻⁶ for the Depth Transformer. Optimal checkpoint selected at step 4,812 based on minimum total validation loss (3.370).
 ### Human Evaluation
+63 evaluators completed 2,125 rating tasks comparing human speech with model responses. Each instance contained two audio samples (Voice A: Human, Voice B: Model) rated on 5-point Likert scales for naturalness and clarity.
+| Dataset | Ratings | Female | Male | 18–25 | 25–30 | 30–35 |
+|---|---|---|---|---|---|---|
+| Speech Dialogue Eval. | 2,125 | 34 | 29 | 28 | 19 | 8 |
+**Perceptual quality:**
 | Metric | Human Score | Model Score | Human Preferred | Model Preferred | Tie |
 |---|---|---|---|---|---|
 | Naturalness | 4.55 | 4.10 | 30.0% | 3.1% | 66.9% |
 | Clarity | 4.05 | 3.04 | — | — | — |
+Generated speech achieves high perceptual quality, with naturalness scores approaching human speech and most pairwise comparisons resulting in ties.
+**Conversational rubric evaluation:**
+Evaluators also assessed conversational quality using three binary rubric questions measuring whether generated responses behave like natural conversational speech.
+| Rubric | Pass Rate |
+|---|---|
+| Human-like interaction | ≈85% |
+| Appropriateness (response follows prompt) | ≈53% |
+| Completion (response forms a complete reply) | ≈42% |
+While the model frequently produces speech that sounds human-like, maintaining contextual relevance and producing fully complete conversational responses remains an ongoing challenge.
 ### Turn-Taking Analysis
 Temperature τ=0.9 produces turn-taking dynamics closest to ground-truth.
 | Hindi-Moshi | 0.9 | 29.14 | 9.24 | 8.54 | 4.30 |
 | Hindi-Moshi | 1.0 | 38.90 | 11.67 | 8.10 | 9.68 |
+## Conversation Style
+Hindi-Moshi is trained on **topic-driven conversations** - real dialogues where two speakers discuss a subject naturally, with backchannels, interruptions, and organic turn-taking.
+After an initial introduction, the model will typically **propose a topic and steer the conversation toward it**, preferring structured discussion over open-ended chitchat. Users can also **introduce their own topic** - the model will pick it up and engage in a focused discussion around it. This is an intentional design choice - the training data consists of real conversations where speakers engage in focused, in-depth discussions on assigned topics.
+This makes the model particularly well-suited for **domain-specific conversational applications**. Our key finding is that the model's ability to stay on-topic emerges naturally from the structure of the training data alone - without any explicit prompting, reward shaping, or guardrails. This suggests that with sufficient hours of domain-specific conversational data, this approach can produce models that learn the conversational norms of virtually any domain - customer support, healthcare consultations, language tutoring, sales, therapy, and more - opening a direct path from curated conversations to deployable, real-world voice agents. Exploring this is an active direction of our future work.
 ## Files
 ```
 ├── tokenizer-e351c8d8-checkpoint125.safetensors   # Mimi audio codec (frozen, from Moshi)
 ├── tokenizer_hindi.model                          # Hindi SentencePiece tokenizer
 ├── tokenizer_hindi.vocab                          # Vocabulary reference
+├── hindi_moshi_architecture.svg                   # Architecture diagram
 └── README.md
 ```
 ```bibtex
 @article{hindimoshi2026,
+  title   = {A Full-Duplex Conversational Modelling Framework in Hindi using Real-World Conversations},
   author  = {Bhaskar Singh and Shobhit Bhanga and Pranav},
   year    = {2026},
   institution = {JoshTalks}