JoshTalksAI
/

Human-1

+---
+license: cc-by-4.0
+language:
+  - hi
+tags:
+  - moshi
+  - speech-to-speech
+  - hindi
+  - conversational-ai
+  - audio
+  - full-duplex
+  - duplex-dialogue
+  - indian-languages
+base_model: kyutai/moshiko-pytorch-bf16
+pipeline_tag: audio-to-audio
+---
+# Hindi-Moshi: A Full-Duplex Conversational Model for Hindi
+Hindi-Moshi is the first full-duplex spoken dialogue model for Hindi, built by adapting [Kyutai's Moshi](https://github.com/kyutai-labs/moshi) architecture. It enables real-time, natural Hindi conversation with support for interruptions, overlaps, backchannels, and natural turn-taking — trained on 26,000 hours of real spontaneous Hindi conversations from 14,695 speakers.
+## Model Details
+| | |
+|---|---|
+| **Developed by** | Bhaskar Singh, Shobhit Bhanga, Pranav — [JoshTalks](https://joshtalks.com) |
+| **Base model** | [kyutai/moshiko-pytorch-bf16](https://huggingface.co/kyutai/moshiko-pytorch-bf16) |
+| **Language** | Hindi (hi) |
+| **Model type** | Full-duplex speech-to-speech dialogue |
+| **Format** | SafeTensors (fp32) |
+| **Tokenizer** | Custom Hindi SentencePiece (32,000 vocabulary) |
+| **Audio codec** | Mimi (frozen, 12.5 Hz, 1.1 kbps) |
+| **License** | CC-BY-4.0 |
+## Architecture
+Hindi-Moshi builds on the Moshi architecture comprising three components:
+**Mimi** is a neural audio codec that encodes 24 kHz speech into discrete tokens at 12.5 Hz using 8 codebook layers. Layer 1 captures semantic content while layers 2–8 capture acoustic detail. Mimi generalises to Hindi without retraining (STOI: 0.878, PESQ: 2.55) and is frozen throughout training.
+**The RQ-Transformer** is a hierarchical architecture. The Temporal Transformer (7B parameters) models 17 parallel streams per timestep (1 text + 8 Moshi audio + 8 user audio). The Depth Transformer then autoregressively generates 16 audio tokens conditioned on the Temporal Transformer's hidden state.
+### What was changed from base Moshi
+The original English SentencePiece tokenizer was replaced with a Hindi SentencePiece model (32,000 vocabulary) trained on a large Hindi text corpus. This required reinitialisation of three vocabulary-dependent parameter groups:
+- `text_emb` — text token embedding in the Temporal Transformer
+- `depformer.emb.0` — text token embedding in the Depth Transformer
+- `text_linear` — text output projection layer
+All audio processing components (Mimi codec) and remaining transformer weights retain their pre-trained values.
+## Training
+### Data
+The model was trained on a purpose-built corpus of **26,000 hours** of real Hindi spontaneous conversations — to our knowledge, the largest conversational speech corpus for any Indian language.
+| Characteristic | Value |
+|---|---|
+| Total duration | 26,000 hours |
+| Unique speakers | 14,695 |
+| Recording type | Spontaneous, unscripted conversations |
+| Channels | Stereo (separate per speaker) |
+| Quality control | Trained annotators + manual checks |
+The stereo recording format with separate speaker channels enables direct learning of turn-taking, overlaps, and backchannels from natural interactions — without requiring artificial speaker diarisation.
+### Two-stage training recipe
+**Stage 1 — Pre-training** on the full 26,000-hour corpus. Learning rate of 3×10⁻⁵ (matching original Moshi pre-training). AdamW with β₁=0.9, β₂=0.95, weight decay 0.1. Effective batch size of 64 (~2.9 hours of audio per update). Trained for 1 epoch (~10,000 steps) in approximately 13 hours on 8× NVIDIA H100 80GB GPUs.
+**Stage 2 — Fine-tuning** on ~990 hours of curated high-quality conversational data. Split learning rates: 2×10⁻⁶ for the Temporal Transformer, 4×10⁻⁶ for the Depth Transformer. Optimal checkpoint selected at step 4,812 based on minimum total validation loss (3.370).
+### Training infrastructure
+8× NVIDIA H100 80GB GPUs with bf16 mixed precision.
+## Evaluation
+### Perplexity
+Measured using Sarvam-1 (2B) on Whisper-v3 transcriptions of generated speech.
+| Temperature | PPL ↓ |
+|---|---|
+| Ground-truth | 237.1 |
+| Hindi-Moshi (τ=0.8) | 356.9 |
+| Hindi-Moshi (τ=0.9) | 467.1 |
+| Hindi-Moshi (τ=1.0) | 640.6 |
+### Human Evaluation
+130 native Hindi speakers evaluated audio samples on 5-point scales.
+| Metric | Human Score | Model Score | Human Preferred | Model Preferred | Tie |
+|---|---|---|---|---|---|
+| Naturalness | 4.55 | 4.10 | 30.0% | 3.1% | 66.9% |
+| Clarity | 4.05 | 3.04 | — | — | — |
+### Turn-Taking Analysis
+Temperature τ=0.9 produces turn-taking dynamics closest to ground-truth.
+| Model | τ | IPU/min | Pause | Gap | Overlap |
+|---|---|---|---|---|---|
+| Ground-truth | — | 35.30 | 10.49 | 8.51 | 3.03 |
+| Hindi-Moshi | 0.8 | 23.12 | 9.16 | 6.77 | 1.67 |
+| Hindi-Moshi | 0.9 | 29.14 | 9.24 | 8.54 | 4.30 |
+| Hindi-Moshi | 1.0 | 38.90 | 11.67 | 8.10 | 9.68 |
+## Files
+```
+├── model.safetensors                              # Hindi-Moshi LM weights
+├── tokenizer-e351c8d8-checkpoint125.safetensors   # Mimi audio codec (frozen, from Moshi)
+├── tokenizer_hindi.model                          # Hindi SentencePiece tokenizer
+├── tokenizer_hindi.vocab                          # Vocabulary reference
+└── README.md
+```
+## Quick Start
+### Install
+```bash
+pip install moshi huggingface_hub
+```
+Or from source:
+```bash
+git clone https://github.com/kyutai-labs/moshi
+cd moshi && pip install -e .
+```
+### Download & Run
+```bash
+# Download all files
+huggingface-cli download bhaskarbuilds/josh1 --local-dir ./hindi-moshi
+# Run the server
+uv run -m moshi.server \
+    --hf-repo bhaskarbuilds/josh1 \
+    --tokenizer hf://bhaskarbuilds/josh1/tokenizer_hindi.model \
+    --host 0.0.0.0 \
+    --static none
+```
+## Intended Use
+The model is intended for research in full-duplex spoken dialogue systems for Hindi and Indian languages. It can be used as a conversational agent for casual Hindi conversations.
+## Limitations
+- Trained primarily on Hindi conversational speech. Performance on other languages or domains is not guaranteed.
+- Inherits limitations from the base Moshi architecture regarding audio quality at 1.1 kbps bitrate.
+- Hindi text tokens are sparser relative to audio (~75% PAD ratio vs. 65% in English) due to Devanagari encoding more phonemic content per token.
+- Not intended for impersonation or any malicious use.
+- This model is for research purposes. We do not recommend it for providing advice or performing any professional duty.
+## Citation
+```bibtex
+@article{hindimoshi2025,
+  title   = {A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations},
+  author  = {Bhaskar Singh and Shobhit Bhanga and Pranav},
+  year    = {2025},
+  institution = {JoshTalks}
+}
+```
+## Acknowledgments
+Built on [Moshi](https://github.com/kyutai-labs/moshi) by [Kyutai](https://kyutai.org/). We thank the 14,695 speakers who contributed to the Hindi conversational corpus.