bhaskarbuilds commited on
Commit
81aaa4c
Β·
verified Β·
1 Parent(s): 3bd4a0b

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +176 -0
README.md ADDED
@@ -0,0 +1,176 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ language:
4
+ - hi
5
+ tags:
6
+ - moshi
7
+ - speech-to-speech
8
+ - hindi
9
+ - conversational-ai
10
+ - audio
11
+ - full-duplex
12
+ - duplex-dialogue
13
+ - indian-languages
14
+ base_model: kyutai/moshiko-pytorch-bf16
15
+ pipeline_tag: audio-to-audio
16
+ ---
17
+
18
+ # Hindi-Moshi: A Full-Duplex Conversational Model for Hindi
19
+
20
+ Hindi-Moshi is the first full-duplex spoken dialogue model for Hindi, built by adapting [Kyutai's Moshi](https://github.com/kyutai-labs/moshi) architecture. It enables real-time, natural Hindi conversation with support for interruptions, overlaps, backchannels, and natural turn-taking β€” trained on 26,000 hours of real spontaneous Hindi conversations from 14,695 speakers.
21
+
22
+ ## Model Details
23
+
24
+ | | |
25
+ |---|---|
26
+ | **Developed by** | Bhaskar Singh, Shobhit Bhanga, Pranav β€” [JoshTalks](https://joshtalks.com) |
27
+ | **Base model** | [kyutai/moshiko-pytorch-bf16](https://huggingface.co/kyutai/moshiko-pytorch-bf16) |
28
+ | **Language** | Hindi (hi) |
29
+ | **Model type** | Full-duplex speech-to-speech dialogue |
30
+ | **Format** | SafeTensors (fp32) |
31
+ | **Tokenizer** | Custom Hindi SentencePiece (32,000 vocabulary) |
32
+ | **Audio codec** | Mimi (frozen, 12.5 Hz, 1.1 kbps) |
33
+ | **License** | CC-BY-4.0 |
34
+
35
+ ## Architecture
36
+
37
+ Hindi-Moshi builds on the Moshi architecture comprising three components:
38
+
39
+ **Mimi** is a neural audio codec that encodes 24 kHz speech into discrete tokens at 12.5 Hz using 8 codebook layers. Layer 1 captures semantic content while layers 2–8 capture acoustic detail. Mimi generalises to Hindi without retraining (STOI: 0.878, PESQ: 2.55) and is frozen throughout training.
40
+
41
+ **The RQ-Transformer** is a hierarchical architecture. The Temporal Transformer (7B parameters) models 17 parallel streams per timestep (1 text + 8 Moshi audio + 8 user audio). The Depth Transformer then autoregressively generates 16 audio tokens conditioned on the Temporal Transformer's hidden state.
42
+
43
+ ### What was changed from base Moshi
44
+
45
+ The original English SentencePiece tokenizer was replaced with a Hindi SentencePiece model (32,000 vocabulary) trained on a large Hindi text corpus. This required reinitialisation of three vocabulary-dependent parameter groups:
46
+
47
+ - `text_emb` β€” text token embedding in the Temporal Transformer
48
+ - `depformer.emb.0` β€” text token embedding in the Depth Transformer
49
+ - `text_linear` β€” text output projection layer
50
+
51
+ All audio processing components (Mimi codec) and remaining transformer weights retain their pre-trained values.
52
+
53
+ ## Training
54
+
55
+ ### Data
56
+
57
+ The model was trained on a purpose-built corpus of **26,000 hours** of real Hindi spontaneous conversations β€” to our knowledge, the largest conversational speech corpus for any Indian language.
58
+
59
+ | Characteristic | Value |
60
+ |---|---|
61
+ | Total duration | 26,000 hours |
62
+ | Unique speakers | 14,695 |
63
+ | Recording type | Spontaneous, unscripted conversations |
64
+ | Channels | Stereo (separate per speaker) |
65
+ | Quality control | Trained annotators + manual checks |
66
+
67
+ The stereo recording format with separate speaker channels enables direct learning of turn-taking, overlaps, and backchannels from natural interactions β€” without requiring artificial speaker diarisation.
68
+
69
+ ### Two-stage training recipe
70
+
71
+ **Stage 1 β€” Pre-training** on the full 26,000-hour corpus. Learning rate of 3Γ—10⁻⁡ (matching original Moshi pre-training). AdamW with β₁=0.9, Ξ²β‚‚=0.95, weight decay 0.1. Effective batch size of 64 (~2.9 hours of audio per update). Trained for 1 epoch (~10,000 steps) in approximately 13 hours on 8Γ— NVIDIA H100 80GB GPUs.
72
+
73
+ **Stage 2 β€” Fine-tuning** on ~990 hours of curated high-quality conversational data. Split learning rates: 2Γ—10⁻⁢ for the Temporal Transformer, 4Γ—10⁻⁢ for the Depth Transformer. Optimal checkpoint selected at step 4,812 based on minimum total validation loss (3.370).
74
+
75
+ ### Training infrastructure
76
+
77
+ 8Γ— NVIDIA H100 80GB GPUs with bf16 mixed precision.
78
+
79
+ ## Evaluation
80
+
81
+ ### Perplexity
82
+
83
+ Measured using Sarvam-1 (2B) on Whisper-v3 transcriptions of generated speech.
84
+
85
+ | Temperature | PPL ↓ |
86
+ |---|---|
87
+ | Ground-truth | 237.1 |
88
+ | Hindi-Moshi (Ο„=0.8) | 356.9 |
89
+ | Hindi-Moshi (Ο„=0.9) | 467.1 |
90
+ | Hindi-Moshi (Ο„=1.0) | 640.6 |
91
+
92
+ ### Human Evaluation
93
+
94
+ 130 native Hindi speakers evaluated audio samples on 5-point scales.
95
+
96
+ | Metric | Human Score | Model Score | Human Preferred | Model Preferred | Tie |
97
+ |---|---|---|---|---|---|
98
+ | Naturalness | 4.55 | 4.10 | 30.0% | 3.1% | 66.9% |
99
+ | Clarity | 4.05 | 3.04 | β€” | β€” | β€” |
100
+
101
+ ### Turn-Taking Analysis
102
+
103
+ Temperature Ο„=0.9 produces turn-taking dynamics closest to ground-truth.
104
+
105
+ | Model | Ο„ | IPU/min | Pause | Gap | Overlap |
106
+ |---|---|---|---|---|---|
107
+ | Ground-truth | β€” | 35.30 | 10.49 | 8.51 | 3.03 |
108
+ | Hindi-Moshi | 0.8 | 23.12 | 9.16 | 6.77 | 1.67 |
109
+ | Hindi-Moshi | 0.9 | 29.14 | 9.24 | 8.54 | 4.30 |
110
+ | Hindi-Moshi | 1.0 | 38.90 | 11.67 | 8.10 | 9.68 |
111
+
112
+ ## Files
113
+
114
+ ```
115
+ β”œβ”€β”€ model.safetensors # Hindi-Moshi LM weights
116
+ β”œβ”€β”€ tokenizer-e351c8d8-checkpoint125.safetensors # Mimi audio codec (frozen, from Moshi)
117
+ β”œβ”€β”€ tokenizer_hindi.model # Hindi SentencePiece tokenizer
118
+ β”œβ”€β”€ tokenizer_hindi.vocab # Vocabulary reference
119
+ └── README.md
120
+ ```
121
+
122
+ ## Quick Start
123
+
124
+ ### Install
125
+
126
+ ```bash
127
+ pip install moshi huggingface_hub
128
+ ```
129
+
130
+ Or from source:
131
+
132
+ ```bash
133
+ git clone https://github.com/kyutai-labs/moshi
134
+ cd moshi && pip install -e .
135
+ ```
136
+
137
+ ### Download & Run
138
+
139
+ ```bash
140
+ # Download all files
141
+ huggingface-cli download bhaskarbuilds/josh1 --local-dir ./hindi-moshi
142
+
143
+ # Run the server
144
+ uv run -m moshi.server \
145
+ --hf-repo bhaskarbuilds/josh1 \
146
+ --tokenizer hf://bhaskarbuilds/josh1/tokenizer_hindi.model \
147
+ --host 0.0.0.0 \
148
+ --static none
149
+ ```
150
+
151
+ ## Intended Use
152
+
153
+ The model is intended for research in full-duplex spoken dialogue systems for Hindi and Indian languages. It can be used as a conversational agent for casual Hindi conversations.
154
+
155
+ ## Limitations
156
+
157
+ - Trained primarily on Hindi conversational speech. Performance on other languages or domains is not guaranteed.
158
+ - Inherits limitations from the base Moshi architecture regarding audio quality at 1.1 kbps bitrate.
159
+ - Hindi text tokens are sparser relative to audio (~75% PAD ratio vs. 65% in English) due to Devanagari encoding more phonemic content per token.
160
+ - Not intended for impersonation or any malicious use.
161
+ - This model is for research purposes. We do not recommend it for providing advice or performing any professional duty.
162
+
163
+ ## Citation
164
+
165
+ ```bibtex
166
+ @article{hindimoshi2025,
167
+ title = {A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations},
168
+ author = {Bhaskar Singh and Shobhit Bhanga and Pranav},
169
+ year = {2025},
170
+ institution = {JoshTalks}
171
+ }
172
+ ```
173
+
174
+ ## Acknowledgments
175
+
176
+ Built on [Moshi](https://github.com/kyutai-labs/moshi) by [Kyutai](https://kyutai.org/). We thank the 14,695 speakers who contributed to the Hindi conversational corpus.