File size: 9,161 Bytes
81aaa4c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77dbf75
94016dc
81aaa4c
f4ebea7
81aaa4c
ba280f6
 
 
 
81aaa4c
 
 
 
2dc6b90
81aaa4c
 
 
 
 
 
 
 
ba280f6
81aaa4c
 
 
 
 
 
 
ba280f6
 
 
81aaa4c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1130d5d
81aaa4c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f4ebea7
 
 
81aaa4c
 
 
98a75ec
ba280f6
 
81aaa4c
 
 
 
 
 
ba280f6
 
 
 
 
 
 
 
 
 
 
 
 
 
81aaa4c
 
 
 
 
 
 
77dbf75
 
 
81aaa4c
ba280f6
 
f4ebea7
ba280f6
 
 
 
 
81aaa4c
 
 
f4ebea7
81aaa4c
 
 
ba280f6
81aaa4c
 
 
 
 
bac6a3b
81aaa4c
 
bac6a3b
 
81aaa4c
 
bac6a3b
81aaa4c
 
f4ebea7
bac6a3b
 
 
81aaa4c
 
bac6a3b
81aaa4c
 
f4ebea7
bac6a3b
 
 
81aaa4c
bac6a3b
81aaa4c
bac6a3b
 
 
81aaa4c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7ef2feb
3cd6e30
2dc6b90
520ac49
81aaa4c
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
---
license: cc-by-4.0
language:
  - hi
tags:
  - moshi
  - speech-to-speech
  - hindi
  - conversational-ai
  - audio
  - full-duplex
  - duplex-dialogue
  - indian-languages
base_model: kyutai/moshiko-pytorch-bf16
pipeline_tag: audio-to-audio
---

# Human-1: A Full-Duplex Conversational Model for Hindi
**πŸŽ™οΈ [Try the live demo β†’](https://ai.joshtalks.com/research/human-1)** | **πŸ“„ [Paper β†’](https://storage.googleapis.com/josh-frontend-asset/human-1.pdf)**

Human-1 by Josh Talks is the first full-duplex spoken dialogue model for Hindi, built by adapting [Kyutai's Moshi](https://github.com/kyutai-labs/moshi) architecture. It enables real-time, natural Hindi conversation with support for interruptions, overlaps, backchannels, and natural turn-taking β€” trained on 26,000 hours of real spontaneous Hindi conversations from 14,695 speakers.

<p align="center">
  <img src="hindi_moshi_architecture.svg" alt="Hindi-Moshi Architecture" width="480"/>
</p>

## Model Details

| | |
|---|---|
| **Developed by** | Bhaskar Singh, Shobhit Banga, Pranav Sharma β€” [JoshTalks](https://joshtalks.com) |
| **Base model** | [kyutai/moshiko-pytorch-bf16](https://huggingface.co/kyutai/moshiko-pytorch-bf16) |
| **Language** | Hindi (hi) |
| **Model type** | Full-duplex speech-to-speech dialogue |
| **Format** | SafeTensors (fp32) |
| **Tokenizer** | Custom Hindi SentencePiece (32,000 vocabulary) |
| **Audio codec** | Mimi (frozen, 12.5 Hz, 1.1 kbps) |
| **License** | CC-BY-4.0 |

## What was changed from base Moshi

The original English SentencePiece tokenizer was replaced with a Hindi SentencePiece model (32,000 vocabulary) trained on a large Hindi text corpus. This required reinitialisation of three vocabulary-dependent parameter groups:

- `text_emb` β€” text token embedding in the Temporal Transformer
- `depformer.emb.0` β€” text token embedding in the Depth Transformer
- `text_linear` β€” text output projection layer

All audio processing components (Mimi codec) and remaining transformer weights retain their pre-trained values. Mimi generalises to Hindi without retraining (STOI: 0.878, PESQ: 2.55).

For full architecture details, see the [Moshi paper](https://arxiv.org/abs/2410.00037).

## Training

### Data

The model was trained on a purpose-built corpus of **26,000 hours** of real Hindi spontaneous conversations β€” to our knowledge, the largest conversational speech corpus for any Indian language.

| Characteristic | Value |
|---|---|
| Total duration | 26,000 hours |
| Unique speakers | 14,695 |
| Recording type | Spontaneous, unscripted conversations |
| Channels | Stereo (separate per speaker) |
| Quality control | Trained annotators + manual checks |

The stereo recording format with separate speaker channels enables direct learning of turn-taking, overlaps, and backchannels from natural interactions β€” without requiring artificial speaker diarisation.

### Two-stage training recipe

**Stage 1 β€” Pre-training** on the full 26,000-hour corpus. Learning rate of 3Γ—10⁻⁡ (matching original Moshi pre-training). AdamW with β₁=0.9, Ξ²β‚‚=0.95, weight decay 0.1. Effective batch size of 64 (\~2.9 hours of audio per update). Trained for 1 epoch (\~10,000 steps) in approximately 13 hours on 8Γ— NVIDIA H100 80GB GPUs.

**Stage 2 β€” Fine-tuning** on ~990 hours of curated high-quality conversational data. Split learning rates: 2Γ—10⁻⁢ for the Temporal Transformer, 4Γ—10⁻⁢ for the Depth Transformer. Optimal checkpoint selected at step 4,812 based on minimum total validation loss (3.370).

### Training infrastructure

8Γ— NVIDIA H100 80GB GPUs with bf16 mixed precision.

## Evaluation

### Perplexity

Measured using Sarvam-1 (2B) on Whisper-v3 transcriptions of generated speech.

| Temperature | PPL ↓ |
|---|---|
| Ground-truth | 237.1 |
| Human-1 (Ο„=0.8) | 356.9 |
| Human-1 (Ο„=0.9) | 467.1 |
| Human-1 (Ο„=1.0) | 640.6 |

### Human Evaluation

130 evaluators completed 2,125 rating tasks comparing human speech with model responses. Each instance contained two audio samples (Voice A: Human, Voice B: Model) rated on 5-point Likert scales for naturalness and clarity.

**Perceptual quality:**

| Metric | Human Score | Model Score | Human Preferred | Model Preferred | Tie |
|---|---|---|---|---|---|
| Naturalness | 4.55 | 4.10 | 30.0% | 3.1% | 66.9% |
| Clarity | 4.05 | 3.04 | β€” | β€” | β€” |

Generated speech achieves high perceptual quality, with naturalness scores approaching human speech and most pairwise comparisons resulting in ties.

**Conversational rubric evaluation:**

Evaluators also assessed conversational quality using three binary rubric questions measuring whether generated responses behave like natural conversational speech.

| Rubric | Pass Rate |
|---|---|
| Human-like interaction | β‰ˆ85% |
| Appropriateness (response follows prompt) | β‰ˆ53% |
| Completion (response forms a complete reply) | β‰ˆ42% |

While the model frequently produces speech that sounds human-like, maintaining contextual relevance and producing fully complete conversational responses remains an ongoing challenge.

### Turn-Taking Analysis

Temperature Ο„=0.9 produces turn-taking dynamics closest to ground-truth.

| Model | Ο„ | IPU/min | Pause | Gap | Overlap |
|---|---|---|---|---|---|
| Ground-truth | β€” | 35.30 | 10.49 | 8.51 | 3.03 |
| Human-1 | 0.8 | 23.12 | 9.16 | 6.77 | 1.67 |
| Human-1 | 0.9 | 29.14 | 9.24 | 8.54 | 4.30 |
| Human-1 | 1.0 | 38.90 | 11.67 | 8.10 | 9.68 |

## Conversation Style

Human-1 is trained on **topic-driven conversations** - real dialogues where two speakers discuss a subject naturally, with backchannels, interruptions, and organic turn-taking.

After an initial introduction, the model will typically **propose a topic and steer the conversation toward it**, preferring structured discussion over open-ended chitchat. Users can also **introduce their own topic** - the model will pick it up and engage in a focused discussion around it. This is an intentional design choice - the training data consists of real conversations where speakers engage in focused, in-depth discussions on assigned topics.

This makes the model particularly well-suited for **domain-specific conversational applications**. Our key finding is that the model's ability to stay on-topic emerges naturally from the structure of the training data alone - without any explicit prompting, reward shaping, or guardrails. This suggests that with sufficient hours of domain-specific conversational data, this approach can produce models that learn the conversational norms of virtually any domain - customer support, healthcare consultations, language tutoring, sales, therapy, and more - opening a direct path from curated conversations to deployable, real-world voice agents. Exploring this is an active direction of our future work.

## Files

```
β”œβ”€β”€ model.safetensors                              # Human-1 LM weights
β”œβ”€β”€ tokenizer-e351c8d8-checkpoint125.safetensors   # Mimi audio codec (frozen, from Moshi)
β”œβ”€β”€ tokenizer_hindi.model                          # Hindi SentencePiece tokenizer
β”œβ”€β”€ tokenizer_hindi.vocab                          # Vocabulary reference
β”œβ”€β”€ hindi_moshi_architecture.svg                   # Architecture diagram
└── README.md
```

## Quick Start

### 1. Install uv

```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
```

### 2. Create project and install dependencies

```bash
uv init human-1 && cd human-1
uv python install 3.12
uv python pin 3.12
uv add moshi huggingface_hub
```

### 3. Download the model

```bash
uv run huggingface-cli download JoshTalksAI/Human-1 --local-dir ./weights
```

### 4. Run the server

```bash
uv run -m moshi.server \
    --moshi-weight ./weights/model.safetensors \
    --mimi-weight ./weights/tokenizer-e351c8d8-checkpoint125.safetensors \
    --tokenizer ./weights/tokenizer_hindi.model
```

## Intended Use

The model is intended for research in full-duplex spoken dialogue systems for Hindi and Indian languages. It can be used as a conversational agent for casual Hindi conversations.

## Limitations

- Trained primarily on Hindi conversational speech. Performance on other languages or domains is not guaranteed.
- Inherits limitations from the base Moshi architecture regarding audio quality at 1.1 kbps bitrate.
- Hindi text tokens are sparser relative to audio (~75% PAD ratio vs. 65% in English) due to Devanagari encoding more phonemic content per token.
- Not intended for impersonation or any malicious use.
- This model is for research purposes. We do not recommend it for providing advice or performing any professional duty.

## Citation

```bibtex
@article{singh2026human1,
  title   = {Human-1 by Josh Talks : A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations},
  author  = {Bhaskar Singh and Shobhit Banga and Pranav Sharma},
  year    = {2026},
  institution = {JoshTalks}
}
```

## Acknowledgments

Built on [Moshi](https://github.com/kyutai-labs/moshi) by [Kyutai](https://kyutai.org/). We thank the 14,695 speakers who contributed to the Hindi conversational corpus.