File size: 9,161 Bytes
81aaa4c 77dbf75 94016dc 81aaa4c f4ebea7 81aaa4c ba280f6 81aaa4c 2dc6b90 81aaa4c ba280f6 81aaa4c ba280f6 81aaa4c 1130d5d 81aaa4c f4ebea7 81aaa4c 98a75ec ba280f6 81aaa4c ba280f6 81aaa4c 77dbf75 81aaa4c ba280f6 f4ebea7 ba280f6 81aaa4c f4ebea7 81aaa4c ba280f6 81aaa4c bac6a3b 81aaa4c bac6a3b 81aaa4c bac6a3b 81aaa4c f4ebea7 bac6a3b 81aaa4c bac6a3b 81aaa4c f4ebea7 bac6a3b 81aaa4c bac6a3b 81aaa4c bac6a3b 81aaa4c 7ef2feb 3cd6e30 2dc6b90 520ac49 81aaa4c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 | ---
license: cc-by-4.0
language:
- hi
tags:
- moshi
- speech-to-speech
- hindi
- conversational-ai
- audio
- full-duplex
- duplex-dialogue
- indian-languages
base_model: kyutai/moshiko-pytorch-bf16
pipeline_tag: audio-to-audio
---
# Human-1: A Full-Duplex Conversational Model for Hindi
**ποΈ [Try the live demo β](https://ai.joshtalks.com/research/human-1)** | **π [Paper β](https://storage.googleapis.com/josh-frontend-asset/human-1.pdf)**
Human-1 by Josh Talks is the first full-duplex spoken dialogue model for Hindi, built by adapting [Kyutai's Moshi](https://github.com/kyutai-labs/moshi) architecture. It enables real-time, natural Hindi conversation with support for interruptions, overlaps, backchannels, and natural turn-taking β trained on 26,000 hours of real spontaneous Hindi conversations from 14,695 speakers.
<p align="center">
<img src="hindi_moshi_architecture.svg" alt="Hindi-Moshi Architecture" width="480"/>
</p>
## Model Details
| | |
|---|---|
| **Developed by** | Bhaskar Singh, Shobhit Banga, Pranav Sharma β [JoshTalks](https://joshtalks.com) |
| **Base model** | [kyutai/moshiko-pytorch-bf16](https://huggingface.co/kyutai/moshiko-pytorch-bf16) |
| **Language** | Hindi (hi) |
| **Model type** | Full-duplex speech-to-speech dialogue |
| **Format** | SafeTensors (fp32) |
| **Tokenizer** | Custom Hindi SentencePiece (32,000 vocabulary) |
| **Audio codec** | Mimi (frozen, 12.5 Hz, 1.1 kbps) |
| **License** | CC-BY-4.0 |
## What was changed from base Moshi
The original English SentencePiece tokenizer was replaced with a Hindi SentencePiece model (32,000 vocabulary) trained on a large Hindi text corpus. This required reinitialisation of three vocabulary-dependent parameter groups:
- `text_emb` β text token embedding in the Temporal Transformer
- `depformer.emb.0` β text token embedding in the Depth Transformer
- `text_linear` β text output projection layer
All audio processing components (Mimi codec) and remaining transformer weights retain their pre-trained values. Mimi generalises to Hindi without retraining (STOI: 0.878, PESQ: 2.55).
For full architecture details, see the [Moshi paper](https://arxiv.org/abs/2410.00037).
## Training
### Data
The model was trained on a purpose-built corpus of **26,000 hours** of real Hindi spontaneous conversations β to our knowledge, the largest conversational speech corpus for any Indian language.
| Characteristic | Value |
|---|---|
| Total duration | 26,000 hours |
| Unique speakers | 14,695 |
| Recording type | Spontaneous, unscripted conversations |
| Channels | Stereo (separate per speaker) |
| Quality control | Trained annotators + manual checks |
The stereo recording format with separate speaker channels enables direct learning of turn-taking, overlaps, and backchannels from natural interactions β without requiring artificial speaker diarisation.
### Two-stage training recipe
**Stage 1 β Pre-training** on the full 26,000-hour corpus. Learning rate of 3Γ10β»β΅ (matching original Moshi pre-training). AdamW with Ξ²β=0.9, Ξ²β=0.95, weight decay 0.1. Effective batch size of 64 (\~2.9 hours of audio per update). Trained for 1 epoch (\~10,000 steps) in approximately 13 hours on 8Γ NVIDIA H100 80GB GPUs.
**Stage 2 β Fine-tuning** on ~990 hours of curated high-quality conversational data. Split learning rates: 2Γ10β»βΆ for the Temporal Transformer, 4Γ10β»βΆ for the Depth Transformer. Optimal checkpoint selected at step 4,812 based on minimum total validation loss (3.370).
### Training infrastructure
8Γ NVIDIA H100 80GB GPUs with bf16 mixed precision.
## Evaluation
### Perplexity
Measured using Sarvam-1 (2B) on Whisper-v3 transcriptions of generated speech.
| Temperature | PPL β |
|---|---|
| Ground-truth | 237.1 |
| Human-1 (Ο=0.8) | 356.9 |
| Human-1 (Ο=0.9) | 467.1 |
| Human-1 (Ο=1.0) | 640.6 |
### Human Evaluation
130 evaluators completed 2,125 rating tasks comparing human speech with model responses. Each instance contained two audio samples (Voice A: Human, Voice B: Model) rated on 5-point Likert scales for naturalness and clarity.
**Perceptual quality:**
| Metric | Human Score | Model Score | Human Preferred | Model Preferred | Tie |
|---|---|---|---|---|---|
| Naturalness | 4.55 | 4.10 | 30.0% | 3.1% | 66.9% |
| Clarity | 4.05 | 3.04 | β | β | β |
Generated speech achieves high perceptual quality, with naturalness scores approaching human speech and most pairwise comparisons resulting in ties.
**Conversational rubric evaluation:**
Evaluators also assessed conversational quality using three binary rubric questions measuring whether generated responses behave like natural conversational speech.
| Rubric | Pass Rate |
|---|---|
| Human-like interaction | β85% |
| Appropriateness (response follows prompt) | β53% |
| Completion (response forms a complete reply) | β42% |
While the model frequently produces speech that sounds human-like, maintaining contextual relevance and producing fully complete conversational responses remains an ongoing challenge.
### Turn-Taking Analysis
Temperature Ο=0.9 produces turn-taking dynamics closest to ground-truth.
| Model | Ο | IPU/min | Pause | Gap | Overlap |
|---|---|---|---|---|---|
| Ground-truth | β | 35.30 | 10.49 | 8.51 | 3.03 |
| Human-1 | 0.8 | 23.12 | 9.16 | 6.77 | 1.67 |
| Human-1 | 0.9 | 29.14 | 9.24 | 8.54 | 4.30 |
| Human-1 | 1.0 | 38.90 | 11.67 | 8.10 | 9.68 |
## Conversation Style
Human-1 is trained on **topic-driven conversations** - real dialogues where two speakers discuss a subject naturally, with backchannels, interruptions, and organic turn-taking.
After an initial introduction, the model will typically **propose a topic and steer the conversation toward it**, preferring structured discussion over open-ended chitchat. Users can also **introduce their own topic** - the model will pick it up and engage in a focused discussion around it. This is an intentional design choice - the training data consists of real conversations where speakers engage in focused, in-depth discussions on assigned topics.
This makes the model particularly well-suited for **domain-specific conversational applications**. Our key finding is that the model's ability to stay on-topic emerges naturally from the structure of the training data alone - without any explicit prompting, reward shaping, or guardrails. This suggests that with sufficient hours of domain-specific conversational data, this approach can produce models that learn the conversational norms of virtually any domain - customer support, healthcare consultations, language tutoring, sales, therapy, and more - opening a direct path from curated conversations to deployable, real-world voice agents. Exploring this is an active direction of our future work.
## Files
```
βββ model.safetensors # Human-1 LM weights
βββ tokenizer-e351c8d8-checkpoint125.safetensors # Mimi audio codec (frozen, from Moshi)
βββ tokenizer_hindi.model # Hindi SentencePiece tokenizer
βββ tokenizer_hindi.vocab # Vocabulary reference
βββ hindi_moshi_architecture.svg # Architecture diagram
βββ README.md
```
## Quick Start
### 1. Install uv
```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
```
### 2. Create project and install dependencies
```bash
uv init human-1 && cd human-1
uv python install 3.12
uv python pin 3.12
uv add moshi huggingface_hub
```
### 3. Download the model
```bash
uv run huggingface-cli download JoshTalksAI/Human-1 --local-dir ./weights
```
### 4. Run the server
```bash
uv run -m moshi.server \
--moshi-weight ./weights/model.safetensors \
--mimi-weight ./weights/tokenizer-e351c8d8-checkpoint125.safetensors \
--tokenizer ./weights/tokenizer_hindi.model
```
## Intended Use
The model is intended for research in full-duplex spoken dialogue systems for Hindi and Indian languages. It can be used as a conversational agent for casual Hindi conversations.
## Limitations
- Trained primarily on Hindi conversational speech. Performance on other languages or domains is not guaranteed.
- Inherits limitations from the base Moshi architecture regarding audio quality at 1.1 kbps bitrate.
- Hindi text tokens are sparser relative to audio (~75% PAD ratio vs. 65% in English) due to Devanagari encoding more phonemic content per token.
- Not intended for impersonation or any malicious use.
- This model is for research purposes. We do not recommend it for providing advice or performing any professional duty.
## Citation
```bibtex
@article{singh2026human1,
title = {Human-1 by Josh Talks : A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations},
author = {Bhaskar Singh and Shobhit Banga and Pranav Sharma},
year = {2026},
institution = {JoshTalks}
}
```
## Acknowledgments
Built on [Moshi](https://github.com/kyutai-labs/moshi) by [Kyutai](https://kyutai.org/). We thank the 14,695 speakers who contributed to the Hindi conversational corpus. |