Human-1: A Full-Duplex Conversational Model for Hindi

πŸŽ™οΈ Try the live demo β†’ | πŸ“„ Paper β†’

Human-1 by Josh Talks is the first full-duplex spoken dialogue model for Hindi, built by adapting Kyutai's Moshi architecture. It enables real-time, natural Hindi conversation with support for interruptions, overlaps, backchannels, and natural turn-taking β€” trained on 26,000 hours of real spontaneous Hindi conversations from 14,695 speakers.

Hindi-Moshi Architecture

Model Details

Developed by Bhaskar Singh, Shobhit Banga, Pranav Sharma β€” JoshTalks
Base model kyutai/moshiko-pytorch-bf16
Language Hindi (hi)
Model type Full-duplex speech-to-speech dialogue
Format SafeTensors (fp32)
Tokenizer Custom Hindi SentencePiece (32,000 vocabulary)
Audio codec Mimi (frozen, 12.5 Hz, 1.1 kbps)
License CC-BY-4.0

What was changed from base Moshi

The original English SentencePiece tokenizer was replaced with a Hindi SentencePiece model (32,000 vocabulary) trained on a large Hindi text corpus. This required reinitialisation of three vocabulary-dependent parameter groups:

  • text_emb β€” text token embedding in the Temporal Transformer
  • depformer.emb.0 β€” text token embedding in the Depth Transformer
  • text_linear β€” text output projection layer

All audio processing components (Mimi codec) and remaining transformer weights retain their pre-trained values. Mimi generalises to Hindi without retraining (STOI: 0.878, PESQ: 2.55).

For full architecture details, see the Moshi paper.

Training

Data

The model was trained on a purpose-built corpus of 26,000 hours of real Hindi spontaneous conversations β€” to our knowledge, the largest conversational speech corpus for any Indian language.

Characteristic Value
Total duration 26,000 hours
Unique speakers 14,695
Recording type Spontaneous, unscripted conversations
Channels Stereo (separate per speaker)
Quality control Trained annotators + manual checks

The stereo recording format with separate speaker channels enables direct learning of turn-taking, overlaps, and backchannels from natural interactions β€” without requiring artificial speaker diarisation.

Two-stage training recipe

Stage 1 β€” Pre-training on the full 26,000-hour corpus. Learning rate of 3Γ—10⁻⁡ (matching original Moshi pre-training). AdamW with β₁=0.9, Ξ²β‚‚=0.95, weight decay 0.1. Effective batch size of 64 (~2.9 hours of audio per update). Trained for 1 epoch (~10,000 steps) in approximately 13 hours on 8Γ— NVIDIA H100 80GB GPUs.

Stage 2 β€” Fine-tuning on ~990 hours of curated high-quality conversational data. Split learning rates: 2Γ—10⁻⁢ for the Temporal Transformer, 4Γ—10⁻⁢ for the Depth Transformer. Optimal checkpoint selected at step 4,812 based on minimum total validation loss (3.370).

Training infrastructure

8Γ— NVIDIA H100 80GB GPUs with bf16 mixed precision.

Evaluation

Perplexity

Measured using Sarvam-1 (2B) on Whisper-v3 transcriptions of generated speech.

Temperature PPL ↓
Ground-truth 237.1
Human-1 (Ο„=0.8) 356.9
Human-1 (Ο„=0.9) 467.1
Human-1 (Ο„=1.0) 640.6

Human Evaluation

130 evaluators completed 2,125 rating tasks comparing human speech with model responses. Each instance contained two audio samples (Voice A: Human, Voice B: Model) rated on 5-point Likert scales for naturalness and clarity.

Perceptual quality:

Metric Human Score Model Score Human Preferred Model Preferred Tie
Naturalness 4.55 4.10 30.0% 3.1% 66.9%
Clarity 4.05 3.04 β€” β€” β€”

Generated speech achieves high perceptual quality, with naturalness scores approaching human speech and most pairwise comparisons resulting in ties.

Conversational rubric evaluation:

Evaluators also assessed conversational quality using three binary rubric questions measuring whether generated responses behave like natural conversational speech.

Rubric Pass Rate
Human-like interaction β‰ˆ85%
Appropriateness (response follows prompt) β‰ˆ53%
Completion (response forms a complete reply) β‰ˆ42%

While the model frequently produces speech that sounds human-like, maintaining contextual relevance and producing fully complete conversational responses remains an ongoing challenge.

Turn-Taking Analysis

Temperature Ο„=0.9 produces turn-taking dynamics closest to ground-truth.

Model Ο„ IPU/min Pause Gap Overlap
Ground-truth β€” 35.30 10.49 8.51 3.03
Human-1 0.8 23.12 9.16 6.77 1.67
Human-1 0.9 29.14 9.24 8.54 4.30
Human-1 1.0 38.90 11.67 8.10 9.68

Conversation Style

Human-1 is trained on topic-driven conversations - real dialogues where two speakers discuss a subject naturally, with backchannels, interruptions, and organic turn-taking.

After an initial introduction, the model will typically propose a topic and steer the conversation toward it, preferring structured discussion over open-ended chitchat. Users can also introduce their own topic - the model will pick it up and engage in a focused discussion around it. This is an intentional design choice - the training data consists of real conversations where speakers engage in focused, in-depth discussions on assigned topics.

This makes the model particularly well-suited for domain-specific conversational applications. Our key finding is that the model's ability to stay on-topic emerges naturally from the structure of the training data alone - without any explicit prompting, reward shaping, or guardrails. This suggests that with sufficient hours of domain-specific conversational data, this approach can produce models that learn the conversational norms of virtually any domain - customer support, healthcare consultations, language tutoring, sales, therapy, and more - opening a direct path from curated conversations to deployable, real-world voice agents. Exploring this is an active direction of our future work.

Files

β”œβ”€β”€ model.safetensors                              # Human-1 LM weights
β”œβ”€β”€ tokenizer-e351c8d8-checkpoint125.safetensors   # Mimi audio codec (frozen, from Moshi)
β”œβ”€β”€ tokenizer_hindi.model                          # Hindi SentencePiece tokenizer
β”œβ”€β”€ tokenizer_hindi.vocab                          # Vocabulary reference
β”œβ”€β”€ hindi_moshi_architecture.svg                   # Architecture diagram
└── README.md

Quick Start

1. Install uv

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

2. Create project and install dependencies

uv init human-1 && cd human-1
uv python install 3.12
uv python pin 3.12
uv add moshi huggingface_hub

3. Download the model

uv run huggingface-cli download JoshTalksAI/Human-1 --local-dir ./weights

4. Run the server

uv run -m moshi.server \
    --moshi-weight ./weights/model.safetensors \
    --mimi-weight ./weights/tokenizer-e351c8d8-checkpoint125.safetensors \
    --tokenizer ./weights/tokenizer_hindi.model

Intended Use

The model is intended for research in full-duplex spoken dialogue systems for Hindi and Indian languages. It can be used as a conversational agent for casual Hindi conversations.

Limitations

  • Trained primarily on Hindi conversational speech. Performance on other languages or domains is not guaranteed.
  • Inherits limitations from the base Moshi architecture regarding audio quality at 1.1 kbps bitrate.
  • Hindi text tokens are sparser relative to audio (~75% PAD ratio vs. 65% in English) due to Devanagari encoding more phonemic content per token.
  • Not intended for impersonation or any malicious use.
  • This model is for research purposes. We do not recommend it for providing advice or performing any professional duty.

Citation

@article{singh2026human1,
  title   = {Human-1 by Josh Talks : A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations},
  author  = {Bhaskar Singh and Shobhit Banga and Pranav Sharma},
  year    = {2026},
  institution = {JoshTalks}
}

Acknowledgments

Built on Moshi by Kyutai. We thank the 14,695 speakers who contributed to the Hindi conversational corpus.

Downloads last month
15
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for JoshTalksAI/Human-1

Finetuned
(6)
this model

Space using JoshTalksAI/Human-1 1

Paper for JoshTalksAI/Human-1