ScrappyLabs Narrator TTS

A fine-tuned Qwen3-TTS 1.7B CustomVoice model producing a warm, expressive narrator voice optimized for storytelling, audiobooks, and conversational AI.

What Is This?

This is one of the first public fine-tunes of Qwen3-TTS. We took the 1.7B CustomVoice base model and trained it on 461 curated narrator-style audio samples to produce a consistent, high-quality storytelling voice without needing a reference audio clip at inference time.

Unlike voice cloning (which requires a reference WAV each time), this finetune bakes the voice identity into the model weights β€” just pass text in and get narrator audio out.

Audio Samples

Storytelling

"Once upon a time, in a forest older than memory, there lived a bear who collected stories the way other bears collected honey."

Suspense

"The door creaked open. Inside, the room was empty, except for a single envelope on the table, addressed to no one."

Educational

"The octopus has three hearts, blue blood, and the ability to change both the color and texture of its skin in less than a second."

Conversational

"Hey, you know what? Today was a good day. Not perfect, but good. And sometimes that's more than enough."

Key Features

  • No reference audio needed β€” the narrator voice is embedded in the weights
  • Instruct/emotion support β€” accepts style instructions (e.g., "warm and gentle", "excited", "mysterious whisper")
  • Multilingual capable β€” inherits Qwen3-TTS base multilingual support (en, zh, ja, ko, de, fr, ru, pt, es, it)
  • Drop-in compatible β€” works with any Qwen3-TTS serving stack
  • Apache 2.0 β€” fully open, commercial use OK

Training Details

Parameter Value
Base model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
Training samples 461 WAV clips (~varied lengths, diverse content)
Epochs 5
Batch size 64
Learning rate 2e-5
Warmup steps 200
Precision bf16
Gradient checkpointing Enabled
Block size 10240
Hardware NVIDIA RTX PRO 6000 (Blackwell, 48GB VRAM)

The training data consists of narrator-style speech samples covering a range of emotional expressions, pacing, and content types β€” from calm storytelling to dramatic narration.

Model Size

Component Size
Model weights (model.safetensors) 3.8 GB
Speech tokenizer 682 MB
Total ~4.5 GB

Usage

With Qwen3-TTS Server

If you're running a Qwen3-TTS server (like qwen3-tts-server), register this as a finetuned speaker and request it by name:

import requests

response = requests.post("http://localhost:8109/v1/audio/speech", json={
    "input": "Once upon a time, in a land far far away, there lived a brave little bear.",
    "voice": "narrator"
})

with open("output.wav", "wb") as f:
    f.write(response.content)

With Instruct (Emotion Control)

response = requests.post("http://localhost:8109/v1/audio/speech", json={
    "input": "And then, from the shadows... something moved.",
    "voice": "narrator",
    "instruct": "mysterious and suspenseful, slow pace"
})

Direct with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "scrappylabs/narrator-tts",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("scrappylabs/narrator-tts")

See the Qwen3-TTS documentation for full inference code.

Why Finetune vs. Voice Clone?

Voice Clone Finetune
Reference audio needed Yes (every request) No
Voice consistency Good Excellent
Inference latency Higher (processes reference) Lower (voice is in weights)
Emotion/instruct control Limited Full support
Deployment size Base model + WAV files Single model checkpoint

For production use cases where you need a consistent character voice across thousands of generations, finetuning wins.

Use Cases

  • Audiobook narration
  • Children's story apps
  • Podcast intros/outros
  • AI assistant voice
  • Game narration
  • Any application needing a warm, consistent narrator

About ScrappyLabs

ScrappyLabs builds open-source voice AI tools. This model powers our production TTS pipeline, serving narrator voice across our apps including Poo Bear (an AI-powered teddy bear for kids).

License

Apache 2.0 β€” same as the base Qwen3-TTS model. Use it for anything.

Citation

If you use this model, a mention of ScrappyLabs is appreciated but not required.

@misc{scrappylabs-narrator-tts-2026,
  author = {ScrappyLabs},
  title = {ScrappyLabs Narrator TTS: A Qwen3-TTS 1.7B Voice Finetune},
  year = {2026},
  url = {https://huggingface.co/scrappylabs/narrator-tts}
}
Downloads last month
65
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for scrappylabs/narrator-tts

Finetuned
(5)
this model