ScrappyLabs Narrator TTS

A fine-tuned Qwen3-TTS 1.7B CustomVoice model producing a warm, expressive narrator voice optimized for storytelling, audiobooks, and conversational AI.

What Is This?

This is one of the first public fine-tunes of Qwen3-TTS. We took the 1.7B CustomVoice base model and trained it on 461 curated narrator-style audio samples to produce a consistent, high-quality storytelling voice without needing a reference audio clip at inference time.

Unlike voice cloning (which requires a reference WAV each time), this finetune bakes the voice identity into the model weights — just pass text in and get narrator audio out.

Audio Samples

Storytelling

"Once upon a time, in a forest older than memory, there lived a bear who collected stories the way other bears collected honey."

Suspense

"The door creaked open. Inside, the room was empty, except for a single envelope on the table, addressed to no one."

Educational

"The octopus has three hearts, blue blood, and the ability to change both the color and texture of its skin in less than a second."

Conversational

"Hey, you know what? Today was a good day. Not perfect, but good. And sometimes that's more than enough."

Key Features

No reference audio needed — the narrator voice is embedded in the weights
Instruct/emotion support — accepts style instructions (e.g., "warm and gentle", "excited", "mysterious whisper")
Multilingual capable — inherits Qwen3-TTS base multilingual support (en, zh, ja, ko, de, fr, ru, pt, es, it)
Drop-in compatible — works with any Qwen3-TTS serving stack
Apache 2.0 — fully open, commercial use OK

Training Details

Parameter	Value
Base model	`Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice`
Training samples	461 WAV clips (~varied lengths, diverse content)
Epochs	5
Batch size	64
Learning rate	2e-5
Warmup steps	200
Precision	bf16
Gradient checkpointing	Enabled
Block size	10240
Hardware	NVIDIA RTX PRO 6000 (Blackwell, 48GB VRAM)

The training data consists of narrator-style speech samples covering a range of emotional expressions, pacing, and content types — from calm storytelling to dramatic narration.

Model Size

Component	Size
Model weights (`model.safetensors`)	3.8 GB
Speech tokenizer	682 MB
Total	~4.5 GB

Usage

With Qwen3-TTS Server

If you're running a Qwen3-TTS server (like qwen3-tts-server), register this as a finetuned speaker and request it by name:

import requests

response = requests.post("http://localhost:8109/v1/audio/speech", json={
    "input": "Once upon a time, in a land far far away, there lived a brave little bear.",
    "voice": "narrator"
})

with open("output.wav", "wb") as f:
    f.write(response.content)

With Instruct (Emotion Control)

response = requests.post("http://localhost:8109/v1/audio/speech", json={
    "input": "And then, from the shadows... something moved.",
    "voice": "narrator",
    "instruct": "mysterious and suspenseful, slow pace"
})

Direct with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "scrappylabs/narrator-tts",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("scrappylabs/narrator-tts")

See the Qwen3-TTS documentation for full inference code.

Why Finetune vs. Voice Clone?

	Voice Clone	Finetune
Reference audio needed	Yes (every request)	No
Voice consistency	Good	Excellent
Inference latency	Higher (processes reference)	Lower (voice is in weights)
Emotion/instruct control	Limited	Full support
Deployment size	Base model + WAV files	Single model checkpoint

For production use cases where you need a consistent character voice across thousands of generations, finetuning wins.

Use Cases

Audiobook narration
Children's story apps
Podcast intros/outros
AI assistant voice
Game narration
Any application needing a warm, consistent narrator

About ScrappyLabs

ScrappyLabs builds open-source voice AI tools. This model powers our production TTS pipeline, serving narrator voice across our apps including Poo Bear (an AI-powered teddy bear for kids).

License

Apache 2.0 — same as the base Qwen3-TTS model. Use it for anything.

Citation

If you use this model, a mention of ScrappyLabs is appreciated but not required.

@misc{scrappylabs-narrator-tts-2026,
  author = {ScrappyLabs},
  title = {ScrappyLabs Narrator TTS: A Qwen3-TTS 1.7B Voice Finetune},
  year = {2026},
  url = {https://huggingface.co/scrappylabs/narrator-tts}
}

Downloads last month: 65

Model tree for scrappylabs/narrator-tts

Base model

Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice

Finetuned

(5)

this model