Orpheus 3B — Bangla TTS (Small Data Fine-tune)

Fine-tuned version of Orpheus 3B for Bangla (Bengali) Text-to-Speech using LoRA adapters. Trained on ~39K Bangla speech samples (Adiba speaker dataset) for 4,500 steps on H100 GPU.

For higher quality output, see the High Data version trained on ~99K samples.

Model Details

Property Value
Base Model canopylabs/orpheus-3b-0.1-pretrained
Architecture Llama 3B + LoRA adapters
Training Data ~39,000 Bangla speech samples (Adiba speaker)
Training Steps 4,500
Audio Codec SNAC 24kHz
Training Platform Modal (H100 GPU) with Unsloth
Language Bangla (bn)
License Apache 2.0

What is Orpheus?

Orpheus TTS is a Llama-based text-to-speech model that generates audio as interleaved SNAC codec tokens. It supports emotional speech tags for expressive synthesis.

Usage

Note: The base model canopylabs/orpheus-3b-0.1-pretrained is gated — you need a HuggingFace token with approved access.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

# 1. Load base model
base_model_id = "canopylabs/orpheus-3b-0.1-pretrained"
tokenizer = AutoTokenizer.from_pretrained(base_model_id, token="YOUR_HF_TOKEN")
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.float16,
    token="YOUR_HF_TOKEN"
)

# 2. IMPORTANT: Resize embeddings before loading LoRA
model.resize_token_embeddings(156940)

# 3. Load LoRA adapter
model = PeftModel.from_pretrained(
    model,
    "EMTIAZZ/orpheus-3b-bangla-small-data-finetuning",
    token="YOUR_HF_TOKEN"
)
model = model.merge_and_unload()

# 4. Prepare prompt and generate
text = "আমি বাংলায় কথা বলতে পারি।"
prompt = f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=1200,
        do_sample=True,
        temperature=0.6,
        top_p=0.95,
    )

Emotional Speech Tags

<laugh>   <chuckle>   <sigh>   <cough>   <sniffle>
<groan>   <yawn>      <gasp>

Training Details

  • Framework: Unsloth + HuggingFace Trainer
  • Method: LoRA (Low-Rank Adaptation)
  • Speaker: Adiba (single-speaker Bangla dataset, ~39K samples)
  • Hardware: H100 GPU on Modal
  • Training: 4,500 steps

When to Use This vs. High-Data Version

  • Use this model if you want a single-speaker voice (Adiba) or need quicker prototyping
  • Use high-data version for better generalization, more natural prosody, and higher quality

Citation

@misc{emtiaz2026orpheusbanglasmall,
  author = {Emtiaz Uddin Ahmed},
  title  = {Orpheus 3B Bangla Small-Data Fine-tune},
  year   = {2026},
  url    = {https://huggingface.co/EMTIAZZ/orpheus-3b-bangla-small-data-finetuning}
}

Author

GitHub · Portfolio

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EMTIAZZ/orpheus-3b-bangla-small-data-finetuning

Space using EMTIAZZ/orpheus-3b-bangla-small-data-finetuning 1