YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

MOSS-VoiceGenerator GRPO LoRA

A LoRA adapter for MOSS-VoiceGenerator (1.7B) trained with Group Relative Policy Optimization (GRPO) to improve speaker similarity, emotion expression, and speech intelligibility.

Model Details

Property Value
Base model LAION-AI/MOSS-VoiceGenerator (1.7B params)
Architecture MossTTSDelayModel (Qwen3 backbone + 16 VQ codebooks)
LoRA rank 8
LoRA alpha 16
LoRA targets q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Applied to model.language_model (Qwen3 backbone only)
Adapter size 34 MB
Training GRPO: 500 steps (v1) + 500 steps (v7) = 1000 total GRPO steps

Training Results

GRPO v1 (500 steps, lr=5e-5, batch=8, G=4)

Metric Start (5-step avg) End (5-step avg)
Speaker Similarity (ECAPA-TDNN cosine) 0.374 0.441
Emotion Match (CLAP cosine) 0.191 0.221
Word Error Rate 0.234 0.135

GRPO v7 (500 additional steps, continuing v1)

Metric Start (5-step avg) End (5-step avg) Peak
Speaker Similarity 0.433 0.391 0.572 (step 258)
Emotion Match (CLAP) 0.218 0.203 0.308 (step 472)
Word Error Rate 0.111 0.186 0.057 (step 431)

Overall Improvement (v1 start to v7 end, 1000 steps total)

Metric v1 Start v7 End Change
Speaker Similarity 0.374 0.391 +0.017
Emotion Match (CLAP) 0.191 0.203 +0.012
Word Error Rate 0.234 0.186 -0.048 (improved)

Reward weights (v7): Speaker 50%, Emotion 40%, Quality 10%, with WER penalty exp(-10 * WER)

Inference Benchmark (SGLang, 6x H100 80GB, DP=6)

Batch Size Wall Clock Latency (avg) Real-Time Factor Requests/s
1 0.81s 1.25s 8.3x 1.2
4 0.85s 0.89s 31.9x 4.7
8 1.05s 0.98s 49.9x 7.6
16 1.10s 1.01s 95.1x 14.6
24 1.34s 1.19s 117.0x 17.9
48 1.75s 1.53s 178.8x 27.4

Peak throughput: 178.8x realtime at BS=48 (27.4 requests/second).

Usage

Quick Start with transformers + peft

import torch
from transformers import AutoModel, AutoProcessor
from peft import PeftModel

# Load base model
model_id = "LAION-AI/MOSS-VoiceGenerator"
codec_path = "LAION-AI/MOSS-Audio-Tokenizer"

proc = AutoProcessor.from_pretrained(
    model_id, trust_remote_code=True,
    normalize_inputs=True, codec_path=codec_path)
model = AutoModel.from_pretrained(
    model_id, trust_remote_code=True,
    torch_dtype=torch.bfloat16, attn_implementation="sdpa"
).to("cuda")

# Apply LoRA and merge (IMPORTANT: must merge for correct generation)
lora_path = "laion/MOSS-VoiceGenerator-GRPO-LoRA"
model.language_model = PeftModel.from_pretrained(
    model.language_model, lora_path, is_trainable=False)
model.language_model = model.language_model.merge_and_unload()

# Prepare input (4-message ICL format)
# Message 1: empty user turn
# Message 2: reference speaker audio (assistant)
# Message 3: target text + emotion (user)
# Message 4: generated audio (assistant, to be generated)

ref_audio_path = "speaker_reference.wav"
text = "Hello, how are you today?"
emotion = "happiness"

messages = [
    {"role": "user", "content": ""},
    {"role": "assistant", "content": [{"type": "audio", "audio_url": ref_audio_path}]},
    {"role": "user", "content": f"${{instruction:{emotion}}}{text}"},
]

inputs = proc.apply_chat_template(
    messages, tokenize=True, return_tensors="pt").to("cuda")

# Generate
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=4096,
        audio_temperature=1.0,
        audio_top_p=0.8,
        audio_top_k=50,
        audio_repetition_penalty=1.1,
    )

# Decode to audio
audio = proc.decode(outputs)
# audio contains waveform at 24kHz

Production Serving with SGLang (Recommended)

SGLang provides 8-178x realtime performance via the OpenMOSS SGLang fork.

Step 1: Merge LoRA into base model

python serve_sglang.py merge \
    --base-model LAION-AI/MOSS-VoiceGenerator \
    --lora-path laion/MOSS-VoiceGenerator-GRPO-LoRA \
    --output ./merged_model

Step 2: Launch SGLang server

# Single GPU
python -m sglang.launch_server \
    --model-path ./merged_model \
    --delay-pattern \
    --trust-remote-code \
    --port 30000

# Multi-GPU (DP=6 for maximum throughput)
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python -m sglang.launch_server \
    --model-path ./merged_model \
    --delay-pattern \
    --trust-remote-code \
    --port 30000 \
    --dp-size 6 \
    --mem-fraction-static 0.85

Step 3: Generate audio via API

import requests, base64, io, soundfile as sf

response = requests.post("http://localhost:30000/generate", json={
    "text": "${instruction:happiness}Hello, how are you today?",
    "audio_data": ["speaker_reference.wav"],  # absolute path to ref audio
    "sampling_params": {
        "temperature": 1.0,
        "top_p": 0.8,
        "top_k": 50,
        "repetition_penalty": 1.1,
        "max_new_tokens": 4096,
    },
    "stream": False,
})

# Decode base64 WAV response
audio_bytes = base64.b64decode(response.json()["text"])
wav, sr = sf.read(io.BytesIO(audio_bytes))  # 24kHz
sf.write("output.wav", wav, sr)

Or use the included CLI tool:

python serve_sglang.py generate \
    --text "Hello, how are you today?" \
    --emotion happiness \
    --ref-audio speaker_reference.wav \
    --output output.wav

How GRPO Works

Group Relative Policy Optimization (GRPO) is a reinforcement learning method that directly optimizes generation quality:

  1. Generate: For each text prompt, generate G=4 audio completions with the current model
  2. Score: Rate each completion with three reward models:
    • Speaker similarity (ECAPA-TDNN): cosine similarity between generated and reference speaker embeddings
    • Emotion match (Voice-OpenCLAP): cosine similarity between audio embedding and emotion text
    • Intelligibility (Parakeet ASR): Word Error Rate of transcribed audio vs target text
  3. Advantage: Normalize rewards within each group: A_i = (R_i - mean) / std
  4. Train: Update model with policy gradient: L = mean(A_i * CE_loss(completion_i))

This trains the model to produce audio that better matches the target speaker, conveys the right emotion, and maintains intelligibility.

Important Notes

  • Always use merge_and_unload() before generation. The PeftModel wrapper breaks MOSS's multi-head architecture and produces garbled audio.
  • The LoRA only modifies model.language_model (Qwen3 backbone). Audio embedding layers and output heads are unchanged.
  • Audio is generated at 24kHz with 16 VQ codebooks using a delay pattern.

Training

This section describes how to replicate GRPO training using the included grpo_train_v6.py and grpo_rewards.py scripts. The pipeline uses SGLang for fast generation, enabling semi-on-policy GRPO on a single 8-GPU node.

Dependencies

# Core
pip install torch>=2.3 transformers peft accelerate safetensors

# Audio processing
pip install librosa soundfile numpy jiwer

# Reward models
pip install speechbrain          # ECAPA-TDNN speaker verification
pip install openai-whisper        # Whisper ASR (fallback)
pip install nemo_toolkit[asr]     # Parakeet ASR (preferred)

# CLAP model (clone separately)
# See: https://huggingface.co/laion/voice-openclap-poc

# SGLang (OpenMOSS fork with MOSS TTS support)
pip install "sglang[all]" --find-links https://github.com/OpenMOSS/sglang
# Or install from source:
#   git clone https://github.com/OpenMOSS/sglang && cd sglang && pip install -e ".[all]"

# Data
pip install datasets              # HuggingFace datasets for voice-acting-prompts

Architecture Overview

The GRPO training pipeline uses a semi-on-policy design with pipeline overlap across 8 GPUs:

GPU 0:       Training (forward/backward/optimizer) β€” base model + LoRA
GPUs 1-6:    SGLang server with data-parallel-size=6 (continuous batching)
GPU 7:       Reward models (ECAPA-TDNN, Voice-OpenCLAP, Parakeet ASR)

The key idea is that generation (the bottleneck in GRPO) runs on a dedicated SGLang server with 6-way data parallelism, while training and reward scoring happen on separate GPUs. This enables pipeline overlap: while the training GPU processes batch N (scoring rewards + computing gradients), the SGLang server is already generating completions for batch N+1.

Semi-on-policy: LoRA weights are synced from the training model to the SGLang server every --sync-every steps (default 5). Between syncs, the SGLang server generates using slightly stale weights β€” a practical trade-off that avoids the costly restart-per-step approach while keeping the policy reasonably fresh.

Weight sync works by:

  1. Saving the current LoRA adapter to disk
  2. Merging it into a fused model checkpoint (only regenerating the modified safetensors shard)
  3. Triggering SGLang to reload the updated weights

Data Pipeline

Dataset: voice-acting-prompts β€” a large collection of expressive text prompts with emotion labels.

IMPORTANT: English-only filtering is required. The dataset is approximately 78% non-English (German, French, etc.). The _is_english function in grpo_train_v6.py filters prompts by checking the ratio of ASCII characters to total characters (threshold: 80% ASCII). Without this filter, the model will train on non-English text and WER rewards become meaningless.

IMPORTANT: Always shuffle the streaming dataset. The dataset shards are not randomly ordered β€” early shards are heavily German. Without shuffling, the first several hundred steps would contain almost exclusively non-English prompts that pass through even the ASCII filter. The training script uses dataset.shuffle(seed=42) to ensure a uniform language distribution across training.

Reference speakers: Emolia dataset speaker clusters (3000 clusters of expressive speech). Each training step randomly samples a speaker cluster and a random utterance within that cluster as the reference audio for voice cloning.

Prompt processing:

  1. Stream and shuffle the dataset
  2. Filter to English-only using ASCII character ratio
  3. Clean text (remove quotes, stage directions, normalize whitespace)
  4. Pair each prompt with a randomly sampled emolia speaker reference
  5. Format as MOSS 4-message ICL: [empty_user, ref_assistant, target_user, gen_assistant]

Reward Models

Three reward models run on GPU 7, wrapped by grpo_rewards.py:

Model What it measures Output range Class
ECAPA-TDNN (speechbrain) Speaker similarity between generated and reference audio [-1, 1] cosine SpeakerReward
Voice-OpenCLAP Emotion match: audio-text cosine similarity to emotion description [-1, 1] cosine CLAPReward
Voice-OpenCLAP (quality) Audio quality: similarity to "High Quality Recording, fluid pleasant performance" [-1, 1] cosine CLAPReward.score_quality()
Parakeet TDT 0.6B (or Whisper fallback) Intelligibility: Word Error Rate of transcribed audio vs target text [0, inf) WER ASRReward

All audio is resampled to 16kHz before reward scoring.

Reward Formula

Individual rewards are z-normalized using running baseline statistics, then combined with a multiplicative WER penalty:

R_total = (w_spk * Z(speaker_sim) + w_clap * Z(emotion_match) + w_qual * Z(quality)) * exp(-beta * WER)

Where:

  • Z(x) = (x - mean) / std β€” z-normalization using baseline statistics
  • w_spk, w_clap, w_qual β€” reward component weights (must sum to 1.0)
  • beta β€” WER penalty strength (default: 10.0)
  • WER β€” word error rate from ASR transcription

The exponential WER penalty ensures that unintelligible speech receives near-zero total reward regardless of how well it matches the speaker or emotion. At WER=0, the penalty is 1.0 (no effect); at WER=0.3, the penalty is ~0.05.

Reward weight configurations used:

  • v1 (default): w_spk=0.6, w_clap=0.4, w_qual=0.0 β€” speaker-heavy, no quality term
  • v7: w_spk=0.5, w_clap=0.4, w_qual=0.1 β€” balanced with quality bonus

GRPO Algorithm Details

For each training step:

  1. Sample a batch of B=8 prompts from the dataset
  2. Generate G=4 completions per prompt using SGLang (total 32 audio samples)
  3. Score each completion with all three reward models
  4. Combine rewards using the formula above
  5. Compute advantages within each group: A_i = (R_i - mean(R_group)) / std(R_group)
  6. Train using advantage-weighted cross-entropy loss on the audio tokens

The loss is computed as:

L = mean(A_i * CE_loss(completion_i))

where CE_loss uses channelwise loss weighting (--channelwise-loss-weight "1,32" means text head weight=1, total audio heads weight=32 spread across 16 codebook heads).

Key Training Parameters

Parameter Default Description
--lora-init (required) Path to initial LoRA adapter or HF repo
--output-dir output/grpo_v6 Directory for checkpoints and logs
--batch-size 8 Prompts per training step
--group-size 4 Completions generated per prompt (G)
--lr 5e-5 Learning rate
--max-steps 1600 Total training steps
--sync-every 5 Steps between LoRA-to-SGLang weight syncs
--save-every 200 Steps between checkpoint saves
--channelwise-loss-weight "1,32" Text head vs total audio weight
--w-speaker 0.5 Speaker similarity reward weight
--w-clap 0.4 Emotion match reward weight
--w-quality 0.1 Audio quality reward weight
--beta-wer 10.0 WER penalty strength
--train-device cuda:0 GPU for training
--sglang-gpus 1,2,3,4,5,6 GPUs for SGLang server (DP)
--reward-device cuda:7 GPU for reward models
--lr-schedule constant LR schedule: constant, cosine, or linear
--warmup-steps 0 Linear warmup steps
--lr-min 0.0 Minimum LR for cosine/linear decay
--seed 42 Random seed
--resume-step 0 Resume from this step number

LoRA Configuration

LoRA is applied only to model.language_model β€” the Qwen3 backbone that handles sequence modeling. The 16 audio embedding layers (emb_ext) and 17 output heads (lm_heads) are frozen and unchanged. This is critical because:

  • The audio codebook embeddings map discrete VQ codes to continuous representations β€” modifying them would break the learned codebook alignment
  • The output heads project back to codebook logits β€” these must remain calibrated to the frozen embeddings
  • The Qwen3 backbone is where high-level decisions about what to generate are made, making it the right target for RL fine-tuning

LoRA config: rank=8, alpha=16, dropout=0.0, applied to all linear layers in the Qwen3 model (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj), using TaskType.FEATURE_EXTRACTION.

Replicating v7 Training

v7 continued from the v1 checkpoint (500 steps) with adjusted reward weights. To replicate:

# Requires: 8x GPUs (H100 80GB recommended), ~2-3 hours for 500 steps
# The script will automatically:
#   1. Load base model + LoRA on GPU 0
#   2. Launch SGLang server on GPUs 1-6
#   3. Load reward models on GPU 7
#   4. Stream and filter the voice-acting-prompts dataset

python grpo_train_v6.py \
    --lora-init output/grpo/final \
    --output-dir output/grpo_v7 \
    --max-steps 500 \
    --save-every 100 \
    --sync-every 5 \
    --w-speaker 0.5 --w-clap 0.4 --w-quality 0.1 \
    --lr 5e-5

To start from scratch (no prior LoRA, fresh rank-8 adapter):

python grpo_train_v6.py \
    --output-dir output/grpo_from_scratch \
    --max-steps 500 \
    --save-every 100 \
    --sync-every 5 \
    --w-speaker 0.5 --w-clap 0.4 --w-quality 0.1 \
    --lr 5e-5

Environment variables (optional, for custom paths):

export MVG_DIR=/path/to/MOSS-VoiceGenerator           # Base model
export CODEC_DIR=/path/to/MOSS-Audio-Tokenizer         # Audio codec
export EMOLIA_DIR=/path/to/emolia/cluster_samples      # Reference speakers
export CLAP_DIR=/path/to/voice-openclap-poc            # CLAP model

Training Tips

  • Monitor WER closely. If WER rises above ~0.3, the model is producing unintelligible speech. Consider increasing --beta-wer or reducing the learning rate.
  • Speaker similarity and emotion match trade off. Increasing --w-speaker improves voice cloning fidelity but may reduce emotional expressiveness, and vice versa.
  • Checkpoints every 100-200 steps are recommended. Peak metrics often occur mid-training (e.g., v7 peaked at step 258 for speaker sim, step 472 for CLAP) and final checkpoints may not be optimal.
  • --sync-every 5 is a good default. Lower values (1-2) keep the policy more on-policy but increase overhead from weight sync. Higher values (10+) risk the SGLang server generating with stale weights, reducing training signal quality.
  • Channelwise loss weight "1,32" means the text token head has weight 1 and all 16 audio codebook heads share a total weight of 32 (2 per head). This upweights audio quality relative to text token prediction.

Citation

@misc{moss-grpo-lora-2026,
    title={MOSS-VoiceGenerator GRPO LoRA},
    author={LAION},
    year={2026},
    url={https://huggingface.co/laion/MOSS-VoiceGenerator-GRPO-LoRA}
}

License

Same as the base MOSS-VoiceGenerator model.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support