YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
MOSS-VoiceGenerator GRPO LoRA
A LoRA adapter for MOSS-VoiceGenerator (1.7B) trained with Group Relative Policy Optimization (GRPO) to improve speaker similarity, emotion expression, and speech intelligibility.
Model Details
| Property | Value |
|---|---|
| Base model | LAION-AI/MOSS-VoiceGenerator (1.7B params) |
| Architecture | MossTTSDelayModel (Qwen3 backbone + 16 VQ codebooks) |
| LoRA rank | 8 |
| LoRA alpha | 16 |
| LoRA targets | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Applied to | model.language_model (Qwen3 backbone only) |
| Adapter size | 34 MB |
| Training | GRPO: 500 steps (v1) + 500 steps (v7) = 1000 total GRPO steps |
Training Results
GRPO v1 (500 steps, lr=5e-5, batch=8, G=4)
| Metric | Start (5-step avg) | End (5-step avg) |
|---|---|---|
| Speaker Similarity (ECAPA-TDNN cosine) | 0.374 | 0.441 |
| Emotion Match (CLAP cosine) | 0.191 | 0.221 |
| Word Error Rate | 0.234 | 0.135 |
GRPO v7 (500 additional steps, continuing v1)
| Metric | Start (5-step avg) | End (5-step avg) | Peak |
|---|---|---|---|
| Speaker Similarity | 0.433 | 0.391 | 0.572 (step 258) |
| Emotion Match (CLAP) | 0.218 | 0.203 | 0.308 (step 472) |
| Word Error Rate | 0.111 | 0.186 | 0.057 (step 431) |
Overall Improvement (v1 start to v7 end, 1000 steps total)
| Metric | v1 Start | v7 End | Change |
|---|---|---|---|
| Speaker Similarity | 0.374 | 0.391 | +0.017 |
| Emotion Match (CLAP) | 0.191 | 0.203 | +0.012 |
| Word Error Rate | 0.234 | 0.186 | -0.048 (improved) |
Reward weights (v7): Speaker 50%, Emotion 40%, Quality 10%, with WER penalty exp(-10 * WER)
Inference Benchmark (SGLang, 6x H100 80GB, DP=6)
| Batch Size | Wall Clock | Latency (avg) | Real-Time Factor | Requests/s |
|---|---|---|---|---|
| 1 | 0.81s | 1.25s | 8.3x | 1.2 |
| 4 | 0.85s | 0.89s | 31.9x | 4.7 |
| 8 | 1.05s | 0.98s | 49.9x | 7.6 |
| 16 | 1.10s | 1.01s | 95.1x | 14.6 |
| 24 | 1.34s | 1.19s | 117.0x | 17.9 |
| 48 | 1.75s | 1.53s | 178.8x | 27.4 |
Peak throughput: 178.8x realtime at BS=48 (27.4 requests/second).
Usage
Quick Start with transformers + peft
import torch
from transformers import AutoModel, AutoProcessor
from peft import PeftModel
# Load base model
model_id = "LAION-AI/MOSS-VoiceGenerator"
codec_path = "LAION-AI/MOSS-Audio-Tokenizer"
proc = AutoProcessor.from_pretrained(
model_id, trust_remote_code=True,
normalize_inputs=True, codec_path=codec_path)
model = AutoModel.from_pretrained(
model_id, trust_remote_code=True,
torch_dtype=torch.bfloat16, attn_implementation="sdpa"
).to("cuda")
# Apply LoRA and merge (IMPORTANT: must merge for correct generation)
lora_path = "laion/MOSS-VoiceGenerator-GRPO-LoRA"
model.language_model = PeftModel.from_pretrained(
model.language_model, lora_path, is_trainable=False)
model.language_model = model.language_model.merge_and_unload()
# Prepare input (4-message ICL format)
# Message 1: empty user turn
# Message 2: reference speaker audio (assistant)
# Message 3: target text + emotion (user)
# Message 4: generated audio (assistant, to be generated)
ref_audio_path = "speaker_reference.wav"
text = "Hello, how are you today?"
emotion = "happiness"
messages = [
{"role": "user", "content": ""},
{"role": "assistant", "content": [{"type": "audio", "audio_url": ref_audio_path}]},
{"role": "user", "content": f"${{instruction:{emotion}}}{text}"},
]
inputs = proc.apply_chat_template(
messages, tokenize=True, return_tensors="pt").to("cuda")
# Generate
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=4096,
audio_temperature=1.0,
audio_top_p=0.8,
audio_top_k=50,
audio_repetition_penalty=1.1,
)
# Decode to audio
audio = proc.decode(outputs)
# audio contains waveform at 24kHz
Production Serving with SGLang (Recommended)
SGLang provides 8-178x realtime performance via the OpenMOSS SGLang fork.
Step 1: Merge LoRA into base model
python serve_sglang.py merge \
--base-model LAION-AI/MOSS-VoiceGenerator \
--lora-path laion/MOSS-VoiceGenerator-GRPO-LoRA \
--output ./merged_model
Step 2: Launch SGLang server
# Single GPU
python -m sglang.launch_server \
--model-path ./merged_model \
--delay-pattern \
--trust-remote-code \
--port 30000
# Multi-GPU (DP=6 for maximum throughput)
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python -m sglang.launch_server \
--model-path ./merged_model \
--delay-pattern \
--trust-remote-code \
--port 30000 \
--dp-size 6 \
--mem-fraction-static 0.85
Step 3: Generate audio via API
import requests, base64, io, soundfile as sf
response = requests.post("http://localhost:30000/generate", json={
"text": "${instruction:happiness}Hello, how are you today?",
"audio_data": ["speaker_reference.wav"], # absolute path to ref audio
"sampling_params": {
"temperature": 1.0,
"top_p": 0.8,
"top_k": 50,
"repetition_penalty": 1.1,
"max_new_tokens": 4096,
},
"stream": False,
})
# Decode base64 WAV response
audio_bytes = base64.b64decode(response.json()["text"])
wav, sr = sf.read(io.BytesIO(audio_bytes)) # 24kHz
sf.write("output.wav", wav, sr)
Or use the included CLI tool:
python serve_sglang.py generate \
--text "Hello, how are you today?" \
--emotion happiness \
--ref-audio speaker_reference.wav \
--output output.wav
How GRPO Works
Group Relative Policy Optimization (GRPO) is a reinforcement learning method that directly optimizes generation quality:
- Generate: For each text prompt, generate G=4 audio completions with the current model
- Score: Rate each completion with three reward models:
- Speaker similarity (ECAPA-TDNN): cosine similarity between generated and reference speaker embeddings
- Emotion match (Voice-OpenCLAP): cosine similarity between audio embedding and emotion text
- Intelligibility (Parakeet ASR): Word Error Rate of transcribed audio vs target text
- Advantage: Normalize rewards within each group:
A_i = (R_i - mean) / std - Train: Update model with policy gradient:
L = mean(A_i * CE_loss(completion_i))
This trains the model to produce audio that better matches the target speaker, conveys the right emotion, and maintains intelligibility.
Important Notes
- Always use
merge_and_unload()before generation. ThePeftModelwrapper breaks MOSS's multi-head architecture and produces garbled audio. - The LoRA only modifies
model.language_model(Qwen3 backbone). Audio embedding layers and output heads are unchanged. - Audio is generated at 24kHz with 16 VQ codebooks using a delay pattern.
Training
This section describes how to replicate GRPO training using the included grpo_train_v6.py and grpo_rewards.py scripts. The pipeline uses SGLang for fast generation, enabling semi-on-policy GRPO on a single 8-GPU node.
Dependencies
# Core
pip install torch>=2.3 transformers peft accelerate safetensors
# Audio processing
pip install librosa soundfile numpy jiwer
# Reward models
pip install speechbrain # ECAPA-TDNN speaker verification
pip install openai-whisper # Whisper ASR (fallback)
pip install nemo_toolkit[asr] # Parakeet ASR (preferred)
# CLAP model (clone separately)
# See: https://huggingface.co/laion/voice-openclap-poc
# SGLang (OpenMOSS fork with MOSS TTS support)
pip install "sglang[all]" --find-links https://github.com/OpenMOSS/sglang
# Or install from source:
# git clone https://github.com/OpenMOSS/sglang && cd sglang && pip install -e ".[all]"
# Data
pip install datasets # HuggingFace datasets for voice-acting-prompts
Architecture Overview
The GRPO training pipeline uses a semi-on-policy design with pipeline overlap across 8 GPUs:
GPU 0: Training (forward/backward/optimizer) β base model + LoRA
GPUs 1-6: SGLang server with data-parallel-size=6 (continuous batching)
GPU 7: Reward models (ECAPA-TDNN, Voice-OpenCLAP, Parakeet ASR)
The key idea is that generation (the bottleneck in GRPO) runs on a dedicated SGLang server with 6-way data parallelism, while training and reward scoring happen on separate GPUs. This enables pipeline overlap: while the training GPU processes batch N (scoring rewards + computing gradients), the SGLang server is already generating completions for batch N+1.
Semi-on-policy: LoRA weights are synced from the training model to the SGLang server every --sync-every steps (default 5). Between syncs, the SGLang server generates using slightly stale weights β a practical trade-off that avoids the costly restart-per-step approach while keeping the policy reasonably fresh.
Weight sync works by:
- Saving the current LoRA adapter to disk
- Merging it into a fused model checkpoint (only regenerating the modified safetensors shard)
- Triggering SGLang to reload the updated weights
Data Pipeline
Dataset: voice-acting-prompts β a large collection of expressive text prompts with emotion labels.
IMPORTANT: English-only filtering is required. The dataset is approximately 78% non-English (German, French, etc.). The
_is_englishfunction ingrpo_train_v6.pyfilters prompts by checking the ratio of ASCII characters to total characters (threshold: 80% ASCII). Without this filter, the model will train on non-English text and WER rewards become meaningless.
IMPORTANT: Always shuffle the streaming dataset. The dataset shards are not randomly ordered β early shards are heavily German. Without shuffling, the first several hundred steps would contain almost exclusively non-English prompts that pass through even the ASCII filter. The training script uses
dataset.shuffle(seed=42)to ensure a uniform language distribution across training.
Reference speakers: Emolia dataset speaker clusters (3000 clusters of expressive speech). Each training step randomly samples a speaker cluster and a random utterance within that cluster as the reference audio for voice cloning.
Prompt processing:
- Stream and shuffle the dataset
- Filter to English-only using ASCII character ratio
- Clean text (remove quotes, stage directions, normalize whitespace)
- Pair each prompt with a randomly sampled emolia speaker reference
- Format as MOSS 4-message ICL:
[empty_user, ref_assistant, target_user, gen_assistant]
Reward Models
Three reward models run on GPU 7, wrapped by grpo_rewards.py:
| Model | What it measures | Output range | Class |
|---|---|---|---|
| ECAPA-TDNN (speechbrain) | Speaker similarity between generated and reference audio | [-1, 1] cosine | SpeakerReward |
| Voice-OpenCLAP | Emotion match: audio-text cosine similarity to emotion description | [-1, 1] cosine | CLAPReward |
| Voice-OpenCLAP (quality) | Audio quality: similarity to "High Quality Recording, fluid pleasant performance" | [-1, 1] cosine | CLAPReward.score_quality() |
| Parakeet TDT 0.6B (or Whisper fallback) | Intelligibility: Word Error Rate of transcribed audio vs target text | [0, inf) WER | ASRReward |
All audio is resampled to 16kHz before reward scoring.
Reward Formula
Individual rewards are z-normalized using running baseline statistics, then combined with a multiplicative WER penalty:
R_total = (w_spk * Z(speaker_sim) + w_clap * Z(emotion_match) + w_qual * Z(quality)) * exp(-beta * WER)
Where:
Z(x) = (x - mean) / stdβ z-normalization using baseline statisticsw_spk,w_clap,w_qualβ reward component weights (must sum to 1.0)betaβ WER penalty strength (default: 10.0)WERβ word error rate from ASR transcription
The exponential WER penalty ensures that unintelligible speech receives near-zero total reward regardless of how well it matches the speaker or emotion. At WER=0, the penalty is 1.0 (no effect); at WER=0.3, the penalty is ~0.05.
Reward weight configurations used:
- v1 (default):
w_spk=0.6, w_clap=0.4, w_qual=0.0β speaker-heavy, no quality term - v7:
w_spk=0.5, w_clap=0.4, w_qual=0.1β balanced with quality bonus
GRPO Algorithm Details
For each training step:
- Sample a batch of B=8 prompts from the dataset
- Generate G=4 completions per prompt using SGLang (total 32 audio samples)
- Score each completion with all three reward models
- Combine rewards using the formula above
- Compute advantages within each group:
A_i = (R_i - mean(R_group)) / std(R_group) - Train using advantage-weighted cross-entropy loss on the audio tokens
The loss is computed as:
L = mean(A_i * CE_loss(completion_i))
where CE_loss uses channelwise loss weighting (--channelwise-loss-weight "1,32" means text head weight=1, total audio heads weight=32 spread across 16 codebook heads).
Key Training Parameters
| Parameter | Default | Description |
|---|---|---|
--lora-init |
(required) | Path to initial LoRA adapter or HF repo |
--output-dir |
output/grpo_v6 |
Directory for checkpoints and logs |
--batch-size |
8 | Prompts per training step |
--group-size |
4 | Completions generated per prompt (G) |
--lr |
5e-5 | Learning rate |
--max-steps |
1600 | Total training steps |
--sync-every |
5 | Steps between LoRA-to-SGLang weight syncs |
--save-every |
200 | Steps between checkpoint saves |
--channelwise-loss-weight |
"1,32" |
Text head vs total audio weight |
--w-speaker |
0.5 | Speaker similarity reward weight |
--w-clap |
0.4 | Emotion match reward weight |
--w-quality |
0.1 | Audio quality reward weight |
--beta-wer |
10.0 | WER penalty strength |
--train-device |
cuda:0 |
GPU for training |
--sglang-gpus |
1,2,3,4,5,6 |
GPUs for SGLang server (DP) |
--reward-device |
cuda:7 |
GPU for reward models |
--lr-schedule |
constant |
LR schedule: constant, cosine, or linear |
--warmup-steps |
0 | Linear warmup steps |
--lr-min |
0.0 | Minimum LR for cosine/linear decay |
--seed |
42 | Random seed |
--resume-step |
0 | Resume from this step number |
LoRA Configuration
LoRA is applied only to model.language_model β the Qwen3 backbone that handles sequence modeling. The 16 audio embedding layers (emb_ext) and 17 output heads (lm_heads) are frozen and unchanged. This is critical because:
- The audio codebook embeddings map discrete VQ codes to continuous representations β modifying them would break the learned codebook alignment
- The output heads project back to codebook logits β these must remain calibrated to the frozen embeddings
- The Qwen3 backbone is where high-level decisions about what to generate are made, making it the right target for RL fine-tuning
LoRA config: rank=8, alpha=16, dropout=0.0, applied to all linear layers in the Qwen3 model (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj), using TaskType.FEATURE_EXTRACTION.
Replicating v7 Training
v7 continued from the v1 checkpoint (500 steps) with adjusted reward weights. To replicate:
# Requires: 8x GPUs (H100 80GB recommended), ~2-3 hours for 500 steps
# The script will automatically:
# 1. Load base model + LoRA on GPU 0
# 2. Launch SGLang server on GPUs 1-6
# 3. Load reward models on GPU 7
# 4. Stream and filter the voice-acting-prompts dataset
python grpo_train_v6.py \
--lora-init output/grpo/final \
--output-dir output/grpo_v7 \
--max-steps 500 \
--save-every 100 \
--sync-every 5 \
--w-speaker 0.5 --w-clap 0.4 --w-quality 0.1 \
--lr 5e-5
To start from scratch (no prior LoRA, fresh rank-8 adapter):
python grpo_train_v6.py \
--output-dir output/grpo_from_scratch \
--max-steps 500 \
--save-every 100 \
--sync-every 5 \
--w-speaker 0.5 --w-clap 0.4 --w-quality 0.1 \
--lr 5e-5
Environment variables (optional, for custom paths):
export MVG_DIR=/path/to/MOSS-VoiceGenerator # Base model
export CODEC_DIR=/path/to/MOSS-Audio-Tokenizer # Audio codec
export EMOLIA_DIR=/path/to/emolia/cluster_samples # Reference speakers
export CLAP_DIR=/path/to/voice-openclap-poc # CLAP model
Training Tips
- Monitor WER closely. If WER rises above ~0.3, the model is producing unintelligible speech. Consider increasing
--beta-weror reducing the learning rate. - Speaker similarity and emotion match trade off. Increasing
--w-speakerimproves voice cloning fidelity but may reduce emotional expressiveness, and vice versa. - Checkpoints every 100-200 steps are recommended. Peak metrics often occur mid-training (e.g., v7 peaked at step 258 for speaker sim, step 472 for CLAP) and final checkpoints may not be optimal.
--sync-every 5is a good default. Lower values (1-2) keep the policy more on-policy but increase overhead from weight sync. Higher values (10+) risk the SGLang server generating with stale weights, reducing training signal quality.- Channelwise loss weight
"1,32"means the text token head has weight 1 and all 16 audio codebook heads share a total weight of 32 (2 per head). This upweights audio quality relative to text token prediction.
Citation
@misc{moss-grpo-lora-2026,
title={MOSS-VoiceGenerator GRPO LoRA},
author={LAION},
year={2026},
url={https://huggingface.co/laion/MOSS-VoiceGenerator-GRPO-LoRA}
}
License
Same as the base MOSS-VoiceGenerator model.