SoulX-Singer NVFP4 (Quantized)
This is the NVIDIA FP4 (NVFP4) weight-only quantized version of the original SoulX-Singer zero-shot singing voice synthesis model.
It is produced by a one-time torchao NVFP4 quantization pass over the official
fp32 base model, and is verified to be true 4-bit quantization (not a
pseudo-quantization / fake-quant) by direct on-GPU forward comparison against
the original fp32 weights.
Source Model
- Base model:
Soul-AILab/SoulX-Singer(fp32, 2687.54 MB) - Quantization format: NVIDIA FP4 weight-only (E2M1 mantissa + float8_e4m3fn per-block microscaling scales, 32-element blocks)
- Tooling:
torchao>=0.17.0(NVFP4WeightOnlyConfig) - Hardware requirement: NVIDIA Blackwell GPU (sm100+), e.g. RTX 5060 / 5090 / B200
- Verified on: NVIDIA GeForce RTX 5060 Laptop GPU (sm120), CUDA 13.2, torch 2.12.0
Why NVFP4
NVFP4 is the native 4-bit floating-point format introduced with the Blackwell architecture. Compared with INT4 / W4A16 pseudo-quantization, NVFP4 has dedicated hardware units on sm100+ and runs the actual 4-bit matmul on silicon β no on-the-fly dequantization to fp16/bf16 is required inside the GEMM. This gives both real memory savings and real compute throughput on Blackwell.
Quantization Scope
Only nn.Linear layers whose both dimensions are divisible by 16 are
quantized to NVFP4. The following tensors stay in their original precision:
nn.Embedding(phoneme / pitch / note-type / F0 encoders)nn.Conv1d(depth-wise convolutions in ConvNeXtV2 blocks)nn.LayerNorm/GRNparameters- All biases
Layer coverage
| Statistic | Value |
|---|---|
Total nn.Linear layers in the model |
277 |
| Layers quantized to NVFP4 | 276 |
| Layers left in higher precision | 1 (single non-divisible-by-16 Linear) |
| Total state-dict entries | 587 |
NVFP4Tensor entries |
276 |
Regular torch.Tensor entries |
311 |
Quantized module groups
| Group | # layers | Examples |
|---|---|---|
preflow (ConvNeXtV2 blocks) |
8 | preflow.{0..3}.pwconv1/2 |
cfm_decoder.model.cond_emb |
1 | conditioning embedding projection |
cfm_decoder.model.diff_estimator (DiffLlama, 22 layers) |
207 | attention Q/K/V/O, MLP up/down, layernorm gates, time-step MLP |
vocoder.model.backbone (ConvNeXtV2, 30 blocks) |
60 | convnext.{0..29}.pwconv1/2 |
File Size
| Variant | Size | Compression |
|---|---|---|
SoulX-Singer/model.pt (fp32) |
2687.54 MB | 1.00Γ |
SoulX-Singer-nvfp4/model.pt (NVFP4) |
396.62 MB | 6.78Γ |
The achieved 6.78Γ compression closely matches the theoretical 4-bit weight-only ratio (~8Γ for weights alone), with the remainder coming from un-quantized embeddings, convolutions, layer-norms, and biases.
Precision Verification
The verification is performed by loading both the original fp32 model
and the NVFP4 model on a Blackwell GPU, then running the same random
fp32 input through every NVFP4-quantized nn.Linear using native NVFP4
matmul kernels (no dequantization, no fake-quant emulation). Outputs are
compared layer-by-layer against the fp32 reference.
Per-group results (fp32 reference vs. native NVFP4 forward)
| Group | # layers | Mean MSE | Mean RMSE | Mean MaxAbs | Mean RelRMSE | Mean Cosine |
|---|---|---|---|---|---|---|
cond_emb |
1 | 1.514e-03 | 0.0389 | 0.1714 | 9.449% | 0.99553 |
diff_estimator |
207 | 7.512e-03 | 0.0828 | 0.5202 | 9.502% | 0.99539 |
preflow |
8 | 4.911e-03 | 0.0700 | 0.3197 | 9.590% | 0.99538 |
vocoder |
60 | 1.339e-01 | 0.3455 | 2.1242 | 9.446% | 0.99550 |
| OVERALL | 276 | 3.488e-02 | 0.1394 | 6.5048 | 9.492% | 0.99541 |
Metrics:
- MSE / RMSE β element-wise mean squared / root-mean-squared error of the layer output.
- MaxAbs β maximum absolute element-wise error in the output.
- RelRMSE β RMSE divided by the RMS magnitude of the fp32 reference output.
- Cosine β mean cosine similarity over output feature dimension.
Worst layers by max abs error
| Layer | Shape | MaxAbs | Cosine |
|---|---|---|---|
vocoder.model.backbone.convnext.29.pwconv2 |
1024 Γ 4096 | 6.5048 | 0.99549 |
vocoder.model.backbone.convnext.24.pwconv2 |
1024 Γ 4096 | 3.4804 | 0.99567 |
vocoder.model.backbone.convnext.25.pwconv1 |
4096 Γ 1024 | 3.4157 | 0.99564 |
vocoder.model.backbone.convnext.27.pwconv1 |
4096 Γ 1024 | 3.3751 | 0.99563 |
vocoder.model.backbone.convnext.14.pwconv2 |
1024 Γ 4096 | 3.2466 | 0.99553 |
All worst-case layers are in the vocoder backbone, which is expected β the BigVGAN-style ConvNeXt blocks operate on large-magnitude activations and have the largest absolute weight values, so they show the largest absolute errors even though the direction (cosine) is still well-preserved.
Worst layers by cosine similarity
All 276 quantized layers stay above 0.9942 cosine similarity, with the lowest being a few layernorm gate projections inside the DiffLlama transformer. The mean over all 276 layers is 0.9954.
Verdict
| Indicator | Value | Grade |
|---|---|---|
| Mean cosine similarity (all 276 layers) | 0.9954 | β |
| Mean relative RMSE | 9.49% | β |
| Max abs error (single layer) | 6.505 | β |
| Quantization grade | GOOD | β Real NVFP4 |
Conclusion: The checkpoint contains genuine NVFP4-typed weight tensors
(NVFP4Tensor instances), runs through native NVFP4 GEMM kernels on
Blackwell, achieves a 6.78Γ model-size reduction, and preserves per-layer
output direction to >0.995 cosine similarity on average. This is real
4-bit weight-only quantization, not a pseudo-quantization wrapper.
How to Use
Requirements
pip install torch>=2.12 torchao>=0.17.0 omegaconf
# CUDA 13.x with Blackwell (sm100+) support
Loading the NVFP4 model directly (no dequantization)
import torch
from omegaconf import OmegaConf
from torchao.quantization import quantize_
from torchao.prototype.mx_formats import NVFP4WeightOnlyConfig
from soulxsinger.models.soulxsinger import SoulXSinger
def nvfp4_filter(mod, fqn):
import torch.nn as nn
if not isinstance(mod, nn.Linear):
return False
N, K = mod.weight.shape
return K % 16 == 0 and N % 16 == 0
config = OmegaConf.load("soulxsinger/config/soulxsinger.yaml")
model = SoulXSinger(config)
# Convert eligible Linear weights to NVFP4Tensor wrappers.
quantize_(model, NVFP4WeightOnlyConfig(), nvfp4_filter)
# Load saved NVFP4 weights. NVFP4Tensor does NOT support copy_ via
# load_state_dict, so we assign each quantized Linear's weight directly.
ckpt = torch.load("pretrained_models/SoulX-Singer-nvfp4/model.pt",
map_location="cpu", weights_only=False)
assert ckpt["nvfp4_quantized"] is True
nvfp4_sd = ckpt["state_dict"]
import torch.nn as nn
for name, param in model.named_parameters():
if name not in nvfp4_sd:
continue
saved = nvfp4_sd[name]
if type(saved).__name__ != "NVFP4Tensor":
param.data.copy_(saved) # embedding / conv / layernorm / bias
continue
# Direct assignment keeps the weight in NVFP4Tensor format.
parent_name = name.rsplit(".", 1)[0]
target_mod = model.get_submodule(parent_name)
target_mod.weight = nn.Parameter(saved.to(device="cuda"))
model = model.to("cuda").eval()
# Forward passes now use native NVFP4 matmul on Blackwell.
Inference example
bash example/infer.sh
# (Point the script at pretrained_models/SoulX-Singer-nvfp4/model.pt)
File Layout
SoulX-Singer-nvfp4/
βββ model.pt # NVFP4-quantized state_dict (396.62 MB)
βββ README.md # this file
The checkpoint model.pt is a regular torch.save dict with the following
keys:
| Key | Type | Meaning |
|---|---|---|
state_dict |
dict[str, Tensor | NVFP4Tensor] | 587 entries (276 NVFP4Tensor + 311 Tensor) |
nvfp4_quantized |
bool | Always True β marks this as a real NVFP4 checkpoint |
torchao_required |
bool | Always True β torchao is required to load/forward |
orig_model_path |
str | Path to the fp32 source model that was quantized |
Limitations
- Hardware gated. NVFP4 native kernels require NVIDIA Blackwell (sm100+). On older GPUs (Ada / Hopper / Ampere), this checkpoint will either fail to load or silently fall back to a slow emulation path β use the fp32 or an INT8 variant there instead.
- Weight-only. Activations and gradients are not quantized; only
nn.Linearweights are NVFP4. Embeddings / convs / norms stay fp32. - No calibration. NVFP4 weight-only uses the weights' own per-block statistics for scaling β no representative dataset is needed. The ~9.5% mean relative RMSE is the intrinsic precision floor of 4-bit weights without K-V / activation quantization.
- LayerNorm gate projections in the DiffLlama show slightly higher
directional error than other layers (~0.9942 cosine). If you observe
quality degradation on specific singing samples, consider keeping
to_weightlayers in fp16 (they are small).
License
Apache 2.0, inherited from the base Soul-AILab/SoulX-Singer model. See
LICENSE for
details.
Citation
@misc{soulxsinger,
title={SoulX-Singer: Towards High-Quality Zero-Shot Singing Voice Synthesis},
author={Jiale Qian and Hao Meng and Tian Zheng and Pengcheng Zhu and Haopeng Lin and Yuhang Dai and Hanke Xie and Wenxiao Cao and Ruixuan Shang and Jun Wu and Hongmei Liu and Hanlin Wen and Jian Zhao and Zhonglin Jiang and Yong Chen and Shunshun Yin and Ming Tao and Jianguo Wei and Lei Xie and Xinsheng Wang},
year={2026},
eprint={2602.07803},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2602.07803},
}
Contact
For questions about the NVFP4 quantization pipeline in this repository,
please open an issue. For questions about the
original SoulX-Singer model, contact Soul-AILab qianjiale@soulapp.cn /
menghao@soulapp.cn / wangxinsheng@soulapp.cn.
Model tree for Henley04/SoulX-Singer-nvfp4
Base model
Soul-AILab/SoulX-Singer