Ornith 1.0 35B ModelOpt NVFP4 Expert
This repository contains a community ModelOpt NVFP4 experts-only quantization of
deepreinforce-ai/Ornith-1.0-35B.
This is not an official DeepReinforce release. The source BF16 checkpoint is unchanged and remains available from the upstream repository.
Model Details
| Field | Value |
|---|---|
| Base model | deepreinforce-ai/Ornith-1.0-35B |
| Base revision | 5df2ed3f675c7beaa490328cc70bb573b65fb660 |
| Release repo | LS-ML/Ornith-1.0-35B-ModelOpt-NVFP4-Expert |
| Architecture | Qwen3_5MoeForConditionalGeneration |
| Model type | qwen3_5_moe |
| Modality | Text and vision-language |
| Max context tested | 262144 tokens |
| Quantization | ModelOpt NVFP4, experts-only target |
| Group size | 16 |
| Exported checkpoint size | 23G on disk |
| Source BF16 checkpoint size | 66G on disk |
| License | MIT |
The repository name uses Expert, but the quantization target is experts-only:
the MoE expert linear weights are quantized while embeddings, lm head, visual
modules, attention, linear-attention modules, and shared experts are excluded.
See hf_quant_config.json and config.json for the exact ModelOpt
quantization metadata.
Benchmark Summary
The checkpoint was benchmarked against the upstream BF16 model on the same DGX Spark / GB10, same vLLM nightly container, same 262K serving profile, same FP8 KV cache allocation, and same benchmark harness.
Performance summary:
| Metric | BF16 | ModelOpt NVFP4 | Result |
|---|---|---|---|
| Model directory size | 66G |
23G |
2.9x smaller |
| vLLM model loading memory | 65.53 GiB |
22.41 GiB |
2.9x lower |
| Weight loading time | 454.55s |
153.13s |
3.0x faster |
| Text prefill speedup range | baseline | 1.08x to 1.41x |
faster in every row |
| Text decode speedup range | baseline | 1.16x to 1.45x |
faster in every row |
| 4K image prefill | 1204.6 tok/s |
1357.6 tok/s |
1.13x faster |
| 4K image decode | 30.0 tok/s |
36.1 tok/s |
1.20x faster |
Accuracy summary:
| Benchmark | BF16 acc / score | NVFP4 acc / score | Delta NVFP4-BF16 |
|---|---|---|---|
| MMLU-Pro | 55.8% |
55.0% |
-0.8 pp |
| GPQA Diamond mirror | 22.7% |
13.6% |
-9.1 pp |
| MATH-500 | 34.6% |
36.6% |
+2.0 pp |
| HumanEval+ | 39.0% |
21.3% |
-17.7 pp |
| MBPP+ | 78.0% |
76.5% |
-1.6 pp |
| MMMU validation | 57.4% |
56.1% |
-1.3 pp |
| OCRBench | 69.9% |
70.2% |
+0.3 pp |
| BFCL v3 10-each subset | 1.0% |
0.0% |
-1.0 pp |
| ToolEvalBench | 82 |
88 |
+6 |
Detailed performance and accuracy benchmark tables are included below.
Benchmark Graphics
Performance benchmark:
Accuracy and quality benchmark:
Quantization
The checkpoint was produced with NVIDIA ModelOpt from a fused-expert staging copy of the upstream BF16 model. The upstream checkpoint stores split expert tensors; the staging copy fused expert gate/up/down tensors into the layout expected by current Transformers and ModelOpt. The upstream BF16 checkpoint itself was not modified.
Quantization metadata:
| Field | Value |
|---|---|
| ModelOpt source version | 0.46.0.dev106+g6cc522658 |
| ModelOpt commit | 6cc5226588f0668679df03ba4646b7dfec32f99c |
| Quant algorithm | NVFP4 |
| KV cache quantization during export | none |
| Calibration dataset | nvidia/Nemotron-SFT-Agentic-v2, split search |
| Calibration settings | calib_size=16, calib_seq=512, batch_size=1 |
| Export mode | low-memory ModelOpt path |
The quantization notes used during export are included in
QUANTIZATION_NOTES.md.
Serving With vLLM
Validated serving stack:
vllm/vllm-openai:nightly- ModelOpt quantization loader:
--quantization modelopt - FlashInfer attention backend
- Blackwell / GB10 tested with
CUTE_DSL_ARCH=sm_121a - OpenAI-compatible chat completions
- Text, image, and Qwen3 XML tool-call smoke tests
Example full-context DGX Spark / GB10 launch:
docker run --rm \
--name ornith35-nvfp4-vllm \
--gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --shm-size=32g \
-p 8000:8000 \
-v /path/to/models:/models \
-v ~/.cache:/root/.cache \
-e FLASHINFER_DISABLE_VERSION_CHECK=1 \
-e CUTE_DSL_ARCH=sm_121a \
vllm/vllm-openai:nightly \
--model /models/Ornith-1.0-35B-ModelOpt-NVFP4-Expert \
--host 0.0.0.0 \
--port 8000 \
--served-model-name ornith35-nvfp4 \
--trust-remote-code \
--dtype bfloat16 \
--quantization modelopt \
--kv-cache-dtype fp8 \
--kv-cache-memory-bytes 28G \
--attention-backend flashinfer \
--max-model-len 262144 \
--max-num-seqs 4 \
--max-num-batched-tokens 8192 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--enable-chunked-prefill
Lower-memory smoke profile:
docker run --rm \
--name ornith35-nvfp4-vllm-smoke \
--gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --shm-size=32g \
-p 8000:8000 \
-v /path/to/models:/models \
-v ~/.cache:/root/.cache \
-e FLASHINFER_DISABLE_VERSION_CHECK=1 \
-e CUTE_DSL_ARCH=sm_121a \
vllm/vllm-openai:nightly \
--model /models/Ornith-1.0-35B-ModelOpt-NVFP4-Expert \
--host 0.0.0.0 \
--port 8000 \
--served-model-name ornith35-nvfp4 \
--trust-remote-code \
--dtype bfloat16 \
--quantization modelopt \
--kv-cache-dtype fp8 \
--kv-cache-memory-bytes 4G \
--attention-backend flashinfer \
--max-model-len 8192 \
--max-num-seqs 1 \
--max-num-batched-tokens 8192 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--enable-chunked-prefill
For non-thinking deterministic chat/eval requests, use:
{
"temperature": 0,
"chat_template_kwargs": {
"enable_thinking": false
}
}
Validation
The quantized checkpoint was validated on a local DGX Spark / GB10 against the BF16 upstream checkpoint using the same vLLM runtime profile, same maximum context, same FP8 KV cache allocation, and the same benchmark harness.
Performance Benchmark Results
All text rows below used unique prompts, prefix caching disabled, max_tokens=128,
and submitted concurrency 1, 2, or 4.
| Context | Conc | BF16 prefill | NVFP4 prefill | Prefill speedup | BF16 decode | NVFP4 decode | Decode speedup | BF16 TTFT | NVFP4 TTFT |
|---|---|---|---|---|---|---|---|---|---|
| 32k | 1 | 3288.8 | 4626.2 | 1.41x | 29.6 | 35.5 | 1.20x | 10.0s | 7.1s |
| 32k | 2 | 4349.7 | 5412.5 | 1.24x | 55.4 | 70.2 | 1.27x | 15.1s | 12.1s |
| 32k | 4 | 4425.1 | 5512.5 | 1.25x | 82.1 | 118.9 | 1.45x | 29.6s | 23.8s |
| 64k | 1 | 3656.1 | 4348.0 | 1.19x | 28.2 | 33.5 | 1.19x | 17.9s | 15.1s |
| 64k | 2 | 3625.8 | 4315.6 | 1.19x | 48.5 | 63.6 | 1.31x | 36.1s | 30.4s |
| 64k | 4 | 3594.4 | 4289.9 | 1.19x | 79.7 | 103.3 | 1.30x | 72.9s | 61.1s |
| 128k | 1 | 2653.1 | 3003.2 | 1.13x | 26.1 | 30.7 | 1.18x | 49.4s | 43.6s |
| 128k | 2 | 2632.2 | 2988.2 | 1.14x | 45.0 | 55.5 | 1.23x | 99.6s | 87.7s |
| 128k | 4 | 2607.5 | 2969.3 | 1.14x | 64.0 | 84.4 | 1.32x | 201.1s | 176.6s |
| full | 1 | 1713.5 | 1853.7 | 1.08x | 22.2 | 25.7 | 1.16x | 152.8s | 141.3s |
| full | 2 | 1704.9 | 1838.8 | 1.08x | 37.5 | 43.8 | 1.17x | 307.2s | 284.8s |
| full | 4 | 1702.5 | 1840.8 | 1.08x | 56.7 | 71.0 | 1.25x | 615.3s | 569.0s |
Throughput units are tokens per second. TTFT is max time to first token for the concurrent batch.
The serial 4096 x 4096 image test:
| Case | BF16 prefill | NVFP4 prefill | Prefill speedup | BF16 decode | NVFP4 decode | Decode speedup | BF16 TTFT | NVFP4 TTFT |
|---|---|---|---|---|---|---|---|---|
| 4K image | 1204.6 | 1357.6 | 1.13x | 30.0 | 36.1 | 1.20x | 13.6s | 12.1s |
Startup and memory observations:
| Metric | BF16 | ModelOpt NVFP4 |
|---|---|---|
| Remote model directory size | 66G |
23G |
| vLLM model loading memory | 65.53 GiB |
22.41 GiB |
| Weight loading time | 454.55s |
153.13s |
| Engine init observed to health | about 610s |
about 290s |
| MemAvailable after health | about 17 GiB |
about 59 GiB |
Accuracy Benchmark Results
Quality runs used deterministic decoding with temperature=0,
chat_template_kwargs={"enable_thinking": false}, and paired per-item scoring.
The key quantization regression count is "BF16 correct / NVFP4 wrong".
| Benchmark | Items | BF16 acc | NVFP4 acc | Delta NVFP4-BF16 | BF16 correct/NVFP4 wrong | NVFP4 correct/BF16 wrong |
|---|---|---|---|---|---|---|
| MMLU-Pro | 12032 | 55.8% | 55.0% | -0.8 pp | 527 | 429 |
| GPQA Diamond mirror | 198 | 22.7% | 13.6% | -9.1 pp | 23 | 5 |
| MATH-500 | 500 | 34.6% | 36.6% | +2.0 pp | 17 | 27 |
| HumanEval+ | 164 | 39.0% | 21.3% | -17.7 pp | 33 | 4 |
| MBPP+ | 378 | 78.0% | 76.5% | -1.6 pp | 13 | 7 |
| MMMU validation | 900 | 57.4% | 56.1% | -1.3 pp | 54 | 42 |
| OCRBench | 1000 | 69.9% | 70.2% | +0.3 pp | 17 | 20 |
| BFCL v3 10-each subset | 100 | 1.0% | 0.0% | -1.0 pp | 1 | 0 |
ToolEvalBench version 2.0.7 was run sequentially over 69 standard scenarios:
| Model | Final score | Points | Deployability | Responsiveness | Safety warnings |
|---|---|---|---|---|---|
| BF16 | 82 | 113 / 138 | 74 | 54 | 3 |
| NVFP4 | 88 | 121 / 138 | 81 | 64 | 1 |
Interpretation: this NVFP4 export is close to BF16 on broad multiple-choice, multimodal, OCR, and MBPP-style code tasks, but it showed clear regressions on GPQA Diamond and HumanEval+ in this run.
Caveats
- This is a community quantization, not an official upstream release.
- The checkpoint has been validated with vLLM ModelOpt loading. Other loaders may not support this ModelOpt NVFP4 format.
- vLLM marks ModelOpt NVFP4 support as experimental, so revalidate after major vLLM, ModelOpt, CUDA, or FlashInfer changes.
- The export does not include FP8 KV q/prob scaling factors. When serving with
--kv-cache-dtype fp8, vLLM reports that it uses scale1.0; treat this as a quality caveat for accuracy-sensitive workloads. - GPQA used the ungated
fingertap/GPQA-Diamondmirror because the official dataset was gated at the time of evaluation. - HumanEval+ and MBPP+ used EvalPlus prompts and expanded inputs, but scoring used a local subprocess checker because the official EvalPlus sandbox failed on the local macOS host with resource-limit errors.
- BFCL v3 10-each should be treated as raw paired signal only; ToolEvalBench is the stronger tool-use benchmark in this report.
Responsible Use
Use this model consistently with the upstream model license and any applicable laws or platform policies. Because this is a quantized derivative, evaluate it for your own target domain before relying on it in production or accuracy-sensitive workflows.
License And Attribution
The upstream model is MIT licensed. This quantized release preserves the MIT
license and attribution to deepreinforce-ai/Ornith-1.0-35B.
Upstream model: https://huggingface.co/deepreinforce-ai/Ornith-1.0-35B
Quantized by LS-ML.
- Downloads last month
- 174
Model tree for LS-ML/Ornith-1.0-35B-ModelOpt-NVFP4-Expert
Base model
deepreinforce-ai/Ornith-1.0-35B
