Ornith 1.0 35B ModelOpt NVFP4 Expert

This repository contains a community ModelOpt NVFP4 experts-only quantization of deepreinforce-ai/Ornith-1.0-35B.

This is not an official DeepReinforce release. The source BF16 checkpoint is unchanged and remains available from the upstream repository.

Model Details

Field	Value
Base model	`deepreinforce-ai/Ornith-1.0-35B`
Base revision	`5df2ed3f675c7beaa490328cc70bb573b65fb660`
Release repo	`LS-ML/Ornith-1.0-35B-ModelOpt-NVFP4-Expert`
Architecture	`Qwen3_5MoeForConditionalGeneration`
Model type	`qwen3_5_moe`
Modality	Text and vision-language
Max context tested	`262144` tokens
Quantization	ModelOpt `NVFP4`, experts-only target
Group size	`16`
Exported checkpoint size	`23G` on disk
Source BF16 checkpoint size	`66G` on disk
License	MIT

The repository name uses Expert, but the quantization target is experts-only: the MoE expert linear weights are quantized while embeddings, lm head, visual modules, attention, linear-attention modules, and shared experts are excluded. See hf_quant_config.json and config.json for the exact ModelOpt quantization metadata.

Benchmark Summary

The checkpoint was benchmarked against the upstream BF16 model on the same DGX Spark / GB10, same vLLM nightly container, same 262K serving profile, same FP8 KV cache allocation, and same benchmark harness.

Performance summary:

Metric	BF16	ModelOpt NVFP4	Result
Model directory size	`66G`	`23G`	`2.9x` smaller
vLLM model loading memory	`65.53 GiB`	`22.41 GiB`	`2.9x` lower
Weight loading time	`454.55s`	`153.13s`	`3.0x` faster
Text prefill speedup range	baseline	`1.08x` to `1.41x`	faster in every row
Text decode speedup range	baseline	`1.16x` to `1.45x`	faster in every row
4K image prefill	`1204.6 tok/s`	`1357.6 tok/s`	`1.13x` faster
4K image decode	`30.0 tok/s`	`36.1 tok/s`	`1.20x` faster

Accuracy summary:

Benchmark	BF16 acc / score	NVFP4 acc / score	Delta NVFP4-BF16
MMLU-Pro	`55.8%`	`55.0%`	`-0.8 pp`
GPQA Diamond mirror	`22.7%`	`13.6%`	`-9.1 pp`
MATH-500	`34.6%`	`36.6%`	`+2.0 pp`
HumanEval+	`39.0%`	`21.3%`	`-17.7 pp`
MBPP+	`78.0%`	`76.5%`	`-1.6 pp`
MMMU validation	`57.4%`	`56.1%`	`-1.3 pp`
OCRBench	`69.9%`	`70.2%`	`+0.3 pp`
BFCL v3 10-each subset	`1.0%`	`0.0%`	`-1.0 pp`
ToolEvalBench	`82`	`88`	`+6`

Detailed performance and accuracy benchmark tables are included below.

Benchmark Graphics

Performance benchmark:

Accuracy and quality benchmark:

Quantization

The checkpoint was produced with NVIDIA ModelOpt from a fused-expert staging copy of the upstream BF16 model. The upstream checkpoint stores split expert tensors; the staging copy fused expert gate/up/down tensors into the layout expected by current Transformers and ModelOpt. The upstream BF16 checkpoint itself was not modified.

Quantization metadata:

Field	Value
ModelOpt source version	`0.46.0.dev106+g6cc522658`
ModelOpt commit	`6cc5226588f0668679df03ba4646b7dfec32f99c`
Quant algorithm	`NVFP4`
KV cache quantization during export	none
Calibration dataset	`nvidia/Nemotron-SFT-Agentic-v2`, split `search`
Calibration settings	`calib_size=16`, `calib_seq=512`, `batch_size=1`
Export mode	low-memory ModelOpt path

The quantization notes used during export are included in QUANTIZATION_NOTES.md.

Serving With vLLM

Validated serving stack:

vllm/vllm-openai:nightly
ModelOpt quantization loader: --quantization modelopt
FlashInfer attention backend
Blackwell / GB10 tested with CUTE_DSL_ARCH=sm_121a
OpenAI-compatible chat completions
Text, image, and Qwen3 XML tool-call smoke tests

Example full-context DGX Spark / GB10 launch:

docker run --rm \
  --name ornith35-nvfp4-vllm \
  --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --shm-size=32g \
  -p 8000:8000 \
  -v /path/to/models:/models \
  -v ~/.cache:/root/.cache \
  -e FLASHINFER_DISABLE_VERSION_CHECK=1 \
  -e CUTE_DSL_ARCH=sm_121a \
  vllm/vllm-openai:nightly \
  --model /models/Ornith-1.0-35B-ModelOpt-NVFP4-Expert \
  --host 0.0.0.0 \
  --port 8000 \
  --served-model-name ornith35-nvfp4 \
  --trust-remote-code \
  --dtype bfloat16 \
  --quantization modelopt \
  --kv-cache-dtype fp8 \
  --kv-cache-memory-bytes 28G \
  --attention-backend flashinfer \
  --max-model-len 262144 \
  --max-num-seqs 4 \
  --max-num-batched-tokens 8192 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml \
  --enable-chunked-prefill

Lower-memory smoke profile:

docker run --rm \
  --name ornith35-nvfp4-vllm-smoke \
  --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --shm-size=32g \
  -p 8000:8000 \
  -v /path/to/models:/models \
  -v ~/.cache:/root/.cache \
  -e FLASHINFER_DISABLE_VERSION_CHECK=1 \
  -e CUTE_DSL_ARCH=sm_121a \
  vllm/vllm-openai:nightly \
  --model /models/Ornith-1.0-35B-ModelOpt-NVFP4-Expert \
  --host 0.0.0.0 \
  --port 8000 \
  --served-model-name ornith35-nvfp4 \
  --trust-remote-code \
  --dtype bfloat16 \
  --quantization modelopt \
  --kv-cache-dtype fp8 \
  --kv-cache-memory-bytes 4G \
  --attention-backend flashinfer \
  --max-model-len 8192 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 8192 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml \
  --enable-chunked-prefill

For non-thinking deterministic chat/eval requests, use:

{
  "temperature": 0,
  "chat_template_kwargs": {
    "enable_thinking": false
  }
}

Validation

The quantized checkpoint was validated on a local DGX Spark / GB10 against the BF16 upstream checkpoint using the same vLLM runtime profile, same maximum context, same FP8 KV cache allocation, and the same benchmark harness.

Performance Benchmark Results

All text rows below used unique prompts, prefix caching disabled, max_tokens=128, and submitted concurrency 1, 2, or 4.

Context	Conc	BF16 prefill	NVFP4 prefill	Prefill speedup	BF16 decode	NVFP4 decode	Decode speedup	BF16 TTFT	NVFP4 TTFT
32k	1	3288.8	4626.2	1.41x	29.6	35.5	1.20x	10.0s	7.1s
32k	2	4349.7	5412.5	1.24x	55.4	70.2	1.27x	15.1s	12.1s
32k	4	4425.1	5512.5	1.25x	82.1	118.9	1.45x	29.6s	23.8s
64k	1	3656.1	4348.0	1.19x	28.2	33.5	1.19x	17.9s	15.1s
64k	2	3625.8	4315.6	1.19x	48.5	63.6	1.31x	36.1s	30.4s
64k	4	3594.4	4289.9	1.19x	79.7	103.3	1.30x	72.9s	61.1s
128k	1	2653.1	3003.2	1.13x	26.1	30.7	1.18x	49.4s	43.6s
128k	2	2632.2	2988.2	1.14x	45.0	55.5	1.23x	99.6s	87.7s
128k	4	2607.5	2969.3	1.14x	64.0	84.4	1.32x	201.1s	176.6s
full	1	1713.5	1853.7	1.08x	22.2	25.7	1.16x	152.8s	141.3s
full	2	1704.9	1838.8	1.08x	37.5	43.8	1.17x	307.2s	284.8s
full	4	1702.5	1840.8	1.08x	56.7	71.0	1.25x	615.3s	569.0s

Throughput units are tokens per second. TTFT is max time to first token for the concurrent batch.

The serial 4096 x 4096 image test:

Case	BF16 prefill	NVFP4 prefill	Prefill speedup	BF16 decode	NVFP4 decode	Decode speedup	BF16 TTFT	NVFP4 TTFT
4K image	1204.6	1357.6	1.13x	30.0	36.1	1.20x	13.6s	12.1s

Startup and memory observations:

Metric	BF16	ModelOpt NVFP4
Remote model directory size	`66G`	`23G`
vLLM model loading memory	`65.53 GiB`	`22.41 GiB`
Weight loading time	`454.55s`	`153.13s`
Engine init observed to health	about `610s`	about `290s`
MemAvailable after health	about `17 GiB`	about `59 GiB`

Accuracy Benchmark Results

Quality runs used deterministic decoding with temperature=0, chat_template_kwargs={"enable_thinking": false}, and paired per-item scoring. The key quantization regression count is "BF16 correct / NVFP4 wrong".

Benchmark	Items	BF16 acc	NVFP4 acc	Delta NVFP4-BF16	BF16 correct/NVFP4 wrong	NVFP4 correct/BF16 wrong
MMLU-Pro	12032	55.8%	55.0%	-0.8 pp	527	429
GPQA Diamond mirror	198	22.7%	13.6%	-9.1 pp	23	5
MATH-500	500	34.6%	36.6%	+2.0 pp	17	27
HumanEval+	164	39.0%	21.3%	-17.7 pp	33	4
MBPP+	378	78.0%	76.5%	-1.6 pp	13	7
MMMU validation	900	57.4%	56.1%	-1.3 pp	54	42
OCRBench	1000	69.9%	70.2%	+0.3 pp	17	20
BFCL v3 10-each subset	100	1.0%	0.0%	-1.0 pp	1	0

ToolEvalBench version 2.0.7 was run sequentially over 69 standard scenarios:

Model	Final score	Points	Deployability	Responsiveness	Safety warnings
BF16	82	113 / 138	74	54	3
NVFP4	88	121 / 138	81	64	1

Interpretation: this NVFP4 export is close to BF16 on broad multiple-choice, multimodal, OCR, and MBPP-style code tasks, but it showed clear regressions on GPQA Diamond and HumanEval+ in this run.

Caveats

This is a community quantization, not an official upstream release.
The checkpoint has been validated with vLLM ModelOpt loading. Other loaders may not support this ModelOpt NVFP4 format.
vLLM marks ModelOpt NVFP4 support as experimental, so revalidate after major vLLM, ModelOpt, CUDA, or FlashInfer changes.
The export does not include FP8 KV q/prob scaling factors. When serving with --kv-cache-dtype fp8, vLLM reports that it uses scale 1.0; treat this as a quality caveat for accuracy-sensitive workloads.
GPQA used the ungated fingertap/GPQA-Diamond mirror because the official dataset was gated at the time of evaluation.
HumanEval+ and MBPP+ used EvalPlus prompts and expanded inputs, but scoring used a local subprocess checker because the official EvalPlus sandbox failed on the local macOS host with resource-limit errors.
BFCL v3 10-each should be treated as raw paired signal only; ToolEvalBench is the stronger tool-use benchmark in this report.

Responsible Use

Use this model consistently with the upstream model license and any applicable laws or platform policies. Because this is a quantized derivative, evaluate it for your own target domain before relying on it in production or accuracy-sensitive workflows.

License And Attribution

The upstream model is MIT licensed. This quantized release preserves the MIT license and attribution to deepreinforce-ai/Ornith-1.0-35B.

Upstream model: https://huggingface.co/deepreinforce-ai/Ornith-1.0-35B

Quantized by LS-ML.

Downloads last month: 174

Safetensors

Model size

19B params

Tensor type

BF16

F8_E4M3

Model tree for LS-ML/Ornith-1.0-35B-ModelOpt-NVFP4-Expert

Base model

deepreinforce-ai/Ornith-1.0-35B

Quantized

(58)

this model