Ornith 1.0 35B ModelOpt NVFP4 Expert

This repository contains a community ModelOpt NVFP4 experts-only quantization of deepreinforce-ai/Ornith-1.0-35B.

This is not an official DeepReinforce release. The source BF16 checkpoint is unchanged and remains available from the upstream repository.

Model Details

Field Value
Base model deepreinforce-ai/Ornith-1.0-35B
Base revision 5df2ed3f675c7beaa490328cc70bb573b65fb660
Release repo LS-ML/Ornith-1.0-35B-ModelOpt-NVFP4-Expert
Architecture Qwen3_5MoeForConditionalGeneration
Model type qwen3_5_moe
Modality Text and vision-language
Max context tested 262144 tokens
Quantization ModelOpt NVFP4, experts-only target
Group size 16
Exported checkpoint size 23G on disk
Source BF16 checkpoint size 66G on disk
License MIT

The repository name uses Expert, but the quantization target is experts-only: the MoE expert linear weights are quantized while embeddings, lm head, visual modules, attention, linear-attention modules, and shared experts are excluded. See hf_quant_config.json and config.json for the exact ModelOpt quantization metadata.

Benchmark Summary

The checkpoint was benchmarked against the upstream BF16 model on the same DGX Spark / GB10, same vLLM nightly container, same 262K serving profile, same FP8 KV cache allocation, and same benchmark harness.

Performance summary:

Metric BF16 ModelOpt NVFP4 Result
Model directory size 66G 23G 2.9x smaller
vLLM model loading memory 65.53 GiB 22.41 GiB 2.9x lower
Weight loading time 454.55s 153.13s 3.0x faster
Text prefill speedup range baseline 1.08x to 1.41x faster in every row
Text decode speedup range baseline 1.16x to 1.45x faster in every row
4K image prefill 1204.6 tok/s 1357.6 tok/s 1.13x faster
4K image decode 30.0 tok/s 36.1 tok/s 1.20x faster

Accuracy summary:

Benchmark BF16 acc / score NVFP4 acc / score Delta NVFP4-BF16
MMLU-Pro 55.8% 55.0% -0.8 pp
GPQA Diamond mirror 22.7% 13.6% -9.1 pp
MATH-500 34.6% 36.6% +2.0 pp
HumanEval+ 39.0% 21.3% -17.7 pp
MBPP+ 78.0% 76.5% -1.6 pp
MMMU validation 57.4% 56.1% -1.3 pp
OCRBench 69.9% 70.2% +0.3 pp
BFCL v3 10-each subset 1.0% 0.0% -1.0 pp
ToolEvalBench 82 88 +6

Detailed performance and accuracy benchmark tables are included below.

Benchmark Graphics

Performance benchmark:

Ornith 1.0 35B BF16 vs ModelOpt NVFP4 performance benchmark

Accuracy and quality benchmark:

Ornith 1.0 35B BF16 vs ModelOpt NVFP4 accuracy benchmark

Quantization

The checkpoint was produced with NVIDIA ModelOpt from a fused-expert staging copy of the upstream BF16 model. The upstream checkpoint stores split expert tensors; the staging copy fused expert gate/up/down tensors into the layout expected by current Transformers and ModelOpt. The upstream BF16 checkpoint itself was not modified.

Quantization metadata:

Field Value
ModelOpt source version 0.46.0.dev106+g6cc522658
ModelOpt commit 6cc5226588f0668679df03ba4646b7dfec32f99c
Quant algorithm NVFP4
KV cache quantization during export none
Calibration dataset nvidia/Nemotron-SFT-Agentic-v2, split search
Calibration settings calib_size=16, calib_seq=512, batch_size=1
Export mode low-memory ModelOpt path

The quantization notes used during export are included in QUANTIZATION_NOTES.md.

Serving With vLLM

Validated serving stack:

  • vllm/vllm-openai:nightly
  • ModelOpt quantization loader: --quantization modelopt
  • FlashInfer attention backend
  • Blackwell / GB10 tested with CUTE_DSL_ARCH=sm_121a
  • OpenAI-compatible chat completions
  • Text, image, and Qwen3 XML tool-call smoke tests

Example full-context DGX Spark / GB10 launch:

docker run --rm \
  --name ornith35-nvfp4-vllm \
  --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --shm-size=32g \
  -p 8000:8000 \
  -v /path/to/models:/models \
  -v ~/.cache:/root/.cache \
  -e FLASHINFER_DISABLE_VERSION_CHECK=1 \
  -e CUTE_DSL_ARCH=sm_121a \
  vllm/vllm-openai:nightly \
  --model /models/Ornith-1.0-35B-ModelOpt-NVFP4-Expert \
  --host 0.0.0.0 \
  --port 8000 \
  --served-model-name ornith35-nvfp4 \
  --trust-remote-code \
  --dtype bfloat16 \
  --quantization modelopt \
  --kv-cache-dtype fp8 \
  --kv-cache-memory-bytes 28G \
  --attention-backend flashinfer \
  --max-model-len 262144 \
  --max-num-seqs 4 \
  --max-num-batched-tokens 8192 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml \
  --enable-chunked-prefill

Lower-memory smoke profile:

docker run --rm \
  --name ornith35-nvfp4-vllm-smoke \
  --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --shm-size=32g \
  -p 8000:8000 \
  -v /path/to/models:/models \
  -v ~/.cache:/root/.cache \
  -e FLASHINFER_DISABLE_VERSION_CHECK=1 \
  -e CUTE_DSL_ARCH=sm_121a \
  vllm/vllm-openai:nightly \
  --model /models/Ornith-1.0-35B-ModelOpt-NVFP4-Expert \
  --host 0.0.0.0 \
  --port 8000 \
  --served-model-name ornith35-nvfp4 \
  --trust-remote-code \
  --dtype bfloat16 \
  --quantization modelopt \
  --kv-cache-dtype fp8 \
  --kv-cache-memory-bytes 4G \
  --attention-backend flashinfer \
  --max-model-len 8192 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 8192 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml \
  --enable-chunked-prefill

For non-thinking deterministic chat/eval requests, use:

{
  "temperature": 0,
  "chat_template_kwargs": {
    "enable_thinking": false
  }
}

Validation

The quantized checkpoint was validated on a local DGX Spark / GB10 against the BF16 upstream checkpoint using the same vLLM runtime profile, same maximum context, same FP8 KV cache allocation, and the same benchmark harness.

Performance Benchmark Results

All text rows below used unique prompts, prefix caching disabled, max_tokens=128, and submitted concurrency 1, 2, or 4.

Context Conc BF16 prefill NVFP4 prefill Prefill speedup BF16 decode NVFP4 decode Decode speedup BF16 TTFT NVFP4 TTFT
32k 1 3288.8 4626.2 1.41x 29.6 35.5 1.20x 10.0s 7.1s
32k 2 4349.7 5412.5 1.24x 55.4 70.2 1.27x 15.1s 12.1s
32k 4 4425.1 5512.5 1.25x 82.1 118.9 1.45x 29.6s 23.8s
64k 1 3656.1 4348.0 1.19x 28.2 33.5 1.19x 17.9s 15.1s
64k 2 3625.8 4315.6 1.19x 48.5 63.6 1.31x 36.1s 30.4s
64k 4 3594.4 4289.9 1.19x 79.7 103.3 1.30x 72.9s 61.1s
128k 1 2653.1 3003.2 1.13x 26.1 30.7 1.18x 49.4s 43.6s
128k 2 2632.2 2988.2 1.14x 45.0 55.5 1.23x 99.6s 87.7s
128k 4 2607.5 2969.3 1.14x 64.0 84.4 1.32x 201.1s 176.6s
full 1 1713.5 1853.7 1.08x 22.2 25.7 1.16x 152.8s 141.3s
full 2 1704.9 1838.8 1.08x 37.5 43.8 1.17x 307.2s 284.8s
full 4 1702.5 1840.8 1.08x 56.7 71.0 1.25x 615.3s 569.0s

Throughput units are tokens per second. TTFT is max time to first token for the concurrent batch.

The serial 4096 x 4096 image test:

Case BF16 prefill NVFP4 prefill Prefill speedup BF16 decode NVFP4 decode Decode speedup BF16 TTFT NVFP4 TTFT
4K image 1204.6 1357.6 1.13x 30.0 36.1 1.20x 13.6s 12.1s

Startup and memory observations:

Metric BF16 ModelOpt NVFP4
Remote model directory size 66G 23G
vLLM model loading memory 65.53 GiB 22.41 GiB
Weight loading time 454.55s 153.13s
Engine init observed to health about 610s about 290s
MemAvailable after health about 17 GiB about 59 GiB

Accuracy Benchmark Results

Quality runs used deterministic decoding with temperature=0, chat_template_kwargs={"enable_thinking": false}, and paired per-item scoring. The key quantization regression count is "BF16 correct / NVFP4 wrong".

Benchmark Items BF16 acc NVFP4 acc Delta NVFP4-BF16 BF16 correct/NVFP4 wrong NVFP4 correct/BF16 wrong
MMLU-Pro 12032 55.8% 55.0% -0.8 pp 527 429
GPQA Diamond mirror 198 22.7% 13.6% -9.1 pp 23 5
MATH-500 500 34.6% 36.6% +2.0 pp 17 27
HumanEval+ 164 39.0% 21.3% -17.7 pp 33 4
MBPP+ 378 78.0% 76.5% -1.6 pp 13 7
MMMU validation 900 57.4% 56.1% -1.3 pp 54 42
OCRBench 1000 69.9% 70.2% +0.3 pp 17 20
BFCL v3 10-each subset 100 1.0% 0.0% -1.0 pp 1 0

ToolEvalBench version 2.0.7 was run sequentially over 69 standard scenarios:

Model Final score Points Deployability Responsiveness Safety warnings
BF16 82 113 / 138 74 54 3
NVFP4 88 121 / 138 81 64 1

Interpretation: this NVFP4 export is close to BF16 on broad multiple-choice, multimodal, OCR, and MBPP-style code tasks, but it showed clear regressions on GPQA Diamond and HumanEval+ in this run.

Caveats

  • This is a community quantization, not an official upstream release.
  • The checkpoint has been validated with vLLM ModelOpt loading. Other loaders may not support this ModelOpt NVFP4 format.
  • vLLM marks ModelOpt NVFP4 support as experimental, so revalidate after major vLLM, ModelOpt, CUDA, or FlashInfer changes.
  • The export does not include FP8 KV q/prob scaling factors. When serving with --kv-cache-dtype fp8, vLLM reports that it uses scale 1.0; treat this as a quality caveat for accuracy-sensitive workloads.
  • GPQA used the ungated fingertap/GPQA-Diamond mirror because the official dataset was gated at the time of evaluation.
  • HumanEval+ and MBPP+ used EvalPlus prompts and expanded inputs, but scoring used a local subprocess checker because the official EvalPlus sandbox failed on the local macOS host with resource-limit errors.
  • BFCL v3 10-each should be treated as raw paired signal only; ToolEvalBench is the stronger tool-use benchmark in this report.

Responsible Use

Use this model consistently with the upstream model license and any applicable laws or platform policies. Because this is a quantized derivative, evaluate it for your own target domain before relying on it in production or accuracy-sensitive workflows.

License And Attribution

The upstream model is MIT licensed. This quantized release preserves the MIT license and attribution to deepreinforce-ai/Ornith-1.0-35B.

Upstream model: https://huggingface.co/deepreinforce-ai/Ornith-1.0-35B

Quantized by LS-ML.

Downloads last month
174
Safetensors
Model size
19B params
Tensor type
BF16
F8_E4M3
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for LS-ML/Ornith-1.0-35B-ModelOpt-NVFP4-Expert

Quantized
(58)
this model