Ornith-1.0-35B-FP8

Block-wise FP8 (E4M3) quantization of deepreinforce-ai/Ornith-1.0-35B — DeepReinforce's self-scaffolding agentic-coding MoE (Qwen3.5-35B-A3B hybrid). Served size ~35 GB; fits a single 80–96 GB GPU with full context.

Quantized and serve-validated by protoLabs on RTX PRO 6000 Blackwell (vLLM 0.22.1).

Why this exists

The upstream FP8 release was reported broken. This is a clean, serve-validated FP8 built to the Qwen official block-wise format (served natively on vLLM's fused-MoE Triton path), with the entire linear-attention / SSM path and the MoE router kept in bf16 — FP8-quantizing those corrupts the hybrid SSM and is the most likely failure mode for a naive FP8 of this architecture.

Quantization recipe

  • Scheme: block-wise [128, 128] FP8 E4M3, dynamic per-token activations (quant_method: fp8, activation_scheme: dynamic).
  • Quantized: all expert FFNs (gate/up/down), shared-expert FFNs, full-attention projections (q/k/v/o).
  • Kept bf16 (modules_to_not_convert): lm_head, embed_tokens, MoE router gates (mlp.gate, shared_expert_gate), all linear_attn.* (SSM: in_proj_*, out_proj, conv1d, A_log, dt_bias), all norms, and the vision tower.
  • Streaming tensor-by-tensor quantizer (peak RAM ≈ 2× largest tensor) — no full-model load.

Validation (protoLabs harness, thinking-on, single trial)

Smoke: coherent across coding / reasoning / long-form / multilingual / tool-calling — no gibberish, no reasoning leak.

Metric bf16 source This FP8
custom coding (one-shot) 1.00 (10/10) 0.975 (9/10)
function-call 93% (50/54) 89% (48/54)
decode tok/s (1× RTX PRO 6000) 208 207

Deltas are within single-trial run-to-run variance (temp 0.6–0.7); throughput is identical to the source.

Serving (vLLM)

vllm serve protoLabsAI/Ornith-1.0-35B-FP8 \
  --served-model-name ornith-35b \
  --max-model-len 262144 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice --tool-call-parser qwen3_xml \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code

Context: full 256K (262144). Vision: the base is multimodal (Qwen-VL-style image + video tokens) and the vision tower is preserved in bf16 — the recipe above keeps it enabled. For text-only serving (smaller footprint), add --language-model-only. Verified serving with vision on at 256K on RTX PRO 6000 (Blackwell, sm120).

Ornith is a reasoning model: the assistant turn opens with a <think>…</think> block surfaced as reasoning_content; tool calls are emitted as standard tool_calls. Recommended sampling: temperature=0.6, top_p=0.95, top_k=20.

License & attribution

MIT, inheriting Ornith-1.0. All credit for the base model to the DeepReinforce team (blog); this repo only adds the FP8 quantization.

@misc{ornith-35b, title={{Ornith-1.0-35B}: Agentic Coding, Open to All},
  url={https://deep-reinforce.com/ornith_1_0.html}, author={{DeepReinforce Team}}, year={2026}}
Downloads last month
1,672
Safetensors
Model size
35B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for protoLabsAI/Ornith-1.0-35B-FP8

Quantized
(82)
this model