You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Soofi-S-Isar-Preview-FP8

⚠️ Preview / internal checkpoint. Weights and metadata may still change.

FP8 (W8A8 dynamic) quantization of Soofi-Project/Soofi-S-Isar-Preview for space-efficient serving with vLLM. This is the Isar reasoning variant — it emits explicit <think> traces before the final answer.

Quantized from bf16 safetensors with llm-compressor to the compressed-tensors FP8 format, which vLLM loads natively at roughly half the weight memory of the bf16 checkpoint.

Architecture support: SOOFI-S is a custom hybrid Mamba-2/MoE model and ships with its own modeling code (trust_remote_code). FP8 serving requires a vLLM build that understands this architecture — verify against the actual checkpoint before relying on this artifact.

Quantization details

Property Value
Scheme FP8_DYNAMIC (W8A8)
Weights FP8 E4M3, per-channel static scales
Activations FP8 E4M3, per-token dynamic (quantized at runtime)
Calibration none (data-free)
Kept in full precision lm_head, Mamba-2 in_proj/out_proj

Why dynamic / data-free? Dynamic per-token activation scales need no calibration dataset and are robust for MoE, where a single static activation scale across experts is a poor fit. The MoE router is not an nn.Linear, so it stays full precision automatically; the Mamba-2 in_proj/out_proj (the recurrent SSM path) are kept bf16 as the most quantization-sensitive layers.

Size scales with the total 30B parameters (not the 3.5B active), so the FP8 weights are ~half the bf16 size minus the few full-precision tensors above.

Usage with vLLM

# OpenAI-compatible server
vllm serve Soofi-Project/Soofi-S-Isar-Preview-FP8 --trust-remote-code
from vllm import LLM, SamplingParams

llm = LLM(model="Soofi-Project/Soofi-S-Isar-Preview-FP8", trust_remote_code=True)
out = llm.chat(
    [{"role": "user", "content": "How many r's are in strawberry?"}],
    # reasoning models spend output tokens on the <think> trace — budget generously
    SamplingParams(temperature=0.6, top_p=0.95, max_tokens=2048),
)
print(out[0].outputs[0].text)

--trust-remote-code is required: the custom hybrid Mamba-2/MoE modeling code travels with the checkpoint. FP8 needs a GPU with hardware FP8 support (NVIDIA Hopper/Ada/Blackwell — e.g. H100, L40S, RTX 4090) for the fast path; on older GPUs vLLM falls back to a slower Marlin kernel.

Identity and tool calling work out of the box. vLLM's chat endpoint applies the model's own embedded Jinja chat_template, so the identity default system prompt and the native tool-calling format are used verbatim — unlike Ollama, you do not need to supply a SYSTEM block manually.

Reasoning output: the model emits <think> … </think> blocks inline, which consume output tokens — give it a generous max_tokens budget and context.

Architecture note

This is a hybrid Mixture-of-Experts model designed from scratch: 23 Mamba-2/MoE layers + 6 attention layers, 128 routing experts + 1 shared expert per MoE layer, 6 experts active per token (30B total / 3.5B active). FP8 is applied to the Linear layers (attention/MoE expert projections); the SSM (Mamba-2) recurrent parameters and the router stay in higher precision. A recent version of vLLM is recommended.

Related models

License & provenance

Released under a custom license ("Other"), following the base model Soofi-Project/Soofi-S-Isar-Preview. TODO: mirror the full license text once the base model card defines it.

Downloads last month
-
Safetensors
Model size
32B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Soofi-Project/Soofi-S-Isar-Preview-FP8

Quantized
(2)
this model

Collection including Soofi-Project/Soofi-S-Isar-Preview-FP8