danielporta's picture
Update model card
90ed419 verified
|
Raw
History Blame Contribute Delete
4.16 kB
---
base_model: Soofi-Project/Soofi-S-Instruct-Preview
license: other
language:
- en
- de
- es
- fr
- it
library_name: transformers
pipeline_tag: text-generation
tags:
- fp8
- compressed-tensors
- mamba-2
- moe
- soofi
- vllm
- quantized
- preview
quantized_by: Soofi-Project
---
# Soofi-S-Instruct-Preview-FP8
> ⚠️ **Preview / internal checkpoint.** Weights and metadata may still change.
FP8 (W8A8 dynamic) quantization of
[**Soofi-Project/Soofi-S-Instruct-Preview**](https://huggingface.co/Soofi-Project/Soofi-S-Instruct-Preview)
for space-efficient serving with **vLLM**.
Quantized from bf16 safetensors with
[`llm-compressor`](https://github.com/vllm-project/llm-compressor) to the
`compressed-tensors` FP8 format, which vLLM loads natively at roughly **half the
weight memory** of the bf16 checkpoint.
> **Architecture support:** SOOFI-S is a custom hybrid Mamba-2/MoE model and
> ships with its own modeling code (`trust_remote_code`). FP8 serving requires a
> vLLM build that understands this architecture — verify against the actual
> checkpoint before relying on this artifact.
## Quantization details
| Property | Value |
|---|---|
| Scheme | **FP8_DYNAMIC** (W8A8) |
| Weights | FP8 E4M3, **per-channel** static scales |
| Activations | FP8 E4M3, **per-token dynamic** (quantized at runtime) |
| Calibration | none (data-free) |
| Kept in full precision | `lm_head`, Mamba-2 `in_proj`/`out_proj` |
> **Why dynamic / data-free?** Dynamic per-token activation scales need no
> calibration dataset and are robust for MoE, where a single static activation
> scale across experts is a poor fit. The MoE **router** is not an nn.Linear, so
> it stays full precision automatically; the Mamba-2 `in_proj`/`out_proj` (the
> recurrent SSM path) are kept bf16 as the most quantization-sensitive layers.
>
> **Size scales with the *total* 30B parameters** (not the 3.5B active), so the
> FP8 weights are ~half the bf16 size minus the few full-precision tensors above.
## Usage with vLLM
```bash
# OpenAI-compatible server
vllm serve Soofi-Project/Soofi-S-Instruct-Preview-FP8 --trust-remote-code
# query it
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "Soofi-Project/Soofi-S-Instruct-Preview-FP8",
"messages": [{"role": "user", "content": "Explain AI sovereignty in one sentence."}]}'
```
```python
from vllm import LLM, SamplingParams
llm = LLM(model="Soofi-Project/Soofi-S-Instruct-Preview-FP8", trust_remote_code=True)
out = llm.chat([{"role": "user", "content": "Explain AI sovereignty in one sentence."}],
SamplingParams(temperature=0.6, top_p=0.95))
print(out[0].outputs[0].text)
```
> **`--trust-remote-code`** is required: the custom hybrid Mamba-2/MoE modeling
> code travels with the checkpoint. FP8 needs a GPU with hardware FP8 support
> (NVIDIA Hopper/Ada/Blackwell — e.g. H100, L40S, RTX 4090) for the fast path;
> on older GPUs vLLM falls back to a slower Marlin kernel.
>
> The chat template is applied by vLLM's chat endpoint, so the model's native
> identity and tool-calling format work out of the box — no manual template.
## Architecture note
This is a **hybrid Mixture-of-Experts** model designed from scratch: 23
Mamba-2/MoE layers + 6 attention layers, 128 routing experts + 1 shared expert
per MoE layer, 6 experts active per token (30B total / 3.5B active). FP8 is
applied to the Linear layers (attention/MoE expert projections); the SSM
(Mamba-2) recurrent parameters and the router stay in higher precision. A recent
version of vLLM is recommended.
## Related models
- Base (bf16): [Soofi-Project/Soofi-S-Instruct-Preview](https://huggingface.co/Soofi-Project/Soofi-S-Instruct-Preview)
- GGUF (llama.cpp/Ollama): [Soofi-Project/Soofi-S-Instruct-Preview-GGUF](https://huggingface.co/Soofi-Project/Soofi-S-Instruct-Preview-GGUF)
## License & provenance
Released under a custom license ("Other"), following the base model
[Soofi-Project/Soofi-S-Instruct-Preview](https://huggingface.co/Soofi-Project/Soofi-S-Instruct-Preview).
TODO: mirror the full license text once the base model card defines it.