Instructions to use Soofi-Project/Soofi-S-Isar-Preview-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Soofi-Project/Soofi-S-Isar-Preview-FP8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Soofi-Project/Soofi-S-Isar-Preview-FP8", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Soofi-Project/Soofi-S-Isar-Preview-FP8", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Soofi-Project/Soofi-S-Isar-Preview-FP8", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Soofi-Project/Soofi-S-Isar-Preview-FP8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Soofi-Project/Soofi-S-Isar-Preview-FP8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Soofi-Project/Soofi-S-Isar-Preview-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Soofi-Project/Soofi-S-Isar-Preview-FP8

SGLang

How to use Soofi-Project/Soofi-S-Isar-Preview-FP8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Soofi-Project/Soofi-S-Isar-Preview-FP8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Soofi-Project/Soofi-S-Isar-Preview-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Soofi-Project/Soofi-S-Isar-Preview-FP8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Soofi-Project/Soofi-S-Isar-Preview-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Soofi-Project/Soofi-S-Isar-Preview-FP8 with Docker Model Runner:
```
docker model run hf.co/Soofi-Project/Soofi-S-Isar-Preview-FP8
```

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Soofi-S-Isar-Preview-FP8

⚠️ Preview / internal checkpoint. Weights and metadata may still change.

FP8 (W8A8 dynamic) quantization of Soofi-Project/Soofi-S-Isar-Preview for space-efficient serving with vLLM. This is the Isar reasoning variant — it emits explicit <think> traces before the final answer.

Quantized from bf16 safetensors with llm-compressor to the compressed-tensors FP8 format, which vLLM loads natively at roughly half the weight memory of the bf16 checkpoint.

Architecture support: SOOFI-S is a custom hybrid Mamba-2/MoE model and ships with its own modeling code (trust_remote_code). FP8 serving requires a vLLM build that understands this architecture — verify against the actual checkpoint before relying on this artifact.

Quantization details

Property	Value
Scheme	FP8_DYNAMIC (W8A8)
Weights	FP8 E4M3, per-channel static scales
Activations	FP8 E4M3, per-token dynamic (quantized at runtime)
Calibration	none (data-free)
Kept in full precision	`lm_head`, Mamba-2 `in_proj`/`out_proj`

Why dynamic / data-free? Dynamic per-token activation scales need no calibration dataset and are robust for MoE, where a single static activation scale across experts is a poor fit. The MoE router is not an nn.Linear, so it stays full precision automatically; the Mamba-2 in_proj/out_proj (the recurrent SSM path) are kept bf16 as the most quantization-sensitive layers.

Size scales with the total 30B parameters (not the 3.5B active), so the FP8 weights are ~half the bf16 size minus the few full-precision tensors above.

Usage with vLLM

# OpenAI-compatible server
vllm serve Soofi-Project/Soofi-S-Isar-Preview-FP8 --trust-remote-code

from vllm import LLM, SamplingParams

llm = LLM(model="Soofi-Project/Soofi-S-Isar-Preview-FP8", trust_remote_code=True)
out = llm.chat(
    [{"role": "user", "content": "How many r's are in strawberry?"}],
    # reasoning models spend output tokens on the <think> trace — budget generously
    SamplingParams(temperature=0.6, top_p=0.95, max_tokens=2048),
)
print(out[0].outputs[0].text)

--trust-remote-code is required: the custom hybrid Mamba-2/MoE modeling code travels with the checkpoint. FP8 needs a GPU with hardware FP8 support (NVIDIA Hopper/Ada/Blackwell — e.g. H100, L40S, RTX 4090) for the fast path; on older GPUs vLLM falls back to a slower Marlin kernel.

Identity and tool calling work out of the box. vLLM's chat endpoint applies the model's own embedded Jinja chat_template, so the identity default system prompt and the native tool-calling format are used verbatim — unlike Ollama, you do not need to supply a SYSTEM block manually.

Reasoning output: the model emits <think> … </think> blocks inline, which consume output tokens — give it a generous max_tokens budget and context.

Architecture note

This is a hybrid Mixture-of-Experts model designed from scratch: 23 Mamba-2/MoE layers + 6 attention layers, 128 routing experts + 1 shared expert per MoE layer, 6 experts active per token (30B total / 3.5B active). FP8 is applied to the Linear layers (attention/MoE expert projections); the SSM (Mamba-2) recurrent parameters and the router stay in higher precision. A recent version of vLLM is recommended.

Related models

Base (bf16): Soofi-Project/Soofi-S-Isar-Preview
Other reasoning variant: Soofi-Project/Soofi-S-Rhine-Preview
GGUF (llama.cpp/Ollama): Soofi-Project/Soofi-S-Isar-Preview-GGUF

License & provenance

Released under a custom license ("Other"), following the base model Soofi-Project/Soofi-S-Isar-Preview. TODO: mirror the full license text once the base model card defines it.

Downloads last month: -

Safetensors

Model size

32B params

Tensor type

F32

BF16

F8_E4M3

Model tree for Soofi-Project/Soofi-S-Isar-Preview-FP8

Base model

Soofi-Project/Soofi-S-Isar-Preview

Quantized

(2)

this model

Collection including Soofi-Project/Soofi-S-Isar-Preview-FP8

Soofi S Beta Models

Collection

9 items • Updated about 14 hours ago