Instructions to use srswti/axe-strada-28b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use srswti/axe-strada-28b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="srswti/axe-strada-28b")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("srswti/axe-strada-28b")
model = AutoModelForImageTextToText.from_pretrained("srswti/axe-strada-28b")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use srswti/axe-strada-28b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "srswti/axe-strada-28b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "srswti/axe-strada-28b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/srswti/axe-strada-28b

SGLang

How to use srswti/axe-strada-28b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "srswti/axe-strada-28b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "srswti/axe-strada-28b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "srswti/axe-strada-28b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "srswti/axe-strada-28b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use srswti/axe-strada-28b with Docker Model Runner:
```
docker model run hf.co/srswti/axe-strada-28b
```

Axe-Strada-28b

A 28 billion parameter multimodal model, built for NVIDIA Blackwell hardware. Compressed with 4-bit block floating point across the language model's linear layers, with the vision tower, routing gates, and embedding layers fully preserved. The result fits where the original could not and runs substantially faster where the original was already fast.

The standard approach of quantizing everything uniformly trades correctness for simplicity. We take the opposite position: compress aggressively where it is safe to do so, and preserve precision exactly where the architecture is sensitive.

How the Compression Works

The Numerical Format

The compressed layers in Axe Strada operate in a two-level block floating point format. Every weight value is stored as an E2M1 4-bit float: 1 sign bit, 2 exponent bits, 1 mantissa bit. Sixteen of these values are grouped into a block, and each block carries a shared F8_E4M3 scale factor. A single F32 scale applies across the full tensor as a second-level anchor.

The representable values in E2M1 are a small, fixed codebook:

$\{0, \pm 0.5, \pm 1, \pm 1.5, \pm 2, \pm 3, \pm 4, \pm 6\}$

That is 15 non-zero values plus zero. But the two-level scaling architecture means the actual range covered by any given block is determined by its FP8 scale, and the range covered by the full tensor is determined by the F32 scale. The per-block scale maps the 16 local values so the largest magnitude element in that block lands at the FP4 maximum representable value. What looks like a narrow codebook becomes, with local rescaling, a numerically faithful representation of the original weight distribution.

The simple version. Each weight is stored in 4 bits. Every 16 weights share a small header that tells the GPU what magnitude range those weights live in. A second header tells the GPU the global scale of the whole tensor. At compute time, the GPU reconstructs the full-precision effective value on the fly, inside the matrix multiply unit, without ever writing a higher-precision copy back to memory. Storage is 4 bits per weight plus a small bookkeeping overhead, giving an effective cost of about 4.5 bits per parameter versus 16 bits in BF16.

$\text{Effective bits/param} = 4 + \frac{8}{16} = 4.5$

$\text{Compression vs BF16} \approx \frac{16}{4.5} \approx 3.5\times$

How the Matrix Multiply Changes

Every linear layer computes:

$Y = X W^{T}$

In BF16, both $X$ and $W$ are 16-bit values. The GPU loads 16 bits per weight element from VRAM, performs the multiply-accumulate in an FP32 accumulator, and writes the output. Memory bandwidth is the primary constraint during autoregressive decode.

In the 4-bit block format, the operation is restructured. For a block of 16 weight elements ${w_0, ..., w_{15}}$ with block scale $s_{block}$ and tensor scale $s_{tensor}$, each reconstructed weight is:

$\hat{w}_i = s_{tensor} \times s_{block} \times w_i^{E2M1}$

The Blackwell Tensor Core handles this reconstruction natively. The two-level dequantization is fused into the matrix multiply instruction. The GPU executes FP4 matrix multiplies with inline scale application, accumulating into FP16 or FP32, and never touches a BF16 weight copy at any point in the pipeline.

What this means for throughput. The B200's fifth-generation Tensor Cores expose a dedicated FP4 compute path that does not exist on any prior architecture:

Precision	Tensor Core throughput
BF16	~4.5 PFLOPS
FP8	~9 PFLOPS
FP4	~18 PFLOPS

The FP4 path delivers 4x the raw compute throughput of BF16 on the same chip. Combined with the 3.5x reduction in weight data movement, the compounding effect on decode throughput is substantial at the batch sizes relevant to production serving.

Activations are quantized per-token at runtime, with scales derived from the live activation distribution of each forward pass. No calibration corpus. No offline statistics.

Precision Mapping Across the Architecture

Through our own layer-by-layer profiling of activation distributions, routing sensitivity, and accumulated rounding error, we identified exactly which components of this architecture can absorb 4-bit compression without behavioral change.

Quantized to 4-bit block floating point

All standard linear projections within the language model: Q, K, V, and output projections in attention, and the up, gate, and down projections in the routed expert MLPs. These layers constitute the overwhelming majority of parameter count and memory bandwidth in the model.

Preserved at BF16

Component	Reason
Visual encoder	Vision features have a fundamentally different activation distribution from language features. 4-bit compression here degrades spatial and perceptual grounding in ways that propagate into cross-modal attention.
MoE router gates	Routing is a discrete, winner-take-all decision. Small numerical errors here misroute tokens to the wrong expert entirely, with effects that cannot be recovered downstream in the same forward pass.
Shared expert gate	Controls whether the shared expert fires at all, every token, every forward pass. Same sensitivity class as the router.
Language model head	The final projection onto vocabulary logits shapes the output distribution. Errors here affect sampling, greedy decoding, and structured output fidelity at the token level.

The stored tensor types in this model are F32, BF16, F8_E4M3, and U8 -- reflecting the two-level scale storage (F32 global, F8 block) alongside the packed 4-bit weight values.

Memory and KV Cache

Original Qwen3.6-27B in BF16 occupies approximately 55 GB. Axe Strada brings this to approximately 19 GB on disk -- a 2.9x reduction in footprint. That difference is not just storage. It is the gap between needing multiple GPUs and fitting on one. It is the headroom that becomes available KV cache.

The KV cache scales with sequence length and concurrent requests:

$\text{KV Cache} = 2 \times L \times H \times d \times T \times b$

Where $L$ is the number of layers, $H$ is the number of KV heads, $d$ is the head dimension, $T$ is the sequence length, and $b$ is bytes per element. With the weight footprint reduced by 3.5x, VRAM previously committed to holding model weights is now available for KV cache. At 256K context on a 192GB B200, this means more concurrent requests, longer contexts, and higher aggregate throughput without any hardware change.

Throughput

Measured on a single NVIDIA Blackwell GPU, vLLM 0.19.1rc1, 256K context, KV FP8, max-num-seqs 2:

Prompt length	Single tok/s	2-parallel aggregate tok/s	Per-request tok/s
Short (50 tokens)	59.3	110.7	55.3
Medium (350 tokens)	60.6	123.5	61.75
Long-form (700 tokens)	60.7	122.8	61.4

Available KV memory at 256K context with FP8 KV cache: 66.79 GiB, supporting a maximum concurrency of 7.95x per request at full 256K context length.

Deployment via vLLM

Axe Strada is compatible with vLLM on NVIDIA Blackwell hardware.

Production config -- 256K context with FP8 KV cache:

vllm serve srswti/axe-strada-28b \
  --trust-remote-code \
  --max-model-len 262144 \
  --max-num-seqs 2 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.9 \
  --reasoning-parser qwen3

Text only -- skip the vision encoder to free memory for additional KV cache:

vllm serve srswti/axe-strada-28b --reasoning-parser qwen3 --language-model-only

Multimodal -- full vision and language support:

vllm serve srswti/axe-strada-28b --reasoning-parser qwen3

Tool use:

vllm serve srswti/axe-strada-28b --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder

Speculative decoding via Multi-Token Prediction:

vllm serve srswti/axe-strada-28b --reasoning-parser qwen3 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":3}'

Send requests using the OpenAI-compatible endpoint:

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://<your-server-host>:8000/v1",
)

messages = [
    {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]

response = client.chat.completions.create(
    model="srswti/axe-strada-28b",
    messages=messages,
)

print(response.choices[0].message.content)

Requirements: NVIDIA Blackwell GPU (SM120), vLLM >= 0.19.

Evaluation

Benchmarks are in progress. This page will be updated when results across the full suite are verified.

Downloads last month: 69

Safetensors

Model size

17B params

Tensor type

F32

BF16

F8_E4M3

Model tree for srswti/axe-strada-28b

Base model

Qwen/Qwen3.6-27B

Quantized

(275)

this model

Collection including srswti/axe-strada-28b

cuDega

Collection

Optimized for cuda acceleration • 10 items • Updated 7 days ago