Axe-Strada-28b
A 28 billion parameter multimodal model, built for NVIDIA Blackwell hardware. Compressed with 4-bit block floating point across the language model's linear layers, with the vision tower, routing gates, and embedding layers fully preserved. The result fits where the original could not and runs substantially faster where the original was already fast.
The standard approach of quantizing everything uniformly trades correctness for simplicity. We take the opposite position: compress aggressively where it is safe to do so, and preserve precision exactly where the architecture is sensitive.
How the Compression Works
The Numerical Format
The compressed layers in Axe Strada operate in a two-level block floating point format. Every weight value is stored as an E2M1 4-bit float: 1 sign bit, 2 exponent bits, 1 mantissa bit. Sixteen of these values are grouped into a block, and each block carries a shared F8_E4M3 scale factor. A single F32 scale applies across the full tensor as a second-level anchor.
The representable values in E2M1 are a small, fixed codebook:
That is 15 non-zero values plus zero. But the two-level scaling architecture means the actual range covered by any given block is determined by its FP8 scale, and the range covered by the full tensor is determined by the F32 scale. The per-block scale maps the 16 local values so the largest magnitude element in that block lands at the FP4 maximum representable value. What looks like a narrow codebook becomes, with local rescaling, a numerically faithful representation of the original weight distribution.
The simple version. Each weight is stored in 4 bits. Every 16 weights share a small header that tells the GPU what magnitude range those weights live in. A second header tells the GPU the global scale of the whole tensor. At compute time, the GPU reconstructs the full-precision effective value on the fly, inside the matrix multiply unit, without ever writing a higher-precision copy back to memory. Storage is 4 bits per weight plus a small bookkeeping overhead, giving an effective cost of about 4.5 bits per parameter versus 16 bits in BF16.
How the Matrix Multiply Changes
Every linear layer computes:
In BF16, both $X$ and $W$ are 16-bit values. The GPU loads 16 bits per weight element from VRAM, performs the multiply-accumulate in an FP32 accumulator, and writes the output. Memory bandwidth is the primary constraint during autoregressive decode.
In the 4-bit block format, the operation is restructured. For a block of 16 weight elements ${w_0, ..., w_{15}}$ with block scale $s_{block}$ and tensor scale $s_{tensor}$, each reconstructed weight is:
The Blackwell Tensor Core handles this reconstruction natively. The two-level dequantization is fused into the matrix multiply instruction. The GPU executes FP4 matrix multiplies with inline scale application, accumulating into FP16 or FP32, and never touches a BF16 weight copy at any point in the pipeline.
What this means for throughput. The B200's fifth-generation Tensor Cores expose a dedicated FP4 compute path that does not exist on any prior architecture:
| Precision | Tensor Core throughput |
|---|---|
| BF16 | ~4.5 PFLOPS |
| FP8 | ~9 PFLOPS |
| FP4 | ~18 PFLOPS |
The FP4 path delivers 4x the raw compute throughput of BF16 on the same chip. Combined with the 3.5x reduction in weight data movement, the compounding effect on decode throughput is substantial at the batch sizes relevant to production serving.
Activations are quantized per-token at runtime, with scales derived from the live activation distribution of each forward pass. No calibration corpus. No offline statistics.
Precision Mapping Across the Architecture
Through our own layer-by-layer profiling of activation distributions, routing sensitivity, and accumulated rounding error, we identified exactly which components of this architecture can absorb 4-bit compression without behavioral change.
Quantized to 4-bit block floating point
All standard linear projections within the language model: Q, K, V, and output projections in attention, and the up, gate, and down projections in the routed expert MLPs. These layers constitute the overwhelming majority of parameter count and memory bandwidth in the model.
Preserved at BF16
| Component | Reason |
|---|---|
| Visual encoder | Vision features have a fundamentally different activation distribution from language features. 4-bit compression here degrades spatial and perceptual grounding in ways that propagate into cross-modal attention. |
| MoE router gates | Routing is a discrete, winner-take-all decision. Small numerical errors here misroute tokens to the wrong expert entirely, with effects that cannot be recovered downstream in the same forward pass. |
| Shared expert gate | Controls whether the shared expert fires at all, every token, every forward pass. Same sensitivity class as the router. |
| Language model head | The final projection onto vocabulary logits shapes the output distribution. Errors here affect sampling, greedy decoding, and structured output fidelity at the token level. |
The stored tensor types in this model are F32, BF16, F8_E4M3, and U8 -- reflecting the two-level scale storage (F32 global, F8 block) alongside the packed 4-bit weight values.
Memory and KV Cache
Original Qwen3.6-27B in BF16 occupies approximately 55 GB. Axe Strada brings this to approximately 19 GB on disk -- a 2.9x reduction in footprint. That difference is not just storage. It is the gap between needing multiple GPUs and fitting on one. It is the headroom that becomes available KV cache.
The KV cache scales with sequence length and concurrent requests:
Where $L$ is the number of layers, $H$ is the number of KV heads, $d$ is the head dimension, $T$ is the sequence length, and $b$ is bytes per element. With the weight footprint reduced by 3.5x, VRAM previously committed to holding model weights is now available for KV cache. At 256K context on a 192GB B200, this means more concurrent requests, longer contexts, and higher aggregate throughput without any hardware change.
Throughput
Measured on a single NVIDIA Blackwell GPU, vLLM 0.19.1rc1, 256K context, KV FP8, max-num-seqs 2:
| Prompt length | Single tok/s | 2-parallel aggregate tok/s | Per-request tok/s |
|---|---|---|---|
| Short (50 tokens) | 59.3 | 110.7 | 55.3 |
| Medium (350 tokens) | 60.6 | 123.5 | 61.75 |
| Long-form (700 tokens) | 60.7 | 122.8 | 61.4 |
Available KV memory at 256K context with FP8 KV cache: 66.79 GiB, supporting a maximum concurrency of 7.95x per request at full 256K context length.
Deployment via vLLM
Axe Strada is compatible with vLLM on NVIDIA Blackwell hardware.
Production config -- 256K context with FP8 KV cache:
vllm serve srswti/axe-strada-28b \
--trust-remote-code \
--max-model-len 262144 \
--max-num-seqs 2 \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.9 \
--reasoning-parser qwen3
Text only -- skip the vision encoder to free memory for additional KV cache:
vllm serve srswti/axe-strada-28b --reasoning-parser qwen3 --language-model-only
Multimodal -- full vision and language support:
vllm serve srswti/axe-strada-28b --reasoning-parser qwen3
Tool use:
vllm serve srswti/axe-strada-28b --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder
Speculative decoding via Multi-Token Prediction:
vllm serve srswti/axe-strada-28b --reasoning-parser qwen3 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":3}'
Send requests using the OpenAI-compatible endpoint:
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://<your-server-host>:8000/v1",
)
messages = [
{"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]
response = client.chat.completions.create(
model="srswti/axe-strada-28b",
messages=messages,
)
print(response.choices[0].message.content)
Requirements: NVIDIA Blackwell GPU (SM120), vLLM >= 0.19.
Evaluation
Benchmarks are in progress. This page will be updated when results across the full suite are verified.
- Downloads last month
- 69
Model tree for srswti/axe-strada-28b
Base model
Qwen/Qwen3.6-27B