Axe Corsa 28B

A 28 billion parameter language model with weights compressed to 4-bit integers via activation-aware quantization. Activations stay at full BF16 precision throughout every forward pass. The compression is calibration-guided -- it knows which weights matter most before it touches any of them.

The standard approach of compressinh everything uniformly trades correctness for simplicity. We take the opposite position: quantize aggressively where it is safe to do so, and preserve precision exactly where the architecture is sensitive.


How the Compression Works

The Two-Stage Pipeline

Axe Corsa is produced by two sequential operations applied to the base model. Each stage has a distinct job and neither can substitute for the other.

Stage 1: Activation-Aware Weight Scaling

Before a single weight is quantized, AWQ runs a calibration forward pass through the model on representative data and measures the per-channel activation magnitudes at every linear layer. The result is an importance score for each input channel:

importancei=E[โˆฃxiโˆฃ]\text{importance}_i = \mathbb{E}\left[|x_i|\right]

The insight is that not all weights contribute equally to the output. A weight that multiplies a consistently large activation carries proportionally more signal than one that multiplies near-zero activations. Rounding the former introduces far more output error than rounding the latter. RTN-style quantization ignores this entirely. AWQ uses it as the foundation for everything that follows.

AWQ computes a per-channel smoothing scale $s$ derived from the importance scores:

si=importanceiโ€‰ฮฑs_i = \text{importance}_i^{\,\alpha}

where $\alpha \in (0, 1)$ is a tuned exponent. This scale is applied to the weight matrix and its inverse is absorbed into the preceding layer:

Wโ‹…X=(Wโ‹…diag(s))โŸWโ€ฒโ‹…(diag(s)โˆ’1โ‹…X)โŸXโ€ฒW \cdot X = \underbrace{(W \cdot \text{diag}(s))}_{W'} \cdot \underbrace{(\text{diag}(s)^{-1} \cdot X)}_{X'}

The output is mathematically identical. But $W'$ -- the rescaled weight matrix -- has a much more uniform per-channel magnitude distribution. The channels that matter most have been pushed into a range where the quantization grid is densest relative to their values. The channels that matter least have been pushed into a range where quantization error, even if larger in absolute terms, has minimal effect on the output.

The division $X' = \text{diag}(s)^{-1} \cdot X$ is absorbed into the preceding LayerNorm weights and biases -- zero inference overhead, no extra operation at runtime.

Stage 2: W4A16_ASYM -- Asymmetric 4-bit Weight Quantization

After AWQ has rescaled the weights, quantization maps them to a 4-bit integer grid. The scheme is asymmetric: each group of 128 consecutive weights gets both a scale and a zero point, allowing the quantization range to shift away from zero and cover non-symmetric weight distributions faithfully.

The forward mapping:

qi=roundโ€‰โฃ(wisg)+zgโˆˆ[0,โ€‰15]q_i = \text{round}\!\left(\frac{w_i}{s_g}\right) + z_g \quad\in [0,\, 15]

The inverse at inference time (dequantization, fused into the matrix multiply):

w^i=(qiโˆ’zg)ร—sg\hat{w}_i = (q_i - z_g) \times s_g

Where $s_g$ is the BF16 scale for group $g$ and $z_g$ is the INT32 zero point for group $g$. Activations $X$ are never quantized -- they enter the matrix multiply at full BF16 precision, and the dequantized weights $\hat{W}$ are reconstructed on-the-fly in GPU registers before the multiply-accumulate executes.

Why asymmetric matters here. Symmetric quantization forces the zero of the quantization grid to align with the zero of the weight distribution. For weight distributions with a non-zero mean -- which is common after AWQ's rescaling has shifted channel magnitudes -- symmetric quantization wastes half the available integer range. Asymmetric quantization lets the grid float to wherever the weights actually are. All 16 representable INT4 values are used.

What the Tensor Types Tell You

The model stores three distinct tensor types, each with a specific role:

BF16 -- Quantization scales (one per group of 128 weights, stored as BF16) and the lm_head weight matrix, which is excluded from quantization and preserved at full precision.

I32 -- Zero points for asymmetric quantization. One per group of 128 weights. An integer container is required because zero points are offsets on the integer grid, not floating point values.

I64 -- Packed INT4 weight tensors. The compressed-tensors format packs two INT4 values into each byte, and stores the result in 64-bit integer containers. The actual weight precision is 4 bits per parameter. The I64 container is a serialisation choice, not a precision statement.

Storage Cost

The effective memory cost per quantized parameter:

bits/param=4 (weight)+16128 (BF16 scale)+32128 (I32 zero point)โ‰ˆ4.375 bits\text{bits/param} = 4\ (\text{weight}) + \frac{16}{128}\ (\text{BF16 scale}) + \frac{32}{128}\ (\text{I32 zero point}) \approx 4.375\ \text{bits}

Compression vs BF16โ‰ˆ164.375โ‰ˆ3.66ร—\text{Compression vs BF16} \approx \frac{16}{4.375} \approx 3.66\times

Original Qwen3.6-27B in BF16 occupies approximately 54 GB. The quantized weight layers in Axe Corsa come to roughly 14--15 GB for those parameters, with the lm_head and metadata stored at full precision on top. Total on-disk footprint is substantially below the original.


Precision Mapping Across the Architecture

Quantized to INT4 (W4A16_ASYM, group_size=128)

All Linear layers within the transformer blocks: Q, K, V, and output projections in attention, and the up, gate, and down projections in the expert MLPs.

AWQ's activation-aware scaling is applied before quantization on all of these layers, meaning the quantization grid is calibrated to the actual activation distribution of each channel -- not a uniform assumption.

Preserved at full precision

Component Reason
Language model head (lm_head) The final projection onto vocabulary logits. Errors here directly reshape the output distribution and affect every token the model generates.

Memory

The 3.66x weight compression frees VRAM that maps directly into available KV cache budget. At long context lengths, this is the difference between fitting a request and dropping it. At fixed hardware, it is the difference between serving a handful of concurrent users and serving many more with the same machine.


Deployment via vLLM

Axe Corsa is compatible with vLLM.

Text generation:

vllm serve srswti/axe-corsa-28b --reasoning-parser qwen3

Tool use:

vllm serve srswti/axe-corsa-28b \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

Speculative decoding via Multi-Token Prediction:

vllm serve srswti/axe-corsa-28b \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":3}'

Production config -- long context with FP8 KV cache:

vllm serve srswti/axe-corsa-28b \
  --trust-remote-code \
  --max-model-len 131072 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.9 \
  --reasoning-parser qwen3

Send requests using the OpenAI-compatible endpoint:

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://<your-server-host>:8000/v1",
)

messages = [
    {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]

response = client.chat.completions.create(
    model="srswti/axe-corsa-28b",
    messages=messages,
)

print(response.choices[0].message.content)

Evaluation

Benchmarks are in progress. This page will be updated when results across the full suite are verified.

Downloads last month
198
Safetensors
Model size
29B params
Tensor type
I64
ยท
I32
ยท
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for srswti/axe-corsa-28b

Base model

Qwen/Qwen3.6-27B
Quantized
(287)
this model

Collection including srswti/axe-corsa-28b