Axe Corsa 28B
A 28 billion parameter language model with weights compressed to 4-bit integers via activation-aware quantization. Activations stay at full BF16 precision throughout every forward pass. The compression is calibration-guided -- it knows which weights matter most before it touches any of them.
The standard approach of compressinh everything uniformly trades correctness for simplicity. We take the opposite position: quantize aggressively where it is safe to do so, and preserve precision exactly where the architecture is sensitive.
How the Compression Works
The Two-Stage Pipeline
Axe Corsa is produced by two sequential operations applied to the base model. Each stage has a distinct job and neither can substitute for the other.
Stage 1: Activation-Aware Weight Scaling
Before a single weight is quantized, AWQ runs a calibration forward pass through the model on representative data and measures the per-channel activation magnitudes at every linear layer. The result is an importance score for each input channel:
The insight is that not all weights contribute equally to the output. A weight that multiplies a consistently large activation carries proportionally more signal than one that multiplies near-zero activations. Rounding the former introduces far more output error than rounding the latter. RTN-style quantization ignores this entirely. AWQ uses it as the foundation for everything that follows.
AWQ computes a per-channel smoothing scale $s$ derived from the importance scores:
where $\alpha \in (0, 1)$ is a tuned exponent. This scale is applied to the weight matrix and its inverse is absorbed into the preceding layer:
The output is mathematically identical. But $W'$ -- the rescaled weight matrix -- has a much more uniform per-channel magnitude distribution. The channels that matter most have been pushed into a range where the quantization grid is densest relative to their values. The channels that matter least have been pushed into a range where quantization error, even if larger in absolute terms, has minimal effect on the output.
The division $X' = \text{diag}(s)^{-1} \cdot X$ is absorbed into the preceding LayerNorm weights and biases -- zero inference overhead, no extra operation at runtime.
Stage 2: W4A16_ASYM -- Asymmetric 4-bit Weight Quantization
After AWQ has rescaled the weights, quantization maps them to a 4-bit integer grid. The scheme is asymmetric: each group of 128 consecutive weights gets both a scale and a zero point, allowing the quantization range to shift away from zero and cover non-symmetric weight distributions faithfully.
The forward mapping:
The inverse at inference time (dequantization, fused into the matrix multiply):
Where $s_g$ is the BF16 scale for group $g$ and $z_g$ is the INT32 zero point for group $g$. Activations $X$ are never quantized -- they enter the matrix multiply at full BF16 precision, and the dequantized weights $\hat{W}$ are reconstructed on-the-fly in GPU registers before the multiply-accumulate executes.
Why asymmetric matters here. Symmetric quantization forces the zero of the quantization grid to align with the zero of the weight distribution. For weight distributions with a non-zero mean -- which is common after AWQ's rescaling has shifted channel magnitudes -- symmetric quantization wastes half the available integer range. Asymmetric quantization lets the grid float to wherever the weights actually are. All 16 representable INT4 values are used.
What the Tensor Types Tell You
The model stores three distinct tensor types, each with a specific role:
BF16 -- Quantization scales (one per group of 128 weights, stored as BF16)
and the lm_head weight matrix, which is excluded from quantization and preserved
at full precision.
I32 -- Zero points for asymmetric quantization. One per group of 128 weights.
An integer container is required because zero points are offsets on the integer grid,
not floating point values.
I64 -- Packed INT4 weight tensors. The compressed-tensors format packs two
INT4 values into each byte, and stores the result in 64-bit integer containers.
The actual weight precision is 4 bits per parameter. The I64 container is a
serialisation choice, not a precision statement.
Storage Cost
The effective memory cost per quantized parameter:
Original Qwen3.6-27B in BF16 occupies approximately 54 GB. The quantized weight layers in Axe Corsa come to roughly 14--15 GB for those parameters, with the lm_head and metadata stored at full precision on top. Total on-disk footprint is substantially below the original.
Precision Mapping Across the Architecture
Quantized to INT4 (W4A16_ASYM, group_size=128)
All Linear layers within the transformer blocks: Q, K, V, and output projections in attention, and the up, gate, and down projections in the expert MLPs.
AWQ's activation-aware scaling is applied before quantization on all of these layers, meaning the quantization grid is calibrated to the actual activation distribution of each channel -- not a uniform assumption.
Preserved at full precision
| Component | Reason |
|---|---|
Language model head (lm_head) |
The final projection onto vocabulary logits. Errors here directly reshape the output distribution and affect every token the model generates. |
Memory
The 3.66x weight compression frees VRAM that maps directly into available KV cache budget. At long context lengths, this is the difference between fitting a request and dropping it. At fixed hardware, it is the difference between serving a handful of concurrent users and serving many more with the same machine.
Deployment via vLLM
Axe Corsa is compatible with vLLM.
Text generation:
vllm serve srswti/axe-corsa-28b --reasoning-parser qwen3
Tool use:
vllm serve srswti/axe-corsa-28b \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
Speculative decoding via Multi-Token Prediction:
vllm serve srswti/axe-corsa-28b \
--reasoning-parser qwen3 \
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":3}'
Production config -- long context with FP8 KV cache:
vllm serve srswti/axe-corsa-28b \
--trust-remote-code \
--max-model-len 131072 \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.9 \
--reasoning-parser qwen3
Send requests using the OpenAI-compatible endpoint:
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://<your-server-host>:8000/v1",
)
messages = [
{"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]
response = client.chat.completions.create(
model="srswti/axe-corsa-28b",
messages=messages,
)
print(response.choices[0].message.content)
Evaluation
Benchmarks are in progress. This page will be updated when results across the full suite are verified.
- Downloads last month
- 198
Model tree for srswti/axe-corsa-28b
Base model
Qwen/Qwen3.6-27B