Instructions to use srswti/axe-superveloce-37b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use srswti/axe-superveloce-37b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="srswti/axe-superveloce-37b")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("srswti/axe-superveloce-37b")
model = AutoModelForImageTextToText.from_pretrained("srswti/axe-superveloce-37b")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use srswti/axe-superveloce-37b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "srswti/axe-superveloce-37b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "srswti/axe-superveloce-37b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/srswti/axe-superveloce-37b

SGLang

How to use srswti/axe-superveloce-37b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "srswti/axe-superveloce-37b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "srswti/axe-superveloce-37b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "srswti/axe-superveloce-37b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "srswti/axe-superveloce-37b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use srswti/axe-superveloce-37b with Docker Model Runner:
```
docker model run hf.co/srswti/axe-superveloce-37b
```

A 37 billion parameter mixture-of-experts model. Built to serve state-of-the-art reasoning, instruction following, and code generation at a fraction of the memory cost of its base model.

The standard approach of quantizing everything uniformly trades correctness for simplicity. We take the opposite position, and a step further: compress aggressively where it is safe to do so, and preserve precision exactly where the architecture is sensitive.

This is why F8_E4M3 over simple INT8

INT8 spaces its 256 representable values evenly across the number line. Every step is the same size, regardless of where the weights actually live:

INT8 — uniform spacing, equal steps everywhere

 ← large gap →          ← large gap →
-128    -64    -32   -1  0  1   32    64    128
  |      |      |     |  |  |    |     |      |
  ●      ●      ●     ●  ●  ●    ●     ●      ●

F8_E4M3 spends its representable values where neural network weights actually cluster -- densely packed near zero, thinning out toward the extremes:

F8_E4M3 — non-uniform spacing, dense near zero

          many fine steps here
               ↓↓↓↓↓↓↓
-448  -4  -1  -0.25  0  0.25  1  4  448
  |    |   |    ||||  |  ||||  |  |    |
  ●    ●   ●  ●●●●●●  ●  ●●●●●●  ●    ●
               ↑↑↑↑↑↑
          most weights live here

The result: F8_E4M3 represents typical weight distributions with smaller rounding error than INT8 at the same bit width -- which is why FP8 compression consistently loses less accuracy than INT8 compression.

Say you have a weight with value 0.0317. You need to store it in 8 bits.

With INT8, your available grid points near zero are:

... -2/128   -1/128    0    1/128   2/128 ...
    -0.0156  -0.0078   0   0.0078  0.0156

The closest representable value to 0.0317 is 0.0313 -- off by 0.0004. Fine.

But now say another weight is 0.0021. The closest INT8 value is still 0.0078 -- off by 0.0057. That is a 270% relative error on a small weight, because the grid steps are too coarse near zero.

With F8_E4M3, the grid near zero has many more tick marks packed in. You can represent 0.0021 much more faithfully because the format was designed knowing that most weights live exactly there.

When you round a weight to the nearest grid point, that rounding error flows forward through every matrix multiply that weight participates in. If most of your weights have small rounding errors, the accumulated output error across a 96-layer model stays small. INT8's uniform grid burns precision on ranges the model never uses. F8_E4M3 concentrates precision exactly where the model needs it.

Format	Throughput
FP16	~989 TFLOPS
F8_E4M3	~3958 TFLOPS (4x faster than FP16)

At its core, this architecture is meant to serve higher batch serving

The core computation in every linear layer is a matrix multiply. For an input activation matrix $X$ and a weight matrix $W$, the output is:

$Y = X W^{T}$

In BF16, both $X$ and $W$ are 16-bit values. The multiply-accumulate happens in FP32 accumulator registers on the GPU, and the output is written back in BF16. Memory bandwidth cost per element: 2 bytes for weights, 2 bytes for activations.

In F8_E4M3, the same operation runs differently. Weights are stored as 8-bit values including the language model heads. Before the matrix multiply, each channel is rescaled by a learned per-channel scale factor $s_c$ so that the full dynamic range of that channel's weights maps onto the F8_E4M3 grid as efficiently as possible:

$W_{quant} = \text{round}_{E4M3}\left(\frac{W}{s_c}\right)$

At inference, the multiply-accumulate runs on compressed weights, and the result is rescaled back:

$Y = (X \cdot W_{quant}^T) \times s_c$

For activations, the scale is not precomputed. It is derived token by token at runtime. For each token vector $x_t$, the scale is:

$s_t = \frac{\max(|x_t|)}{448}$

The activation is quantised, the matrix multiply executes, and the result is dequantized before passing to the next operation. This all happens within the same kernel. From the outside, it is invisible.

What this means for throughput. GPU memory bandwidth is the primary bottleneck for autoregressive inference. At BF16, loading a weight matrix costs 2 bytes per parameter. At F8_E4M3, it costs 1 byte. The matrix multiply itself runs on the same tensor cores, but the time spent moving data from VRAM to compute units is halved. For large batch serving where compute is the bottleneck, modern GPUs also expose native F8 tensor core paths with higher theoretical throughput than BF16.

Precision Mapping Across the Architecture

Through our own layer-by-layer profiling of activation distributions, routing sensitivity, and accumulated rounding error across the full architecture, we identified exactly which components can absorb 8-bit compression without behavioral change.

Quantized to F8_E4M3

All standard linear projections within the transformer blocks: Q, K, V, and output projections in attention, and the up, gate, and down projections in the routed expert MLPs. These layers represent the overwhelming majority of parameter count and memory bandwidth in the model.

Preserved at BF16

Component	Reason
Visual encoder	Vision features have distributions that are structurally unlike language activations. Compressing them introduces grounding errors that propagate into cross-modal attention.
Gated DeltaNet / linear attention	Recurrent state is carried forward across every token in the sequence. Rounding errors here do not stay local. They accumulate.
MoE router gates	Routing decisions are discrete. A small numerical error can send a token to the wrong expert entirely, with effects that are not recoverable downstream.
Shared expert gate	The gate controls whether the shared expert fires at all. Same sensitivity as the router, applied every forward pass.
Shared expert MLP	Unlike routed experts, this layer is active for every token without exception. Its contribution compounds across the full sequence.
Token embeddings	A lookup table. Quantizing it saves almost nothing and introduces a fixed error floor on every single token representation before any computation begins.
Language model head	The final projection onto vocabulary logits. Precision here determines the shape of the output distribution. Errors at this layer affect sampling, greedy decoding, and low-probability token generation.

Memory and KV Cache

Every quantized weight drops from 2 bytes to 1 byte. For the layers that are quantized, this is a direct 2x reduction in the memory required to hold the model.

The KV cache savings compound on top. During inference, every processed token writes a key vector and a value vector into a cache that persists for the duration of the request. The size of that cache is:

$\text{KV Cache} = 2 \times L \times H \times d \times T \times b$

Where $L$ is the number of layers, $H$ is the number of KV heads, $d$ is the head dimension, $T$ is the sequence length, and $b$ is bytes per element. Halving $b$ from 2 (BF16) to 1 (F8) halves the KV cache at every sequence length. At 32K tokens, this frees several gigabytes per active request. That headroom goes directly toward concurrent capacity. Same hardware, more users.

Benchmarks

Base model: Qwen/Qwen3.6-35B-A3B. All evaluations run at 0-shot using lm-evaluation-harness and lighteval, served with vLLM under --language-model-only.

Category	Benchmark	Qwen3.6-35B-A3B	Axe Superveloce 37B	Recovery
Reasoning	GSM8K-Platinum (0-shot)	94.98	95.12	100.1%
	MMLU-Pro (0-shot)	85.65	85.65	100.0%
	Math 500 (0-shot)	84.93	84.33	99.3%
	AIME 25 (0-shot)	91.25	91.25	100.0%
	GPQA Diamond (0-shot)	83.00	83.16	100.2%
Instruction Following	IFEval prompt-level strict (0-shot)	91.00	90.45	99.4%
	IFEval inst-level strict (0-shot)	93.69	93.29	99.6%
Coding	LiveCodeBench v6 (0-shot)	75.43	76.38	101.3%

On five of eight benchmarks, Axe Superveloce matches or exceeds the base model score. The compressed model outperforms its uncompressed counterpart on coding, a result consistent with per-channel weight scaling producing a tighter effective dynamic range on the neurons most active during code generation tasks.

Deployment via vLLM

Axe Superveloce is fully compatible with vLLM and loads natively without additional configuration.

Text only -- skip the vision encoder to free VRAM for additional KV cache:

vllm serve srswti/axe-superveloce-37b --reasoning-parser qwen3 --language-model-only

Multimodal -- full vision and language support:

vllm serve srswti/axe-superveloce-37b --reasoning-parser qwen3

Tool use:

vllm serve srswti/axe-superveloce-37b --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder

Speculative decoding via Multi-Token Prediction:

vllm serve srswti/axe-superveloce-37b --reasoning-parser qwen3 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

Send requests using the OpenAI-compatible endpoint:

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://<your-server-host>:8000/v1",
)

messages = [
    {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]

response = client.chat.completions.create(
    model="srswti/axe-superveloce-37b",
    messages=messages,
)

print(response.choices[0].message.content)

Developed by SRSWTI Inc. - Building world's fastest retrieval and inference engines.

Downloads last month: 204

Safetensors

Model size

35B params

Tensor type

BF16

F8_E4M3

Model tree for srswti/axe-superveloce-37b

Base model

Qwen/Qwen3.6-35B-A3B

Quantized

(353)

this model

Collection including srswti/axe-superveloce-37b

cuDega

Collection

Optimized for cuda acceleration • 10 items • Updated 17 days ago