Instructions to use srswti/axe-superveloce-37b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use srswti/axe-superveloce-37b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="srswti/axe-superveloce-37b") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("srswti/axe-superveloce-37b") model = AutoModelForImageTextToText.from_pretrained("srswti/axe-superveloce-37b") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use srswti/axe-superveloce-37b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "srswti/axe-superveloce-37b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "srswti/axe-superveloce-37b", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/srswti/axe-superveloce-37b
- SGLang
How to use srswti/axe-superveloce-37b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "srswti/axe-superveloce-37b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "srswti/axe-superveloce-37b", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "srswti/axe-superveloce-37b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "srswti/axe-superveloce-37b", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use srswti/axe-superveloce-37b with Docker Model Runner:
docker model run hf.co/srswti/axe-superveloce-37b
A 37 billion parameter mixture-of-experts model. Built to serve state-of-the-art reasoning, instruction following, and code generation at a fraction of the memory cost of its base model.
The standard approach of quantizing everything uniformly trades correctness for simplicity. We take the opposite position, and a step further: compress aggressively where it is safe to do so, and preserve precision exactly where the architecture is sensitive.
This is why F8_E4M3 over simple INT8
INT8 spaces its 256 representable values evenly across the number line. Every step is the same size, regardless of where the weights actually live:
INT8 โ uniform spacing, equal steps everywhere
โ large gap โ โ large gap โ
-128 -64 -32 -1 0 1 32 64 128
| | | | | | | | |
โ โ โ โ โ โ โ โ โ
F8_E4M3 spends its representable values where neural network weights actually cluster -- densely packed near zero, thinning out toward the extremes:
F8_E4M3 โ non-uniform spacing, dense near zero
many fine steps here
โโโโโโโ
-448 -4 -1 -0.25 0 0.25 1 4 448
| | | |||| | |||| | | |
โ โ โ โโโโโโ โ โโโโโโ โ โ
โโโโโโ
most weights live here
The result: F8_E4M3 represents typical weight distributions with smaller rounding error than INT8 at the same bit width -- which is why FP8 compression consistently loses less accuracy than INT8 compression.
Say you have a weight with value 0.0317. You need to store it in 8 bits.
With INT8, your available grid points near zero are:
... -2/128 -1/128 0 1/128 2/128 ...
-0.0156 -0.0078 0 0.0078 0.0156
The closest representable value to 0.0317 is 0.0313 -- off by 0.0004. Fine.
But now say another weight is 0.0021. The closest INT8 value is still 0.0078
-- off by 0.0057. That is a 270% relative error on a small weight, because the
grid steps are too coarse near zero.
With F8_E4M3, the grid near zero has many more tick marks packed in. You can
represent 0.0021 much more faithfully because the format was designed knowing
that most weights live exactly there.
When you round a weight to the nearest grid point, that rounding error flows forward through every matrix multiply that weight participates in. If most of your weights have small rounding errors, the accumulated output error across a 96-layer model stays small. INT8's uniform grid burns precision on ranges the model never uses. F8_E4M3 concentrates precision exactly where the model needs it.
| Format | Throughput |
|---|---|
| FP16 | ~989 TFLOPS |
| F8_E4M3 | ~3958 TFLOPS (4x faster than FP16) |
At its core, this architecture is meant to serve higher batch serving
The core computation in every linear layer is a matrix multiply. For an input activation matrix $X$ and a weight matrix $W$, the output is:
In BF16, both $X$ and $W$ are 16-bit values. The multiply-accumulate happens in FP32 accumulator registers on the GPU, and the output is written back in BF16. Memory bandwidth cost per element: 2 bytes for weights, 2 bytes for activations.
In F8_E4M3, the same operation runs differently. Weights are stored as 8-bit values including the language model heads. Before the matrix multiply, each channel is rescaled by a learned per-channel scale factor $s_c$ so that the full dynamic range of that channel's weights maps onto the F8_E4M3 grid as efficiently as possible:
At inference, the multiply-accumulate runs on compressed weights, and the result is rescaled back:
For activations, the scale is not precomputed. It is derived token by token at runtime. For each token vector $x_t$, the scale is:
The activation is quantised, the matrix multiply executes, and the result is dequantized before passing to the next operation. This all happens within the same kernel. From the outside, it is invisible.
What this means for throughput. GPU memory bandwidth is the primary bottleneck for autoregressive inference. At BF16, loading a weight matrix costs 2 bytes per parameter. At F8_E4M3, it costs 1 byte. The matrix multiply itself runs on the same tensor cores, but the time spent moving data from VRAM to compute units is halved. For large batch serving where compute is the bottleneck, modern GPUs also expose native F8 tensor core paths with higher theoretical throughput than BF16.
Precision Mapping Across the Architecture
Through our own layer-by-layer profiling of activation distributions, routing sensitivity, and accumulated rounding error across the full architecture, we identified exactly which components can absorb 8-bit compression without behavioral change.
Quantized to F8_E4M3
All standard linear projections within the transformer blocks: Q, K, V, and output projections in attention, and the up, gate, and down projections in the routed expert MLPs. These layers represent the overwhelming majority of parameter count and memory bandwidth in the model.
Preserved at BF16
| Component | Reason |
|---|---|
| Visual encoder | Vision features have distributions that are structurally unlike language activations. Compressing them introduces grounding errors that propagate into cross-modal attention. |
| Gated DeltaNet / linear attention | Recurrent state is carried forward across every token in the sequence. Rounding errors here do not stay local. They accumulate. |
| MoE router gates | Routing decisions are discrete. A small numerical error can send a token to the wrong expert entirely, with effects that are not recoverable downstream. |
| Shared expert gate | The gate controls whether the shared expert fires at all. Same sensitivity as the router, applied every forward pass. |
| Shared expert MLP | Unlike routed experts, this layer is active for every token without exception. Its contribution compounds across the full sequence. |
| Token embeddings | A lookup table. Quantizing it saves almost nothing and introduces a fixed error floor on every single token representation before any computation begins. |
| Language model head | The final projection onto vocabulary logits. Precision here determines the shape of the output distribution. Errors at this layer affect sampling, greedy decoding, and low-probability token generation. |
Memory and KV Cache
Every quantized weight drops from 2 bytes to 1 byte. For the layers that are quantized, this is a direct 2x reduction in the memory required to hold the model.
The KV cache savings compound on top. During inference, every processed token writes a key vector and a value vector into a cache that persists for the duration of the request. The size of that cache is:
Where $L$ is the number of layers, $H$ is the number of KV heads, $d$ is the head dimension, $T$ is the sequence length, and $b$ is bytes per element. Halving $b$ from 2 (BF16) to 1 (F8) halves the KV cache at every sequence length. At 32K tokens, this frees several gigabytes per active request. That headroom goes directly toward concurrent capacity. Same hardware, more users.
Benchmarks
Base model: Qwen/Qwen3.6-35B-A3B. All evaluations run at 0-shot using lm-evaluation-harness and lighteval, served with vLLM under --language-model-only.
| Category | Benchmark | Qwen3.6-35B-A3B | Axe Superveloce 37B | Recovery |
|---|---|---|---|---|
| Reasoning | GSM8K-Platinum (0-shot) | 94.98 | 95.12 | 100.1% |
| MMLU-Pro (0-shot) | 85.65 | 85.65 | 100.0% | |
| Math 500 (0-shot) | 84.93 | 84.33 | 99.3% | |
| AIME 25 (0-shot) | 91.25 | 91.25 | 100.0% | |
| GPQA Diamond (0-shot) | 83.00 | 83.16 | 100.2% | |
| Instruction Following | IFEval prompt-level strict (0-shot) | 91.00 | 90.45 | 99.4% |
| IFEval inst-level strict (0-shot) | 93.69 | 93.29 | 99.6% | |
| Coding | LiveCodeBench v6 (0-shot) | 75.43 | 76.38 | 101.3% |
On five of eight benchmarks, Axe Superveloce matches or exceeds the base model score. The compressed model outperforms its uncompressed counterpart on coding, a result consistent with per-channel weight scaling producing a tighter effective dynamic range on the neurons most active during code generation tasks.
Deployment via vLLM
Axe Superveloce is fully compatible with vLLM and loads natively without additional configuration.
Text only -- skip the vision encoder to free VRAM for additional KV cache:
vllm serve srswti/axe-superveloce-37b --reasoning-parser qwen3 --language-model-only
Multimodal -- full vision and language support:
vllm serve srswti/axe-superveloce-37b --reasoning-parser qwen3
Tool use:
vllm serve srswti/axe-superveloce-37b --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder
Speculative decoding via Multi-Token Prediction:
vllm serve srswti/axe-superveloce-37b --reasoning-parser qwen3 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
Send requests using the OpenAI-compatible endpoint:
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://<your-server-host>:8000/v1",
)
messages = [
{"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]
response = client.chat.completions.create(
model="srswti/axe-superveloce-37b",
messages=messages,
)
print(response.choices[0].message.content)
Developed by SRSWTI Inc. - Building world's fastest retrieval and inference engines.
- Downloads last month
- 204
Model tree for srswti/axe-superveloce-37b
Base model
Qwen/Qwen3.6-35B-A3B