step-3.5-flash-imatrix-gguf
This repo contains GGUF weights for stepfun-ai/Step-3.5-Flash with imatrix. Tested on strix halo.
This imatrix version fits into ~104 GiB of VRAM/RAM, saving roughly 7 GiB compared to the standard Q4_K_M, while actually providing slightly better output quality (lower perplexity).
Performance & Efficiency Benchmark (Strix Halo)
This model was tested on the AMD Strix Halo platform (Debian, Kernel 6.18.5) using llama.cpp 7966 (8872ad212) with two different backends: ROCm and Vulkan.
Key Findings:
- ROCm is more efficient: For a full benchmark run, ROCm was 4.7x faster and consumed 65% less energy than Vulkan.
- Prompt Processing: ROCm dominates in prompt ingestion speed, reaching over 350 t/s for short contexts and maintaining much higher throughput as context grows.
- Token Generation: Vulkan shows slightly higher raw generation speeds (T/s) for small contexts, but at a significantly higher energy cost. Not efficient with CTX >= 8k.
- Context Scaling: The model remains usable and tested up to 131k context, though energy costs scale exponentially on the Vulkan backend compared to a more linear progression on ROCm.
| Backend | Total Time | Total Energy |
|---|---|---|
| ROCm | 31m 14s | 60.63 Wh |
| Vulkan | 149m 03s | 175.47 Wh |
Full performance charts (Power, T/s, Energy) are available in the image below.
Memory Requirements & Comparison
| Quantization | Size (Binary GiB) | Size (Decimal GB) | PPL (Perplexity) |
|---|---|---|---|
| Q4_K_S (imatrix) THIS VERSION | 104 GiB | 111 GB | 2.4130 |
| Q4_K_M (standard) | 111 GiB | 119 GB | 2.4177 |
Quantization Details
- Method:
llama-quantize - Llama.cpp Version:
7966 (8872ad212) - Original Model Precision:
BF16 - imatrix with wikitext-103-raw-v1
Files Provided
| File | Quant Method | Size | Description |
|---|---|---|---|
step-3.5-flash-q4_k_s-0000{1..3}-of-00003.gguf |
Q4_K_S | 104 GB | High quality, with imatrix great for strix-halo |
Usage
You can use these models with llama.cpp
./llama-server -m step-3.5-flash-q4_k_s-00001-of-00003.gguf -no-mmap -ngl 99 --port 8080 -c 0 -fa 1 --jinja
- Downloads last month
- 225
Hardware compatibility
Log In
to add your hardware
4-bit
Model tree for mixer3d/step-3.5-flash-imatrix-gguf
Base model
stepfun-ai/Step-3.5-Flash