step-3.5-flash-imatrix-gguf

This repo contains GGUF weights for stepfun-ai/Step-3.5-Flash with imatrix. Tested on strix halo.

This imatrix version fits into ~104 GiB of VRAM/RAM, saving roughly 7 GiB compared to the standard Q4_K_M, while actually providing slightly better output quality (lower perplexity).

Performance & Efficiency Benchmark (Strix Halo)

This model was tested on the AMD Strix Halo platform (Debian, Kernel 6.18.5) using llama.cpp 7966 (8872ad212) with two different backends: ROCm and Vulkan.

Key Findings:

  • ROCm is more efficient: For a full benchmark run, ROCm was 4.7x faster and consumed 65% less energy than Vulkan.
  • Prompt Processing: ROCm dominates in prompt ingestion speed, reaching over 350 t/s for short contexts and maintaining much higher throughput as context grows.
  • Token Generation: Vulkan shows slightly higher raw generation speeds (T/s) for small contexts, but at a significantly higher energy cost. Not efficient with CTX >= 8k.
  • Context Scaling: The model remains usable and tested up to 131k context, though energy costs scale exponentially on the Vulkan backend compared to a more linear progression on ROCm.
Backend Total Time Total Energy
ROCm 31m 14s 60.63 Wh
Vulkan 149m 03s 175.47 Wh

Full performance charts (Power, T/s, Energy) are available in the image below.

Benchmark

Memory Requirements & Comparison

Quantization Size (Binary GiB) Size (Decimal GB) PPL (Perplexity)
Q4_K_S (imatrix) THIS VERSION 104 GiB 111 GB 2.4130
Q4_K_M (standard) 111 GiB 119 GB 2.4177

Quantization Details

  • Method: llama-quantize
  • Llama.cpp Version: 7966 (8872ad212)
  • Original Model Precision: BF16
  • imatrix with wikitext-103-raw-v1

Files Provided

File Quant Method Size Description
step-3.5-flash-q4_k_s-0000{1..3}-of-00003.gguf Q4_K_S 104 GB High quality, with imatrix great for strix-halo

Usage

You can use these models with llama.cpp

./llama-server -m step-3.5-flash-q4_k_s-00001-of-00003.gguf -no-mmap -ngl 99 --port 8080 -c 0 -fa 1 --jinja
Downloads last month
225
GGUF
Model size
197B params
Architecture
step35
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mixer3d/step-3.5-flash-imatrix-gguf

Quantized
(20)
this model