Instructions to use mixer3d/step-3.5-flash-imatrix-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use mixer3d/step-3.5-flash-imatrix-gguf with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="mixer3d/step-3.5-flash-imatrix-gguf", filename="step-3.5-flash-q4_k_s-00001-of-00003.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use mixer3d/step-3.5-flash-imatrix-gguf with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf mixer3d/step-3.5-flash-imatrix-gguf:Q4_K_S # Run inference directly in the terminal: llama-cli -hf mixer3d/step-3.5-flash-imatrix-gguf:Q4_K_S
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf mixer3d/step-3.5-flash-imatrix-gguf:Q4_K_S # Run inference directly in the terminal: llama-cli -hf mixer3d/step-3.5-flash-imatrix-gguf:Q4_K_S
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf mixer3d/step-3.5-flash-imatrix-gguf:Q4_K_S # Run inference directly in the terminal: ./llama-cli -hf mixer3d/step-3.5-flash-imatrix-gguf:Q4_K_S
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf mixer3d/step-3.5-flash-imatrix-gguf:Q4_K_S # Run inference directly in the terminal: ./build/bin/llama-cli -hf mixer3d/step-3.5-flash-imatrix-gguf:Q4_K_S
Use Docker
docker model run hf.co/mixer3d/step-3.5-flash-imatrix-gguf:Q4_K_S
- LM Studio
- Jan
- vLLM
How to use mixer3d/step-3.5-flash-imatrix-gguf with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "mixer3d/step-3.5-flash-imatrix-gguf" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mixer3d/step-3.5-flash-imatrix-gguf", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/mixer3d/step-3.5-flash-imatrix-gguf:Q4_K_S
- Ollama
How to use mixer3d/step-3.5-flash-imatrix-gguf with Ollama:
ollama run hf.co/mixer3d/step-3.5-flash-imatrix-gguf:Q4_K_S
- Unsloth Studio
How to use mixer3d/step-3.5-flash-imatrix-gguf with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for mixer3d/step-3.5-flash-imatrix-gguf to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for mixer3d/step-3.5-flash-imatrix-gguf to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for mixer3d/step-3.5-flash-imatrix-gguf to start chatting
- Pi
How to use mixer3d/step-3.5-flash-imatrix-gguf with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf mixer3d/step-3.5-flash-imatrix-gguf:Q4_K_S
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "mixer3d/step-3.5-flash-imatrix-gguf:Q4_K_S" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use mixer3d/step-3.5-flash-imatrix-gguf with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf mixer3d/step-3.5-flash-imatrix-gguf:Q4_K_S
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default mixer3d/step-3.5-flash-imatrix-gguf:Q4_K_S
Run Hermes
hermes
- Docker Model Runner
How to use mixer3d/step-3.5-flash-imatrix-gguf with Docker Model Runner:
docker model run hf.co/mixer3d/step-3.5-flash-imatrix-gguf:Q4_K_S
- Lemonade
How to use mixer3d/step-3.5-flash-imatrix-gguf with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull mixer3d/step-3.5-flash-imatrix-gguf:Q4_K_S
Run and chat with the model
lemonade run user.step-3.5-flash-imatrix-gguf-Q4_K_S
List all available models
lemonade list
llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
)step-3.5-flash-imatrix-gguf
This repo contains GGUF weights for stepfun-ai/Step-3.5-Flash with imatrix. Tested on strix halo.
This imatrix version fits into ~104 GiB of VRAM/RAM, saving roughly 7 GiB compared to the standard Q4_K_M, while actually providing slightly better output quality (lower perplexity).
Performance & Efficiency Benchmark (Strix Halo)
This model was tested on the AMD Strix Halo platform (Debian, Kernel 6.18.5) using llama.cpp 7966 (8872ad212) with two different backends: ROCm and Vulkan.
Key Findings:
- ROCm is more efficient: For a full benchmark run, ROCm was 4.7x faster and consumed 65% less energy than Vulkan.
- Prompt Processing: ROCm dominates in prompt ingestion speed, reaching over 350 t/s for short contexts and maintaining much higher throughput as context grows.
- Token Generation: Vulkan shows slightly higher raw generation speeds (T/s) for small contexts, but at a significantly higher energy cost. Not efficient with CTX >= 8k.
- Context Scaling: The model remains usable and tested up to 131k context, though energy costs scale exponentially on the Vulkan backend compared to a more linear progression on ROCm.
| Backend | Total Time | Total Energy |
|---|---|---|
| ROCm | 31m 14s | 60.63 Wh |
| Vulkan | 149m 03s | 175.47 Wh |
Full performance charts (Power, T/s, Energy) are available in the image below.
Memory Requirements & Comparison
| Quantization | Size (Binary GiB) | Size (Decimal GB) | PPL (Perplexity) |
|---|---|---|---|
| Q4_K_S (imatrix) THIS VERSION | 104 GiB | 111 GB | 2.4130 |
| Q4_K_M (standard) | 111 GiB | 119 GB | 2.4177 |
Quantization Details
- Method:
llama-quantize - Llama.cpp Version:
7966 (8872ad212) - Original Model Precision:
BF16 - imatrix with wikitext-103-raw-v1
Files Provided
| File | Quant Method | Size | Description |
|---|---|---|---|
step-3.5-flash-q4_k_s-0000{1..3}-of-00003.gguf |
Q4_K_S | 104 GB | High quality, with imatrix great for strix-halo |
Usage
You can use these models with llama.cpp
./llama-server -m step-3.5-flash-q4_k_s-00001-of-00003.gguf -no-mmap -ngl 99 --port 8080 -c 0 -fa 1 --jinja
- Downloads last month
- 27
4-bit
Model tree for mixer3d/step-3.5-flash-imatrix-gguf
Base model
stepfun-ai/Step-3.5-Flash
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="mixer3d/step-3.5-flash-imatrix-gguf", filename="", )