Instructions to use notSnix/Step-3.7-Flash-MTP-Draft-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use notSnix/Step-3.7-Flash-MTP-Draft-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="notSnix/Step-3.7-Flash-MTP-Draft-GGUF", filename="Step-3.7-Flash-MTP-BF16.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use notSnix/Step-3.7-Flash-MTP-Draft-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf notSnix/Step-3.7-Flash-MTP-Draft-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf notSnix/Step-3.7-Flash-MTP-Draft-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf notSnix/Step-3.7-Flash-MTP-Draft-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf notSnix/Step-3.7-Flash-MTP-Draft-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf notSnix/Step-3.7-Flash-MTP-Draft-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf notSnix/Step-3.7-Flash-MTP-Draft-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf notSnix/Step-3.7-Flash-MTP-Draft-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf notSnix/Step-3.7-Flash-MTP-Draft-GGUF:Q4_K_M
Use Docker
docker model run hf.co/notSnix/Step-3.7-Flash-MTP-Draft-GGUF:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use notSnix/Step-3.7-Flash-MTP-Draft-GGUF with Ollama:
ollama run hf.co/notSnix/Step-3.7-Flash-MTP-Draft-GGUF:Q4_K_M
- Unsloth Studio
How to use notSnix/Step-3.7-Flash-MTP-Draft-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for notSnix/Step-3.7-Flash-MTP-Draft-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for notSnix/Step-3.7-Flash-MTP-Draft-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for notSnix/Step-3.7-Flash-MTP-Draft-GGUF to start chatting
- Docker Model Runner
How to use notSnix/Step-3.7-Flash-MTP-Draft-GGUF with Docker Model Runner:
docker model run hf.co/notSnix/Step-3.7-Flash-MTP-Draft-GGUF:Q4_K_M
- Lemonade
How to use notSnix/Step-3.7-Flash-MTP-Draft-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull notSnix/Step-3.7-Flash-MTP-Draft-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Step-3.7-Flash-MTP-Draft-GGUF-Q4_K_M
List all available models
lemonade list
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf notSnix/Step-3.7-Flash-MTP-Draft-GGUF:# Run inference directly in the terminal:
llama-cli -hf notSnix/Step-3.7-Flash-MTP-Draft-GGUF:Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf notSnix/Step-3.7-Flash-MTP-Draft-GGUF:# Run inference directly in the terminal:
./llama-cli -hf notSnix/Step-3.7-Flash-MTP-Draft-GGUF:Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf notSnix/Step-3.7-Flash-MTP-Draft-GGUF:# Run inference directly in the terminal:
./build/bin/llama-cli -hf notSnix/Step-3.7-Flash-MTP-Draft-GGUF:Use Docker
docker model run hf.co/notSnix/Step-3.7-Flash-MTP-Draft-GGUF:Step 3.7 Flash MTP Draft GGUFs
These are companion MTP draft GGUFs for speculative decoding. They are not standalone chat models.
Use them with the full model repo:
notSnix/Step-3.7-Flash-Q4_K_M-GGUF
The draft GGUF is passed with --model-draft; the full model is passed with --model.
Files
| File | Size | SHA256 | Purpose |
|---|---|---|---|
Step-3.7-Flash-MTP-Q8_0.gguf |
3.5 GB | 017de8990140621b5b4af431448f20873fbf0b052f6c50d2afac15f45802a98d |
Recommended MTP draft |
Step-3.7-Flash-MTP-Q6_K.gguf |
2.7 GB | f41736e0dcce133d0dd0b81e14bd2965091e27dff306a28cec11ceb19fadbf46 |
Smaller Q6_K MTP draft |
Step-3.7-Flash-MTP-Q4_K_M.gguf |
2.0 GB | 44118cfe64f45b38127ad6fb626e16bd94ee5a827cb34aa83d9e6df3450aebaf |
Smaller MTP draft |
Step-3.7-Flash-MTP-BF16.gguf |
6.5 GB | fd811c81d14c786d314d8006655bba61971059abcfdfb6109ce83fd768f8b289 |
Experimental BF16 MTP draft |
Runtime
Current llama.cpp main supports Step MTP-tail draft loading natively. This was smoke-tested with clean llama.cpp commit d545a2a993849fcf3b752d85ae256fc9d6a9de79.
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build-cuda -DGGML_CUDA=ON
cmake --build build-cuda --config Release -j
Usage
llama-server \
--model Step-3.7-Flash-Q4_K_M.gguf \
--model-draft Step-3.7-Flash-MTP-Q8_0.gguf \
--host 0.0.0.0 \
--port 8000 \
--ctx-size 262144 \
--n-gpu-layers all \
--split-mode layer \
--parallel 1 \
--reasoning on \
--reasoning-format deepseek \
--spec-type draft-mtp \
--spec-draft-n-max 2 \
--spec-draft-p-min 0.60
Which Draft Should I Use?
Use Step-3.7-Flash-MTP-Q8_0.gguf first. It was the best local default in testing.
Use Step-3.7-Flash-MTP-Q6_K.gguf if you want a smaller draft file while staying above Q4.
Use Step-3.7-Flash-MTP-Q4_K_M.gguf if you want the smaller draft file.
Use Step-3.7-Flash-MTP-BF16.gguf for experimentation.
Checksums
sha256sum -c SHA256SUMS
Notes
- These files intentionally keep the upstream Step MTP tail-layer numbering (
blk.45,blk.46,blk.47). - They are companion speculative-decoding draft GGUFs, not full-model quants.
- The full Q4_K_M model is hosted separately so Hugging Face's GGUF widget does not display draft files as tiny full-model quantizations.
- This is a community GGUF conversion of the upstream Apache-2.0 model, not an official StepFun release.
- Downloads last month
- 1,237
4-bit
6-bit
8-bit
16-bit
Model tree for notSnix/Step-3.7-Flash-MTP-Draft-GGUF
Base model
stepfun-ai/Step-3.7-Flash
Install from brew
# Start a local OpenAI-compatible server with a web UI: llama-server -hf notSnix/Step-3.7-Flash-MTP-Draft-GGUF:# Run inference directly in the terminal: llama-cli -hf notSnix/Step-3.7-Flash-MTP-Draft-GGUF: