How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="tarruda/Step-3.7-Flash-GGUF",
	filename="",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

This is the same Step 3.7 Flash IQ4_XS quant published by stepfun: https://huggingface.co/stepfun-ai/Step-3.7-Flash-GGUF/tree/main/IQ4_XS

The only change I made is replacing the embedded chat template with one that works around this issue

I also added preserve_thinking option to the chat template, which preserves thinking across user turns and can improve the experience when prompt processing speed is a bottleneck.

This is the script I use to run this model:

#!/bin/sh -e

model=IQ4_XS/Step-3.7-Flash-IQ4_XS-00001-of-00003.gguf
mmproj=Step-3.7-Flash-mmproj-Q8_0.gguf
 
ctx=262144
parallel=1

ctx_size=$((ctx * parallel))

reasoning_budget_message="...

Actually, I will stop now.

Let me provide the user with a comprehensive answer."

llama-server --no-mmap --no-warmup --model $model --mmproj $mmproj \
  --ctx-size $ctx_size -np $parallel -fa on --temp 1.0 --top-p 0.95 \
  --repeat-penalty 1.0 --presence-penalty 0.0 \
  --reasoning-budget-message "$reasoning_budget_message" \
  -ctxcp 4 --checkpoint-min-step 512 \
  --cache-ram 6144 \
  --host 0.0.0.0
Downloads last month
250
GGUF
Model size
197B params
Architecture
step35
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tarruda/Step-3.7-Flash-GGUF

Quantized
(35)
this model