Instructions to use tcclaviger/Step-3.7-Flash-240REAP-MXFP416 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use tcclaviger/Step-3.7-Flash-240REAP-MXFP416 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="tcclaviger/Step-3.7-Flash-240REAP-MXFP416", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("tcclaviger/Step-3.7-Flash-240REAP-MXFP416", trust_remote_code=True, device_map="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use tcclaviger/Step-3.7-Flash-240REAP-MXFP416 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "tcclaviger/Step-3.7-Flash-240REAP-MXFP416"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tcclaviger/Step-3.7-Flash-240REAP-MXFP416",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/tcclaviger/Step-3.7-Flash-240REAP-MXFP416

SGLang

How to use tcclaviger/Step-3.7-Flash-240REAP-MXFP416 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "tcclaviger/Step-3.7-Flash-240REAP-MXFP416" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tcclaviger/Step-3.7-Flash-240REAP-MXFP416",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "tcclaviger/Step-3.7-Flash-240REAP-MXFP416" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tcclaviger/Step-3.7-Flash-240REAP-MXFP416",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use tcclaviger/Step-3.7-Flash-240REAP-MXFP416 with Docker Model Runner:
```
docker model run hf.co/tcclaviger/Step-3.7-Flash-240REAP-MXFP416
```

mxfp4_16 quant of stepfun-ai/Step-3.7-Flash

Runtime: requires tcclaviger/vllm22:latest — an RDNA 4 (gfx12xx) vLLM image and the only build with the mxfp4_16 kernels; no other vLLM build loads these weights. Not validated on any other hardware at this time.

Step-3.7-Flash-240REAP

A REAP-pruned variant of stepfun-ai/Step-3.7-Flash: routed experts reduced 288 → 240 per MoE layer via Cerebras REAP. Top-8 routing and all non-MoE components are unchanged structurally, but have been quantized to maximize space for this 128GB envelope usecase.
A new experimental quantization inspired by Q4_NL, MXFP4, and NVFP4. Currently only qualified to work on RDNA4 GPUs but will work on any FP16 capable GPU (in theory). More accurate the NVFP4, Q4_NL, MXFP4, or any other 4bit based quant I've tested, MSE is in ballpark of Q5/Q6_K_L.
A prebuilt vllm-22 Docker image is available — pull (More kernels integrated soon™) tcclaviger/vllm22:latest from Docker Hub, and is currently the only way to run this model, it also has a TON of fixes for RDNA 4 in general. You will not find a faster vllm build on 4x9700s than this. It ha scome to my attention that TP8 is not accounted for, I'll be adding the configs this weekend.
Chat template and the parser in the container have been modified to respect the normal chat kwargs overrides for vllm to disable/control thinking correctly. Thinking off now behaves like an instruct model, eagerly following commands with no reasoning.
I AM dogfooding this, using it to drive modified openclaude as my system agent, working quite well, vibe feels....~Opus 4.5ish.

Docker Container RDNA4 Fixes Include

Fixed attention selection order to select Triton before ROCM attention. ROCM attention is brittle and slow currently, about 10x slower as context length grows.
Added tuneable triton attention kernel, this is the single greatest uplift in performance, the defautl unified attention kernel is wildly miss-configured for RDNA4, it's why the 9070/9700 perform worse than a RTX 4060 in vllm right now. Results in nearly flat curves from 1 token prompt to max length prompt, exactly as a correct attention kernel should do (note the ~linear fall off instead of quadratic collapse of normal RDNA4 charts).
Added a wide range of GEMM tuned configuration files for the R9700 that lift performance.
Fixed the Q projection during kv-cache-dtype fp8 use to allow Q to use FP8 on R9700, quantizing kv is now a speed up instead of a slow down as it should be, linear speed gain in line with context lenght, ~300% (from ~780 to ~2450) speed up at 200k on this model.
Fixed the KV quantized to FP8 condition when Q is not FP8, essentially what is used by most GPUs in triton attention, previsouly it had a nasty bug, fixing it doubled long context throughput. If using this code path, its ~200% vs original triton FP8 KV quantization code.

Step 3.7 Flash 240 Reap mxfp416 Decode scaling with prompt length on 4x R9700s with MTP3 (variability is just MTP hit rates.):

Qwen 3.6 35B A3B mxfp416 Decode scaling with promt length on 4x R9700s without MTP:

Results (this variant)

Perplexity: 3.944 Up slightly due to FP8 attention quantization
Perplexity: 3.951 Up slightly due to FP8 attention quantization + KV dynamc FP8 quantization if you need the space for long context
tcclaviger codeneedle: 100 / 100 (with whitespace normalization omitted from scoring)
Tool calls: 0 failed

REAP Expert Pruning

This model was produced by applying REAP — Router-weighted Expert Activation Pruning (Cerebras, arXiv:2510.13999) — to the base Step-3.7-Flash Mixture-of-Experts.

Routed experts: 288 → 240 per MoE layer (48 pruned). Top-8 routing is unchanged.
Pruning touches the MoE text layers (3–44) only. Dense layers 0–2, the always-on shared expert in each MoE layer, attention, the router gate/bias, norms, lm_head, and the vision tower are left untouched.
REAP scores each expert by router-weighted activation magnitude over a calibration set, drops the lowest-contribution experts, renumbers the survivors, and slices the matching rows from the router (moe.gate / router_bias). Surviving expert weights are copied verbatim — no re-training.
config.json moe_num_experts is updated to 240 to match.
MTP layers (45–47) — the multi-token-prediction / speculative-decoding heads — are preserved untouched in the checkpoint (all 51 tensors retained), so MTP/EAGLE speculative decoding still works.
MTP accuracy at MTP observed at a consistent 50 to 70 percent accross different domains, typical distribution is ~0.92, 0.75, 0.55 for the predicted tokens.

Preserved super-experts

Per REAP guidelines, super-experts — those with anomalously high max-activation magnitudes — are protected from pruning and explicitly retained. The following 8 super-experts (outlier mode, all layers) were identified via REAP's get_super_expert_indices over the calibration metrics and verified present in the final survivor set. Expert IDs are in the original (pre-renumber) indexing:

Layer	Expert	Max activation
41	e186	1984.2 ← the giant
42	e128	1102.1
44	e125	689.3
44	e86	527.3
43	e67	490.0
44	e88	389.6
29	e171	268.6
30	e187	206.7

All 8 were kept.

FP8 KV-Cache Scales

This checkpoint ships with calibrated per-tensor FP8 KV-cache scales baked in, so FP8 KV cache runs at the model's measured activation ranges instead of generic defaults.

90 scalar scales — one k_scale and one v_scale per attention layer (45 layers), stored as FP32 scalars in model-kvscales.safetensors and registered in model.safetensors.index.json (tensor names model.layers.{i}.self_attn.{k,v}_proj.{k,v}_scale).
Scales were calibrated, not assumed: observed ranges are k_scale ≈ 0.021–0.120 and v_scale ≈ 0.018–0.184, varying per layer. 3072 various length samples from chat, to huge code files, images, multiple images per request, single images, mixed content, agentic tasks, etc. A representative workload.
Weights are untouched (bf16). These scales only affect the KV cache — they are the static dequant scales for an FP8 (e4m3) KV cache.
Enable in vLLM with --kv-cache-dtype fp8; the scales load straight from the checkpoint, so no separate calibration file or runtime scale-search is needed.
Why: FP8 KV roughly halves KV-cache memory, which is what makes 200k context at MTP 3 fit inside the 128GB envelope; the embedded calibration keeps quantization error low versus default/unit scales.
Safety Maring: The scales used have some headroom built in to ensure small excursions don't clip by exceding the calibration observed range, ~15% for K and 25% for V, if they clip it will be a rare edge case. Accuracy is still vastly improved vs 1.0 scalar.

The intent was to provide a model retaining nearly full 3.7 Flash capability but more friendly to 128GB systems, Strix Halo 395+, Spark, Thor, or those running 4x32 or 4x48 gpu setups. It is possible with careful deployments to get 200k context, MTP 3, and fit within the 128GB envelope when quantized.

[ModelPage]: https://static.stepfun.com/blog/step-3.7-flash/

1. Introduction

Step 3.7 Flash is a 198B-parameter sparse Mixture-of-Experts (MoE) vision-language model that combines a 196B-parameter language backbone with a 1.8B-parameter vision encoder for native image understanding. Engineered for high-frequency production workloads, it activates approximately 11B parameters per token and delivers a throughput of up to 400 tokens per second. Step 3.7 Flash supports a 256k context window and offers three selectable reasoning levels (low, medium, and high) so developers can easily balance speed, cost, and cognitive depth.

We built Step 3.7 Flash for developers who need to scale agentic workflows that combine perception, search, and reasoning. It is designed to handle intensive tasks such as parsing massive financial reports in one pass, running multi-step search loops with cross-source verification, or operating concurrent coding agents in high-throughput pipelines.

2. Capabilities & Performance

Multimodal Perception and Verification

The model delivers top-tier visual intelligence, securing first place on SimpleVQA (Search) with a 79.2 and achieving frontier parity on V* (Python) at 95.3. These metrics reflect strong visual grounding and retrieval-augmented reasoning beyond basic image description. The model accurately processes dense visual interfaces, such as UI wireframes, application GUIs, and data charts, to map them into structured code. When it encounters an incomplete visual asset, it can independently identify missing data and execute lookups to verify context before returning a factually verified conclusion.

Workflow Integrity and Tool Orchestration

Execution reliability is critical for autonomous agents. Step 3.7 Flash leads the ClawEval-1.1 benchmark with a score of 67.1, which significantly outperforms the next closest competitor at 59.8. This performance demonstrates high resistance to adversarial traps and strict adherence to system policies during multi-turn orchestration. Backed by scores of 49.5 on Toolathlon and 48.1 on HLE w. Tool, this profile ensures high trajectory integrity. Step 3.7 Flash reliably interacts with external APIs and executes long-horizon workflows without drifting from instructions or violating system constraints.

Code Engineering and Professional Baselines

Step 3.7 Flash is built for live engineering tasks and secured a definitive second-place finish on SWE-Bench PRO with a score of 56.3. It can independently trace multi-file repositories, isolate bugs from raw issue reports, and generate functional patches that pass automated unit tests. While evaluations like Terminal-Bench 2.1 (59.5) and GDPVal-AA (45.8) show clear areas for future optimization compared to the absolute peak of the cohort, they establish a dependable baseline for system interactions and structured professional deliverables.

3. Pricing

Token Type	Price
Input (cache miss)	$0.20 / M tokens
Input (cache hit)	$0.04 / M tokens
Output	$1.15 / M tokens

4. Availability, Deployment, and Ecosystem

Availability: Step 3.7 Flash is available on the StepFun Open Platform — platform.stepfun.ai (Global) and platform.stepfun.com (China), OpenRouter, and NVIDIA NIM. StepFun is also partnering with DeepInfra, Fireworks AI, and Modal to expand availability soon.
Deployment: Step 3.7 Flash supports flexible deployment across cloud, data center, and local environments. For large-scale production and enterprise use cases, Step 3.7 Flash can be deployed on modern data center infrastructure. For local and workstation scenarios, it can also run on high-memory devices such as NVIDIA DGX Station, AMD Ryzen AI Max+ 395-based systems, and Mac Studio / Macbook Pro devices with at least 128GB unified memory.
Ecosystem: Step 3.7 Flash is supported across popular open-source infrastructure for both inference and model development. For inference and serving, developers can use vLLM, SGLang, Hugging Face Transformers, and llama.cpp. For model development & customization workflows, StepFun model support has landed in the NVIDIA Nemo ecosystem, including AutoModel, Megatron Core and Megatron Bridge. Step 3.7 Flash is also available as an NVIDIA NIM inference microservice for on-prem, cloud, or hybrid deployment.

5. Examples

You can get started with Step 3.7 Flash in minutes using StepFun's API or via other inference providers.

Pick the right base_url for your region. StepFun operates two regional platforms with separate API hosts. The base_url you pass to the OpenAI client must match the platform where your API key was issued, otherwise requests will be rejected as unauthorized.

Global: platform.stepfun.ai — base_url=https://api.stepfun.ai/v1

China: platform.stepfun.com — base_url=https://api.stepfun.com/v1

To avoid hard-coding the wrong region, the examples below read both the API key and base URL from environment variables. Export them once before running:
export STEP_API_KEY="sk-..."
export STEP_BASE_URL="https://api.stepfun.ai/v1"   # use https://api.stepfun.com/v1 for the China platform

5.1 Chat Example

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["STEP_API_KEY"],
    base_url=os.environ["STEP_BASE_URL"],
)

completion = client.chat.completions.create(
    model="step-3.7-flash",
    messages=[
        {
            "role": "system",
            "content": "You are an AI assistant provided by StepFun. You are good at Chinese, English, and many other languages, and you can see, think, and act to help users get things done.",
        },
        {
            "role": "user",
            "content": "Introduce StepFun's artificial intelligence capabilities."
        },
    ],
)

print(completion)

5.2 Text and Image Input Example

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["STEP_API_KEY"],
    base_url=os.environ["STEP_BASE_URL"],
)

completion = client.chat.completions.create(
    model="step-3.7-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What is in this picture?"},
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/photo.jpg"},
                },
            ],
        },
    ],
)

print(completion)

6. Local Deployment

Step 3.7 Flash is optimized for local inference and supports industry-standard backends including vLLM, SGLang, Hugging Face Transformers and llama.cpp.

6.1 vLLM

We recommend using StepFun's prebuilt vLLM Docker image with Step 3.7 support.

Install vLLM.

# via Docker
docker pull vllm/vllm-openai:stepfun37

Launch the server.

For FP8 model

vllm serve <MODEL_PATH_OR_HF_ID> \
--served-model-name step3p7-flash \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--disable-cascade-attn \
--reasoning-parser step3p5 \
--enable-auto-tool-choice \
--tool-call-parser step3p5 \
--speculative_config '{"method": "mtp", "num_speculative_tokens": 3}' \
--trust-remote-code

For BF16 model

vllm serve <MODEL_PATH_OR_HF_ID> \
--served-model-name step3p7-flash-bf16 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--disable-cascade-attn \
--reasoning-parser step3p5 \
--enable-auto-tool-choice \
--tool-call-parser step3p5 \
--speculative_config '{"method": "mtp", "num_speculative_tokens": 3}' \
--trust-remote-code

For NVFP4 model Compared to standard precisions, running the FP4 quantized version requires modelopt activation and FP8 KV Cache alignment.

python3 -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port ${PORT} \
--model stepfun-ai/Step-3.7-Flash-NVFP4 \
--served-model-name step3p7 \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9 \
--enable-expert-parallel \
--trust-remote-code \
--quantization modelopt \
--kv-cache-dtype fp8 \
--max-model-len 8192 \
--reasoning-parser step3p5 \
--enable-auto-tool-choice \
--tool-call-parser step3p5 \
--async-scheduling

6.2 SGLang

Install SGLang.

# via Docker
docker pull lmsysorg/sglang:dev-step-3.7-flash

# or from source (pip)
pip install "sglang[all] @ git+https://github.com/sgl-project/sglang.git"

Launch the server.

Note: For Blackwell GPUs, --mm-attention-backend fa4 may be used.

For BF16 model

sglang serve --model-path stepfun-ai/Step-3.7-Flash \
  --tp 8 \
  --reasoning-parser step3p5 \
  --tool-call-parser step3p5 \
  --enable-multimodal \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --enable-multi-layer-eagle \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8000

For FP8 model

sglang serve --model-path stepfun-ai/Step-3.7-Flash-FP8 \
  --tp 8 \
  --ep 4 \
  --reasoning-parser step3p5 \
  --tool-call-parser step3p5 \
  --enable-multimodal \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --enable-multi-layer-eagle \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8000

For NVFP4 model

sglang serve --model-path stepfun-ai/Step-3.7-Flash-NVFP4 \
  --tp 4 --ep 4 \
  --moe-runner-backend flashinfer_trtllm \
  --kv-cache-dtype fp8_e4m3 \
  --quantization modelopt_fp4 \
  --trust-remote-code \
  --reasoning-parser step3p5 \
  --tool-call-parser step3p5 \
  --attention-backend trtllm_mha

6.3 Transformers (Debug / Verification)

Use this snippet for quick functional verification. For high-throughput serving, use vLLM or SGLang.

Note: Deployment of this model requires transformers 5.0 or later.

from transformers import AutoProcessor, AutoModelForCausalLM

MODEL_PATH = "<MODEL_PATH_OR_HF_ID>"

# 1. Setup
processor = AutoProcessor.from_pretrained(MODEL_PATH, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    device_map="auto",
    dtype="auto",
    trust_remote_code=True
)

# 2. Prepare Input
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://example.com/photo.jpg"},
            {"type": "text", "text": "What is in this picture?"}
        ]
    },
]
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

# 3. Generate
generated_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
output_text = processor.decode(generated_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)

print(output_text)

6.4 llama.cpp

System Requirements

GGUF Model Weights:

Component	Quantization	File Size
Language Model	Q4_K_S	111.5 GB
Language Model	IQ4_XS	104.99 GB
Language Model	Q3_K_L	102.5 GB
Multimodal Projector	FP16	3.97 GB

Runtime Overhead: ~7 GB
Minimum unified memory / VRAM: 120 GB (e.g., Mac Studio, NVIDIA DGX Station, AMD Ryzen AI Max+ 395)
Recommended: 128 GB unified memory

Steps

Use llama.cpp:

git clone https://github.com/stepfun-ai/llama.cpp.git
cd llama.cpp
git checkout -b step3.7 origin/step3.7

Build llama.cpp on Mac:

cmake -B build-macos -S . \
    -DCMAKE_BUILD_TYPE=Release \
    -DBUILD_SHARED_LIBS=ON \
    -DLLAMA_BUILD_SERVER=ON \
    -DLLAMA_BUILD_TESTS=ON \
    -DGGML_METAL=ON \
    -DGGML_METAL_EMBED_LIBRARY=ON \
    -DGGML_BLAS=ON \
    -DGGML_BLAS_VENDOR=Apple \
    -DGGML_ACCELERATE=ON \
    -DGGML_NATIVE=ON
cmake --build build-macos -j8

Build llama.cpp on DGX-Spark:

cmake -S . -B build-cuda \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_CUDA=ON \
  -DGGML_CUDA_GRAPHS=ON \
  -DGGML_CUDA_FORCE_MMQ=ON \
  -DLLAMA_OPENSSL=OFF \
  -DLLAMA_BUILD_COMMON=ON \
  -DLLAMA_BUILD_TOOLS=ON \
  -DLLAMA_BUILD_SERVER=ON \
  -DLLAMA_BUILD_EXAMPLES=OFF \
  -DLLAMA_BUILD_TESTS=OFF
cmake --build build-cuda -j8

Build llama.cpp on AMD Windows:

cmake -S . -B build-vulkan \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_VULKAN=ON \
  -DGGML_NATIVE=ON \
  -DLLAMA_BUILD_SERVER=ON \
  -DLLAMA_BUILD_UI=OFF \
  -DLLAMA_BUILD_TOOLS=ON
cmake --build build-vulkan -j8

Run with llama-cli:

./llama-cli -m Step3.7_Q4_K_S.gguf -b 2048 -ub 2048 -fa on --temp 1.0 -p "What's your name?"

Test performance with llama-batched-bench:

./llama-batched-bench -m step3.7_Q4_K_S.gguf -c 32768 -b 2048 -ub 2048 -npp 0,2048,8192,16384,32768 -ntg 128 -npl 1

7. Using Step 3.7 Flash on Agent Platforms

You can use Step 3.7 Flash on Agent platforms such as Hermes Agent, OpenClaw, Kilo Code, and more.

8. Getting in Touch

As we work to shape the future of AGI by expanding broad model capabilities, we want to ensure we are solving the right problems. We invite you to be part of this continuous feedback loop — your insights directly influence our priorities.

Join the Conversation: Our Discord community is the primary hub for brainstorming future architectures, proposing capabilities, and getting early access updates 🚀
Report Friction: Encountering limitations? You can open an issue or start a discussion on GitHub / HuggingFace, or flag it directly in our Discord support channels.

📄 License

This project is open-sourced under the Apache 2.0 License.

Downloads last month: 31

Safetensors

Model size

100B params

Tensor type

F32

BF16

F8_E4M3

F16

Model tree for tcclaviger/Step-3.7-Flash-240REAP-MXFP416

Base model

stepfun-ai/Step-3.7-Flash

Quantized

(40)

this model

Paper for tcclaviger/Step-3.7-Flash-240REAP-MXFP416

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 20

tcclaviger
/

Step-3.7-Flash-240REAP-MXFP416

`mxfp4_16` quant of stepfun-ai/Step-3.7-Flash

Step-3.7-Flash-240REAP

Docker Container RDNA4 Fixes Include

Results (this variant)

REAP Expert Pruning

Preserved super-experts

FP8 KV-Cache Scales

1. Introduction

2. Capabilities & Performance

Multimodal Perception and Verification

Workflow Integrity and Tool Orchestration

Code Engineering and Professional Baselines

3. Pricing

4. Availability, Deployment, and Ecosystem

5. Examples

5.1 Chat Example

5.2 Text and Image Input Example

6. Local Deployment

6.1 vLLM

6.2 SGLang

6.3 Transformers (Debug / Verification)

6.4 llama.cpp

7. Using Step 3.7 Flash on Agent Platforms

8. Getting in Touch

📄 License

Model tree for tcclaviger/Step-3.7-Flash-240REAP-MXFP416

Paper for tcclaviger/Step-3.7-Flash-240REAP-MXFP416

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

mxfp4_16 quant of stepfun-ai/Step-3.7-Flash

Step-3.7-Flash-240REAP

Docker Container RDNA4 Fixes Include

Results (this variant)

REAP Expert Pruning

Preserved super-experts

FP8 KV-Cache Scales

1. Introduction

2. Capabilities & Performance

Multimodal Perception and Verification

Workflow Integrity and Tool Orchestration

Code Engineering and Professional Baselines

3. Pricing

4. Availability, Deployment, and Ecosystem

5. Examples

5.1 Chat Example

5.2 Text and Image Input Example

6. Local Deployment

6.1 vLLM

6.2 SGLang

6.3 Transformers (Debug / Verification)

6.4 llama.cpp

7. Using Step 3.7 Flash on Agent Platforms

8. Getting in Touch

📄 License

Model tree for tcclaviger/Step-3.7-Flash-240REAP-MXFP416

Paper for tcclaviger/Step-3.7-Flash-240REAP-MXFP416

`mxfp4_16` quant of stepfun-ai/Step-3.7-Flash