stepfun-ai
/

Step-3.5-Flash-Int4

Transformers

Model card Files Files and versions

xet

Community

WinstonDeng commited on 2 days ago

Commit

58d2849

verified ·

1 Parent(s): f937f1e

Update README.md

Browse files

Files changed (1) hide show

README.md +1 -166

README.md CHANGED Viewed

@@ -511,169 +511,4 @@ As we work to shape the future of AGI by expanding broad model capabilities, we
 - **Report Friction**: Encountering limitations? You can open an issue on GitHub or flag it directly in our Discord support channels.
 ## License
-This project is open-sourced under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
-## 1. Introduction
-**Step3.5** is our most capable open-source reasoning model, purpose-built for agentic workflows.
-It bridges the gap between massive scale and high performance by combining 196B parameters of knowledge with the inference latency of an 11B model.
-We prioritized developer needs to balance speed, cost, and accessibility. This enables the creation of production-grade agents that are fast, stable, and cost-effective.
-## 2. Key Capabilities
-- Frontier intelligence at 200 tokens/s: Step3.5 matches GPT-5 and Gemini 3.0 Pro in reasoning but runs 4x faster. By leveraging Multi-Token Prediction (MTP-3), Step3.5 predicts three tokens simultaneously, achieving 200 tokens/s for real-time responsiveness.
-- Easy local deployment: Despite its massive 196B total parameter count, Step3.5's sparse MoE architecture allows it to run locally on high-end consumer hardware (e.g. Mac Studio M2/M3 Ultra). This enables secure, offline deployment of elite-level intelligence.
-- Agentic & coding mastery: Step3.5 is fine-tuned for reliability. It achieves 85.5% on LiveCodeBench and 72.1% on SWE-bench Verified, making it a robust engine for autonomous software engineering and multi-step planning.
-- Cost-effective long context: Optimized with a 3:1 sliding window attention strategy (512 window), Step3.5 handles extended contexts with minimal memory overhead, perfect for RAG applications and analyzing large codebases.
-## 3. Benchmarks
-## Architecture
-### Key Features:
-- Hybrid Attention Schedules and Compensation for SWA
-- Mixture-of-Experts Routing And Load balancing
-### Architecture Details
-- Backbone: 45-layer Transformer
-- Vocabulary: 128,896 tokens
-- Hidden Dim: 4,096
-- MoE Blocks:
-  - 288 routed experts + 1 shared expert per block
-  - Top-8 expert selection per token
-- Parameters: Total:
-  196.81B (Backbone: 196B + MTP Head: 0.81B)
-- Activated per token:
-  11B (excludes embedding/output projections)
-- Special Components:
-  Multi-token Prediction (MTP) head with sliding-window attention and dense FFN
-## 5. Getting started
-## Deployment Resource Specifications
-- Model Weights: 20 GB
-- Runtime Overhead: ~4 GB
-- Minimum VRAM Required: 24 GB (e.g., RTX 4090 or A100)
-## Deploy Step3.5 Locally
-For local deployment, Step3.5-preview supports inference frameworks including vLLM and SGLang. Comprehensive deployment instructions are available in the official [Github](#) repository.
-vLLM and SGLang only support Step3.5-preview on their main branches. you can use their official docker images for inference.
-### vLLM
-Using Docker as:
-```shell
-docker pull vllm/vllm-openai:nightly
-```
-or using pip (must use pypi.org as the index url):
-```shell
-pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly
-```
-### SGLang
-Using Docker as:
-```shell
-docker pull lmsysorg/sglang:dev
-```
-or using pip install sglang from source.
-### transformers
-```python
-import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer
-MODEL_PATH = "xxxxxx"
-messages = [{"role": "user", "content": "hello"}]
-tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
-inputs = tokenizer.apply_chat_template(
-    messages,
-    tokenize=True,
-    add_generation_prompt=True,
-    return_dict=True,
-    return_tensors="pt",
-)
-model = AutoModelForCausalLM.from_pretrained(
-    pretrained_model_name_or_path=MODEL_PATH,
-    torch_dtype=torch.bfloat16,
-    device_map="auto",
-)
-inputs = inputs.to(model.device)
-generated_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
-output_text = tokenizer.decode(generated_ids[0][inputs.input_ids.shape[1] :])
-print(output_text)
-```
-### vLLM
-```shell
-vllm serve {xxx} \
-     --tensor-parallel-size 4 \
-     --speculative-config.method mtp \
-     --speculative-config.num_speculative_tokens 1 \
-     --tool-call-parser {xxx} \
-     --reasoning-parser {xxx} \
-     --enable-auto-tool-choice \
-     --served-model-name {xxx}
-```
-### SGLang
-```shell
-python3 -m sglang.launch_server \
-  --model-path {xxx} \
-  --tp-size 8 \
-  --tool-call-parser {xxx}  \
-  --reasoning-parser {xxx} \
-  --speculative-algorithm EAGLE \
-  --speculative-num-steps 3 \
-  --speculative-eagle-topk 1 \
-  --speculative-num-draft-tokens 4 \
-  --mem-fraction-static 0.8 \
-  --served-model-name {xxx} \
-  --host 0.0.0.0 \
-  --port 8000
-```
-### Parameter Instructions
-- When using `vLLM` and `SGLang`, thinking mode is enabled by default when sending requests.
-- Both support tool calling. Please use OpenAI-style tool description format for calls.
-<!-- ## Citation
-If you find our work useful in your research, please consider citing the following paper:
-```bibtex
-@misc{xxxx,
-      title={Step3.5-preview},
-      author={StepFun Team},
-      year={2026},
-      eprint={xxxx},
-      archivePrefix={arXiv},
-      primaryClass={cs.CL},
-      url={https://arxiv.org/abs/xxxxx},
-}
-``` -->
-## 📄 License
-This project is open-sourced under the [Apache 2.0 License](https://www.google.com/search?q=LICENSE).

 - **Report Friction**: Encountering limitations? You can open an issue on GitHub or flag it directly in our Discord support channels.
 ## License
+This project is open-sourced under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).