Transformers
WinstonDeng commited on
Commit
58d2849
·
verified ·
1 Parent(s): f937f1e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -166
README.md CHANGED
@@ -511,169 +511,4 @@ As we work to shape the future of AGI by expanding broad model capabilities, we
511
  - **Report Friction**: Encountering limitations? You can open an issue on GitHub or flag it directly in our Discord support channels.
512
 
513
  ## License
514
- This project is open-sourced under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
515
-
516
-
517
-
518
-
519
-
520
-
521
-
522
- ## 1. Introduction
523
-
524
- **Step3.5** is our most capable open-source reasoning model, purpose-built for agentic workflows.
525
- It bridges the gap between massive scale and high performance by combining 196B parameters of knowledge with the inference latency of an 11B model.
526
- We prioritized developer needs to balance speed, cost, and accessibility. This enables the creation of production-grade agents that are fast, stable, and cost-effective.
527
-
528
- ## 2. Key Capabilities
529
-
530
- - Frontier intelligence at 200 tokens/s: Step3.5 matches GPT-5 and Gemini 3.0 Pro in reasoning but runs 4x faster. By leveraging Multi-Token Prediction (MTP-3), Step3.5 predicts three tokens simultaneously, achieving 200 tokens/s for real-time responsiveness.
531
- - Easy local deployment: Despite its massive 196B total parameter count, Step3.5's sparse MoE architecture allows it to run locally on high-end consumer hardware (e.g. Mac Studio M2/M3 Ultra). This enables secure, offline deployment of elite-level intelligence.
532
- - Agentic & coding mastery: Step3.5 is fine-tuned for reliability. It achieves 85.5% on LiveCodeBench and 72.1% on SWE-bench Verified, making it a robust engine for autonomous software engineering and multi-step planning.
533
- - Cost-effective long context: Optimized with a 3:1 sliding window attention strategy (512 window), Step3.5 handles extended contexts with minimal memory overhead, perfect for RAG applications and analyzing large codebases.
534
-
535
- ## 3. Benchmarks
536
-
537
- ## Architecture
538
-
539
- ### Key Features:
540
- - Hybrid Attention Schedules and Compensation for SWA
541
-
542
- - Mixture-of-Experts Routing And Load balancing
543
-
544
- ### Architecture Details
545
-
546
- - Backbone: 45-layer Transformer
547
- - Vocabulary: 128,896 tokens
548
- - Hidden Dim: 4,096
549
- - MoE Blocks:
550
- - 288 routed experts + 1 shared expert per block
551
- - Top-8 expert selection per token
552
- - Parameters: Total:
553
- 196.81B (Backbone: 196B + MTP Head: 0.81B)
554
- - Activated per token:
555
- 11B (excludes embedding/output projections)
556
- - Special Components:
557
-
558
- Multi-token Prediction (MTP) head with sliding-window attention and dense FFN
559
-
560
- ## 5. Getting started
561
-
562
- ## Deployment Resource Specifications
563
-
564
- - Model Weights: 20 GB
565
- - Runtime Overhead: ~4 GB
566
- - Minimum VRAM Required: 24 GB (e.g., RTX 4090 or A100)
567
-
568
- ## Deploy Step3.5 Locally
569
-
570
- For local deployment, Step3.5-preview supports inference frameworks including vLLM and SGLang. Comprehensive deployment instructions are available in the official [Github](#) repository.
571
-
572
- vLLM and SGLang only support Step3.5-preview on their main branches. you can use their official docker images for inference.
573
-
574
- ### vLLM
575
-
576
- Using Docker as:
577
-
578
- ```shell
579
- docker pull vllm/vllm-openai:nightly
580
- ```
581
-
582
- or using pip (must use pypi.org as the index url):
583
-
584
- ```shell
585
- pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly
586
- ```
587
-
588
- ### SGLang
589
-
590
- Using Docker as:
591
-
592
- ```shell
593
- docker pull lmsysorg/sglang:dev
594
- ```
595
-
596
- or using pip install sglang from source.
597
-
598
- ### transformers
599
-
600
- ```python
601
- import torch
602
- from transformers import AutoModelForCausalLM, AutoTokenizer
603
-
604
- MODEL_PATH = "xxxxxx"
605
- messages = [{"role": "user", "content": "hello"}]
606
- tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
607
- inputs = tokenizer.apply_chat_template(
608
- messages,
609
- tokenize=True,
610
- add_generation_prompt=True,
611
- return_dict=True,
612
- return_tensors="pt",
613
- )
614
- model = AutoModelForCausalLM.from_pretrained(
615
- pretrained_model_name_or_path=MODEL_PATH,
616
- torch_dtype=torch.bfloat16,
617
- device_map="auto",
618
- )
619
- inputs = inputs.to(model.device)
620
- generated_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
621
- output_text = tokenizer.decode(generated_ids[0][inputs.input_ids.shape[1] :])
622
- print(output_text)
623
- ```
624
-
625
- ### vLLM
626
-
627
- ```shell
628
- vllm serve {xxx} \
629
- --tensor-parallel-size 4 \
630
- --speculative-config.method mtp \
631
- --speculative-config.num_speculative_tokens 1 \
632
- --tool-call-parser {xxx} \
633
- --reasoning-parser {xxx} \
634
- --enable-auto-tool-choice \
635
- --served-model-name {xxx}
636
- ```
637
-
638
- ### SGLang
639
-
640
- ```shell
641
- python3 -m sglang.launch_server \
642
- --model-path {xxx} \
643
- --tp-size 8 \
644
- --tool-call-parser {xxx} \
645
- --reasoning-parser {xxx} \
646
- --speculative-algorithm EAGLE \
647
- --speculative-num-steps 3 \
648
- --speculative-eagle-topk 1 \
649
- --speculative-num-draft-tokens 4 \
650
- --mem-fraction-static 0.8 \
651
- --served-model-name {xxx} \
652
- --host 0.0.0.0 \
653
- --port 8000
654
- ```
655
-
656
- ### Parameter Instructions
657
-
658
- - When using `vLLM` and `SGLang`, thinking mode is enabled by default when sending requests.
659
- - Both support tool calling. Please use OpenAI-style tool description format for calls.
660
-
661
- <!-- ## Citation
662
-
663
- If you find our work useful in your research, please consider citing the following paper:
664
-
665
- ```bibtex
666
- @misc{xxxx,
667
- title={Step3.5-preview},
668
- author={StepFun Team},
669
- year={2026},
670
- eprint={xxxx},
671
- archivePrefix={arXiv},
672
- primaryClass={cs.CL},
673
- url={https://arxiv.org/abs/xxxxx},
674
- }
675
- ``` -->
676
-
677
- ## 📄 License
678
-
679
- This project is open-sourced under the [Apache 2.0 License](https://www.google.com/search?q=LICENSE).
 
511
  - **Report Friction**: Encountering limitations? You can open an issue on GitHub or flag it directly in our Discord support channels.
512
 
513
  ## License
514
+ This project is open-sourced under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).