Instructions to use unsloth/Step-3.7-Flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use unsloth/Step-3.7-Flash with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="unsloth/Step-3.7-Flash", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("unsloth/Step-3.7-Flash", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use unsloth/Step-3.7-Flash with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "unsloth/Step-3.7-Flash"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/Step-3.7-Flash",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/unsloth/Step-3.7-Flash

SGLang

How to use unsloth/Step-3.7-Flash with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "unsloth/Step-3.7-Flash" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/Step-3.7-Flash",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "unsloth/Step-3.7-Flash" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/Step-3.7-Flash",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use unsloth/Step-3.7-Flash with Docker Model Runner:
```
docker model run hf.co/unsloth/Step-3.7-Flash
```

danielhanchen commited on 3 days ago

Commit

6c0b16e

verified ·

1 Parent(s): 26a7625

Upload folder using huggingface_hub

Browse files

Files changed (40) hide show

.gitattributes +1 -0
README.md +421 -0
assets/benchmarks.png +3 -0
chat_template.jinja +89 -0
config.json +410 -0
configuration_step3p7.py +207 -0
model-00001.safetensors +3 -0
model-00002.safetensors +3 -0
model-00003.safetensors +3 -0
model-00004.safetensors +3 -0
model-00005.safetensors +3 -0
model-00006.safetensors +3 -0
model-00007.safetensors +3 -0
model-00008.safetensors +3 -0
model-00009.safetensors +3 -0
model-00010.safetensors +3 -0
model-00011.safetensors +3 -0
model-00012.safetensors +3 -0
model-00013.safetensors +3 -0
model-00014.safetensors +3 -0
model-00015.safetensors +3 -0
model-00016.safetensors +3 -0
model-00017.safetensors +3 -0
model-00018.safetensors +3 -0
model-00019.safetensors +3 -0
model-00020.safetensors +3 -0
model-00021.safetensors +3 -0
model-00022.safetensors +3 -0
model-00023.safetensors +3 -0
model-00024.safetensors +3 -0
model-vit-00001.safetensors +3 -0
model-vit-00002.safetensors +3 -0
model.safetensors.index.json +0 -0
modeling_step3p7.py +1405 -0
processing_step3.py +475 -0
processor_config.json +6 -0
special_tokens_map.json +23 -0
tokenizer.json +0 -0
tokenizer_config.json +22 -0
vision_encoder.py +452 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+*.png filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,421 @@

+---
+base_model:
+- stepfun-ai/Step-3.7-Flash
+license: apache-2.0
+library_name: transformers
+pipeline_tag: image-text-to-text
+language:
+  - en
+tags:
+- vision-language
+- unsloth
+  - multimodal
+  - moe
+---
+<div>
+<p style="margin-top: 0;margin-bottom: 0;">
+    <em><a href="https://docs.unsloth.ai/basics/unsloth-dynamic-v2.0-gguf">Unsloth Dynamic 2.0</a> achieves superior accuracy & outperforms other leading quants.</em>
+  </p>
+  <div style="display: flex; gap: 5px; align-items: center; ">
+    <a href="https://github.com/unslothai/unsloth/">
+      <img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="133">
+    </a>
+    <a href="https://discord.gg/unsloth">
+      <img src="https://github.com/unslothai/unsloth/raw/main/images/Discord%20button.png" width="173">
+    </a>
+    <a href="https://docs.unsloth.ai/">
+      <img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="143">
+    </a>
+  </div>
+</div>
+**[ModelPage]**: https://static.stepfun.com/blog/step-3.7-flash/
+## 1. Introduction
+Step 3.7 Flash is a 198B-parameter sparse Mixture-of-Experts (MoE) vision-language model that combines a 196B-parameter language backbone with a 1.8B-parameter vision encoder for native image understanding. Engineered for high-frequency production workloads, it activates approximately 11B parameters per token and delivers a throughput of up to 400 tokens per second. Step 3.7 Flash supports a 256k context window and offers three selectable reasoning levels (low, medium, and high) so developers can easily balance speed, cost, and cognitive depth.
+We built Step 3.7 Flash for developers who need to scale agentic workflows that combine perception, search, and reasoning. It is designed to handle intensive tasks such as parsing massive financial reports in one pass, running multi-step search loops with cross-source verification, or operating concurrent coding agents in high-throughput pipelines.
+## 2. Capabilities & Performance
+### Multimodal Perception and Verification
+The model delivers top-tier visual intelligence, securing first place on SimpleVQA (Search) with a 79.2 and achieving frontier parity on V* (Python) at 95.3. These metrics reflect strong visual grounding and retrieval-augmented reasoning beyond basic image description. The model accurately processes dense visual interfaces, such as UI wireframes, application GUIs, and data charts, to map them into structured code. When it encounters an incomplete visual asset, it can independently identify missing data and execute lookups to verify context before returning a factually verified conclusion.
+### Workflow Integrity and Tool Orchestration
+Execution reliability is critical for autonomous agents. Step 3.7 Flash leads the ClawEval-1.1 benchmark with a score of 67.1, which significantly outperforms the next closest competitor at 59.8. This performance demonstrates high resistance to adversarial traps and strict adherence to system policies during multi-turn orchestration. Backed by scores of 49.5 on Toolathlon and 48.1 on HLE w. Tool, this profile ensures high trajectory integrity. Step 3.7 Flash reliably interacts with external APIs and executes long-horizon workflows without drifting from instructions or violating system constraints.
+### Code Engineering and Professional Baselines
+Step 3.7 Flash is built for live engineering tasks and secured a definitive second-place finish on SWE-Bench PRO with a score of 56.3. It can independently trace multi-file repositories, isolate bugs from raw issue reports, and generate functional patches that pass automated unit tests. While evaluations like Terminal-Bench 2.1 (59.5) and GDPVal-AA (45.8) show clear areas for future optimization compared to the absolute peak of the cohort, they establish a dependable baseline for system interactions and structured professional deliverables.
+![Step 3.7 Flash benchmark results across General Agent, Agentic Coding, and Multimodal evaluations](assets/benchmarks.png)
+## 3. Pricing
+| Token Type | Price |
+|---|---|
+| Input (cache miss) | $0.20 / M tokens |
+| Input (cache hit) | $0.04 / M tokens |
+| Output | $1.15 / M tokens |
+## 4. Availability, Deployment, and Ecosystem
+- Availability: Step 3.7 Flash is available on the StepFun Open Platform — [platform.stepfun.ai](https://platform.stepfun.ai) (Global) and [platform.stepfun.com](https://platform.stepfun.com) (China), OpenRouter, and NVIDIA NIM. StepFun is also partnering with DeepInfra, Fireworks AI, and Modal to expand availability soon.
+- Deployment: Step 3.7 Flash supports flexible deployment across cloud, data center, and local environments. For large-scale production and enterprise use cases, Step 3.7 Flash can be deployed on modern data center infrastructure. For local and workstation scenarios, it can also run on high-memory devices such as NVIDIA DGX Station, AMD Ryzen AI Max+ 395-based systems, and Mac Studio / Macbook Pro devices with at least 128GB unified memory.
+- Ecosystem: Step 3.7 Flash is supported across popular open-source infrastructure for both inference and model development. For inference and serving, developers can use vLLM, SGLang, Hugging Face Transformers, and llama.cpp. For model development & customization workflows, StepFun model support has landed in the NVIDIA Nemo ecosystem, including AutoModel, Megatron Core and Megatron Bridge. Step 3.7 Flash is also available as an NVIDIA NIM inference microservice for on-prem, cloud, or hybrid deployment.
+## 5. Examples
+You can get started with Step 3.7 Flash in minutes using StepFun's API or via other inference providers.
+> Pick the right `base_url` for your region. StepFun operates two regional platforms with separate API hosts. The `base_url` you pass to the OpenAI client must match the platform where your API key was issued, otherwise requests will be rejected as unauthorized.
+>
+> - **Global**: [platform.stepfun.ai](https://platform.stepfun.ai) — `base_url=https://api.stepfun.ai/v1`
+> - **China**: [platform.stepfun.com](https://platform.stepfun.com) — `base_url=https://api.stepfun.com/v1`
+>
+> To avoid hard-coding the wrong region, the examples below read both the API key and base URL from environment variables. Export them once before running:
+>
+> ```bash
+> export STEP_API_KEY="sk-..."
+> export STEP_BASE_URL="https://api.stepfun.ai/v1"   # use https://api.stepfun.com/v1 for the China platform
+> ```
+### 5.1 Chat Example
+```python
+import os
+from openai import OpenAI
+client = OpenAI(
+    api_key=os.environ["STEP_API_KEY"],
+    base_url=os.environ["STEP_BASE_URL"],
+)
+completion = client.chat.completions.create(
+    model="step-3.7-flash",
+    messages=[
+        {
+            "role": "system",
+            "content": "You are an AI assistant provided by StepFun. You are good at Chinese, English, and many other languages, and you can see, think, and act to help users get things done.",
+        },
+        {
+            "role": "user",
+            "content": "Introduce StepFun's artificial intelligence capabilities."
+        },
+    ],
+)
+print(completion)
+```
+### 5.2 Text and Image Input Example
+```python
+import os
+from openai import OpenAI
+client = OpenAI(
+    api_key=os.environ["STEP_API_KEY"],
+    base_url=os.environ["STEP_BASE_URL"],
+)
+completion = client.chat.completions.create(
+    model="step-3.7-flash",
+    messages=[
+        {
+            "role": "user",
+            "content": [
+                {"type": "text", "text": "What is in this picture?"},
+                {
+                    "type": "image_url",
+                    "image_url": {"url": "https://example.com/photo.jpg"},
+                },
+            ],
+        },
+    ],
+)
+print(completion)
+```
+## 6. Local Deployment
+Step 3.7 Flash is optimized for local inference and supports industry-standard backends including vLLM, SGLang, Hugging Face Transformers and llama.cpp.
+### 6.1 vLLM
+We recommend using StepFun's prebuilt vLLM Docker image with Step 3.7 support.
+1. Install vLLM.
+```bash
+# via Docker
+docker pull vllm/vllm-openai:stepfun37
+```
+2. Launch the server.
+  - For FP8 model
+  ```bash
+  vllm serve <MODEL_PATH_OR_HF_ID> \
+  --served-model-name step3p7-flash \
+  --tensor-parallel-size 8 \
+  --enable-expert-parallel \
+  --disable-cascade-attn \
+  --reasoning-parser step3p5 \
+  --enable-auto-tool-choice \
+  --tool-call-parser step3p5 \
+  --speculative_config '{"method": "mtp", "num_speculative_tokens": 3}' \
+  --trust-remote-code
+  ```
+  - For BF16 model
+  ```bash
+  vllm serve <MODEL_PATH_OR_HF_ID> \
+  --served-model-name step3p7-flash-bf16 \
+  --tensor-parallel-size 8 \
+  --enable-expert-parallel \
+  --disable-cascade-attn \
+  --reasoning-parser step3p5 \
+  --enable-auto-tool-choice \
+  --tool-call-parser step3p5 \
+  --speculative_config '{"method": "mtp", "num_speculative_tokens": 3}' \
+  --trust-remote-code
+  ```
+  - For NVFP4 model
+  Compared to standard precisions, running the FP4 quantized version requires modelopt activation and FP8 KV Cache alignment.
+  ```bash
+  python3 -m vllm.entrypoints.openai.api_server \
+  --host 0.0.0.0 \
+  --port ${PORT} \
+  --model stepfun-ai/Step-3.7-Flash-NVFP4 \
+  --served-model-name step3p7 \
+  --tensor-parallel-size 4 \
+  --gpu-memory-utilization 0.9 \
+  --enable-expert-parallel \
+  --trust-remote-code \
+  --quantization modelopt \
+  --kv-cache-dtype fp8 \
+  --max-model-len 8192 \
+  --reasoning-parser step3p5 \
+  --enable-auto-tool-choice \
+  --tool-call-parser step3p5 \
+  --async-scheduling
+  ```
+### 6.2 SGLang
+1. Install SGLang.
+```bash
+# via Docker
+docker pull lmsysorg/sglang:dev-step-3.7-flash
+# or from source (pip)
+pip install "sglang[all] @ git+https://github.com/sgl-project/sglang.git"
+```
+2. Launch the server.
+> **Note:** For Blackwell GPUs, `--mm-attention-backend fa4` may be used.
+- For BF16 model
+```bash
+sglang serve --model-path stepfun-ai/Step-3.7-Flash \
+  --tp 8 \
+  --reasoning-parser step3p5 \
+  --tool-call-parser step3p5 \
+  --enable-multimodal \
+  --speculative-algorithm EAGLE \
+  --speculative-num-steps 3 \
+  --speculative-eagle-topk 1 \
+  --speculative-num-draft-tokens 4 \
+  --enable-multi-layer-eagle \
+  --trust-remote-code \
+  --host 0.0.0.0 \
+  --port 8000
+```
+- For FP8 model
+```bash
+sglang serve --model-path stepfun-ai/Step-3.7-Flash-FP8 \
+  --tp 8 \
+  --ep 4 \
+  --reasoning-parser step3p5 \
+  --tool-call-parser step3p5 \
+  --enable-multimodal \
+  --speculative-algorithm EAGLE \
+  --speculative-num-steps 3 \
+  --speculative-eagle-topk 1 \
+  --speculative-num-draft-tokens 4 \
+  --enable-multi-layer-eagle \
+  --trust-remote-code \
+  --host 0.0.0.0 \
+  --port 8000
+```
+- For NVFP4 model
+```bash
+sglang serve --model-path stepfun-ai/Step-3.7-Flash-NVFP4 \
+  --tp 4 --ep 4 \
+  --moe-runner-backend flashinfer_trtllm \
+  --kv-cache-dtype fp8_e4m3 \
+  --quantization modelopt_fp4 \
+  --trust-remote-code \
+  --reasoning-parser step3p5 \
+  --tool-call-parser step3p5 \
+  --attention-backend trtllm_mha
+```
+### 6.3 Transformers (Debug / Verification)
+Use this snippet for quick functional verification. For high-throughput serving, use vLLM or SGLang.
+> **Note:** Deployment of this model requires `transformers` 5.0 or later.
+```python
+from transformers import AutoProcessor, AutoModelForCausalLM
+MODEL_PATH = "<MODEL_PATH_OR_HF_ID>"
+# 1. Setup
+processor = AutoProcessor.from_pretrained(MODEL_PATH, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    MODEL_PATH,
+    device_map="auto",
+    dtype="auto",
+    trust_remote_code=True
+)
+# 2. Prepare Input
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "url": "https://example.com/photo.jpg"},
+            {"type": "text", "text": "What is in this picture?"}
+        ]
+    },
+]
+inputs = processor.apply_chat_template(
+    messages,
+    tokenize=True,
+    add_generation_prompt=True,
+    return_dict=True,
+    return_tensors="pt",
+).to(model.device)
+# 3. Generate
+generated_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
+output_text = processor.decode(generated_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
+print(output_text)
+```
+### 6.4 llama.cpp
+**System Requirements**
+GGUF Model Weights:
+| Component | Quantization | File Size |
+|---|---|---|
+| Language Model | Q4_K_S | 111.5 GB |
+| Language Model | IQ4_XS | 104.99 GB |
+| Language Model | Q3_K_L | 102.5 GB |
+| Multimodal Projector | FP16 | 3.97 GB |
+- **Runtime Overhead:** ~7 GB
+- **Minimum unified memory / VRAM:** 120 GB (e.g., Mac Studio, NVIDIA DGX Station, AMD Ryzen AI Max+ 395)
+- **Recommended:** 128 GB unified memory
+**Steps**
+1. Use llama.cpp:
+```bash
+git clone https://github.com/stepfun-ai/llama.cpp.git
+cd llama.cpp
+git checkout -b step3.7 origin/step3.7
+```
+2. Build llama.cpp on Mac:
+```bash
+cmake -B build-macos -S . \
+    -DCMAKE_BUILD_TYPE=Release \
+    -DBUILD_SHARED_LIBS=ON \
+    -DLLAMA_BUILD_SERVER=ON \
+    -DLLAMA_BUILD_TESTS=ON \
+    -DGGML_METAL=ON \
+    -DGGML_METAL_EMBED_LIBRARY=ON \
+    -DGGML_BLAS=ON \
+    -DGGML_BLAS_VENDOR=Apple \
+    -DGGML_ACCELERATE=ON \
+    -DGGML_NATIVE=ON
+cmake --build build-macos -j8
+```
+3. Build llama.cpp on DGX-Spark:
+```bash
+cmake -S . -B build-cuda \
+  -DCMAKE_BUILD_TYPE=Release \
+  -DGGML_CUDA=ON \
+  -DGGML_CUDA_GRAPHS=ON \
+  -DGGML_CUDA_FORCE_MMQ=ON \
+  -DLLAMA_OPENSSL=OFF \
+  -DLLAMA_BUILD_COMMON=ON \
+  -DLLAMA_BUILD_TOOLS=ON \
+  -DLLAMA_BUILD_SERVER=ON \
+  -DLLAMA_BUILD_EXAMPLES=OFF \
+  -DLLAMA_BUILD_TESTS=OFF
+cmake --build build-cuda -j8
+```
+4. Build llama.cpp on AMD Windows:
+```bash
+cmake -S . -B build-vulkan \
+  -DCMAKE_BUILD_TYPE=Release \
+  -DGGML_VULKAN=ON \
+  -DGGML_NATIVE=ON \
+  -DLLAMA_BUILD_SERVER=ON \
+  -DLLAMA_BUILD_UI=OFF \
+  -DLLAMA_BUILD_TOOLS=ON
+cmake --build build-vulkan -j8
+```
+5. Run with `llama-cli`:
+```bash
+./llama-cli -m Step3.7_Q4_K_S.gguf -b 2048 -ub 2048 -fa on --temp 1.0 -p "What's your name?"
+```
+6. Test performance with `llama-batched-bench`:
+```bash
+./llama-batched-bench -m step3.7_Q4_K_S.gguf -c 32768 -b 2048 -ub 2048 -npp 0,2048,8192,16384,32768 -ntg 128 -npl 1
+```
+## 7. Using Step 3.7 Flash on Agent Platforms
+You can use Step 3.7 Flash on Agent platforms such as Hermes Agent, OpenClaw, Kilo Code, and more.
+## 8. Getting in Touch
+As we work to shape the future of AGI by expanding broad model capabilities, we want to ensure we are solving the right problems. We invite you to be part of this continuous feedback loop — your insights directly influence our priorities.
+- **Join the Conversation:** Our [Discord](https://discord.gg/RcMJhNVAQc) community is the primary hub for brainstorming future architectures, proposing capabilities, and getting early access updates 🚀
+- **Report Friction:** Encountering limitations? You can open an issue or start a discussion on GitHub / HuggingFace, or flag it directly in our Discord support channels.
+## 📄 License
+This project is open-sourced under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).

assets/benchmarks.png ADDED Viewed

Git LFS Details

SHA256: 3d26171162c0421a57c6c2c22074b9b276b626c5d90fe3a62e9fceb8ad988ae7
Pointer size: 131 Bytes
Size of remote file: 322 kB

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,89 @@

+{% macro render_message_content(message) %}{% if message.content is none %}{{- '' }}{% elif message.content is string %}{{- message.content }}{% elif message.content is mapping %}{{- message.content['value'] if 'value' in message.content else message.content['text'] }}{% elif message.content is iterable %}{% set ns = namespace(needs_text_separator=false) %}{% for item in message.content %}{% if item.type == 'text' %}{% if ns.needs_text_separator %}{{- ' ' }}{% endif %}{{- item['value'] if 'value' in item else item['text'] }}{% set ns.needs_text_separator = true %}{% elif item.type == 'image' %}<im_patch>{% set ns.needs_text_separator = false %}{% endif %}{% endfor %}{% endif %}{% endmacro %}
+{{bos_token}}{%- if tools %}
+    {{- '<|im_start|>system\n' }}
+    {%- if reasoning_effort is defined %}
+        {{- "Reasoning: " + reasoning_effort + '\n\n' }}
+    {%- endif %}
+    {%- if messages[0].role == 'system' %}
+        {{- render_message_content(messages[0]) + '\n\n' }}
+    {%- endif %}
+    {{- "# Tools\n\nYou have access to the following functions in JSONSchema format:\n\n<tools>" }}
+    {%- for tool in tools %}
+        {{- "\n" }}
+        {{- tool | tojson(ensure_ascii=False) }}
+    {%- endfor %}
+    {{- "\n</tools>\n\nIf you choose to call a function ONLY reply in the following format with NO suffix:\n\n<tool_call>\n<function=example_function_name>\n<parameter=example_parameter_1>\nvalue_1\n</parameter>\n<parameter=example_parameter_2>\nThis is the value for the second parameter\nthat can span\nmultiple lines\n</parameter>\n</function>\n</tool_call>\n\n<IMPORTANT>\nReminder:\n- Function calls MUST follow the specified format: an inner <function=...>\n...\n</function> block must be nested within <tool_call>\n...\n</tool_call> XML tags\n- Required parameters MUST be specified\n</IMPORTANT><|im_end|>\n" }}
+{%- else %}
+    {%- if messages[0].role == 'system' %}
+        {{- '<|im_start|>system\n' }}
+        {%- if reasoning_effort is defined %}
+            {{- "Reasoning: " + reasoning_effort + '\n\n' }}
+        {%- endif %}
+        {{- render_message_content(messages[0]) + '<|im_end|>\n' }}
+    {%- elif reasoning_effort is defined %}
+        {{- '<|im_start|>system\n' + "Reasoning: " + reasoning_effort + '\n\n' + '<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
+{%- for message in messages[::-1] %}
+    {%- set index = (messages|length - 1) - loop.index0 %}
+    {%- if ns.multi_step_tool and message.role == "user" and render_message_content(message) is string and not(render_message_content(message).startswith('<tool_response>') and render_message_content(message).endswith('</tool_response>')) %}
+        {%- set ns.multi_step_tool = false %}
+        {%- set ns.last_query_index = index %}
+    {%- endif %}
+{%- endfor %}
+{%- for message in messages %}
+    {%- set content = render_message_content(message) %}
+    {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
+        {%- set role_name = 'observation' if (message.role == "system" and not loop.first and message.name == 'observation') else message.role %}
+        {{- '<|im_start|>' + role_name + '\n' + content + '<|im_end|>' + '\n' }}
+    {%- elif message.role == "assistant" %}
+        {%- if message.reasoning_content is string %}
+            {%- set reasoning_content = message.reasoning_content %}
+        {%- else %}
+            {%- if '</think>' in content %}
+                {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
+                {%- set content = content.split('</think>')[-1].lstrip('\n') %}
+            {%- else %}
+                {%- set reasoning_content = '' %}
+            {%- endif %}
+        {%- endif %}
+        {%- if loop.index0 > ns.last_query_index %}
+            {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content + '\n</think>\n' + content }}
+        {%- else %}
+            {{- '<|im_start|>' + message.role + '\n' + content }}
+        {%- endif %}
+        {%- if message.tool_calls %}
+            {%- for tool_call in message.tool_calls %}
+                {%- if tool_call.function is defined %}
+                    {%- set tool_call = tool_call.function %}
+                {%- endif %}
+                {{- '<tool_call>\n<function=' + tool_call.name + '>\n' }}
+                {%- if tool_call.arguments is defined %}
+                    {%- set arguments = tool_call.arguments | fromjson if tool_call.arguments is string else tool_call.arguments %}
+                    {%- for args_name, args_value in arguments|items %}
+                        {{- '<parameter=' + args_name + '>\n' }}
+                        {%- set args_value = args_value | tojson(ensure_ascii=False) | safe if args_value is mapping or (args_value is sequence and args_value is not string) else args_value | string %}
+                        {{- args_value }}
+                        {{- '\n</parameter>\n' }}
+                    {%- endfor %}
+                {%- endif %}
+                {{- '</function>\n</tool_call>' }}
+            {%- endfor %}
+        {%- endif %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
+            {{- '<|im_start|>tool_response\n' }}
+        {%- endif %}
+        {{- '<tool_response>' }}
+        {{- content }}
+        {{- '</tool_response>' }}
+        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant\n<think>\n' }}
+{%- endif %}

config.json ADDED Viewed

	@@ -0,0 +1,410 @@

+{
+  "architectures": [
+    "Step3p7ForConditionalGeneration"
+  ],
+  "auto_map": {
+    "AutoConfig": "configuration_step3p7.Step3p7Config",
+    "AutoModelForCausalLM": "modeling_step3p7.Step3p7ForConditionalGeneration",
+    "AutoProcessor": "processing_step3.Step3VLProcessor"
+  },
+  "hidden_size": 4096,
+  "im_end_token": "<im_end>",
+  "im_patch_token": "<im_patch>",
+  "im_start_token": "<im_start>",
+  "image_token_id": 128001,
+  "image_token_len": 169,
+  "max_position_embeddings": 262144,
+  "model_type": "step3p7",
+  "pad_token_id": 2,
+  "patch_token_len": 81,
+  "projector_bias": false,
+  "text_config": {
+    "architectures": [
+      "Step3p5ForCausalLM"
+    ],
+    "att_impl_type": "GQA",
+    "attention_dropout": 0.0,
+    "attention_other_setting": {
+      "attention_type": "sliding_attention",
+      "head_dim": 128,
+      "num_attention_groups": 8,
+      "num_attention_heads": 96,
+      "true_head_dim": 128
+    },
+    "bos_token_id": 0,
+    "torch_dtype": "bfloat16",
+    "eos_token_id": [
+      1,
+      2,
+      128007
+    ],
+    "head_dim": 128,
+    "hidden_size": 4096,
+    "intermediate_size": 11264,
+    "layer_types": [
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "sliding_attention"
+    ],
+    "max_position_embeddings": 262144,
+    "max_seq_len": 262144,
+    "model_type": "step3p5",
+    "moe_every_n_layer": 1,
+    "moe_intermediate_size": 1280,
+    "moe_layer_offset": 0,
+    "moe_layers_enum": "3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44",
+    "moe_num_experts": 288,
+    "moe_router_activation": "sigmoid",
+    "moe_router_scaling_factor": 3.0,
+    "moe_top_k": 8,
+    "need_fp32_gate": true,
+    "norm_expert_weight": true,
+    "num_attention_groups": 8,
+    "num_attention_heads": 64,
+    "num_hidden_layers": 45,
+    "num_nextn_predict_layers": 3,
+    "pad_token_id": 1,
+    "partial_rotary_factors": [
+      0.5,
+      1.0,
+      1.0,
+      1.0,
+      0.5,
+      1.0,
+      1.0,
+      1.0,
+      0.5,
+      1.0,
+      1.0,
+      1.0,
+      0.5,
+      1.0,
+      1.0,
+      1.0,
+      0.5,
+      1.0,
+      1.0,
+      1.0,
+      0.5,
+      1.0,
+      1.0,
+      1.0,
+      0.5,
+      1.0,
+      1.0,
+      1.0,
+      0.5,
+      1.0,
+      1.0,
+      1.0,
+      0.5,
+      1.0,
+      1.0,
+      1.0,
+      0.5,
+      1.0,
+      1.0,
+      1.0,
+      0.5,
+      1.0,
+      1.0,
+      1.0,
+      0.5,
+      1.0,
+      1.0,
+      1.0
+    ],
+    "rms_norm_eps": 1e-05,
+    "rope_parameters": {
+      "factor": 2.0,
+      "high_freq_factor": 32.0,
+      "low_freq_factor": 1.0,
+      "original_max_position_embeddings": 131072,
+      "rope_theta": [
+        5000000.0,
+        10000.0,
+        10000.0,
+        10000.0,
+        5000000.0,
+        10000.0,
+        10000.0,
+        10000.0,
+        5000000.0,
+        10000.0,
+        10000.0,
+        10000.0,
+        5000000.0,
+        10000.0,
+        10000.0,
+        10000.0,
+        5000000.0,
+        10000.0,
+        10000.0,
+        10000.0,
+        5000000.0,
+        10000.0,
+        10000.0,
+        10000.0,
+        5000000.0,
+        10000.0,
+        10000.0,
+        10000.0,
+        5000000.0,
+        10000.0,
+        10000.0,
+        10000.0,
+        5000000.0,
+        10000.0,
+        10000.0,
+        10000.0,
+        5000000.0,
+        10000.0,
+        10000.0,
+        10000.0,
+        5000000.0,
+        10000.0,
+        10000.0,
+        10000.0,
+        5000000.0,
+        10000.0,
+        10000.0,
+        10000.0
+      ],
+      "rope_type": "llama3"
+    },
+    "rope_theta": [
+      5000000.0,
+      10000.0,
+      10000.0,
+      10000.0,
+      5000000.0,
+      10000.0,
+      10000.0,
+      10000.0,
+      5000000.0,
+      10000.0,
+      10000.0,
+      10000.0,
+      5000000.0,
+      10000.0,
+      10000.0,
+      10000.0,
+      5000000.0,
+      10000.0,
+      10000.0,
+      10000.0,
+      5000000.0,
+      10000.0,
+      10000.0,
+      10000.0,
+      5000000.0,
+      10000.0,
+      10000.0,
+      10000.0,
+      5000000.0,
+      10000.0,
+      10000.0,
+      10000.0,
+      5000000.0,
+      10000.0,
+      10000.0,
+      10000.0,
+      5000000.0,
+      10000.0,
+      10000.0,
+      10000.0,
+      5000000.0,
+      10000.0,
+      10000.0,
+      10000.0,
+      5000000.0,
+      10000.0,
+      10000.0,
+      10000.0
+    ],
+    "share_expert_dim": 1280,
+    "sink": false,
+    "sliding_window": 512,
+    "swiglu_limits": [
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      7,
+      7,
+      0.0,
+      0.0,
+      0.0
+    ],
+    "swiglu_limits_shared": [
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      16,
+      16,
+      0.0,
+      0.0,
+      0.0
+    ],
+    "use_head_wise_attn_gate": true,
+    "use_mfa": false,
+    "use_moe": true,
+    "use_moe_router_bias": true,
+    "use_qk_norm": false,
+    "use_rope_layers": [],
+    "vocab_size": 128896,
+    "yarn_only_types": [
+      "full_attention"
+    ]
+  },
+  "transformers_version": "5.10.0.dev0",
+  "understand_projector_stride": 2,
+  "unsloth_fixed": true,
+  "use_im_start_end": "true",
+  "vision_config": {
+    "heads": 16,
+    "hidden_act": "quick_gelu",
+    "image_size": 728,
+    "layer_norm_eps": 1e-05,
+    "layers": 47,
+    "ls_init_value": 0.1,
+    "mlp_ratio": 5.833333333333333,
+    "model_type": "perception_encoder",
+    "num_channels": 3,
+    "output_dim": null,
+    "patch_size": 14,
+    "pool_type": "none",
+    "ues_cls_token": false,
+    "use_abs_posemb": true,
+    "use_cls_token": false,
+    "use_ln_post": false,
+    "use_ln_pre": true,
+    "use_rope2d": true,
+    "width": 1536
+  },
+  "vision_select_layer": -1
+}

configuration_step3p7.py ADDED Viewed

	@@ -0,0 +1,207 @@

+from typing import Any, Optional, Sequence, Union
+from transformers.configuration_utils import PretrainedConfig
+class StepRoboticsVisionEncoderConfig(PretrainedConfig):
+    model_type = "perception_encoder"
+    def __init__(
+        self,
+        width=1536,
+        layers=47,
+        heads=16,
+        num_channels=3,
+        image_size=728,
+        mlp_ratio = 8960/1536,
+        patch_size=14,
+        hidden_act="quick_gelu",
+        layer_norm_eps=1e-5,
+        ues_cls_token=False,
+        use_cls_token: Optional[bool] = None,
+        use_ln_pre=True,
+        use_ln_post=False,
+        use_abs_posemb=True,
+        use_rope2d=True,
+        ls_init_value=0.1,
+        **kwargs,
+    ):
+        self.width = width
+        self.layers = layers
+        self.heads = heads
+        self.num_channels = num_channels
+        self.patch_size = patch_size
+        self.image_size = image_size
+        self.mlp_ratio = mlp_ratio
+        self.layer_norm_eps = layer_norm_eps
+        self.hidden_act = hidden_act
+        if use_cls_token is None:
+            use_cls_token = ues_cls_token
+        self.ues_cls_token = use_cls_token
+        self.use_cls_token = use_cls_token
+        self.use_ln_pre = use_ln_pre
+        self.ls_init_value = ls_init_value
+        self.use_ln_post = use_ln_post
+        self.use_abs_posemb = use_abs_posemb
+        self.use_rope2d = use_rope2d
+        super().__init__(**kwargs)
+class Step3p7TextConfig(PretrainedConfig):
+    model_type = "step3p5"
+    architectures = ["Step3p5ForCausalLM"]
+    def __init__(
+        self,
+        hidden_size: int = 4096,
+        intermediate_size: int = 11264,
+        num_attention_heads: int = 64,
+        num_attention_groups: int = 8,
+        num_hidden_layers: int = 45,
+        max_seq_len: int = 128000,
+        vocab_size: int = 128815,
+        rms_norm_eps: float = 1e-5,
+        moe_intermediate_size: int = 1280,
+        moe_num_experts: int = 288,
+        moe_top_k: int = 8,
+        rope_theta: float = 10000,
+        rope_scaling: Optional[dict[str, Any]] = None,
+        max_position_embeddings: int = 128000,
+        share_expert_dims: int = 1280,
+        share_expert_dim: Optional[int] = None,
+        head_dim: int = 128,
+        norm_expert_weight: bool = True,
+        layer_types: list[str] = None,
+        sliding_window: Optional[int] = None,
+        pad_token_id: int = 1,
+        attention_dropout: float = 0.0,
+        use_head_wise_attn_gate: bool = False,
+        use_moe_router_bias: bool = False,
+        moe_router_activation: str = "softmax",
+        moe_router_scaling_factor: float = 1.0,
+        need_fp32_gate: bool = False,
+        attention_other_setting: Optional[dict[str, Any]] = None,
+        swiglu_limits: Optional[list[Optional[float]]] = None,
+        swiglu_limits_shared: Optional[list[Optional[float]]] = None,
+        use_rope_layers: Optional[list[bool]] = None,
+        yarn_only_types: Optional[list[str]] = None,
+        moe_layers_enum: tuple[int] = (3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
+                                       15, 16, 17, 18, 19, 20, 21, 22, 23, 24,
+                                       25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
+                                       35, 36, 37, 38, 39, 40, 41, 42, 43, 44),
+        **kwargs,
+    ) -> None:
+        torch_dtype = kwargs.get("torch_dtype")
+        trim_layer_types = _normalize_per_layer_values(layer_types,
+                                                  num_hidden_layers)
+        if isinstance(rope_scaling, dict):
+            rope_scaling = dict(rope_scaling)
+        if share_expert_dim is None:
+            share_expert_dim = share_expert_dims
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_attention_heads = num_attention_heads
+        self.num_attention_groups = num_attention_groups
+        self.num_hidden_layers = num_hidden_layers
+        self.max_seq_len = max_seq_len
+        self.vocab_size = vocab_size
+        self.rms_norm_eps = rms_norm_eps
+        self.moe_intermediate_size = moe_intermediate_size
+        self.moe_num_experts = moe_num_experts
+        self.moe_top_k = moe_top_k
+        self.rope_theta = rope_theta
+        self.rope_scaling = rope_scaling
+        self.max_position_embeddings = max_position_embeddings
+        self.share_expert_dim = share_expert_dim
+        self.head_dim = head_dim
+        self.norm_expert_weight = norm_expert_weight
+        self.moe_layers_enum = moe_layers_enum
+        self.layer_types = trim_layer_types
+        self.sliding_window = sliding_window
+        self.pad_token_id = pad_token_id
+        self.attention_dropout = attention_dropout
+        self.use_head_wise_attn_gate = use_head_wise_attn_gate
+        self.use_moe_router_bias = use_moe_router_bias
+        self.moe_router_activation = moe_router_activation
+        self.moe_router_scaling_factor = moe_router_scaling_factor
+        self.need_fp32_gate = need_fp32_gate
+        self.attention_other_setting = attention_other_setting
+        self.swiglu_limits = swiglu_limits
+        self.swiglu_limits_shared = swiglu_limits_shared
+        self.use_rope_layers = use_rope_layers
+        self.yarn_only_types = yarn_only_types
+        super().__init__(**kwargs)
+        if torch_dtype is not None:
+            self.torch_dtype = torch_dtype
+        self.layer_types = layer_types
+    def to_dict(self):
+        output = super().to_dict()
+        torch_dtype = getattr(self, "torch_dtype", None)
+        if torch_dtype is not None:
+            output["torch_dtype"] = torch_dtype
+        return output
+def _normalize_per_layer_values(
+    values: Optional[Sequence[Any]],
+    num_hidden_layers: int,
+) -> Optional[list[Any]]:
+    if values is None:
+        return None
+    normalized = list(values)
+    if not normalized:
+        return normalized
+    if len(normalized) < num_hidden_layers:
+        normalized.extend([normalized[-1]] *
+                          (num_hidden_layers - len(normalized)))
+    # Some checkpoints keep MTP/spec layer entries after the decoder layers.
+    # This config only builds num_hidden_layers decoder layers, and HF strict
+    # validation requires per-layer fields to match that decoder count.
+    return normalized[:num_hidden_layers]
+class Step3p7Config(PretrainedConfig):
+    # This loader is a compatibility shim for original Step VL checkpoints
+    # whose top-level config model_type is `step3p7`.
+    model_type = "step3p7"
+    def __init__(
+        self,
+        vision_config: Optional[Union[dict, StepRoboticsVisionEncoderConfig]] = None,
+        text_config: Optional[Union[dict, Step3p7TextConfig]] = None,
+        understand_projector_stride: int = 2,
+        projector_bias: bool = False,
+        image_token_id: int = 151679,
+        **kwargs,
+    ) -> None:
+        shared_rope_scaling = kwargs.get("rope_scaling")
+        if isinstance(shared_rope_scaling, dict):
+            shared_rope_scaling = dict(shared_rope_scaling)
+        if vision_config is None:
+            vision_config = StepRoboticsVisionEncoderConfig()
+        elif isinstance(vision_config, dict):
+            vision_config = StepRoboticsVisionEncoderConfig(**vision_config)
+        self.vision_config = vision_config
+        if text_config is None:
+            text_config = Step3p7TextConfig(rope_scaling=shared_rope_scaling)
+        elif isinstance(text_config, dict):
+            text_config = dict(text_config)
+            if shared_rope_scaling is not None and "rope_scaling" not in text_config:
+                text_config["rope_scaling"] = shared_rope_scaling
+            text_config = Step3p7TextConfig(**text_config)
+        elif shared_rope_scaling is not None and text_config.rope_scaling is None:
+            text_config.rope_scaling = dict(shared_rope_scaling)
+        self.text_config = text_config
+        rope_scaling = kwargs.get("rope_scaling")
+        if isinstance(rope_scaling, dict):
+            kwargs["rope_scaling"] = dict(rope_scaling)
+        self.understand_projector_stride = understand_projector_stride
+        self.projector_bias = projector_bias
+        self.hidden_size = text_config.hidden_size
+        self.max_position_embeddings = text_config.max_position_embeddings
+        self.image_token_id = image_token_id
+        # Help Auto classes find the correct implementation when saving/loading.
+        super().__init__(**kwargs)

model-00001.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5a2d47133d0ffa22f50a24ad4974c559c1b31f26f5baca24fc4f4dfe198b46c6
+size 924094096

model-00002.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:67c13067deed696b62763643b7d531fd2cfde4c6e81cfcaba5460551e510d0af
+size 9808156008

model-00003.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6f3567584681f4d2792e4d949c9440198f792a5afd93220d3770b509728b6ef1
+size 18557475928

model-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d035fb813758ed63f1d537bbf41f6cbb2c5c8eb05f187de18a448c7766a64960
+size 18624846944

model-00005.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f9a2c0daa3a49fc88e53e0b6419f2e4db7e412f40760488d49ca0f834fe83725
+size 18557475928

model-00006.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7fee76c5fb28547ad0d4094a0bae7755a292dd439cc23b054210a24c965b093f
+size 18624846976

model-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ccad5d228ec280d95419fbbcf2590f2cdfc4c932a7249a7669dc7f509dc7fe66
+size 18557475968

model-00008.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4d537acabde8deace533c23df8e43268f1423b41e7b6e27c79232955283f4e44
+size 18624846976

model-00009.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:48be665fd9bce6e2fdac06d03a1a9916794fce4231b03009e6a4cfca1055a2c9
+size 18557475968

model-00010.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dd61c7f6d62725005a07fe778dc572b9642972054424b2a12d1494e7ca241d91
+size 18624846976

model-00011.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:51c5fe0dce035dd7fc01333fe3ba0fff46e65412ad7a71c09fa8e2992b8d26a7
+size 18557475968

model-00012.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0f3e890ede3949af958a72da0beb99db6834853ee22978eb7782a600d013abac
+size 18624846976

model-00013.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:98802ed9091498df2ef7a73b2697f5ac275a64892d984b9045a0a99f7b459c78
+size 18557475968

model-00014.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:459e5814b710f888b6763385fb179d52f746f59e702dd165f0c5d5cc73417b03
+size 18624846976

model-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:13a51f345afa384b930387d40ac79ed6614f02129d61a9714e213f726970f47c
+size 18557475968

model-00016.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3475a9dcaff31af71b6183371f8e355bdedea5f4dbb1ade6e84dcfe28ddc9517
+size 18624846976

model-00017.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:92917af53ef59cd99d43d49de2ffcbec3d21db7ebc59107a66aa2438da2eca14
+size 18557475968

model-00018.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:aba73fb3d39556bba83fe864f7a7b60e8b2085204b074101500531e69525ee4f
+size 18624846976

model-00019.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:617c98c96871403936caa0dcea602e7650cb947493555c142dc80e6c991adad8
+size 18557475968

model-00020.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1ccea8f04adaeeb446b8def20c6042c96f6da4eb68da6bf2a76bacf65350e4e9
+size 18624846976

model-00021.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:af8c9ca65f1830163f6d5741569b4dd4c62468a1c21556e7b760e303bc3b7818
+size 18557475968

model-00022.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0cc5137141b5e2522fd3e69a4c828a0dbb602569ab8a0afcce5151b06800339f
+size 18624846976

model-00023.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:05c2c2a08df421f617794e137429246a6ea60dd908fc691263242a12325dae7f
+size 9245052456

model-00024.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7688adfc7748c12fdc8504187c57fe6ec6005798a02defc0d3372f921b1400a1
+size 6968188464

model-vit-00001.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:22aa3f3679feffb57c2fb0bc885db0f5613db3536efef5d4b0984e8d769f6017
+size 1613990904

model-vit-00002.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1f63ca4700a4184459d3ddb3a86c54a62914d359cedfddcfc14739ae782be082
+size 2348122376

model.safetensors.index.json ADDED Viewed

The diff for this file is too large to render. See raw diff

modeling_step3p7.py ADDED Viewed

	@@ -0,0 +1,1405 @@

+# Copyright 2025 The LLAMA4 and HuggingFace Inc. team. All rights reserved.
+#
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import copy
+import inspect
+from dataclasses import dataclass
+from typing import Callable, Literal, Optional, Tuple, TypedDict, Union
+from PIL import Image
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from transformers.activations import ACT2FN
+from transformers.cache_utils import Cache, DynamicCache
+from transformers.generation import GenerationMixin
+from transformers.masking_utils import (
+    create_causal_mask,
+    create_sliding_window_causal_mask,
+)
+from transformers.modeling_flash_attention_utils import FlashAttentionKwargs
+from transformers.modeling_layers import GradientCheckpointingLayer
+from transformers.modeling_outputs import BaseModelOutputWithPast, ModelOutput
+from transformers.modeling_rope_utils import ROPE_INIT_FUNCTIONS, dynamic_rope_update
+from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
+from transformers.processing_utils import Unpack
+from transformers.utils import TransformersKwargs, can_return_tuple, logging
+from .configuration_step3p7 import Step3p7Config, Step3p7TextConfig
+from .vision_encoder import StepRoboticsVisionEncoder
+logger = logging.get_logger(__name__)
+_MASK_INPUT_EMBEDS_ARG = (
+    "inputs_embeds"
+    if "inputs_embeds" in inspect.signature(create_causal_mask).parameters
+    else "input_embeds"
+)
+__all__ = [
+    "Step3p7Model",
+]
+class StepVLImagePixelInputs(TypedDict):
+    type: Literal["pixel_values"]
+    pixel_values: torch.Tensor
+    patch_pixel_values: Optional[torch.Tensor]
+    num_patches: list[int]
+class StepVLImageEmbeddingInputs(TypedDict):
+    type: Literal["image_embeds"]
+    image_embeds: torch.Tensor
+StepVLImageInputs = Union[StepVLImagePixelInputs, StepVLImageEmbeddingInputs]
+def _flatten_embeddings(embeddings) -> torch.Tensor:
+    """
+    Recursively flattens and concatenates NestedTensors on all but the last
+    dimension.
+    """
+    if isinstance(embeddings, torch.Tensor):
+        # Flatten all but the last dimension.
+        return embeddings.flatten(0, -2)
+    return torch.cat(tuple(_flatten_embeddings(t) for t in embeddings))
+def _embedding_count_expression(embeddings) -> str:
+    """
+    Constructs a debugging representation of the number of embeddings in the
+    NestedTensors.
+    """
+    if isinstance(embeddings, torch.Tensor):
+        return " x ".join([str(dim) for dim in embeddings.shape[:-1]])
+    return " + ".join(_embedding_count_expression(inner) for inner in embeddings)
+def _merge_multimodal_embeddings(
+    inputs_embeds: torch.Tensor,
+    is_multimodal: torch.Tensor,
+    multimodal_embeddings,
+) -> torch.Tensor:
+    """
+    Merge ``multimodal_embeddings`` into ``inputs_embeds`` by overwriting the
+    positions in ``inputs_embeds`` corresponding to placeholder tokens in
+    ``input_ids``.
+    Note:
+        This updates ``inputs_embeds`` in place.
+    """
+    num_expected_tokens = is_multimodal.sum().item()
+    assert isinstance(num_expected_tokens, int)
+    flattened = _flatten_embeddings(multimodal_embeddings)
+    if flattened.shape[0] != num_expected_tokens:
+        expr = _embedding_count_expression(multimodal_embeddings)
+        raise ValueError(
+            f"Attempted to assign {expr} = {flattened.shape[0]} "
+            f"multimodal tokens to {num_expected_tokens} placeholders"
+        )
+    is_multimodal = is_multimodal.to(inputs_embeds.device)
+    flattened = flattened.to(inputs_embeds.device)
+    inputs_embeds[is_multimodal] = flattened
+    return inputs_embeds
+def merge_multimodal_embeddings(
+    input_ids: torch.Tensor,
+    inputs_embeds: torch.Tensor,
+    multimodal_embeddings,
+    placeholder_token_id: Union[int, list[int]],
+) -> torch.Tensor:
+    """
+    Merge ``multimodal_embeddings`` into ``inputs_embeds`` by overwriting the
+    positions in ``inputs_embeds`` corresponding to placeholder tokens in
+    ``input_ids``.
+    ``placeholder_token_id`` can be a list of token ids (e.g, token ids
+    of img_start, img_break, and img_end tokens) when needed: This means
+    the order of these tokens in the ``input_ids`` MUST MATCH the order of
+    their embeddings in ``multimodal_embeddings`` since we need to
+    slice-merge instead of individually scattering.
+    For example, if input_ids is "TTTTTSIIIBIIIBIIIETTT", where
+    - T is text token
+    - S is image start token
+    - I is image embedding token
+    - B is image break token
+    - E is image end token.
+    Then the image embeddings (that correspond to I's) from vision encoder
+    must be padded with embeddings of S, B, and E in the same order of
+    input_ids for a correct embedding merge.
+    Note:
+        This updates ``inputs_embeds`` in place.
+    """
+    if isinstance(placeholder_token_id, list):
+        placeholder_token_id = torch.tensor(
+            placeholder_token_id, device=input_ids.device
+        )
+        return _merge_multimodal_embeddings(
+            inputs_embeds,
+            torch.isin(input_ids, placeholder_token_id),
+            multimodal_embeddings,
+        )
+    return _merge_multimodal_embeddings(
+        inputs_embeds,
+        (input_ids == placeholder_token_id),
+        multimodal_embeddings,
+    )
+class Step3p7PreTrainedModel(PreTrainedModel):
+    # Link this model family to its configuration class so PreTrainedModel.from_pretrained
+    # can load the config instead of failing with a NoneType error.
+    config_class = Step3p7Config
+    supports_gradient_checkpointing = True
+    _skip_keys_device_placement = ["past_key_values"]
+    _keys_to_ignore_on_load_unexpected = [
+        r"model\.layers\.45\.*",
+        r"model\.layers\.46\.*",
+        r"model\.layers\.47\.*",
+    ]
+    _supports_flash_attn = False
+    _supports_sdpa = True
+    _supports_flex_attn = True
+    _supports_static_cache = True
+    _supports_attention_backend = True
+    @classmethod
+    def from_pretrained(
+        cls, pretrained_model_name_or_path, *model_args, **kwargs
+    ):
+        key_mapping = getattr(cls, "_checkpoint_conversion_mapping", None)
+        if key_mapping is not None and kwargs.get("key_mapping") is None:
+            # Transformers only applies checkpoint renaming when key_mapping is
+            # passed explicitly; inheriting the class attribute alone is not enough.
+            kwargs["key_mapping"] = copy.deepcopy(key_mapping)
+        return super().from_pretrained(
+            pretrained_model_name_or_path, *model_args, **kwargs
+        )
+class Step3p7RotaryEmbedding(nn.Module):
+    def __init__(self, config: Step3p7TextConfig, device=None, layer_idx=None):
+        super().__init__()
+        self.layer_idx = layer_idx
+        self.max_seq_len_cached = config.max_position_embeddings
+        self.original_max_seq_len = config.max_position_embeddings
+        rope_theta = config.rope_theta
+        if isinstance(rope_theta, list):
+            rope_theta = rope_theta[0 if layer_idx is None else layer_idx]
+        partial_rotary_factor = getattr(config, "partial_rotary_factor", 1.0)
+        partial_rotary_factors = getattr(config, "partial_rotary_factors", None)
+        if partial_rotary_factors is not None:
+            partial_rotary_factor = partial_rotary_factors[
+                0 if layer_idx is None else layer_idx
+            ]
+        self.rope_theta = rope_theta
+        self.partial_rotary_factor = partial_rotary_factor
+        self.config = copy.copy(config)
+        self.config.rope_theta = rope_theta
+        self.config.partial_rotary_factor = partial_rotary_factor
+        if config.rope_parameters is not None:
+            self.config.rope_parameters = copy.deepcopy(config.rope_parameters)
+            self.config.rope_parameters["rope_theta"] = rope_theta
+            self.config.rope_parameters["partial_rotary_factor"] = (
+                partial_rotary_factor
+            )
+            self.rope_type = self.config.rope_parameters.get(
+                "rope_type", self.config.rope_parameters.get("type")
+            )
+        else:
+            self.rope_type = "default"
+        self.rope_init_fn = self.compute_default_rope_parameters
+        if self.rope_type != "default":
+            self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
+        inv_freq, self.attention_scaling = self.rope_init_fn(
+            self.config, device
+        )
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        self.original_inv_freq = self.inv_freq
+    @torch.no_grad()
+    @dynamic_rope_update  # power user: used with advanced RoPE types (e.g. dynamic rope)
+    def forward(self, x, position_ids):
+        inv_freq_expanded = (
+            self.inv_freq[None, :, None]
+            .float()
+            .expand(position_ids.shape[0], -1, 1)
+            .to(x.device)
+        )
+        position_ids_expanded = position_ids[:, None, :].float().to(x.device)
+        device_type = (
+            x.device.type
+            if isinstance(x.device.type, str) and x.device.type != "mps"
+            else "cpu"
+        )
+        with torch.autocast(
+            device_type=device_type, enabled=False
+        ):  # Force float32
+            freqs = (
+                inv_freq_expanded.float() @ position_ids_expanded.float()
+            ).transpose(1, 2)
+            emb = torch.cat((freqs, freqs), dim=-1)
+            cos = emb.cos() * self.attention_scaling
+            sin = emb.sin() * self.attention_scaling
+        return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
+    @staticmethod
+    def compute_default_rope_parameters(
+        config: Step3p7TextConfig | None = None,
+        device: Optional["torch.device"] = None,
+    ) -> tuple["torch.Tensor", float]:
+        """
+        Computes the inverse frequencies according to the original RoPE implementation
+        Args:
+            config ([`~transformers.PreTrainedConfig`]):
+                The model configuration.
+            device (`torch.device`):
+                The device to use for initialization of the inverse frequencies.
+            seq_len (`int`, *optional*):
+                The current sequence length. Unused for this type of RoPE.
+        Returns:
+            Tuple of (`torch.Tensor`, `float`), containing the inverse frequencies for the RoPE embeddings and the
+            post-processing scaling factor applied to the computed cos/sin (unused in this type of RoPE).
+        """
+        base = config.rope_theta
+        partial_rotary_factor = getattr(
+            config, "partial_rotary_factor", 1.0
+        )
+        head_dim = (
+            getattr(config, "head_dim", None)
+            or config.hidden_size // config.num_attention_heads
+        )
+        dim = int(head_dim * partial_rotary_factor)
+        attention_factor = 1.0  # Unused in this type of RoPE
+        # Compute the inverse frequencies
+        inv_freq = 1.0 / (
+            base
+            ** (
+                torch.arange(0, dim, 2, dtype=torch.int64).to(
+                    device=device, dtype=torch.float
+                )
+                / dim
+            )
+        )
+        return inv_freq, attention_factor
+def rotate_half(x):
+    """Rotates half the hidden dims of the input."""
+    x1 = x[..., :x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2:]
+    return torch.cat((-x2, x1), dim=-1)
+def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
+    """Applies Rotary Position Embedding to the query and key tensors.
+    Args:
+        q (`torch.Tensor`): The query tensor.
+        k (`torch.Tensor`): The key tensor.
+        cos (`torch.Tensor`): The cosine part of the rotary embedding.
+        sin (`torch.Tensor`): The sine part of the rotary embedding.
+        position_ids (`torch.Tensor`, *optional*):
+            Deprecated and unused.
+        unsqueeze_dim (`int`, *optional*, defaults to 1):
+            The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
+            sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
+            that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
+            k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
+            cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
+            the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
+    Returns:
+        `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
+    """
+    rotary_dim = cos.shape[-1]
+    q_rot, q_pass = q[..., :rotary_dim], q[..., rotary_dim:]
+    k_rot, k_pass = k[..., :rotary_dim], k[..., rotary_dim:]
+    # Apply rotary embeddings on the first half or full tensor
+    q_embed = (q_rot * cos) + (rotate_half(q_rot) * sin)
+    k_embed = (k_rot * cos) + (rotate_half(k_rot) * sin)
+    # Concatenate back to full shape
+    q_embed = torch.cat([q_embed, q_pass], dim=-1)
+    k_embed = torch.cat([k_embed, k_pass], dim=-1)
+    return q_embed, k_embed
+def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
+    """
+    This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
+    num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
+    """
+    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
+    if n_rep == 1:
+        return hidden_states
+    hidden_states = hidden_states[:, :, None, :, :].expand(
+        batch, num_key_value_heads, n_rep, slen, head_dim
+    )
+    return hidden_states.reshape(
+        batch, num_key_value_heads * n_rep, slen, head_dim
+    )
+# Adapted from transformers.models.llama.modeling_llama.eager_attention_forward.
+# Llama4 does not cast attention weights to fp32 here.
+def eager_attention_forward(
+    module: nn.Module,
+    query: torch.Tensor,
+    key: torch.Tensor,
+    value: torch.Tensor,
+    attention_mask: Optional[torch.Tensor],
+    scaling: float,
+    dropout: float = 0.0,
+    **kwargs,
+):
+    key_states = repeat_kv(key, module.num_key_value_groups)
+    value_states = repeat_kv(value, module.num_key_value_groups)
+    # breakpoint()
+    attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
+    if attention_mask is not None:
+        causal_mask = attention_mask[:, :, :, :key_states.shape[-2]]
+        attn_weights = attn_weights + causal_mask
+    attn_weights = nn.functional.softmax(attn_weights, dim=-1)
+    attn_weights = nn.functional.dropout(
+        attn_weights, p=dropout, training=module.training
+    )
+    attn_output = torch.matmul(attn_weights, value_states)
+    attn_output = attn_output.transpose(1, 2).contiguous()
+    return attn_output, attn_weights
+@dataclass
+class Step3p7CausalLMOutputWithPast(ModelOutput):
+    r"""
+    loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
+        Language modeling loss (for next-token prediction).
+    logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
+        Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+    past_key_values (`Cache`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
+        Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
+        `(batch_size, num_heads, sequence_length, embed_size_per_head)`)
+        Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
+        `past_key_values` input) to speed up sequential decoding.
+    """
+    loss: Optional[torch.FloatTensor] = None
+    last_hidden_state: Optional[torch.FloatTensor] = None
+    logits: torch.FloatTensor = None
+    past_key_values: Optional[list[torch.FloatTensor]] = None
+    hidden_states: Optional[tuple[torch.FloatTensor]] = None
+    attentions: Optional[tuple[torch.FloatTensor]] = None
+class Step3p7MLP(nn.Module):
+    def __init__(self, config, intermediate_size=None, swiglu_limit=None):
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.intermediate_size = (
+            intermediate_size
+            if intermediate_size is not None
+            else config.intermediate_size
+        )
+        self.gate_proj = nn.Linear(self.hidden_size,
+                                   self.intermediate_size,
+                                   bias=False)
+        self.up_proj = nn.Linear(self.hidden_size,
+                                 self.intermediate_size,
+                                 bias=False)
+        self.down_proj = nn.Linear(self.intermediate_size,
+                                   self.hidden_size,
+                                   bias=False)
+        self.act_fn = ACT2FN["silu"]
+        self.limit = swiglu_limit
+    def forward(self, x):
+        up = self.up_proj(x)
+        gate = self.act_fn(self.gate_proj(x))
+        if self.limit is not None:
+            gate = gate.clamp(min=None, max=self.limit)
+            up = up.clamp(min=-self.limit, max=self.limit)
+        return self.down_proj(gate * up)
+def sigmoid_routing_function(gating_output: torch.Tensor, topk: int,
+                             renormalize: bool):
+    gating_output = gating_output.float()
+    gate_prob = torch.sigmoid(gating_output)
+    gate_prob = gate_prob / gate_prob.sum(dim=-1, keepdim=True)
+    topk_prob, indices = torch.topk(gate_prob, k=topk, dim=1)
+    expert_topk_weight = topk_prob
+    if renormalize:
+        expert_topk_weight = expert_topk_weight / torch.sum(
+            expert_topk_weight, dim=-1, keepdim=True)
+    return expert_topk_weight, indices
+def softmax_routing_function(gating_output: torch.Tensor, top_k: int,
+                             renormalize: bool):
+    gating_output = gating_output.float()
+    gate_prob = torch.softmax(gating_output, dim=-1)
+    gate_prob = gate_prob / gate_prob.sum(dim=-1, keepdim=True)
+    topk_prob, indices = torch.topk(gate_prob, k=top_k, dim=1)
+    expert_topk_weight = topk_prob
+    if renormalize:
+        expert_topk_weight = expert_topk_weight / torch.sum(
+            expert_topk_weight, dim=-1, keepdim=True)
+    return expert_topk_weight, indices.to(torch.int32)
+class MoELinear(nn.Module):
+    def __init__(self, num_experts, in_features, out_features):
+        super().__init__()
+        self.num_experts = num_experts
+        self.in_features = in_features
+        self.out_features = out_features
+        self.weight = nn.Parameter(
+            torch.empty(num_experts, out_features, in_features))
+    def forward(self, x, expert_id):
+        x = F.linear(x.float(), self.weight[expert_id].float())
+        return x
+class Step3p7MoEMLP(nn.Module):
+    def __init__(self, config, swiglu_limit=None):
+        super().__init__()
+        self.num_experts = config.moe_num_experts
+        self.top_k = config.moe_top_k
+        self.hidden_size = config.hidden_size
+        self.moe_intermediate_size = config.moe_intermediate_size
+        self.use_moe_router_bias = config.use_moe_router_bias
+        if self.use_moe_router_bias:
+            self.router_bias = nn.Parameter(torch.zeros(config.moe_num_experts,
+                                                        dtype=torch.float32),
+                                            requires_grad=False)
+            self.custom_routing_function = self.router_bias_func
+        elif config.moe_router_activation == "sigmoid":
+            self.custom_routing_function = sigmoid_routing_function
+        else:
+            self.custom_routing_function = None
+        self.need_fp32_gate = config.need_fp32_gate
+        self.routed_scaling_factor = getattr(config,
+                                             "moe_router_scaling_factor", 1.0)
+        # gating
+        self.gate = nn.Linear(self.hidden_size, self.num_experts, bias=False)
+        self.act_fn = ACT2FN["silu"]
+        self.limit = swiglu_limit
+        self.up_proj = MoELinear(self.num_experts, self.hidden_size,
+                                 self.moe_intermediate_size)
+        self.gate_proj = MoELinear(self.num_experts, self.hidden_size,
+                                   self.moe_intermediate_size)
+        self.down_proj = MoELinear(self.num_experts,
+                                   self.moe_intermediate_size,
+                                   self.hidden_size)
+    def router_bias_func(self, gating_output: torch.Tensor, topk: int,
+                         renormalize: bool):
+        gate_prob = torch.sigmoid(gating_output.float())
+        gate_prob_with_bias = gate_prob + self.router_bias.unsqueeze(0)
+        _, indices = torch.topk(gate_prob_with_bias, k=topk, dim=1)
+        topk_prob = torch.gather(gate_prob, 1, indices)
+        expert_topk_weight = topk_prob
+        if renormalize:
+            expert_topk_weight = expert_topk_weight / (
+                torch.sum(expert_topk_weight, dim=-1, keepdim=True) + 1e-20)
+        return expert_topk_weight, indices
+    def get_expert_output(self, inputs: torch.Tensor, expert_id):
+        #if self.limit is None:
+        up = self.up_proj(inputs, expert_id)
+        gate = self.act_fn(self.gate_proj(inputs, expert_id))
+        if self.limit is not None:
+            gate = gate.clamp(min=None, max=self.limit)
+            up = up.clamp(min=-self.limit, max=self.limit)
+        return self.down_proj(gate * up, expert_id)
+    def forward(self, hidden_states):
+        """ """
+        batch_size, sequence_length, hidden_dim = hidden_states.shape
+        hidden_states = hidden_states.view(-1, hidden_dim)
+        if self.need_fp32_gate:
+            router_logits = torch.matmul(
+                hidden_states.to(torch.float32),
+                self.gate.weight.t().to(torch.float32),
+            )
+        else:
+            # router_logits: (batch * sequence_length, n_experts)
+            router_logits = self.gate(hidden_states)
+        if self.custom_routing_function:
+            routing_weights, selected_experts = self.custom_routing_function(
+                router_logits, self.top_k, renormalize=True)
+        else:
+            routing_weights = F.softmax(router_logits,
+                                        dim=1,
+                                        dtype=torch.float)
+            routing_weights, selected_experts = torch.topk(routing_weights,
+                                                           self.top_k,
+                                                           dim=-1)
+        routing_weights = routing_weights * self.routed_scaling_factor
+        final_hidden_states = torch.zeros(
+            (batch_size * sequence_length, hidden_dim),
+            dtype=hidden_states.dtype,
+            device=hidden_states.device)
+        # One hot encode the selected experts to create an expert mask
+        # this will be used to easily index which expert is going to be sollicitated
+        expert_mask = torch.nn.functional.one_hot(
+            selected_experts, num_classes=self.num_experts).permute(2, 1, 0)
+        # Loop over all available experts in the model and perform the computation on each expert
+        for expert_idx in range(self.num_experts):
+            idx, top_x = torch.where(expert_mask[expert_idx])
+            # Index the correct hidden states and compute the expert hidden state for
+            # the current expert. We need to make sure to multiply the output hidden
+            # states by `routing_weights` on the corresponding tokens (top-1 and top-2)
+            current_state = hidden_states[None, top_x].reshape(-1, hidden_dim)
+            current_hidden_states = (
+                self.get_expert_output(current_state, expert_idx) *
+                routing_weights[top_x, idx, None])
+            # However `index_add_` only support torch tensors for indexing so we'll use
+            # the `top_x` tensor here.
+            final_hidden_states.index_add_(
+                0, top_x, current_hidden_states.to(hidden_states.dtype))
+        final_hidden_states = final_hidden_states.reshape(
+            batch_size, sequence_length, hidden_dim)
+        return final_hidden_states
+class Step3p7RMSNorm(nn.Module):
+    def __init__(
+        self,
+        hidden_size: int,
+        eps: float = 1e-5,
+    ) -> None:
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.variance_epsilon = eps
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        dtype = x.dtype
+        x = x.float()
+        variance = x.pow(2).mean(dim=-1, keepdim=True)
+        normed = x * torch.rsqrt(variance + self.variance_epsilon)
+        normed = normed * (self.weight.float() + 1)
+        return normed.to(dtype)
+class Step3p7Attention(nn.Module):
+    def __init__(self, config: Step3p7TextConfig, layer_idx):
+        super().__init__()
+        self.config = config
+        self.layer_idx = layer_idx
+        self.num_attention_heads = config.num_attention_heads
+        self.num_key_value_heads = config.num_attention_groups
+        layer_types = getattr(config, "layer_types", [])
+        if layer_types:
+            enable_sliding_window = layer_types[
+                self.layer_idx] == "sliding_attention"
+        else:
+            enable_sliding_window = self.layer_idx % 2 == 0
+        yarn_only_types = getattr(config, "yarn_only_types", None)
+        if yarn_only_types and layer_types[
+                self.layer_idx] not in yarn_only_types:
+            config.rope_parameters = None
+        else:
+            config.rope_parameters = getattr(config, "rope_scaling", None)
+        self.sliding_window = config.sliding_window
+        if enable_sliding_window:
+            self.num_attention_heads = config.attention_other_setting[
+                "num_attention_heads"]
+            self.num_key_value_heads = config.attention_other_setting[
+                "num_attention_groups"]
+        if self.sliding_window is not None and enable_sliding_window:
+            self.sliding_window = (self.sliding_window)
+        else:
+            self.sliding_window = None
+        self.head_dim = getattr(config, "head_dim",
+                        config.hidden_size // self.num_attention_heads)
+        self.num_key_value_groups = self.num_attention_heads // self.num_key_value_heads
+        self.rotary_emb = Step3p7RotaryEmbedding(config, layer_idx=layer_idx)
+        self.q_size = self.num_attention_heads * self.head_dim
+        self.kv_size = self.num_key_value_heads * self.head_dim
+        self.scaling = self.head_dim**-0.5
+        self.q_proj = nn.Linear(config.hidden_size, self.q_size, bias=False)
+        self.k_proj = nn.Linear(config.hidden_size, self.kv_size, bias=False)
+        self.v_proj = nn.Linear(config.hidden_size, self.kv_size, bias=False)
+        self.o_proj = nn.Linear(self.q_size, config.hidden_size, bias=False)
+        self.attention_dropout = getattr(config, "attention_dropout", 0.0)
+        self.q_norm = Step3p7RMSNorm(self.head_dim,
+                                    eps=config.rms_norm_eps)
+        self.k_norm = Step3p7RMSNorm(self.head_dim,
+                                    eps=config.rms_norm_eps)
+        self.use_head_wise_attn_gate = config.use_head_wise_attn_gate
+        if self.use_head_wise_attn_gate:
+            self.g_proj = nn.Linear(config.hidden_size,
+                                    self.num_attention_heads,
+                                    bias=False)
+        self.use_rope = True
+        use_rope_layers = getattr(config, "use_rope_layers", None)
+        if use_rope_layers:
+            self.use_rope = use_rope_layers[self.layer_idx]
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor],
+        past_key_value: Optional[Cache] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        **kwargs: Unpack[FlashAttentionKwargs],
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor],
+               Optional[Tuple[torch.Tensor]]]:
+        input_shape = hidden_states.shape[:-1]
+        hidden_shape = (*input_shape, -1, self.head_dim)
+        query_states = self.q_norm(
+            self.q_proj(hidden_states).view(hidden_shape)).transpose(1, 2)
+        key_states = self.k_norm(
+            self.k_proj(hidden_states).view(hidden_shape)).transpose(1, 2)
+        value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(
+            1, 2)
+        if self.use_head_wise_attn_gate:
+            gate_states = self.g_proj(hidden_states)
+        cos, sin = self.rotary_emb(hidden_states, position_ids)
+        # cos, sin = position_embeddings
+        query_states, key_states = apply_rotary_pos_emb(
+            query_states, key_states, cos, sin)
+        # query_states, key_states = apply_rotary_pos_emb(query_norm_states, key_norm_states, cos, sin)
+        if past_key_value is not None:
+            # sin and cos are specific to RoPE models; position_ids needed for the static cache
+            cache_kwargs = {
+                "sin": sin,
+                "cos": cos,
+                "cache_position": cache_position
+            }
+            key_states, value_states = past_key_value.update(
+                key_states, value_states, self.layer_idx, cache_kwargs)
+        attention_interface: Callable = eager_attention_forward
+        # TODO: considering FP8；
+        # RuntimeError: Expected attn_mask dtype to be bool or float or to match query dtype,
+        # but got attn_mask.dtype: long int and  query.dtype: c10::BFloat16 instead.
+        if self.config._attn_implementation != "eager":
+            attention_interface = ALL_ATTENTION_FUNCTIONS[
+                self.config._attn_implementation]
+        attn_output, attn_weights = attention_interface(
+            self,
+            query_states,
+            key_states,
+            value_states,
+            attention_mask,
+            dropout=0.0 if not self.training else self.attention_dropout,
+            scaling=self.scaling,
+            sliding_window=self.sliding_window,  # main diff with Llama
+            **kwargs,
+        )
+        attn_output = attn_output.reshape(*input_shape, -1)
+        if self.use_head_wise_attn_gate:
+            output = attn_output.view(
+                *attn_output.shape[:-1], self.num_attention_heads,
+                self.head_dim) * gate_states.unsqueeze(-1).sigmoid()
+            attn_output = output.view(*attn_output.shape)
+        attn_output = self.o_proj(attn_output)
+        return attn_output, attn_weights
+class Step3p7DecoderLayer(GradientCheckpointingLayer):
+    def __init__(self, config, layer_idx):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.layer_idx = layer_idx
+        self.self_attn = Step3p7Attention(config, layer_idx)
+        layer_types = getattr(config, "layer_types", None) or []
+        if layer_types:
+            self.attention_type = layer_types[layer_idx]
+        else:
+            self.attention_type = (
+                "sliding_attention" if layer_idx % 2 == 0 else "full_attention"
+            )
+        moe_layers_enum = getattr(config, "moe_layers_enum", None)
+        if moe_layers_enum is not None:
+            if isinstance(moe_layers_enum, str):
+                moe_layers_idx = [
+                    int(i) for i in moe_layers_enum.split(',') if i.strip()
+                ]
+            else:
+                moe_layers_idx = [int(i) for i in moe_layers_enum]
+        else:
+            moe_layers_idx = [i for i in range(1, config.num_hidden_layers)]
+        self.is_moe_layer = layer_idx in moe_layers_idx
+        self.use_moe = False
+        if config.swiglu_limits_shared and config.swiglu_limits_shared[
+                layer_idx] is not None and config.swiglu_limits_shared[
+                    layer_idx] != 0:
+            swiglu_limit_shared = config.swiglu_limits_shared[layer_idx]
+        else:
+            swiglu_limit_shared = None
+        if config.swiglu_limits and config.swiglu_limits[
+                layer_idx] is not None and config.swiglu_limits[layer_idx] != 0:
+            swiglu_limit = config.swiglu_limits[layer_idx]
+        else:
+            swiglu_limit = None
+        if self.is_moe_layer:
+            self.moe = Step3p7MoEMLP(config, swiglu_limit=swiglu_limit)  #
+            self.share_expert = Step3p7MLP(
+                config,
+                intermediate_size=config.share_expert_dim,
+                swiglu_limit=swiglu_limit_shared)
+            self.use_moe = True
+        else:
+            self.mlp = Step3p7MLP(config,
+                                 intermediate_size=config.intermediate_size,
+                                 swiglu_limit=swiglu_limit_shared)
+        self.input_layernorm = Step3p7RMSNorm(
+            config.hidden_size,
+            eps=config.rms_norm_eps)
+        self.post_attention_layernorm = Step3p7RMSNorm(
+            config.hidden_size,
+            eps=config.rms_norm_eps)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[tuple[torch.Tensor]] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        **kwargs: Unpack[FlashAttentionKwargs],
+    ) -> torch.FloatTensor:
+        residual = hidden_states
+        hidden_states = self.input_layernorm(hidden_states)
+        hidden_states, _ = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_value=past_key_value,
+            cache_position=cache_position,
+            **kwargs,
+        )
+        hidden_states = residual + hidden_states
+        # Fully Connected
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        if self.use_moe:
+            share_output = self.share_expert(hidden_states)
+            moe_output = self.moe(hidden_states)
+            ffn_output = moe_output + share_output
+        else:
+            ffn_output = self.mlp(hidden_states)
+        if isinstance(ffn_output, tuple):
+            hidden_states, _ = ffn_output
+        else:
+            hidden_states = ffn_output
+        hidden_states = residual + hidden_states
+        return hidden_states
+class Step3p7TextPreTrainedModel(PreTrainedModel):
+    # Link this model family to its configuration class so PreTrainedModel.from_pretrained
+    # can load the config instead of failing with a NoneType error.
+    config_class = Step3p7TextConfig
+    supports_gradient_checkpointing = True
+    _skip_keys_device_placement = ["past_key_values"]
+    _keys_to_ignore_on_load_unexpected = [
+        r"model\.layers\.45\.*",
+        r"model\.layers\.46\.*",
+        r"model\.layers\.47\.*",
+    ]
+    _supports_flash_attn = False
+    _supports_sdpa = True
+    _supports_flex_attn = True
+    _supports_static_cache = True
+    _supports_attention_backend = True
+class Step3p7TextModel(Step3p7TextPreTrainedModel, GenerationMixin):
+    _no_split_modules = ["Step3p7DecoderLayer"]
+    base_model_prefix = "model"
+    _tied_weights_keys = ["lm_head.weight"]
+    config: Step3p7TextConfig
+    def __init__(self, config: Step3p7TextConfig):
+        super().__init__(config)
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size,
+                                         self.padding_idx)
+        self.layers = nn.ModuleList([
+            Step3p7DecoderLayer(config, layer_idx)
+            for layer_idx in range(config.num_hidden_layers)
+        ])
+        self.norm = Step3p7RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.gradient_checkpointing = False
+        layer_types = self.config.layer_types or []
+        self.has_sliding_layers = (not layer_types or
+                                   "sliding_attention" in layer_types)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self, input_ids):
+        return self.embed_tokens(input_ids)
+    @can_return_tuple
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> Union[tuple, BaseModelOutputWithPast]:
+        output_attentions = (
+            output_attentions
+            if output_attentions is not None
+            else self.config.output_attentions
+        )
+        output_hidden_states = (
+            output_hidden_states
+            if output_hidden_states is not None
+            else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = (
+            return_dict
+            if return_dict is not None
+            else getattr(self.config, "return_dict", True)
+        )
+        if (input_ids is None) ^ (inputs_embeds is not None):
+            raise ValueError(
+                "You must specify exactly one of input_ids or inputs_embeds")
+        if self.gradient_checkpointing and self.training and use_cache:
+            logger.warning_once(
+                "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`."
+            )
+            use_cache = False
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(
+                input_ids.to(self.embed_tokens.weight.device))
+        if use_cache and past_key_values is None:
+            past_key_values = DynamicCache()
+        if cache_position is None:
+            past_seen_tokens = past_key_values.get_seq_length(
+            ) if past_key_values is not None else 0
+            cache_position = torch.arange(past_seen_tokens,
+                                          past_seen_tokens +
+                                          inputs_embeds.shape[1],
+                                          device=inputs_embeds.device)
+        if position_ids is None:
+            position_ids = cache_position.unsqueeze(0)
+        hidden_states = inputs_embeds
+        # It may already have been prepared by e.g. `generate`
+        if not isinstance(causal_mask_mapping := attention_mask, dict):
+            # Prepare mask arguments
+            mask_kwargs = {
+                "config": self.config,
+                "attention_mask": attention_mask,
+                "past_key_values": past_key_values,
+                "position_ids": position_ids,
+            }
+            mask_kwargs[_MASK_INPUT_EMBEDS_ARG] = inputs_embeds
+            # Create the masks
+            causal_mask_mapping = {
+                "full_attention": create_causal_mask(**mask_kwargs),
+            }
+            # The sliding window alternating layers are not always activated depending on the config
+            if self.has_sliding_layers:
+                causal_mask_mapping[
+                    "sliding_attention"] = create_sliding_window_causal_mask(
+                        **mask_kwargs)
+        # # create position embeddings to be shared across the decoder layers
+        # decoder layers
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attns = () if output_attentions else None
+        for decoder_layer in self.layers[:self.config.num_hidden_layers]:
+            if output_hidden_states:
+                all_hidden_states += (hidden_states, )
+            layer_outputs = decoder_layer(
+                hidden_states,
+                attention_mask=causal_mask_mapping[
+                    decoder_layer.attention_type],
+                position_ids=position_ids,
+                past_key_value=past_key_values,
+                output_attentions=output_attentions,
+                use_cache=use_cache,
+                cache_position=cache_position,
+                **kwargs,
+            )
+            hidden_states = layer_outputs
+        hidden_states = self.norm(hidden_states)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=past_key_values if use_cache else None,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attns,
+        )
+class Step3p7Model(Step3p7PreTrainedModel, GenerationMixin):
+    config: Step3p7Config
+    _tied_weights_keys = ["lm_head.weight"]
+    base_model_prefix = ""
+    def __init__(self, config: Step3p7Config):
+        super().__init__(config)
+        self.vision_model = StepRoboticsVisionEncoder(config.vision_config)
+        self.language_model = Step3p7TextModel(config.text_config)
+        self.vocab_size = config.text_config.vocab_size
+        self.vit_large_projector = nn.Linear(
+                config.vision_config.width * 4,
+                config.text_config.hidden_size,
+                bias=config.projector_bias)
+        self.image_placeholder_token_id = config.image_token_id
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(
+        self,
+        input_ids: torch.Tensor,
+        multimodal_embeddings  = None,
+    ) -> torch.Tensor:
+        # breakpoint()
+        input_ids = input_ids.squeeze(0)
+        if multimodal_embeddings is None:
+            inputs_embeds = self.language_model.get_input_embeddings(input_ids)
+        else:
+            is_text = input_ids != self.config.image_token_id
+            text_ids = input_ids[is_text]
+            text_embeds = self.language_model.get_input_embeddings(text_ids)
+            inputs_embeds = torch.empty(input_ids.shape[0],
+                                        text_embeds.shape[-1],
+                                        dtype=text_embeds.dtype,
+                                        device=text_embeds.device)
+            inputs_embeds[is_text] = text_embeds
+            inputs_embeds = merge_multimodal_embeddings(
+                input_ids, inputs_embeds, multimodal_embeddings,
+                self.config.image_token_id)
+        inputs_embeds = inputs_embeds.unsqueeze(0)
+        return inputs_embeds
+    def set_input_embeddings(self, value):
+        return self.language_model.set_input_embeddings(value)
+    def set_decoder(self, decoder):
+        self.language_model = decoder
+    def get_decoder(self):
+        return self.language_model
+    def _parse_and_validate_image_input(
+            self, **kwargs: object) -> Optional[StepVLImageInputs]:
+        pixel_values = kwargs.pop("pixel_values", None)
+        patch_pixel_values = kwargs.pop("patch_pixel_values", None)
+        num_patches = kwargs.pop("num_patches", None)
+        image_embeds = kwargs.pop("image_embeds", None)
+        if pixel_values is None and image_embeds is None:
+            return None
+        if pixel_values is not None:
+            # pixel_values = flatten_bn(pixel_values, concat=True)
+            if pixel_values.dim() >= 3:
+                pixel_values = pixel_values.view(-1, *pixel_values.shape[-3:])
+            if patch_pixel_values is not None:
+                # patch_pixel_values = flatten_bn(patch_pixel_values,
+                #                                 concat=True)
+                patch_pixel_values = patch_pixel_values.view(
+                    -1, *patch_pixel_values.shape[-3:])
+                # Handle empty patch_pixel_values by setting to None
+                if patch_pixel_values.shape[0] == 0:
+                    patch_pixel_values = None
+            return StepVLImagePixelInputs(
+                type="pixel_values",
+                pixel_values=pixel_values.to(self.dtype).to(self.device),
+                patch_pixel_values=patch_pixel_values.to(self.dtype).to(
+                    self.device) if patch_pixel_values is not None else None,
+                num_patches=num_patches,
+            )
+        if image_embeds is not None:
+            if image_embeds.dim() == 2 or image_embeds.dim() >= 3:
+                image_embeds = image_embeds.view(-1, image_embeds.shape[-1])
+            else:
+                raise ValueError(
+                    f"Unexpected shape for image_embeds: {image_embeds.shape}")
+            return StepVLImageEmbeddingInputs(
+                type="image_embeds",
+                image_embeds=image_embeds.to(self.dtype).to(self.device),
+            )
+        return None
+    def _process_image_features(self,
+                                image_features: torch.Tensor) -> torch.Tensor:
+        B, P = image_features.shape[:2]
+        HW = int(P ** 0.5)
+        image_features = image_features.permute(0, 2, 1).view(B, -1, HW, HW)
+        image_features = self.vision_model.vit_downsampler1(image_features)
+        image_features = self.vision_model.vit_downsampler2(image_features)
+        B, C, HW, HW = image_features.shape
+        image_features = image_features.view(B, -1, HW * HW).permute(0, 2, 1)
+        image_features = self.vit_large_projector(image_features)
+        return image_features
+    def _get_vision_model_output(self,
+                                 input_tensor: torch.Tensor) -> torch.Tensor:
+        return self.vision_model(input_tensor)
+    def _process_image_input(
+            self, image_input: StepVLImageInputs) -> tuple[torch.Tensor, ...]:
+        if image_input["type"] == "image_embeds":
+            image_features = image_input["image_embeds"]
+        else:
+            image_features = self._get_vision_model_output(
+                image_input["pixel_values"])
+            patch_image_features = self._get_vision_model_output(
+                image_input["patch_pixel_values"]
+            ) if image_input["patch_pixel_values"] is not None else None
+            num_patches = image_input["num_patches"]
+        image_features = self._process_image_features(image_features)
+        patch_image_features = self._process_image_features(
+            patch_image_features) if patch_image_features is not None else None
+        merged_image_features = []
+        cur_patch_idx = 0
+        for i, num_patch in enumerate(num_patches):
+            cur_feature = []
+            if num_patch > 0:
+                patch_slice = patch_image_features[
+                    cur_patch_idx:cur_patch_idx + num_patch]
+                cur_feature.append(patch_slice.view(-1, patch_slice.shape[-1]))
+            cur_feature.append(image_features[i].view(
+                -1, image_features.shape[-1]))
+            cur_patch_idx += num_patch
+            merged_image_features.append(
+                torch.cat(cur_feature) if len(cur_feature) >
+                1 else cur_feature[0])
+        return merged_image_features
+    def get_multimodal_embeddings(self, **kwargs):
+        # breakpoint()
+        image_input = self._parse_and_validate_image_input(**kwargs)
+        if image_input is None:
+            return None
+        vision_embeddings = self._process_image_input(image_input)
+        return vision_embeddings
+    @can_return_tuple
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Union[Cache, list[torch.FloatTensor]]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        logits_to_keep: Union[int, torch.Tensor] = 0,
+        images: Optional[list[Image.Image]] = None,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> Union[tuple, Step3p7CausalLMOutputWithPast]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
+            config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+            (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
+        Example:
+        ```python
+        >>> from transformers import AutoTokenizer, Llama4ForCausalLM
+        >>> model = Llama4ForCausalLM.from_pretrained("meta-llama4/Llama4-2-7b-hf")
+        >>> tokenizer = AutoTokenizer.from_pretrained("meta-llama4/Llama4-2-7b-hf")
+        >>> prompt = "Hey, are you conscious? Can you talk to me?"
+        >>> inputs = tokenizer(prompt, return_tensors="pt")
+        >>> # Generate
+        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
+        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+        "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
+        ```"""
+        output_attentions = (
+            output_attentions
+            if output_attentions is not None
+            else self.config.output_attentions
+        )
+        output_hidden_states = (
+            output_hidden_states
+            if output_hidden_states is not None
+            else self.config.output_hidden_states
+        )
+        return_dict = (
+            return_dict if return_dict is not None else self.config.use_return_dict
+        )
+        if inputs_embeds is None:
+            input_ids = input_ids
+            vision_embeddings = self.get_multimodal_embeddings(**kwargs)
+            inputs_embeds = self.get_input_embeddings(input_ids,
+                                                      vision_embeddings)
+            input_ids = None
+        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+        outputs = self.language_model(
+            input_ids=None,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=True,
+            cache_position=cache_position,
+            **kwargs,
+        )
+        output = Step3p7CausalLMOutputWithPast(
+            last_hidden_state=outputs.last_hidden_state,
+            past_key_values=outputs.past_key_values,
+            attentions=outputs.attentions,
+        )
+        return output if return_dict else output.to_tuple()
+class Step3p7ForConditionalGeneration(Step3p7PreTrainedModel, GenerationMixin):
+    _checkpoint_conversion_mapping = {
+        "^vision_model": "model.vision_model",
+        r"^model(?!\.(language_model|vision_model))": "model.language_model",
+        "^vit_large_projector": "model.vit_large_projector",
+    }
+    _tied_weights_keys = ["lm_head.weight"]
+    config: Step3p7Config
+    def __init__(self, config: Step3p7Config):
+        super().__init__(config)
+        self.model = Step3p7Model(config)
+        self.lm_head = nn.Linear(config.hidden_size,
+                                config.text_config.vocab_size,
+                                bias=False)
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.model.get_input_embeddings()
+    def set_input_embeddings(self, value):
+        self.model.set_input_embeddings(value)
+    def get_output_embeddings(self):
+        return self.model.get_output_embeddings()
+    def set_output_embeddings(self, new_embeddings):
+        self.model.set_output_embeddings(new_embeddings)
+    def set_decoder(self, decoder):
+        self.model.set_decoder(decoder)
+    def get_decoder(self):
+        return self.model.get_decoder()
+    @property
+    def language_model(self):
+        return self.model.language_model
+    @property
+    def visual(self):
+        return self.model.vision_model
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        pixel_values: Optional[torch.Tensor] = None,
+        num_patches=None,
+        patch_pixel_values=None,
+        patch_newline_mask=None,
+        image_embeds: Optional[torch.FloatTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> Union[tuple, Step3p7CausalLMOutputWithPast]:
+        output_attentions = (
+            output_attentions
+            if output_attentions is not None
+            else self.config.output_attentions
+        )
+        output_hidden_states = (
+            output_hidden_states
+            if output_hidden_states is not None
+            else self.config.output_hidden_states
+        )
+        outputs = self.model(
+            input_ids=input_ids,
+            num_patches=num_patches,
+            patch_pixel_values=patch_pixel_values,
+            patch_newline_mask=patch_newline_mask,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            cache_position=cache_position,
+            **kwargs,
+        )
+        hidden_states = outputs.last_hidden_state
+        logits = self.lm_head(hidden_states)
+        los = None
+        if labels is not None:
+            loss = self.loss_function(
+                logits=logits, labels=labels, vocab_size=self.config.vocab_size
+            )
+        return Step3p7CausalLMOutputWithPast(
+            logits=logits,
+        )
+    def prepare_inputs_for_generation(
+        self,
+        input_ids,
+        past_key_values=None,
+        inputs_embeds=None,
+        pixel_values=None,
+        patch_pixel_values=None,
+        num_patches=None,
+        image_embeds=None,
+        attention_mask=None,
+        cache_position=None,
+        logits_to_keep=None,
+        **kwargs,
+    ):
+        model_inputs = super().prepare_inputs_for_generation(
+            input_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            cache_position=cache_position,
+            logits_to_keep=logits_to_keep,
+            **kwargs,
+        )
+        generation_cache_position = model_inputs.get("cache_position", cache_position)
+        is_prefill = past_key_values is None
+        if generation_cache_position is not None and generation_cache_position.numel() > 0:
+            is_prefill = generation_cache_position[0].item() == 0
+        if is_prefill:
+            # During cached decoding, input ids no longer contain image tokens,
+            # so pixel values should only be passed at the first step.
+            model_inputs["pixel_values"] = pixel_values
+        return model_inputs
+    def _fix_state_dict_key_on_load(self, key: str) -> tuple[str, bool]:
+        if key.startswith("language_model."):
+            return key[len("language_model.") :], True
+        return key, False

processing_step3.py ADDED Viewed

	@@ -0,0 +1,475 @@

+from transformers import BaseImageProcessor, ImageProcessingMixin
+from transformers.processing_utils import ImagesKwargs, MultiModalData, ProcessingKwargs, ProcessorMixin, Unpack, VideosKwargs
+import math
+from typing import Iterable, Optional, Tuple, List, TypedDict, Literal, Union, overload
+from PIL import Image
+import torch
+import numpy as np
+import torchvision
+from torch import nn
+from torch.nn import functional as F, LayerNorm
+from torchvision.transforms.functional import InterpolationMode
+from transformers.activations import ACT2FN
+from torchvision import transforms
+from torchvision.transforms.functional import InterpolationMode
+from transformers.feature_extraction_utils import BatchFeature, TensorType
+from transformers.image_utils import ImageInput
+from transformers.processing_utils import ProcessingKwargs, ProcessorMixin, Unpack
+from transformers.tokenization_utils_tokenizers import TokenizersBackend
+from math import ceil
+from itertools import product
+MAX_IMAGE_SIZE: int = 3024
+class Step3VLImagePixelInputs(TypedDict):
+    type: Literal["pixel_values"]
+    pixel_values: torch.Tensor
+    patch_pixel_values: Optional[torch.Tensor]
+    num_patches: list[int]
+class Step3VLImageEmbeddingInputs(TypedDict):
+    type: Literal["image_embeds"]
+    image_embeds: torch.Tensor
+ImageWithPatches = tuple[Image.Image, list[Image.Image], list[int] | None]
+class GPUToTensor(torch.nn.Module):
+    def forward(self, raw_image: Union[np.ndarray,
+                                       Image.Image]) -> torch.Tensor:
+        if isinstance(raw_image, Image.Image):
+            return transforms.ToTensor()(raw_image)
+        if raw_image.ndim == 2:
+            raw_image = raw_image[:, :, None].repeat(3, -1)
+        if torch.cuda.is_available():
+            device = torch.device("cuda")
+        else:
+            device = torch.device("cpu")
+        image_tensor = torch.from_numpy(raw_image).to(device)
+        image_tensor = torch.permute(image_tensor, (2, 0, 1)).contiguous()
+        if image_tensor.dtype == torch.uint8:
+            image_tensor = image_tensor.to(torch.float32).div(255)
+        return image_tensor
+class Step3VisionProcessor(BaseImageProcessor):
+    def __init__(self, size, interpolation_mode="bicubic", patch_size=None):
+        mean = [0.48145466, 0.4578275, 0.40821073]
+        std = [0.26862954, 0.26130258, 0.27577711]
+        patch_size = patch_size if patch_size is not None else size
+        self.transform = transforms.Compose([
+            GPUToTensor(),
+            transforms.Normalize(mean, std),
+            transforms.Resize(
+                (size, size),
+                interpolation=InterpolationMode.BICUBIC if interpolation_mode
+                == "bicubic" else InterpolationMode.BILINEAR,
+                antialias=True),
+        ])
+        self.patch_transform = transforms.Compose([
+            GPUToTensor(),
+            transforms.Normalize(mean, std),
+            transforms.Resize(
+                (patch_size, patch_size),
+                interpolation=InterpolationMode.BICUBIC if interpolation_mode
+                == "bicubic" else InterpolationMode.BILINEAR,
+                antialias=True),
+        ]) if patch_size is not None else None
+    def __call__(self, image, is_patch=False):
+        if is_patch:
+            return {"pixel_values": self.patch_transform(image).unsqueeze(0)}
+        else:
+            return {"pixel_values": self.transform(image).unsqueeze(0)}
+class ImagePatcher:
+    def determine_window_size(self, long: int, short: int) -> int:
+        if long <= 728:
+            return short if long / short > 1.5 else 0
+        return min(short, 504) if long / short > 4 else 504
+    def slide_window(
+        self,
+        width: int,
+        height: int,
+        sizes: list[tuple[int, int]],
+        steps: list[tuple[int, int]],
+        img_rate_thr: float = 0.6,
+    ) -> tuple[list[tuple[int, int, int, int]], tuple[int, int]]:
+        assert 1 >= img_rate_thr >= 0, "The `in_rate_thr` should lie in 0~1"
+        windows = []
+        # Sliding windows.
+        for size, step in zip(sizes, steps):
+            size_w, size_h = size
+            step_w, step_h = step
+            x_num = 1 if width <= size_w else ceil((width - size_w) / step_w +
+                                                   1)
+            x_start = [step_w * i for i in range(x_num)]
+            if len(x_start) > 1 and x_start[-1] + size_w > width:
+                x_start[-1] = width - size_w
+            y_num = 1 if height <= size_h else ceil((height - size_h) /
+                                                    step_h + 1)
+            y_start = [step_h * i for i in range(y_num)]
+            if len(y_start) > 1 and y_start[-1] + size_h > height:
+                y_start[-1] = height - size_h
+            start = np.array(list(product(y_start, x_start)), dtype=int)
+            start[:, [0, 1]] = start[:, [1, 0]]
+            windows.append(np.concatenate([start, start + size], axis=1))
+        windows = np.concatenate(windows, axis=0)
+        return [(int(box[0]), int(box[1]), int(box[2] - box[0]),
+                 int(box[3] - box[1])) for box in windows], (x_num, y_num)
+    def square_pad(self, img: Image.Image) -> Image.Image:
+        w, h = img.size
+        if w == h:
+            return img
+        size = max(w, h)
+        padded = Image.new(img.mode, (size, size), 0)
+        padded.paste(img, (0, 0))
+        return padded
+    def get_image_size_for_padding(self, img_width: int,
+                                   img_height: int) -> tuple[int, int]:
+        ratio = img_width / img_height
+        if min(img_height, img_width) < 32 and (ratio > 4 or ratio < 1 / 4):
+            new_size = max(img_height, img_width)
+            return new_size, new_size
+        return img_width, img_height
+    def get_image_size_for_preprocess(self, img_width: int,
+                                      img_height: int) -> tuple[int, int]:
+        if max(img_height, img_width) > MAX_IMAGE_SIZE:
+            scale_factor = MAX_IMAGE_SIZE / max(img_height, img_width)
+            img_width = int(img_width * scale_factor)
+            img_height = int(img_height * scale_factor)
+        return img_width, img_height
+    def get_image_size_for_crop(self, img_width: int, img_height: int,
+                                window_size: int):
+        w_ratio = img_width / window_size
+        h_ratio = img_height / window_size
+        if w_ratio < 1:
+            width_new = img_width
+        else:
+            decimal_w = w_ratio - img_width // window_size
+            w_ratio = int(w_ratio) + 1 if decimal_w > 0.2 else int(w_ratio)
+            width_new = window_size * w_ratio
+        if h_ratio < 1:
+            height_new = img_height
+        else:
+            decimal_h = h_ratio - img_height // window_size
+            h_ratio = int(h_ratio) + 1 if decimal_h > 0.2 else int(h_ratio)
+            height_new = window_size * h_ratio
+        return int(width_new), int(height_new)
+    def patch_crop(self, img: Image.Image, i: int, j: int, th: int, tw: int):
+        target = img.crop((j, i, j + tw, i + th))
+        return target
+    def get_num_patches(self, img_width: int,
+                        img_height: int) -> tuple[int, int]:
+        img_width, img_height = self.get_image_size_for_padding(
+            img_width, img_height)
+        img_width, img_height = self.get_image_size_for_preprocess(
+            img_width, img_height)
+        window_size = self.determine_window_size(max(img_height, img_width),
+                                                 min(img_height, img_width))
+        if window_size == 0:
+            return 0, 0
+        else:
+            img_width, img_height = self.get_image_size_for_crop(
+                img_width, img_height, window_size)
+            center_list, (x_num, y_num) = self.slide_window(
+                img_width, img_height, [(window_size, window_size)],
+                [(window_size, window_size)])
+            full_rows = (len(center_list) - 1) // x_num + 1
+            if len(center_list) > 0 and len(center_list) % x_num == 0:
+                full_rows -= 1
+            return len(center_list), full_rows
+    def __call__(
+        self, img: Image.Image
+    ) -> tuple[Image.Image, list[Image.Image], list[bool] | None]:
+        img_width, img_height = img.size
+        new_img_width, new_img_height = self.get_image_size_for_padding(
+            img_width, img_height)
+        if new_img_width != img_width or new_img_height != img_height:
+            img = self.square_pad(img)
+            img_width, img_height = img.size
+        new_img_width, new_img_height = self.get_image_size_for_preprocess(
+            img_width, img_height)
+        img = img.resize((new_img_width, new_img_height),
+                         Image.Resampling.BILINEAR)
+        window_size = self.determine_window_size(
+            max(new_img_height, new_img_width),
+            min(new_img_height, new_img_width))
+        # return img, [], None
+        if window_size == 0:
+            return img, [], None
+        else:
+            new_img_width, new_img_height = self.get_image_size_for_crop(
+                new_img_width, new_img_height, window_size)
+            if (new_img_width, new_img_height) != (img_width, img_height):
+                img_for_crop = img.resize((new_img_width, new_img_height),
+                                          Image.Resampling.BILINEAR)
+            else:
+                img_for_crop = img
+            patches = []
+            newlines = []
+            center_list, (x_num, y_num) = self.slide_window(
+                new_img_width, new_img_height, [(window_size, window_size)],
+                [(window_size, window_size)])
+            for patch_id, center_lf_point in enumerate(center_list):
+                x, y, patch_w, patch_h = center_lf_point
+                big_patch = self.patch_crop(img_for_crop, y, x, patch_h,
+                                            patch_w)
+                patches.append(big_patch)
+                if (patch_id + 1) % x_num == 0:
+                    newlines.append(patch_id)
+            if newlines and newlines[-1] == len(patches) - 1:
+                newlines.pop()
+            return img, patches, [i in newlines for i in range(len(patches))] if len(patches) > 0 else None
+class Step3VLProcessor(ProcessorMixin):
+    # Align ProcessorMixin with our custom components.
+    # We only have an image processor (not a feature extractor) plus a tokenizer.
+    attributes = ["tokenizer"]
+    tokenizer_class = "AutoTokenizer"
+    @classmethod
+    def _load_tokenizer_from_pretrained(
+        cls, sub_processor_type, pretrained_model_name_or_path, subfolder="", **kwargs
+    ):
+        return TokenizersBackend.from_pretrained(
+            pretrained_model_name_or_path,
+            subfolder=subfolder,
+            **kwargs,
+        )
+    def __init__(
+        self,
+        tokenizer=None,
+        chat_template=None,
+        **kwargs
+    ) -> None:
+        self.image_size = 728
+        self.patch_size = 504
+        self.image_preprocessor = Step3VisionProcessor(self.image_size,
+                                                       "bilinear",
+                                                       self.patch_size)
+        self.num_image_feature_size = 169
+        self.num_patch_feature_size = 81
+        self.image_token = "<im_patch>"
+        self.image_feature_placeholder = (self.image_token *
+                                          self.num_image_feature_size)
+        self.patch_feature_placeholder = (self.image_token *
+                                          self.num_patch_feature_size)
+        super().__init__(tokenizer=tokenizer, chat_template=chat_template, **kwargs)
+        self.patcher = ImagePatcher()
+    @property
+    def image_token_id(self) -> int:
+        return self.tokenizer.get_vocab()[self.image_token]
+    def get_num_image_tokens(self, img_width: int, img_height: int) -> int:
+        num_patches, num_newlines = self.patcher.get_num_patches(
+            img_width, img_height)
+        return num_patches * (
+            self.num_patch_feature_size +
+            2) + self.num_image_feature_size + 2 + num_newlines
+    def _split_images(self,
+                      images: list[Image.Image]) -> list[ImageWithPatches]:
+        result = []
+        for img in images:
+            result.append(self.patcher(img))
+        return result
+    def _convert_images_to_pixel_values(
+        self,
+        images: list[Image.Image],
+        is_patch: bool = False,
+    ) -> list[torch.Tensor]:
+        return [
+            self.image_preprocessor(img, is_patch=is_patch)["pixel_values"]
+            for img in images
+        ]
+    def _get_patch_repl(
+        self,
+        num_patches: int,
+        patch_newline_mask: list[bool] | None,
+    ) -> tuple[str, list[int]]:
+        text = ""
+        token_ids = []
+        for i in range(num_patches):
+            assert len(patch_newline_mask) == num_patches
+            text += f"<patch_start>{self.patch_feature_placeholder}<patch_end>"
+            token_ids.extend(
+                [self.tokenizer.convert_tokens_to_ids("<patch_start>")] +
+                [self.image_token_id] * self.num_patch_feature_size +
+                [self.tokenizer.convert_tokens_to_ids("<patch_end>")])
+            if patch_newline_mask and patch_newline_mask[i]:
+                text += "<patch_newline>"
+                token_ids.append(
+                    self.tokenizer.convert_tokens_to_ids("<patch_newline>"))
+        return text, token_ids
+    def _get_image_repl(
+        self,
+        num_images: int,
+    ) -> tuple[str, list[int]]:
+        text = f"<im_start>{self.image_feature_placeholder}<im_end>"
+        token_ids = [
+            self.tokenizer.convert_tokens_to_ids("<im_start>")
+        ] + [self.image_token_id] * self.num_image_feature_size + [
+            self.tokenizer.convert_tokens_to_ids("<im_end>")
+        ]
+        return text * num_images, token_ids * num_images
+    def _get_image_repl_features(
+        self,
+        num_images: int,
+        num_patches: int,
+        patch_new_line_idx: Optional[list[bool]],
+    ) -> tuple[str, list[int]]:
+        if num_patches > 0:
+            patch_repl, patch_repl_ids = self._get_patch_repl(
+                num_patches, patch_new_line_idx)
+        else:
+            patch_repl = ""
+            patch_repl_ids = []
+        image_repl, image_repl_ids = self._get_image_repl(num_images)
+        return patch_repl + image_repl, patch_repl_ids + image_repl_ids
+    def replace_placeholder(self, text: str, placeholder: str,
+                            repls: list[str]) -> str:
+        parts = text.split(placeholder)
+        if len(parts) - 1 != len(repls):
+            raise ValueError(
+                "The number of placeholders does not match the number of replacements."  # noqa: E501
+            )
+        result = [parts[0]]
+        for i, repl in enumerate(repls):
+            result.append(repl)
+            result.append(parts[i + 1])
+        return "".join(result)
+    def __call__(
+        self,
+        text: Optional[Union[str, list[str]]] = None,
+        images: ImageInput | None = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        **kwargs,
+    ) -> BatchFeature:
+        if images is not None:
+            images = self.image_preprocessor.fetch_images(images)
+        if text is None:
+            text = []
+        if not isinstance(text, list):
+            text = [text]
+        if images is None:
+            images = []
+        elif not isinstance(images, list):
+            images = [images]
+        elif isinstance(images[0], list):
+            images = images[0]
+        if len(images) == 0:
+            image_inputs = {}
+            text_inputs = self.tokenizer(text)
+        else:
+            splitted_images_data = self._split_images(images)
+            pixel_values_lst = []
+            patch_pixel_values_lst = []
+            patch_newline_mask_lst = []
+            image_repl_str_lst = []
+            image_repl_ids_lst = []
+            num_patches = []
+            for raw_img, img_patches, patch_newline_mask in splitted_images_data:  # noqa: E501
+                pixel_values_lst.extend(
+                    self._convert_images_to_pixel_values([raw_img]))
+                if len(img_patches) > 0:
+                    patch_pixel_values_lst.extend(
+                        self._convert_images_to_pixel_values(img_patches,
+                                                             is_patch=True))
+                num_patches.append(len(img_patches))
+                image_repl_str, image_repl_ids = self._get_image_repl_features(
+                    1, len(img_patches), patch_newline_mask)
+                image_repl_str_lst.append(image_repl_str)
+                image_repl_ids_lst.extend(image_repl_ids)
+                if patch_newline_mask is not None:
+                    patch_newline_mask_lst.extend(patch_newline_mask)
+            image_inputs = {
+                "pixel_values": torch.cat(pixel_values_lst),
+                "num_patches": num_patches,
+            }
+            if patch_pixel_values_lst:
+                image_inputs["patch_pixel_values"] = torch.cat(
+                    patch_pixel_values_lst)
+            if patch_newline_mask_lst:
+                image_inputs["patch_newline_mask"] = torch.tensor(
+                    patch_newline_mask_lst, dtype=torch.bool)
+            text = [
+                self.replace_placeholder(t, self.image_token,
+                                         image_repl_str_lst) for t in text
+            ]
+            text_inputs = self.tokenizer(text)
+        return BatchFeature(
+            {
+                **text_inputs,
+                **image_inputs,
+            },
+            tensor_type=return_tensors,
+        )
+    # Copied from transformers.models.clip.processing_clip.CLIPProcessor.batch_decode with CLIP->Gemma
+    def batch_decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to GemmaTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
+        refer to the docstring of this method for more information.
+        """
+        return self.tokenizer.batch_decode(*args, **kwargs)
+    # Copied from transformers.models.clip.processing_clip.CLIPProcessor.decode with CLIP->Gemma
+    def decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to GemmaTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
+        the docstring of this method for more information.
+        """
+        return self.tokenizer.decode(*args, **kwargs)
+__all__ = ["Step3VLProcessor"]

processor_config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "auto_map": {
+    "AutoProcessor": "processing_step3.Step3VLProcessor"
+  },
+  "processor_class": "Step3VLProcessor"
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "bos_token": {
+    "content": "<｜begin▁of▁sentence｜>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|im_end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<｜end▁of▁sentence｜>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,22 @@

+{
+  "add_prefix_space": null,
+  "auto_map": {
+    "AutoProcessor": "processing_step3.Step3VLProcessor"
+  },
+  "backend": "tokenizers",
+  "bos_token": "<｜begin▁of▁sentence｜>",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "is_local": true,
+  "legacy": true,
+  "local_files_only": false,
+  "model_max_length": 262144,
+  "pad_token": "<｜▁pad▁｜>",
+  "padding_side": "left",
+  "processor_class": "Step3VLProcessor",
+  "sp_model_kwargs": {},
+  "tokenizer_class": "TokenizersBackend",
+  "unk_token": null,
+  "use_default_system_prompt": false,
+  "chat_template": "{% macro render_message_content(message) %}{% if message.content is none %}{{- '' }}{% elif message.content is string %}{{- message.content }}{% elif message.content is mapping %}{{- message.content['value'] if 'value' in message.content else message.content['text'] }}{% elif message.content is iterable %}{% set ns = namespace(needs_text_separator=false) %}{% for item in message.content %}{% if item.type == 'text' %}{% if ns.needs_text_separator %}{{- ' ' }}{% endif %}{{- item['value'] if 'value' in item else item['text'] }}{% set ns.needs_text_separator = true %}{% elif item.type == 'image' %}<im_patch>{% set ns.needs_text_separator = false %}{% endif %}{% endfor %}{% endif %}{% endmacro %}\n{{bos_token}}{%- if tools %}\n    {{- '<|im_start|>system\\n' }}\n    {%- if reasoning_effort is defined %}\n        {{- \"Reasoning: \" + reasoning_effort + '\\n\\n' }}\n    {%- endif %}\n    {%- if messages[0].role == 'system' %}\n        {{- render_message_content(messages[0]) + '\\n\\n' }}\n    {%- endif %}\n    {{- \"# Tools\\n\\nYou have access to the following functions in JSONSchema format:\\n\\n<tools>\" }}\n    {%- for tool in tools %}\n        {{- \"\\n\" }}\n        {{- tool | tojson(ensure_ascii=False) }}\n    {%- endfor %}\n    {{- \"\\n</tools>\\n\\nIf you choose to call a function ONLY reply in the following format with NO suffix:\\n\\n<tool_call>\\n<function=example_function_name>\\n<parameter=example_parameter_1>\\nvalue_1\\n</parameter>\\n<parameter=example_parameter_2>\\nThis is the value for the second parameter\\nthat can span\\nmultiple lines\\n</parameter>\\n</function>\\n</tool_call>\\n\\n<IMPORTANT>\\nReminder:\\n- Function calls MUST follow the specified format: an inner <function=...>\\n...\\n</function> block must be nested within <tool_call>\\n...\\n</tool_call> XML tags\\n- Required parameters MUST be specified\\n</IMPORTANT><|im_end|>\\n\" }}\n{%- else %}\n    {%- if messages[0].role == 'system' %}\n        {{- '<|im_start|>system\\n' }}\n        {%- if reasoning_effort is defined %}\n            {{- \"Reasoning: \" + reasoning_effort + '\\n\\n' }}\n        {%- endif %}\n        {{- render_message_content(messages[0]) + '<|im_end|>\\n' }}\n    {%- elif reasoning_effort is defined %}\n        {{- '<|im_start|>system\\n' + \"Reasoning: \" + reasoning_effort + '\\n\\n' + '<|im_end|>\\n' }}\n    {%- endif %}\n{%- endif %}\n{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}\n{%- for message in messages[::-1] %}\n    {%- set index = (messages|length - 1) - loop.index0 %}\n    {%- if ns.multi_step_tool and message.role == \"user\" and render_message_content(message) is string and not(render_message_content(message).startswith('<tool_response>') and render_message_content(message).endswith('</tool_response>')) %}\n        {%- set ns.multi_step_tool = false %}\n        {%- set ns.last_query_index = index %}\n    {%- endif %}\n{%- endfor %}\n{%- for message in messages %}\n    {%- set content = render_message_content(message) %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) %}\n        {%- set role_name = 'observation' if (message.role == \"system\" and not loop.first and message.name == 'observation') else message.role %}\n        {{- '<|im_start|>' + role_name + '\\n' + content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {%- if message.reasoning_content is string %}\n            {%- set reasoning_content = message.reasoning_content %}\n        {%- else %}\n            {%- if '</think>' in content %}\n                {%- set reasoning_content = content.split('</think>')[0].rstrip('\\n').split('<think>')[-1].lstrip('\\n') %}\n                {%- set content = content.split('</think>')[-1].lstrip('\\n') %}\n            {%- else %}\n                {%- set reasoning_content = '' %}\n            {%- endif %}\n        {%- endif %}\n        {%- if loop.index0 > ns.last_query_index %}\n            {{- '<|im_start|>' + message.role + '\\n<think>\\n' + reasoning_content + '\\n</think>\\n' + content }}\n        {%- else %}\n            {{- '<|im_start|>' + message.role + '\\n' + content }}\n        {%- endif %}\n        {%- if message.tool_calls %}\n            {%- for tool_call in message.tool_calls %}\n                {%- if tool_call.function is defined %}\n                    {%- set tool_call = tool_call.function %}\n                {%- endif %}\n                {{- '<tool_call>\\n<function=' + tool_call.name + '>\\n' }}\n                {%- if tool_call.arguments is defined %}\n                    {%- set arguments = tool_call.arguments | fromjson if tool_call.arguments is string else tool_call.arguments %}\n                    {%- for args_name, args_value in arguments|items %}\n                        {{- '<parameter=' + args_name + '>\\n' }}\n                        {%- set args_value = args_value | tojson(ensure_ascii=False) | safe if args_value is mapping or (args_value is sequence and args_value is not string) else args_value | string %}\n                        {{- args_value }}\n                        {{- '\\n</parameter>\\n' }}\n                    {%- endfor %}\n                {%- endif %}\n                {{- '</function>\\n</tool_call>' }}\n            {%- endfor %}\n        {%- endif %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if loop.first or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>tool_response\\n' }}\n        {%- endif %}\n        {{- '<tool_response>' }}\n        {{- content }}\n        {{- '</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n<think>\\n' }}\n{%- endif %}\n"
+}

vision_encoder.py ADDED Viewed

	@@ -0,0 +1,452 @@

+from typing import Literal, Optional, Tuple, Union
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from transformers.activations import ACT2FN
+from .configuration_step3p7 import StepRoboticsVisionEncoderConfig
+def rotate_half(x: torch.Tensor) -> torch.Tensor:
+    """Rotate last dimension halves (used by RoPE)."""
+    x = x.reshape(*x.shape[:-1], -1, 2)
+    x1, x2 = x.unbind(dim=-1)
+    x = torch.stack((-x2, x1), dim=-1)
+    return x.reshape(*x.shape[:-2], -1)
+def apply_rotary_emb(freqs: torch.Tensor,
+                     t: torch.Tensor,
+                     start_index: int = 0,
+                     scale: float = 1.0,
+                     seq_dim: int = -2) -> torch.Tensor:
+    """Apply 2D rotary embeddings to queries / keys."""
+    dtype = t.dtype
+    if t.ndim == 3:
+        seq_len = t.shape[seq_dim]
+        freqs = freqs[-seq_len:]
+    rot_dim = freqs.shape[-1]
+    end_index = start_index + rot_dim
+    assert rot_dim <= t.shape[-1], (
+        f"feature dimension {t.shape[-1]} is too small for rot_dim {rot_dim}")
+    t_left, t, t_right = (
+        t[..., :start_index],
+        t[..., start_index:end_index],
+        t[..., end_index:],
+    )
+    t = (t * freqs.cos() * scale) + (rotate_half(t) * freqs.sin() * scale)
+    out = torch.cat((t_left, t, t_right), dim=-1)
+    return out.type(dtype)
+class EncoderRope2D(nn.Module):
+    """Cacheable 2D rotary positional embedding."""
+    def __init__(
+        self,
+        dim: int,
+        max_grid_height: int,
+        max_grid_width: int,
+        use_cls_token: bool = False,
+        theta: Union[int, float] = 10000,
+        max_freq: int = 10,
+        num_freqs: int = 1,
+        theta_rescale_factor: float = 1.0,
+    ):
+        super().__init__()
+        self.dim = dim
+        self.max_grid_height = max_grid_height
+        self.max_grid_width = max_grid_width
+        self.use_cls_token = use_cls_token
+        self.theta = theta * theta_rescale_factor**(dim / (dim - 2))
+        self.max_freq = max_freq
+        self.num_freqs = num_freqs
+        cache = self._compute_2d_freqs()
+        self.register_buffer("freqs_cache", cache, persistent=False)
+    def _compute_inv_freq(self, base: Union[int, float],
+                          dim: int) -> torch.Tensor:
+        freqs = 1.0 / (base**(
+            torch.arange(0, dim, 2)[:(dim // 2)].float() / dim))
+        return freqs
+    def _compute_freqs(self, t: torch.Tensor, inv_freq: torch.Tensor):
+        freqs = torch.einsum("..., f -> ... f", t.type(inv_freq.dtype),
+                             inv_freq)
+        freqs = freqs.repeat_interleave(2, dim=-1)
+        return freqs
+    def _compute_2d_freqs(self) -> torch.Tensor:
+        grid_h_range = torch.arange(self.max_grid_height, dtype=torch.float)
+        grid_w_range = torch.arange(self.max_grid_width, dtype=torch.float)
+        if self.use_cls_token:
+            grid_h_range += 1
+            grid_w_range += 1
+        inv_freq = self._compute_inv_freq(self.theta, self.dim // 2)
+        freqs_h = self._compute_freqs(grid_h_range, inv_freq)[:, None].expand(
+            self.max_grid_height, self.max_grid_width, -1)
+        freqs_w = self._compute_freqs(grid_w_range, inv_freq)[None, :].expand(
+            self.max_grid_height, self.max_grid_width, -1)
+        freqs = torch.cat([freqs_w, freqs_h], dim=-1).reshape(
+            self.max_grid_height * self.max_grid_width, -1)
+        if self.use_cls_token:
+            freqs = torch.cat([torch.zeros(1, freqs.shape[-1]), freqs], dim=0)
+        freqs = freqs[None, None, ...]
+        return freqs
+    def forward(self, q: torch.Tensor, k: torch.Tensor,
+                grid_hw: tuple[int, int]):
+        # If grid matches cached shape we reuse directly to avoid recomputation.
+        if grid_hw[0] != self.max_grid_height or grid_hw[1] != self.max_grid_width:
+            rows = torch.arange(grid_hw[0], device=q.device).view(-1, 1)
+            cols = torch.arange(grid_hw[1], device=q.device).view(1, -1)
+            positions = (rows * self.max_grid_width + cols).reshape(-1).to(
+                torch.long)
+            if self.use_cls_token:
+                positions = torch.cat(
+                    [torch.zeros(1, device=q.device), positions + 1], dim=0)
+            freqs = self.freqs_cache.index_select(2, positions)
+        else:
+            freqs = self.freqs_cache
+        q = apply_rotary_emb(freqs, q)
+        k = apply_rotary_emb(freqs, k)
+        return q, k
+class EncoderLayerScale(nn.Module):
+    """Per-channel residual scaling used when ls_init_value is set."""
+    def __init__(self, dim: int, init_values: float):
+        super().__init__()
+        self.gamma = nn.Parameter(torch.full((dim,), init_values))
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:  # (B, L, D)
+        return hidden_states * self.gamma
+class EncoderMLP(nn.Module):
+    """Feed-forward network used inside each transformer block."""
+    def __init__(self, hidden_size: int, intermediate_size: int,
+                 hidden_act: str):
+        super().__init__()
+        self.c_fc = nn.Linear(hidden_size, intermediate_size, bias=True)
+        self.act_fn = ACT2FN[hidden_act]
+        self.c_proj = nn.Linear(intermediate_size, hidden_size, bias=True)
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        hidden_states = self.c_proj(self.act_fn(self.c_fc(hidden_states)))
+        return hidden_states
+class EncoderVisionAttention(nn.Module):
+    """Multi-head self attention with optional 2D RoPE."""
+    def __init__(
+        self,
+        hidden_size: int,
+        num_heads: int,
+        max_grid_height: int,
+        max_grid_width: int,
+        use_cls_token: bool = False,
+        use_rope2d: bool = True,
+        rope_theta: Union[int, float] = 10000,
+        rope_max_freq: int = 10,
+        rope_num_freqs: int = 1,
+        rope_theta_rescale_factor: float = 1.0,
+        rope_freqs_for: Literal["lang", "pixel", "constant"] = "lang",
+    ):
+        super().__init__()
+        if hidden_size % num_heads != 0:
+            raise ValueError(
+                f"hidden_size ({hidden_size}) must be divisible by num_heads ({num_heads})."
+            )
+        self.num_heads = num_heads
+        self.head_dim = hidden_size // num_heads
+        self.scale = self.head_dim**-0.5
+        self.in_proj_weight = nn.Parameter(torch.zeros(hidden_size * 3, hidden_size))
+        self.in_proj_bias = nn.Parameter(torch.zeros(hidden_size * 3))
+        self.out_proj = nn.Linear(hidden_size, hidden_size, bias=True)
+        self.rope = None
+        if use_rope2d:
+            self.rope = EncoderRope2D(
+                dim=self.head_dim,
+                max_grid_height=max_grid_height,
+                max_grid_width=max_grid_width,
+                use_cls_token=use_cls_token,
+                theta=rope_theta,
+                max_freq=rope_max_freq,
+                num_freqs=rope_num_freqs,
+                theta_rescale_factor=rope_theta_rescale_factor,
+            )
+    def forward(self, hidden_states: torch.Tensor, grid_hw: tuple[int, int]) -> torch.Tensor:
+        bsz, seq_len, _ = hidden_states.shape
+        qkv = F.linear(
+            hidden_states,
+            self.in_proj_weight,
+            self.in_proj_bias,
+        )
+        q, k, v = qkv.chunk(3, dim=-1)
+        q = q.view(bsz, seq_len, self.num_heads,
+                   self.head_dim).transpose(1, 2)
+        k = k.view(bsz, seq_len, self.num_heads,
+                   self.head_dim).transpose(1, 2)
+        if self.rope is not None:
+            q, k = self.rope(q, k, grid_hw=grid_hw)
+        v = v.view(bsz, seq_len, self.num_heads,
+                   self.head_dim).transpose(1, 2)
+        attn_output = F.scaled_dot_product_attention(
+            q, k, v, is_causal=False, scale=self.scale)
+        attn_output = attn_output.transpose(1, 2).reshape(
+            bsz, seq_len, self.num_heads * self.head_dim)
+        return self.out_proj(attn_output)
+class EncoderVisionBlock(nn.Module):
+    """A single Vision Transformer block (self-attention + MLP)."""
+    def __init__(
+        self,
+        hidden_size: int,
+        num_heads: int,
+        mlp_ratio: float,
+        hidden_act: str,
+        layer_norm_eps: float,
+        ls_init_value: Optional[float] = None,
+        max_grid_height: Optional[int] = None,
+        max_grid_width: Optional[int] = None,
+        use_cls_token: bool = False,
+        use_rope2d: bool = True,
+        rope_kwargs: Optional[dict] = None,
+    ):
+        super().__init__()
+        rope_kwargs = rope_kwargs or {}
+        self.attn = EncoderVisionAttention(
+            hidden_size,
+            num_heads,
+            max_grid_height=max_grid_height,
+            max_grid_width=max_grid_width,
+            use_cls_token=use_cls_token,
+            use_rope2d=use_rope2d,
+            **rope_kwargs,
+        )
+        self.ln_1 = nn.LayerNorm(hidden_size, eps=layer_norm_eps)
+        self.ln_2 = nn.LayerNorm(hidden_size, eps=layer_norm_eps)
+        intermediate = int(hidden_size * mlp_ratio)
+        self.mlp = EncoderMLP(hidden_size, intermediate, hidden_act)
+        self.ls_1 = EncoderLayerScale(hidden_size, ls_init_value)
+        self.ls_2 = EncoderLayerScale(hidden_size, ls_init_value)
+    def forward(self, hidden_states: torch.Tensor,
+                grid_hw: tuple[int, int]) -> torch.Tensor:
+        # breakpoint()
+        residual = hidden_states
+        hidden_states = self.ln_1(hidden_states)
+        hidden_states = self.attn(hidden_states, grid_hw=grid_hw)
+        hidden_states = residual + self.ls_1(hidden_states)
+        residual = hidden_states
+        hidden_states = self.ln_2(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + self.ls_2(hidden_states)
+        return hidden_states
+class EncoderVisionTransformer(nn.Module):
+    """Stack of encoder blocks parameterised by Step35VisionEncoderConfig."""
+    def __init__(
+        self,
+        embed_dim: int,
+        depth: int,
+        num_heads: int,
+        mlp_ratio: float,
+        hidden_act: str,
+        layer_norm_eps: float,
+        ls_init_value: Optional[float] = None,
+        max_grid_height: Optional[int] = None,
+        max_grid_width: Optional[int] = None,
+        use_cls_token: bool = False,
+        use_rope2d: bool = True,
+        rope_kwargs: Optional[dict] = None,
+    ):
+        super().__init__()
+        self.layers = depth
+        rope_kwargs = rope_kwargs or {}
+        self.resblocks = nn.ModuleList([
+            EncoderVisionBlock(embed_dim, num_heads, mlp_ratio, hidden_act,
+                               layer_norm_eps,
+                               max_grid_height=max_grid_height,
+                               max_grid_width=max_grid_width,
+                               use_cls_token=use_cls_token,
+                               use_rope2d=use_rope2d,
+                               ls_init_value=ls_init_value,
+                               rope_kwargs=rope_kwargs)
+            for _ in range(depth)
+        ])
+    def forward(self,
+                hidden_states: torch.Tensor,
+                grid_hw: tuple[int, int]) -> torch.Tensor:
+        for block in self.resblocks:
+            hidden_states = block(hidden_states, grid_hw=grid_hw)
+        return hidden_states
+class StepRoboticsVisionEncoder(nn.Module):
+    """
+    Vision encoder built from StepRoboticsVisionEncoderConfig.
+    The encoder performs patch embedding followed by a stack of transformer
+    blocks. Only the config fields defined in StepRoboticsVisionEncoderConfig (and
+    StepRoboticVLConfig.vision_config) are expected.
+    """
+    def __init__(self, config: StepRoboticsVisionEncoderConfig):
+        super().__init__()
+        self.config = config
+        # Align commonly used attributes so downstream code (e.g. StepRoboticVL)
+        # can access them without extra renaming.
+        self.hidden_size = config.width
+        self.num_heads = config.heads
+        self.num_hidden_layers = config.layers
+        self.patch_size = config.patch_size
+        self.image_size = config.image_size
+        self.use_cls_token = getattr(config, "use_cls_token", False)
+        self.use_rope2d = getattr(config, "use_rope2d", True)
+        self.use_abs_posemb = getattr(config, "use_abs_posemb", True)
+        self.layer_norm_eps = config.layer_norm_eps
+        self.mlp_ratio = getattr(config, "mlp_ratio", 8960 / 1536)
+        self.ls_init_value = getattr(config, "ls_init_value", None)
+        self.hidden_act = config.hidden_act
+        self.use_ln_pre = getattr(config, "use_ln_pre", False)
+        self.use_ln_post = getattr(config, "use_ln_post", True)
+        # Patch embedding.
+        self.conv1 = nn.Conv2d(in_channels=config.num_channels,
+                               out_channels=self.hidden_size,
+                               kernel_size=self.patch_size,
+                               stride=self.patch_size,
+                               bias=False)
+        self.ln_pre = nn.LayerNorm(self.hidden_size, eps=self.layer_norm_eps) if self.use_ln_pre else nn.Identity()
+        self.ln_post =  nn.LayerNorm(self.hidden_size, eps=self.layer_norm_eps) if self.use_ln_post else nn.Identity()
+        grid_size = self.image_size // self.patch_size
+        self.base_grid = (grid_size, grid_size)
+        if self.use_cls_token:
+            self.class_embedding = nn.Parameter(
+                torch.randn(self.hidden_size) * (self.hidden_size**-0.5))
+        else:
+            self.class_embedding = None
+        if self.use_abs_posemb:
+            self.posemb_grid_size = self.image_size // self.patch_size
+            self.positional_embedding = nn.Parameter(
+                (self.hidden_size**-0.5) * torch.randn(
+                    int(self.use_cls_token) + self.posemb_grid_size**2,
+                    self.hidden_size,
+                ))
+        self.transformer = EncoderVisionTransformer(
+            embed_dim=self.hidden_size,
+            depth=self.num_hidden_layers,
+            num_heads=self.num_heads,
+            mlp_ratio=self.mlp_ratio,
+            hidden_act=self.hidden_act,
+            layer_norm_eps=self.layer_norm_eps,
+            ls_init_value=self.ls_init_value,
+            max_grid_height=self.base_grid[0],
+            max_grid_width=self.base_grid[1],
+            use_cls_token=self.use_cls_token,
+            use_rope2d=self.use_rope2d,
+            rope_kwargs={
+                "rope_theta": getattr(config, "rope_theta", 10000),
+                "rope_max_freq": getattr(config, "rope_max_freq", 10),
+                "rope_num_freqs": getattr(config, "rope_num_freqs", 1),
+                "rope_theta_rescale_factor":
+                getattr(config, "rope_theta_rescale_factor", 1.0),
+                "rope_freqs_for": getattr(config, "rope_freqs_for", "lang"),
+            },
+        )
+        self.vit_downsampler1 = nn.Conv2d(self.hidden_size,
+                                          self.hidden_size * 2,
+                                          kernel_size=3,
+                                          stride=2,
+                                          padding=1)
+        self.vit_downsampler2 = nn.Conv2d(self.hidden_size * 2,
+                                          self.hidden_size * 4,
+                                          kernel_size=3,
+                                          stride=2,
+                                          padding=1)
+    def sample_abs_posemb(self, grid_h: int, grid_w: int):
+        if self.posemb_grid_size == grid_h and self.posemb_grid_size == grid_w:
+            return self.positional_embedding[None, ...]
+        pos_embed = self.positional_embedding
+        if self.use_cls_token:
+            cls_token_embed, pos_embed = pos_embed[:1], pos_embed[1:]
+        pos_embed = (pos_embed.reshape(1, self.posemb_grid_size,
+                                       self.posemb_grid_size,
+                                       -1).permute(0, 3, 1, 2).contiguous())
+        pos_embed = F.interpolate(pos_embed,
+                                  size=(grid_h, grid_w),
+                                  mode="bilinear",
+                                  align_corners=False)
+        pos_embed = pos_embed.permute(0, 2, 3, 1).reshape(-1, self.hidden_size)
+        if self.use_cls_token:
+            pos_embed = torch.cat([cls_token_embed, pos_embed], dim=0)
+        return pos_embed[None, ...]
+    def forward(self, pixel_values: torch.Tensor) -> torch.Tensor:
+        """
+        Args:
+            pixel_values: Image tensor of shape (B, C, H, W).
+            layer_idx: Negative indices stop after a given block (e.g., -1 uses all blocks).
+            strip_cls_token: If True and cls token is used, remove it from output.
+        """
+        bsz, _, height, width = pixel_values.shape
+        grid_h, grid_w = height // self.patch_size, width // self.patch_size
+        hidden_state = self.conv1(pixel_values)  # (B, D, Gh, Gw)
+        hidden_state = hidden_state.flatten(2).transpose(1, 2)  # (B, Gh*Gw, D)
+        if self.use_cls_token:
+            cls_token = self.class_embedding.view(1, 1,
+                                                  -1).expand(bsz, -1, -1)
+            hidden_state = torch.cat([cls_token, hidden_state], dim=1)
+        if self.use_abs_posemb:
+            pos_emb = self.sample_abs_posemb(grid_h, grid_w)
+            hidden_state = hidden_state + pos_emb
+        hidden_state = self.ln_pre(hidden_state)
+        hidden_state = self.transformer(hidden_state, grid_hw=(grid_h, grid_w))
+        if self.use_ln_post:
+            hidden_state = self.ln_post(hidden_state)
+        if self.use_cls_token:
+            hidden_state = hidden_state[:, 1:, :]
+        return hidden_state