Upload folder using huggingface_hub

Browse files

Files changed (10) hide show

.gitattributes +2 -34
README.md +110 -0
config.json +345 -0
configuration_step3p7.py +207 -0
generation_config.json +10 -0
model.safetensors +3 -0
model.safetensors.index.json +59 -0
special_tokens_map.json +23 -0
tokenizer.json +0 -0
tokenizer_config.json +0 -0

.gitattributes CHANGED Viewed

@@ -1,35 +1,3 @@
-*.7z filter=lfs diff=lfs merge=lfs -text
-*.arrow filter=lfs diff=lfs merge=lfs -text
-*.bin filter=lfs diff=lfs merge=lfs -text
-*.bz2 filter=lfs diff=lfs merge=lfs -text
-*.ckpt filter=lfs diff=lfs merge=lfs -text
-*.ftz filter=lfs diff=lfs merge=lfs -text
-*.gz filter=lfs diff=lfs merge=lfs -text
-*.h5 filter=lfs diff=lfs merge=lfs -text
-*.joblib filter=lfs diff=lfs merge=lfs -text
-*.lfs.* filter=lfs diff=lfs merge=lfs -text
-*.mlmodel filter=lfs diff=lfs merge=lfs -text
-*.model filter=lfs diff=lfs merge=lfs -text
-*.msgpack filter=lfs diff=lfs merge=lfs -text
-*.npy filter=lfs diff=lfs merge=lfs -text
-*.npz filter=lfs diff=lfs merge=lfs -text
-*.onnx filter=lfs diff=lfs merge=lfs -text
-*.ot filter=lfs diff=lfs merge=lfs -text
-*.parquet filter=lfs diff=lfs merge=lfs -text
-*.pb filter=lfs diff=lfs merge=lfs -text
-*.pickle filter=lfs diff=lfs merge=lfs -text
-*.pkl filter=lfs diff=lfs merge=lfs -text
-*.pt filter=lfs diff=lfs merge=lfs -text
-*.pth filter=lfs diff=lfs merge=lfs -text
-*.rar filter=lfs diff=lfs merge=lfs -text
 *.safetensors filter=lfs diff=lfs merge=lfs -text
-saved_model/**/* filter=lfs diff=lfs merge=lfs -text
-*.tar.* filter=lfs diff=lfs merge=lfs -text
-*.tar filter=lfs diff=lfs merge=lfs -text
-*.tflite filter=lfs diff=lfs merge=lfs -text
-*.tgz filter=lfs diff=lfs merge=lfs -text
-*.wasm filter=lfs diff=lfs merge=lfs -text
-*.xz filter=lfs diff=lfs merge=lfs -text
-*.zip filter=lfs diff=lfs merge=lfs -text
-*.zst filter=lfs diff=lfs merge=lfs -text
-*tfevents* filter=lfs diff=lfs merge=lfs -text

 *.safetensors filter=lfs diff=lfs merge=lfs -text
+*.json -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,110 @@

+---
+license: apache-2.0
+base_model:
+  - stepfun-ai/Step-3.7-Flash
+  - stepfun-ai/Step-3.7-Flash-NVFP4
+tags:
+  - speculative-decoding
+  - mtp
+  - multi-token-prediction
+  - vllm
+  - nvfp4
+  - step3
+language:
+  - en
+  - zh
+  - ja
+library_name: vllm
+pipeline_tag: text-generation
+---
+# Step-3.7-Flash MTP draft (for the NVFP4 checkpoint)
+A tiny **Multi-Token-Prediction (MTP / nextn) draft** for **`stepfun-ai/Step-3.7-Flash-NVFP4`**, so you can run
+**speculative decoding** on the NVFP4 checkpoint in vLLM.
+> **Why this exists:** the official `Step-3.7-Flash-NVFP4` checkpoint **declares**
+> `num_nextn_predict_layers: 3` in its config but **ships zero MTP weights** — the
+> 3 nextn layers were dropped during quantization, and the per-layer config arrays
+> were truncated to 45 (so even loading them would `IndexError`). The BF16 and FP8
+> releases keep the MTP weights, but **the NVFP4 one — the SM120-friendly, smallest
+> one — cannot do speculative decoding out of the box.** This repo is the missing
+> piece: the 3 MTP layers extracted from the BF16 release, kept in BF16 (they're
+> tiny), packaged as a vLLM-loadable draft.
+- **~5.9 GB**, BF16. Base = NVFP4 (mixed precision is fine; the draft is small).
+- **Lossless** in the speculative sense: vLLM's rejection sampling provably matches
+  the target distribution; at `temperature=0` it follows the target's greedy tokens.
+- Drop-in: point vLLM's `--speculative-config` at this directory.
+## Usage (vLLM, stepfun37 image / vLLM ≥ the build with `Step3p5MTP`)
+The draft is auto-routed to vLLM's native `Step3p5MTP` + `Step3p5MTPProposer`
+because its config is `model_type: step3p7` with `num_nextn_predict_layers > 0`.
+```bash
+docker run -d --gpus all --ipc=host --shm-size=64g --network host \
+  -v /path/to/Step-3.7-Flash-NVFP4:/model:ro \
+  -v /path/to/Step-3.7-Flash-MTP-draft:/draft:ro \
+  vllm/vllm-openai:stepfun37 \
+  /model \
+    --served-model-name step3p7 --port 8000 \
+    --trust-remote-code --tensor-parallel-size 2 --enable-expert-parallel \
+    --quantization modelopt --kv-cache-dtype fp8 \
+    --max-model-len 262144 --gpu-memory-utilization 0.92 --async-scheduling \
+    --speculative-config '{"method":"mtp","model":"/draft","num_speculative_tokens":1}'
+```
+JSON for `--speculative-config` must have **no spaces** (brace-expansion safety).
+**`num_speculative_tokens: 1` (K=1) is the sweet spot** — see below.
+## Benchmarks (2× RTX PRO 6000 Blackwell, SM120, TP=2)
+Measured on the NVFP4 base + this draft, K=1, vs. NVFP4 with speculation off.
+`per_req` = decode tok/s a single user feels (prefill excluded). Acceptance ≈ **0.80** in production traffic.
+**Single-stream decode (short context):**
+| workload | base | + MTP K=1 | speedup | accept |
+|---|---|---|---|---|
+| free-form | 106.8 | **125.5** | +17.5% | 0.77 |
+| code | 106.7 | **133.7** | +25.3% | 0.88 |
+| Japanese | 107.0 | **129.3** | +20.9% | 0.80 |
+| tool-call | 106.9 | **135.4** | +26.6% | 0.90 |
+**Decode speedup grows with context length** (longer KV → base is more
+memory-bound → bigger speculative win):
+| context | C=1 | C=2 | C=4 | C=8 |
+|---|---|---|---|---|
+| 1K | +20% | +8% | +32% | +34% |
+| 8K | +22% | +24% | +25% | **+44%** |
+| 32K | +22% | +26% | +20% | +17% |
+| **128K** | **+28%** | **+33%** | **+38%** | — |
+Net-positive across the whole concurrency range we tested (MoE stays memory-bound
+to high batch). Best `K`: **K=1** (K=2/K=3 lose to draft cost — later positions
+have lower acceptance and add forward cost). NaN-free on SM120 (Gate0 5/5).
+## How it was built (reproducible)
+The draft is **not retrained** — it's the original StepFun MTP layers, extracted verbatim:
+1. From `stepfun-ai/Step-3.7-Flash` (BF16), take the 52 tensors of
+   `model.layers.{45,46,47}.*` (the 3 nextn layers, dense-MLP, 17 tensors each)
+   plus `model.embed_tokens.weight`. They all live in one shard
+   (`model-00024.safetensors`).
+2. Keep the **original BF16 weight names** — vLLM's `Step3p5MTP` loader does its own
+   renaming (`.transformer.` strip, `shared_head.output→head`, `.mtp_block.` insert).
+3. `config.json` = the **BF16 original** config (NOT the NVFP4 one): its per-layer
+   arrays (`layer_types`, `partial_rotary_factors`, …) are length 48 and cover the
+   MTP layer indices 45-47. **Strip `quantization_config`** so the draft loads as BF16.
+Full scripts + benchmark harness: **[GitHub repo](#)** (`build_draft.py`,
+`launch_mtp.sh`, `eval_mtp.py`, `bench_matrix.py`).
+## License & attribution
+Apache-2.0, inherited from the base model **`stepfun-ai/Step-3.7-Flash`**. These are
+StepFun's weights, redistributed unchanged (only re-sharded/re-packaged as a draft).
+All credit for the model and the MTP layers goes to StepFun.

config.json ADDED Viewed

	@@ -0,0 +1,345 @@

+{
+  "architectures": [
+    "Step3p7ForConditionalGeneration"
+  ],
+  "auto_map": {
+    "AutoConfig": "configuration_step3p7.Step3p7Config",
+    "AutoProcessor": "processing_step3.Step3VLProcessor",
+    "AutoModelForCausalLM": "modeling_step3p7.Step3p7ForConditionalGeneration"
+  },
+  "model_type": "step3p7",
+  "im_end_token": "<im_end>",
+  "im_patch_token": "<im_patch>",
+  "im_start_token": "<im_start>",
+  "image_token_len": 169,
+  "patch_token_len": 81,
+  "image_token_id": 128001,
+  "understand_projector_stride": 2,
+  "use_im_start_end": "true",
+  "vision_select_layer": -1,
+  "projector_bias": false,
+  "vision_config": {
+    "model_type": "perception_encoder",
+    "image_size": 728,
+    "patch_size": 14,
+    "width": 1536,
+    "layers": 47,
+    "heads": 16,
+    "pool_type": "none",
+    "output_dim": null,
+    "use_cls_token": false,
+    "ls_init_value": 0.1,
+    "use_ln_post": false,
+    "hidden_act": "quick_gelu"
+  },
+  "text_config": {
+    "architectures": [
+      "Step3p5ForCausalLM"
+    ],
+    "rope_scaling": {
+      "rope_type": "llama3",
+      "factor": 2.0,
+      "original_max_position_embeddings": 131072,
+      "low_freq_factor": 1.0,
+      "high_freq_factor": 32.0
+    },
+    "yarn_only_types": [
+      "full_attention"
+    ],
+    "model_type": "step3p5",
+    "hidden_size": 4096,
+    "intermediate_size": 11264,
+    "num_hidden_layers": 45,
+    "max_seq_len": 262144,
+    "max_position_embeddings": 262144,
+    "vocab_size": 128896,
+    "torch_dtype": "bfloat16",
+    "use_qk_norm": false,
+    "moe_layers_enum": "3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44",
+    "use_mfa": false,
+    "num_attention_heads": 64,
+    "num_attention_groups": 8,
+    "head_dim": 128,
+    "use_moe": true,
+    "moe_num_experts": 288,
+    "moe_top_k": 8,
+    "moe_intermediate_size": 1280,
+    "share_expert_dim": 1280,
+    "moe_layer_offset": 0,
+    "moe_every_n_layer": 1,
+    "norm_expert_weight": true,
+    "moe_router_activation": "sigmoid",
+    "moe_router_scaling_factor": 3.0,
+    "att_impl_type": "GQA",
+    "num_nextn_predict_layers": 3,
+    "rope_theta": [
+      5000000.0,
+      10000.0,
+      10000.0,
+      10000.0,
+      5000000.0,
+      10000.0,
+      10000.0,
+      10000.0,
+      5000000.0,
+      10000.0,
+      10000.0,
+      10000.0,
+      5000000.0,
+      10000.0,
+      10000.0,
+      10000.0,
+      5000000.0,
+      10000.0,
+      10000.0,
+      10000.0,
+      5000000.0,
+      10000.0,
+      10000.0,
+      10000.0,
+      5000000.0,
+      10000.0,
+      10000.0,
+      10000.0,
+      5000000.0,
+      10000.0,
+      10000.0,
+      10000.0,
+      5000000.0,
+      10000.0,
+      10000.0,
+      10000.0,
+      5000000.0,
+      10000.0,
+      10000.0,
+      10000.0,
+      5000000.0,
+      10000.0,
+      10000.0,
+      10000.0,
+      5000000.0,
+      10000.0,
+      10000.0,
+      10000.0
+    ],
+    "use_head_wise_attn_gate": true,
+    "sliding_window": 512,
+    "use_moe_router_bias": true,
+    "need_fp32_gate": true,
+    "sink": false,
+    "layer_types": [
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "sliding_attention"
+    ],
+    "use_rope_layers": [],
+    "partial_rotary_factors": [
+      0.5,
+      1.0,
+      1.0,
+      1.0,
+      0.5,
+      1.0,
+      1.0,
+      1.0,
+      0.5,
+      1.0,
+      1.0,
+      1.0,
+      0.5,
+      1.0,
+      1.0,
+      1.0,
+      0.5,
+      1.0,
+      1.0,
+      1.0,
+      0.5,
+      1.0,
+      1.0,
+      1.0,
+      0.5,
+      1.0,
+      1.0,
+      1.0,
+      0.5,
+      1.0,
+      1.0,
+      1.0,
+      0.5,
+      1.0,
+      1.0,
+      1.0,
+      0.5,
+      1.0,
+      1.0,
+      1.0,
+      0.5,
+      1.0,
+      1.0,
+      1.0,
+      0.5,
+      1.0,
+      1.0,
+      1.0
+    ],
+    "eos_token_id": [
+      1,
+      2,
+      128007
+    ],
+    "bos_token_id": 0,
+    "attention_other_setting": {
+      "attention_type": "sliding_attention",
+      "num_attention_heads": 96,
+      "num_attention_groups": 8,
+      "head_dim": 128,
+      "true_head_dim": 128
+    },
+    "swiglu_limits": [
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      7,
+      7,
+      0.0,
+      0.0,
+      0.0
+    ],
+    "swiglu_limits_shared": [
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      0.0,
+      16,
+      16,
+      0.0,
+      0.0,
+      0.0
+    ]
+  }
+}

configuration_step3p7.py ADDED Viewed

	@@ -0,0 +1,207 @@

+from typing import Any, Optional, Sequence, Union
+from transformers.configuration_utils import PretrainedConfig
+class StepRoboticsVisionEncoderConfig(PretrainedConfig):
+    model_type = "perception_encoder"
+    def __init__(
+        self,
+        width=1536,
+        layers=47,
+        heads=16,
+        num_channels=3,
+        image_size=728,
+        mlp_ratio = 8960/1536,
+        patch_size=14,
+        hidden_act="quick_gelu",
+        layer_norm_eps=1e-5,
+        ues_cls_token=False,
+        use_cls_token: Optional[bool] = None,
+        use_ln_pre=True,
+        use_ln_post=False,
+        use_abs_posemb=True,
+        use_rope2d=True,
+        ls_init_value=0.1,
+        **kwargs,
+    ):
+        self.width = width
+        self.layers = layers
+        self.heads = heads
+        self.num_channels = num_channels
+        self.patch_size = patch_size
+        self.image_size = image_size
+        self.mlp_ratio = mlp_ratio
+        self.layer_norm_eps = layer_norm_eps
+        self.hidden_act = hidden_act
+        if use_cls_token is None:
+            use_cls_token = ues_cls_token
+        self.ues_cls_token = use_cls_token
+        self.use_cls_token = use_cls_token
+        self.use_ln_pre = use_ln_pre
+        self.ls_init_value = ls_init_value
+        self.use_ln_post = use_ln_post
+        self.use_abs_posemb = use_abs_posemb
+        self.use_rope2d = use_rope2d
+        super().__init__(**kwargs)
+class Step3p7TextConfig(PretrainedConfig):
+    model_type = "step3p5"
+    architectures = ["Step3p5ForCausalLM"]
+    def __init__(
+        self,
+        hidden_size: int = 4096,
+        intermediate_size: int = 11264,
+        num_attention_heads: int = 64,
+        num_attention_groups: int = 8,
+        num_hidden_layers: int = 45,
+        max_seq_len: int = 128000,
+        vocab_size: int = 128815,
+        rms_norm_eps: float = 1e-5,
+        moe_intermediate_size: int = 1280,
+        moe_num_experts: int = 288,
+        moe_top_k: int = 8,
+        rope_theta: float = 10000,
+        rope_scaling: Optional[dict[str, Any]] = None,
+        max_position_embeddings: int = 128000,
+        share_expert_dims: int = 1280,
+        share_expert_dim: Optional[int] = None,
+        head_dim: int = 128,
+        norm_expert_weight: bool = True,
+        layer_types: list[str] = None,
+        sliding_window: Optional[int] = None,
+        pad_token_id: int = 1,
+        attention_dropout: float = 0.0,
+        use_head_wise_attn_gate: bool = False,
+        use_moe_router_bias: bool = False,
+        moe_router_activation: str = "softmax",
+        moe_router_scaling_factor: float = 1.0,
+        need_fp32_gate: bool = False,
+        attention_other_setting: Optional[dict[str, Any]] = None,
+        swiglu_limits: Optional[list[Optional[float]]] = None,
+        swiglu_limits_shared: Optional[list[Optional[float]]] = None,
+        use_rope_layers: Optional[list[bool]] = None,
+        yarn_only_types: Optional[list[str]] = None,
+        moe_layers_enum: tuple[int] = (3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
+                                       15, 16, 17, 18, 19, 20, 21, 22, 23, 24,
+                                       25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
+                                       35, 36, 37, 38, 39, 40, 41, 42, 43, 44),
+        **kwargs,
+    ) -> None:
+        torch_dtype = kwargs.get("torch_dtype")
+        trim_layer_types = _normalize_per_layer_values(layer_types,
+                                                  num_hidden_layers)
+        if isinstance(rope_scaling, dict):
+            rope_scaling = dict(rope_scaling)
+        if share_expert_dim is None:
+            share_expert_dim = share_expert_dims
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_attention_heads = num_attention_heads
+        self.num_attention_groups = num_attention_groups
+        self.num_hidden_layers = num_hidden_layers
+        self.max_seq_len = max_seq_len
+        self.vocab_size = vocab_size
+        self.rms_norm_eps = rms_norm_eps
+        self.moe_intermediate_size = moe_intermediate_size
+        self.moe_num_experts = moe_num_experts
+        self.moe_top_k = moe_top_k
+        self.rope_theta = rope_theta
+        self.rope_scaling = rope_scaling
+        self.max_position_embeddings = max_position_embeddings
+        self.share_expert_dim = share_expert_dim
+        self.head_dim = head_dim
+        self.norm_expert_weight = norm_expert_weight
+        self.moe_layers_enum = moe_layers_enum
+        self.layer_types = trim_layer_types
+        self.sliding_window = sliding_window
+        self.pad_token_id = pad_token_id
+        self.attention_dropout = attention_dropout
+        self.use_head_wise_attn_gate = use_head_wise_attn_gate
+        self.use_moe_router_bias = use_moe_router_bias
+        self.moe_router_activation = moe_router_activation
+        self.moe_router_scaling_factor = moe_router_scaling_factor
+        self.need_fp32_gate = need_fp32_gate
+        self.attention_other_setting = attention_other_setting
+        self.swiglu_limits = swiglu_limits
+        self.swiglu_limits_shared = swiglu_limits_shared
+        self.use_rope_layers = use_rope_layers
+        self.yarn_only_types = yarn_only_types
+        super().__init__(**kwargs)
+        if torch_dtype is not None:
+            self.torch_dtype = torch_dtype
+        self.layer_types = layer_types
+    def to_dict(self):
+        output = super().to_dict()
+        torch_dtype = getattr(self, "torch_dtype", None)
+        if torch_dtype is not None:
+            output["torch_dtype"] = torch_dtype
+        return output
+def _normalize_per_layer_values(
+    values: Optional[Sequence[Any]],
+    num_hidden_layers: int,
+) -> Optional[list[Any]]:
+    if values is None:
+        return None
+    normalized = list(values)
+    if not normalized:
+        return normalized
+    if len(normalized) < num_hidden_layers:
+        normalized.extend([normalized[-1]] *
+                          (num_hidden_layers - len(normalized)))
+    # Some checkpoints keep MTP/spec layer entries after the decoder layers.
+    # This config only builds num_hidden_layers decoder layers, and HF strict
+    # validation requires per-layer fields to match that decoder count.
+    return normalized[:num_hidden_layers]
+class Step3p7Config(PretrainedConfig):
+    # This loader is a compatibility shim for original Step VL checkpoints
+    # whose top-level config model_type is `step3p7`.
+    model_type = "step3p7"
+    def __init__(
+        self,
+        vision_config: Optional[Union[dict, StepRoboticsVisionEncoderConfig]] = None,
+        text_config: Optional[Union[dict, Step3p7TextConfig]] = None,
+        understand_projector_stride: int = 2,
+        projector_bias: bool = False,
+        image_token_id: int = 151679,
+        **kwargs,
+    ) -> None:
+        shared_rope_scaling = kwargs.get("rope_scaling")
+        if isinstance(shared_rope_scaling, dict):
+            shared_rope_scaling = dict(shared_rope_scaling)
+        if vision_config is None:
+            vision_config = StepRoboticsVisionEncoderConfig()
+        elif isinstance(vision_config, dict):
+            vision_config = StepRoboticsVisionEncoderConfig(**vision_config)
+        self.vision_config = vision_config
+        if text_config is None:
+            text_config = Step3p7TextConfig(rope_scaling=shared_rope_scaling)
+        elif isinstance(text_config, dict):
+            text_config = dict(text_config)
+            if shared_rope_scaling is not None and "rope_scaling" not in text_config:
+                text_config["rope_scaling"] = shared_rope_scaling
+            text_config = Step3p7TextConfig(**text_config)
+        elif shared_rope_scaling is not None and text_config.rope_scaling is None:
+            text_config.rope_scaling = dict(shared_rope_scaling)
+        self.text_config = text_config
+        rope_scaling = kwargs.get("rope_scaling")
+        if isinstance(rope_scaling, dict):
+            kwargs["rope_scaling"] = dict(rope_scaling)
+        self.understand_projector_stride = understand_projector_stride
+        self.projector_bias = projector_bias
+        self.hidden_size = text_config.hidden_size
+        self.max_position_embeddings = text_config.max_position_embeddings
+        self.image_token_id = image_token_id
+        # Help Auto classes find the correct implementation when saving/loading.
+        super().__init__(**kwargs)

generation_config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 0,
+  "eos_token_id": [
+    1,
+    2,
+    128007
+  ],
+  "transformers_version": "4.56.2"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dc7600d64dba5fc566a9a00d09f3f4fa7691aa6eccc27b606a225c5ff7cbc7bc
+size 5912264080

model.safetensors.index.json ADDED Viewed

	@@ -0,0 +1,59 @@

+{
+  "metadata": {
+    "total_size": 5912258048
+  },
+  "weight_map": {
+    "model.embed_tokens.weight": "model.safetensors",
+    "model.layers.45.eh_proj.weight": "model.safetensors",
+    "model.layers.45.enorm.weight": "model.safetensors",
+    "model.layers.45.hnorm.weight": "model.safetensors",
+    "model.layers.45.input_layernorm.weight": "model.safetensors",
+    "model.layers.45.mlp.down_proj.weight": "model.safetensors",
+    "model.layers.45.mlp.gate_proj.weight": "model.safetensors",
+    "model.layers.45.mlp.up_proj.weight": "model.safetensors",
+    "model.layers.45.post_attention_layernorm.weight": "model.safetensors",
+    "model.layers.45.self_attn.g_proj.weight": "model.safetensors",
+    "model.layers.45.self_attn.k_norm.weight": "model.safetensors",
+    "model.layers.45.self_attn.k_proj.weight": "model.safetensors",
+    "model.layers.45.self_attn.o_proj.weight": "model.safetensors",
+    "model.layers.45.self_attn.q_norm.weight": "model.safetensors",
+    "model.layers.45.self_attn.q_proj.weight": "model.safetensors",
+    "model.layers.45.self_attn.v_proj.weight": "model.safetensors",
+    "model.layers.45.transformer.shared_head.norm.weight": "model.safetensors",
+    "model.layers.45.transformer.shared_head.output.weight": "model.safetensors",
+    "model.layers.46.eh_proj.weight": "model.safetensors",
+    "model.layers.46.enorm.weight": "model.safetensors",
+    "model.layers.46.hnorm.weight": "model.safetensors",
+    "model.layers.46.input_layernorm.weight": "model.safetensors",
+    "model.layers.46.mlp.down_proj.weight": "model.safetensors",
+    "model.layers.46.mlp.gate_proj.weight": "model.safetensors",
+    "model.layers.46.mlp.up_proj.weight": "model.safetensors",
+    "model.layers.46.post_attention_layernorm.weight": "model.safetensors",
+    "model.layers.46.self_attn.g_proj.weight": "model.safetensors",
+    "model.layers.46.self_attn.k_norm.weight": "model.safetensors",
+    "model.layers.46.self_attn.k_proj.weight": "model.safetensors",
+    "model.layers.46.self_attn.o_proj.weight": "model.safetensors",
+    "model.layers.46.self_attn.q_norm.weight": "model.safetensors",
+    "model.layers.46.self_attn.q_proj.weight": "model.safetensors",
+    "model.layers.46.self_attn.v_proj.weight": "model.safetensors",
+    "model.layers.46.transformer.shared_head.norm.weight": "model.safetensors",
+    "model.layers.46.transformer.shared_head.output.weight": "model.safetensors",
+    "model.layers.47.eh_proj.weight": "model.safetensors",
+    "model.layers.47.enorm.weight": "model.safetensors",
+    "model.layers.47.hnorm.weight": "model.safetensors",
+    "model.layers.47.input_layernorm.weight": "model.safetensors",
+    "model.layers.47.mlp.down_proj.weight": "model.safetensors",
+    "model.layers.47.mlp.gate_proj.weight": "model.safetensors",
+    "model.layers.47.mlp.up_proj.weight": "model.safetensors",
+    "model.layers.47.post_attention_layernorm.weight": "model.safetensors",
+    "model.layers.47.self_attn.g_proj.weight": "model.safetensors",
+    "model.layers.47.self_attn.k_norm.weight": "model.safetensors",
+    "model.layers.47.self_attn.k_proj.weight": "model.safetensors",
+    "model.layers.47.self_attn.o_proj.weight": "model.safetensors",
+    "model.layers.47.self_attn.q_norm.weight": "model.safetensors",
+    "model.layers.47.self_attn.q_proj.weight": "model.safetensors",
+    "model.layers.47.self_attn.v_proj.weight": "model.safetensors",
+    "model.layers.47.transformer.shared_head.norm.weight": "model.safetensors",
+    "model.layers.47.transformer.shared_head.output.weight": "model.safetensors"
+  }
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "bos_token": {
+    "content": "<｜begin▁of▁sentence｜>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|im_end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<｜end▁of▁sentence｜>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

The diff for this file is too large to render. See raw diff