yunfeixie commited on
Commit
5c6d204
·
verified ·
1 Parent(s): 3d6bcd8

Upload folder using huggingface_hub

Browse files
INFERENCE.md ADDED
@@ -0,0 +1,162 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Run (b) – best reproducible checkpoint for SimplerBridge
2
+
3
+ This bundle contains everything needed to load and evaluate our run-(b) Qwen2.5-VL-3B freeze-vision VLA on SimplerBridge and reproduce (and exceed) the paper's reported 23.95% SR.
4
+
5
+ - Checkpoint: `stepstep=0030000.fp32.pt` (15 GB, FP32 single-file state dict)
6
+ - Target task suite: SimplerBridge visual-matching (4 tasks × 24 episodes)
7
+ - Best-of-sweep SR: **30.21%** (at `execute_step=2`) – paper target: 23.95%
8
+ - Vision encoder: **frozen** during training (Qwen2.5-VL ViT, 668 M params)
9
+ - Trainable params at eval time: 3.09 B (language tower + word-embed + FC action head)
10
+
11
+ ## What's in this bundle
12
+
13
+ ```
14
+ run_b_best/
15
+ ├── stepstep=0030000.fp32.pt # checkpoint weights (15 GB, FP32)
16
+ ├── project.json # training-time config (Lightning snapshot)
17
+ ├── dataset_statistics_bridge.json # BridgeV2 action stats (used for un-normalization)
18
+ ├── configs/
19
+ │ └── qwen2.5-vl-3b-instruct/ # exact HF base-model files used at training time
20
+ ├── patches/
21
+ │ ├── eval_calvin_model_wrapper.py.patch # fix: `get_text_function` 4-arg -> 2-arg
22
+ │ ├── eval_simpler_main_inference.py.patch # fix: set args.policy_model from configs["model"]
23
+ │ └── convert_ckpt_standalone.py # DS stage-2 -> FP32 converter (thin wrapper)
24
+ ├── requirements-eval.txt # `pip freeze` of the eval conda env
25
+ ├── INFERENCE.md # this file
26
+ └── RESULTS.md # full 30-cell sweep matrix + best-of
27
+ ```
28
+
29
+ ## Environment (known-good)
30
+
31
+ - Linux + CUDA 12.4
32
+ - 1x A100-80GB (for a single-ckpt eval); 8x A100-80GB for a full 30-cell sweep
33
+ - `conda` (miniforge3 works), Python 3.10
34
+ - Disk: ~25 GB for this bundle + ~1-2 GB of per-task eval artifacts
35
+
36
+ ### Install (fresh machine)
37
+
38
+ ```bash
39
+ # 1. Clone the fork with the eval fixes already applied
40
+ git clone https://github.com/yunfeixie233/VLM4VLA.git
41
+ cd VLM4VLA
42
+ git checkout b4ddb40 # or the latest main with patches applied
43
+
44
+ # 2. Training / inference deps
45
+ conda create -y -n vlm4vla_eval python=3.10
46
+ conda activate vlm4vla_eval
47
+ pip install -e .
48
+ pip install -e ./openvla
49
+ git clone https://github.com/moojink/dlimp_openvla.git /tmp/dlimp_openvla
50
+ pip install -e /tmp/dlimp_openvla
51
+ pip install -U hydra-core
52
+ pip install bitsandbytes pretty_errors deepspeed qwen-vl-utils decord accelerate
53
+ pip install 'huggingface_hub<1.0,>=0.34.0'
54
+ pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
55
+
56
+ # 3. SimplerEnv stack (conflicts in numpy, do it last)
57
+ git clone https://github.com/simpler-env/SimplerEnv --recurse-submodules /tmp/SimplerEnv
58
+ cd /tmp/SimplerEnv/ManiSkill2_real2sim && pip install -e .
59
+ cd /tmp/SimplerEnv && pip install -e .
60
+ pip install mediapy 'numpy==1.24.4' 'opencv-python<4.11' 'setuptools<81'
61
+
62
+ # 4. Asset symlink expected by scripts/bridge.bash
63
+ ln -sfn /tmp/SimplerEnv/ManiSkill2_real2sim/data/real_inpainting ${VLM4VLA_REPO}/real_inpainting
64
+ ```
65
+
66
+ If you *cannot* use the fork, apply the two patches in `patches/` to the upstream repo:
67
+ ```bash
68
+ git apply patches/eval_calvin_model_wrapper.py.patch
69
+ git apply patches/eval_simpler_main_inference.py.patch
70
+ ```
71
+ (These are `git show` dumps; use `-p1` through `git apply` as usual.)
72
+
73
+ `requirements-eval.txt` is the exact `pip freeze` of the conda env that produced the sweep results below; use it if you need to pin precisely.
74
+
75
+ ## Fix up paths in `project.json`
76
+
77
+ The training config hard-codes paths that were valid on the training box (`/workspace/models/...`, `/workspace/data/...`). Before inference, point them at this bundle:
78
+
79
+ ```bash
80
+ python - <<'PY'
81
+ import json, os
82
+ BUNDLE = os.path.abspath(".")
83
+ p = os.path.join(BUNDLE, "project.json")
84
+ c = json.load(open(p))
85
+ qwen = os.path.join(BUNDLE, "configs/qwen2.5-vl-3b-instruct")
86
+ c["model_path"] = qwen
87
+ c["model_config"] = os.path.join(qwen, "config.json")
88
+ c["tokenizer"]["pretrained_model_name_or_path"] = qwen
89
+ c["vlm"]["pretrained_model_name_or_path"] = qwen
90
+ # data_root_dir is unused at inference; leave as-is (or point to a Bridge copy if you resume training)
91
+ json.dump(c, open(p, "w"), indent=4)
92
+ print("patched", p)
93
+ PY
94
+ ```
95
+
96
+ ## Also copy the BridgeV2 action stats into the repo
97
+
98
+ The `BaseModelInference` class reads `configs/data/oxe_dataset_stats/dataset_statistics_bridge.json` relative to `$CWD`. Inside the repo:
99
+
100
+ ```bash
101
+ cp /path/to/run_b_best/dataset_statistics_bridge.json \
102
+ VLM4VLA/configs/data/oxe_dataset_stats/dataset_statistics_bridge.json
103
+ ```
104
+
105
+ ## One-shot eval (best cell)
106
+
107
+ Our headline result is `execute_step=2` at step 30000, avg SR = 30.21%. Run it with a single call of the canonical bridge script:
108
+
109
+ ```bash
110
+ cd VLM4VLA
111
+ conda activate vlm4vla_eval
112
+ export TF_CPP_MIN_LOG_LEVEL=2
113
+ BUNDLE=/abs/path/to/run_b_best
114
+ bash scripts/bridge.bash \
115
+ ${BUNDLE}/stepstep=0030000.fp32.pt \
116
+ ${BUNDLE}/project.json \
117
+ 2 \
118
+ 0 # GPU index
119
+ ```
120
+
121
+ Wall time: ~17-25 min on one A100 (4 tasks, 24 episodes each, up to 120 env steps/episode).
122
+
123
+ Per-task `Average success` lines get printed into stdout; find them with:
124
+ ```bash
125
+ grep -n 'Average success' eval.log
126
+ ```
127
+
128
+ Expected output (identical up to sim-stochasticity):
129
+ ```
130
+ PutCarrotOnPlate: 16.7 %
131
+ StackGreenCubeOnYellow: 0.0 %
132
+ PutSpoonOnTableCloth: 12.5 %
133
+ PutEggplantInBasket: 91.7 %
134
+ -> avg = 30.21 %
135
+ ```
136
+
137
+ ## Full sweep (reproduces paper's `eval_ckpts_bridge.py` protocol)
138
+
139
+ The paper's protocol iterates over `execute_step in [4, 2, 1]` and picks the best cell, which on 8 GPUs finishes in ~100 min. Use the parallel launcher from the fork:
140
+
141
+ ```bash
142
+ cd VLM4VLA
143
+ conda activate vlm4vla_eval
144
+ python eval/simpler/sweep_parallel_bridge.py \
145
+ --base-path /abs/path/to/run_b_best \
146
+ --ngpu 8 --exec-steps 4,2,1
147
+ ```
148
+
149
+ See `RESULTS.md` for the full 30-cell SR matrix we obtained.
150
+
151
+ ## Known quirks
152
+
153
+ - `StackGreenCubeOnYellow` has **0 % SR** across all 10 checkpoints × 3 execute_steps. This is a known hard task in SimplerBridge; the paper's run (b) also scores near-zero on it (the 23.95 % headline is dominated by eggplant).
154
+ - `PutEggplantInBasket` is the dominant SR driver (up to 100 %). Evaluate it at both exec=4 and exec=2 for the most stable number.
155
+ - `execute_step=1` is consistently the slowest (~30 min per cell) and rarely the best. `execute_step=4` is fastest (~21 min).
156
+ - Vision freeze verification after load: at startup the code prints `Trainable Model Parameters: 3092.51M` and `-- trainable backbone Parameters: 3085.94M`. If you see 3761 M the freeze did not take effect (the ViT is trainable).
157
+
158
+ ## Caveats on reproduction
159
+
160
+ - Training was done on this fork with `learning_rate=5e-5` (in the LOCAL config) vs the paper's non-LOCAL config setting of `2e-5`. Despite the larger LR, our converged SR *exceeds* the paper's, so the LR difference is not limiting reproduction.
161
+ - We trained `max_steps=50000`. The paper's `max_steps` value is the same; however the peak performance in our sweep is at **step 30000**, not step 50000. If you are retraining, save often; don't assume the last step is best.
162
+ - SimplerEnv upstream may drift. We pinned the specific SimplerEnv + ManiSkill2_real2sim HEAD as of Tue Apr 21 2026; future versions may perturb SR by a few pp due to sim changes.
RESULTS.md ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Run (b) – full SimplerBridge sweep (30 cells, 10 ckpts × 3 exec_steps)
2
+
3
+ Training: Qwen2.5-VL-3B-Instruct + FCDecoder (action head), **vision encoder frozen**, BridgeData V2 RLDS, Lightning + DeepSpeed stage-2, bf16, 50 000 opt steps, global batch 512 (8 per GPU × 8 accumulate × 8x A100-80GB), LR 5e-5. See `project.json` for the complete config snapshot.
4
+
5
+ Eval: `scripts/bridge.bash` – 4 visual-matching tasks, 24 episodes per task, `max_episode_steps=60` (120 for `PutEggplantInBasket`), `control_freq=5`, `sim_freq=500`, 8x A100 running 30 `(ckpt, execute_step)` cells in parallel (~107 min wall).
6
+
7
+ Numbers below are per-task success rates in **percent** (24 episodes per task), followed by the 4-task mean (`avg`). The paper target for run (b) is **23.95 %**.
8
+
9
+ ## Per-cell SR matrix
10
+
11
+ | step | exec | carrot | stack | spoon | eggplant | avg SR |
12
+ |-------:|-----:|-------:|------:|------:|---------:|-------:|
13
+ | 5000 | 1 | 0.0 | 0.0 | 0.0 | 12.5 | 3.12 |
14
+ | 5000 | 2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00 |
15
+ | 5000 | 4 | 0.0 | 0.0 | 0.0 | 33.3 | 8.33 |
16
+ | 10000 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00 |
17
+ | 10000 | 2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00 |
18
+ | 10000 | 4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00 |
19
+ | 15000 | 1 | 0.0 | 0.0 | 0.0 | 75.0 | 18.75 |
20
+ | 15000 | 2 | 0.0 | 0.0 | 0.0 | 91.7 | 22.92 |
21
+ | 15000 | 4 | 0.0 | 0.0 | 0.0 | 79.2 | 19.79 |
22
+ | 20000 | 1 | 0.0 | 0.0 | 0.0 | 70.8 | 17.71 |
23
+ | 20000 | 2 | 0.0 | 0.0 | 0.0 | 62.5 | 15.62 |
24
+ | 20000 | 4 | 0.0 | 0.0 | 0.0 | 75.0 | 18.75 |
25
+ | 25000 | 1 | 0.0 | 0.0 | 4.2 | 58.3 | 15.62 |
26
+ | 25000 | 2 | 4.2 | 0.0 | 4.2 | 37.5 | 11.46 |
27
+ | 25000 | 4 | 0.0 | 0.0 | 4.2 | 20.8 | 6.25 |
28
+ | **30000** | **2** | **16.7** | **0.0** | **12.5** | **91.7** | **30.21** |
29
+ | 30000 | 4 | 12.5 | 0.0 | 4.2 | 100.0 | 29.17 |
30
+ | 30000 | 1 | 12.5 | 0.0 | 4.2 | 87.5 | 26.04 |
31
+ | 35000 | 1 | 0.0 | 0.0 | 4.2 | 70.8 | 18.75 |
32
+ | 35000 | 2 | 4.2 | 0.0 | 0.0 | 91.7 | 23.96 |
33
+ | 35000 | 4 | 0.0 | 0.0 | 0.0 | 95.8 | 23.96 |
34
+ | 40000 | 1 | 8.3 | 0.0 | 0.0 | 45.8 | 13.54 |
35
+ | 40000 | 2 | 8.3 | 0.0 | 0.0 | 50.0 | 14.58 |
36
+ | 40000 | 4 | 0.0 | 0.0 | 0.0 | 16.7 | 4.17 |
37
+ | 45000 | 1 | 0.0 | 0.0 | 0.0 | 83.3 | 20.83 |
38
+ | 45000 | 2 | 0.0 | 0.0 | 0.0 | 91.7 | 22.92 |
39
+ | 45000 | 4 | 0.0 | 0.0 | 0.0 | 95.8 | 23.96 |
40
+ | 50000 | 1 | 4.2 | 0.0 | 0.0 | 58.3 | 15.62 |
41
+ | 50000 | 2 | 4.2 | 0.0 | 0.0 | 79.2 | 20.83 |
42
+ | 50000 | 4 | 8.3 | 0.0 | 4.2 | 45.8 | 14.58 |
43
+
44
+ ## Aggregates
45
+
46
+ ### Best cell per `execute_step`
47
+
48
+ | exec | best step | avg SR |
49
+ |-----:|----------:|-------:|
50
+ | 4 | 30000 | 29.17 |
51
+ | 2 | 30000 | **30.21** |
52
+ | 1 | 30000 | 26.04 |
53
+
54
+ ### Best cell per step (across `execute_step`)
55
+
56
+ | step | best exec | avg SR |
57
+ |-------:|----------:|-------:|
58
+ | 5000 | 4 | 8.33 |
59
+ | 10000 | 1 | 0.00 |
60
+ | 15000 | 2 | 22.92 |
61
+ | 20000 | 4 | 18.75 |
62
+ | 25000 | 1 | 15.62 |
63
+ | 30000 | 2 | **30.21** |
64
+ | 35000 | 4 or 2 | 23.96 |
65
+ | 40000 | 2 | 14.58 |
66
+ | 45000 | 4 | 23.96 |
67
+ | 50000 | 2 | 20.83 |
68
+
69
+ ### Overall best
70
+
71
+ - **step=30000, exec_step=2, avg SR = 30.21 %** (paper: 23.95 %, **+6.26 pp**)
72
+ - Per-task: carrot 16.7 · stack 0.0 · spoon 12.5 · eggplant 91.7
73
+
74
+ ## Key observations
75
+
76
+ 1. **The peak is at step 30 000, not step 50 000.** Evaluating only the final checkpoint (our earlier single-cell run) gave 14.58 % at exec=4 – 16 pp below step-30k. A single-step eval would not have reproduced the paper.
77
+ 2. **Both execute_step 4 and 2 are competitive.** Exec=1 is consistently weakest and ~50 % slower (because every env step requires a fresh forward pass). Exec=4 finishes fastest; exec=2 generally gives the best SR in the converged regime (step 30k-45k).
78
+ 3. **StackGreenCubeOnYellow is 0.0 % everywhere.** That is consistent with the paper's own per-task breakdown; this task is dominated by other policies' zero-shot priors and is near-impossible without large-scale stacking data in the finetune mix.
79
+ 4. **Eggplant-in-basket carries the SR.** Peaking at 100 % (step 30k, exec=4) and mostly 80-95 % in the converged regime.
80
+ 5. **Learning-rate comparison.** Our LOCAL config uses `lr=5e-5`; the paper's non-LOCAL config uses `lr=2e-5`. Despite the 2.5× larger LR, our converged SR exceeds the paper's headline by 6 pp. The larger LR may have pushed the peak earlier (step 30k) relative to where the paper's run peaks.
81
+ 6. **Training-time freeze verification held end-to-end.** The converter-side log at `convert_ckpt_standalone.py` reports `Reconstructed Frozen fp32 state dict with 390 params 668 684 288 elements` at every save step – exactly the Qwen2.5-VL ViT parameter count. The loaded state dict matches `BaseTrainer` with `missing=0, unexpected=0`.
82
+ 7. **Inference-time shape check.** The policy's `inference_step` returns an action chunk of shape `(B=1, seq=1, chunk=4, act_dim=7)` – i.e. 4 distinct 7-D actions per inference. This matches the `execute_step` semantics in `eval/calvin/model_wrapper.py`.
83
+
84
+ ## Reproduction command
85
+
86
+ ```bash
87
+ cd VLM4VLA
88
+ conda activate vlm4vla_eval
89
+ python eval/simpler/sweep_parallel_bridge.py \
90
+ --base-path /abs/path/to/run_b_best \
91
+ --ngpu 8 --exec-steps 4,2,1
92
+ ```
93
+
94
+ Wall time on 8× A100-80GB: **~107 min** (first wave starts at t=0, last cell finishes at t≈107m). RAM peak ~20 GB per process (transient during ckpt load); steady VRAM ~9 GB per process.
95
+
96
+ See `INFERENCE.md` for the single-cell command that reproduces just the best cell.
configs/qwen2.5-vl-3b-instruct/LICENSE ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Qwen RESEARCH LICENSE AGREEMENT
2
+
3
+ Qwen RESEARCH LICENSE AGREEMENT Release Date: September 19, 2024
4
+
5
+ By clicking to agree or by using or distributing any portion or element of the Qwen Materials, you will be deemed to have recognized and accepted the content of this Agreement, which is effective immediately.
6
+
7
+ 1. Definitions
8
+ a. This Qwen RESEARCH LICENSE AGREEMENT (this "Agreement") shall mean the terms and conditions for use, reproduction, distribution and modification of the Materials as defined by this Agreement.
9
+ b. "We" (or "Us") shall mean Alibaba Cloud.
10
+ c. "You" (or "Your") shall mean a natural person or legal entity exercising the rights granted by this Agreement and/or using the Materials for any purpose and in any field of use.
11
+ d. "Third Parties" shall mean individuals or legal entities that are not under common control with us or you.
12
+ e. "Qwen" shall mean the large language models, and software and algorithms, consisting of trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by us.
13
+ f. "Materials" shall mean, collectively, Alibaba Cloud's proprietary Qwen and Documentation (and any portion thereof) made available under this Agreement.
14
+ g. "Source" form shall mean the preferred form for making modifications, including but not limited to model source code, documentation source, and configuration files.
15
+ h. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
16
+ i. "Non-Commercial" shall mean for research or evaluation purposes only.
17
+
18
+ 2. Grant of Rights
19
+ a. You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Alibaba Cloud's intellectual property or other rights owned by us embodied in the Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Materials FOR NON-COMMERCIAL PURPOSES ONLY.
20
+ b. If you are commercially using the Materials, you shall request a license from us.
21
+
22
+ 3. Redistribution
23
+ You may distribute copies or make the Materials, or derivative works thereof, available as part of a product or service that contains any of them, with or without modifications, and in Source or Object form, provided that you meet the following conditions:
24
+ a. You shall give any other recipients of the Materials or derivative works a copy of this Agreement;
25
+ b. You shall cause any modified files to carry prominent notices stating that you changed the files;
26
+ c. You shall retain in all copies of the Materials that you distribute the following attribution notices within a "Notice" text file distributed as a part of such copies: "Qwen is licensed under the Qwen RESEARCH LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved."; and
27
+ d. You may add your own copyright statement to your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of your modifications, or for any such derivative works as a whole, provided your use, reproduction, and distribution of the work otherwise complies with the terms and conditions of this Agreement.
28
+
29
+ 4. Rules of use
30
+ a. The Materials may be subject to export controls or restrictions in China, the United States or other countries or regions. You shall comply with applicable laws and regulations in your use of the Materials.
31
+ b. If you use the Materials or any outputs or results therefrom to create, train, fine-tune, or improve an AI model that is distributed or made available, you shall prominently display “Built with Qwen” or “Improved using Qwen” in the related product documentation.
32
+
33
+ 5. Intellectual Property
34
+ a. We retain ownership of all intellectual property rights in and to the Materials and derivatives made by or for us. Conditioned upon compliance with the terms and conditions of this Agreement, with respect to any derivative works and modifications of the Materials that are made by you, you are and will be the owner of such derivative works and modifications.
35
+ b. No trademark license is granted to use the trade names, trademarks, service marks, or product names of us, except as required to fulfill notice requirements under this Agreement or as required for reasonable and customary use in describing and redistributing the Materials.
36
+ c. If you commence a lawsuit or other proceedings (including a cross-claim or counterclaim in a lawsuit) against us or any entity alleging that the Materials or any output therefrom, or any part of the foregoing, infringe any intellectual property or other right owned or licensable by you, then all licenses granted to you under this Agreement shall terminate as of the date such lawsuit or other proceeding is commenced or brought.
37
+
38
+ 6. Disclaimer of Warranty and Limitation of Liability
39
+ a. We are not obligated to support, update, provide training for, or develop any further version of the Qwen Materials or to grant any license thereto.
40
+ b. THE MATERIALS ARE PROVIDED "AS IS" WITHOUT ANY EXPRESS OR IMPLIED WARRANTY OF ANY KIND INCLUDING WARRANTIES OF MERCHANTABILITY, NONINFRINGEMENT, OR FITNESS FOR A PARTICULAR PURPOSE. WE MAKE NO WARRANTY AND ASSUME NO RESPONSIBILITY FOR THE SAFETY OR STABILITY OF THE MATERIALS AND ANY OUTPUT THEREFROM.
41
+ c. IN NO EVENT SHALL WE BE LIABLE TO YOU FOR ANY DAMAGES, INCLUDING, BUT NOT LIMITED TO ANY DIRECT, OR INDIRECT, SPECIAL OR CONSEQUENTIAL DAMAGES ARISING FROM YOUR USE OR INABILITY TO USE THE MATERIALS OR ANY OUTPUT OF IT, NO MATTER HOW IT’S CAUSED.
42
+ d. You will defend, indemnify and hold harmless us from and against any claim by any third party arising out of or related to your use or distribution of the Materials.
43
+
44
+ 7. Survival and Termination.
45
+ a. The term of this Agreement shall commence upon your acceptance of this Agreement or access to the Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein.
46
+ b. We may terminate this Agreement if you breach any of the terms or conditions of this Agreement. Upon termination of this Agreement, you must delete and cease use of the Materials. Sections 6 and 8 shall survive the termination of this Agreement.
47
+
48
+ 8. Governing Law and Jurisdiction.
49
+ a. This Agreement and any dispute arising out of or relating to it will be governed by the laws of China, without regard to conflict of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement.
50
+ b. The People's Courts in Hangzhou City shall have exclusive jurisdiction over any dispute arising out of this Agreement.
51
+
52
+ 9. Other Terms and Conditions.
53
+ a. Any arrangements, understandings, or agreements regarding the Material not stated herein are separate from and independent of the terms and conditions of this Agreement. You shall request a separate license from us, if you use the Materials in ways not expressly agreed to in this Agreement.
54
+ b. We shall not be bound by any additional or different terms or conditions communicated by you unless expressly agreed.
configs/qwen2.5-vl-3b-instruct/README.md ADDED
@@ -0,0 +1,525 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ license_name: qwen-research
4
+ license_link: https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct/blob/main/LICENSE
5
+ language:
6
+ - en
7
+ pipeline_tag: image-text-to-text
8
+ tags:
9
+ - multimodal
10
+ library_name: transformers
11
+ ---
12
+
13
+ # Qwen2.5-VL-3B-Instruct
14
+ <a href="https://chat.qwenlm.ai/" target="_blank" style="margin: 2px;">
15
+ <img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
16
+ </a>
17
+
18
+ ## Introduction
19
+
20
+ In the past five months since Qwen2-VL’s release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL.
21
+
22
+ #### Key Enhancements:
23
+ * **Understand things visually**: Qwen2.5-VL is not only proficient in recognizing common objects such as flowers, birds, fish, and insects, but it is highly capable of analyzing texts, charts, icons, graphics, and layouts within images.
24
+
25
+ * **Being agentic**: Qwen2.5-VL directly plays as a visual agent that can reason and dynamically direct tools, which is capable of computer use and phone use.
26
+
27
+ * **Understanding long videos and capturing events**: Qwen2.5-VL can comprehend videos of over 1 hour, and this time it has a new ability of cpaturing event by pinpointing the relevant video segments.
28
+
29
+ * **Capable of visual localization in different formats**: Qwen2.5-VL can accurately localize objects in an image by generating bounding boxes or points, and it can provide stable JSON outputs for coordinates and attributes.
30
+
31
+ * **Generating structured outputs**: for data like scans of invoices, forms, tables, etc. Qwen2.5-VL supports structured outputs of their contents, benefiting usages in finance, commerce, etc.
32
+
33
+
34
+ #### Model Architecture Updates:
35
+
36
+ * **Dynamic Resolution and Frame Rate Training for Video Understanding**:
37
+
38
+ We extend dynamic resolution to the temporal dimension by adopting dynamic FPS sampling, enabling the model to comprehend videos at various sampling rates. Accordingly, we update mRoPE in the time dimension with IDs and absolute time alignment, enabling the model to learn temporal sequence and speed, and ultimately acquire the ability to pinpoint specific moments.
39
+
40
+ <p align="center">
41
+ <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-VL/qwen2.5vl_arc.jpeg" width="80%"/>
42
+ <p>
43
+
44
+
45
+ * **Streamlined and Efficient Vision Encoder**
46
+
47
+ We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM.
48
+
49
+
50
+ We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 3B Qwen2.5-VL model. For more information, visit our [Blog](https://qwenlm.github.io/blog/qwen2.5-vl/) and [GitHub](https://github.com/QwenLM/Qwen2.5-VL).
51
+
52
+
53
+
54
+ ## Evaluation
55
+
56
+ ### Image benchmark
57
+
58
+ | Benchmark | InternVL2.5-4B |Qwen2-VL-7B |Qwen2.5-VL-3B |
59
+ | :--- | :---: | :---: | :---: |
60
+ | MMMU<sub>val</sub> | 52.3 | 54.1 | 53.1|
61
+ | MMMU-Pro<sub>val</sub> | **32.7** | 30.5 | 31.6|
62
+ | AI2D<sub>test</sub> | 81.4 | **83.0** | 81.5 |
63
+ | DocVQA<sub>test</sub> | 91.6 | 94.5 | **93.9** |
64
+ | InfoVQA<sub>test</sub> | 72.1 | 76.5 | **77.1** |
65
+ | TextVQA<sub>val</sub> | 76.8 | **84.3** | 79.3|
66
+ | MMBench-V1.1<sub>test</sub> | 79.3 | **80.7** | 77.6 |
67
+ | MMStar | 58.3 | **60.7** | 55.9 |
68
+ | MathVista<sub>testmini</sub> | 60.5 | 58.2 | **62.3** |
69
+ | MathVision<sub>full</sub> | 20.9 | 16.3 | **21.2** |
70
+
71
+
72
+ ### Video benchmark
73
+ | Benchmark | InternVL2.5-4B | Qwen2-VL-7B | Qwen2.5-VL-3B |
74
+ | :--- | :---: | :---: | :---: |
75
+ | MVBench | 71.6 | 67.0 | 67.0 |
76
+ | VideoMME | 63.6/62.3 | 69.0/63.3 | 67.6/61.5 |
77
+ | MLVU | 48.3 | - | 68.2 |
78
+ | LVBench | - | - | 43.3 |
79
+ | MMBench-Video | 1.73 | 1.44 | 1.63 |
80
+ | EgoSchema | - | - | 64.8 |
81
+ | PerceptionTest | - | - | 66.9 |
82
+ | TempCompass | - | - | 64.4 |
83
+ | LongVideoBench | 55.2 | 55.6 | 54.2 |
84
+ | CharadesSTA/mIoU | - | - | 38.8 |
85
+
86
+
87
+ ### Agent benchmark
88
+ | Benchmarks | Qwen2.5-VL-3B |
89
+ |-------------------------|---------------|
90
+ | ScreenSpot | 55.5 |
91
+ | ScreenSpot Pro | 23.9 |
92
+ | AITZ_EM | 76.9 |
93
+ | Android Control High_EM | 63.7 |
94
+ | Android Control Low_EM | 22.2 |
95
+ | AndroidWorld_SR | 90.8 |
96
+ | MobileMiniWob++_SR | 67.9 |
97
+
98
+ ## Requirements
99
+ The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command:
100
+ ```
101
+ pip install git+https://github.com/huggingface/transformers accelerate
102
+ ```
103
+ or you might encounter the following error:
104
+ ```
105
+ KeyError: 'qwen2_5_vl'
106
+ ```
107
+
108
+
109
+ ## Quickstart
110
+
111
+ Below, we provide simple examples to show how to use Qwen2.5-VL with 🤖 ModelScope and 🤗 Transformers.
112
+
113
+ The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command:
114
+ ```
115
+ pip install git+https://github.com/huggingface/transformers accelerate
116
+ ```
117
+ or you might encounter the following error:
118
+ ```
119
+ KeyError: 'qwen2_5_vl'
120
+ ```
121
+
122
+
123
+ We offer a toolkit to help you handle various types of visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved images and videos. You can install it using the following command:
124
+
125
+ ```bash
126
+ # It's highly recommanded to use `[decord]` feature for faster video loading.
127
+ pip install qwen-vl-utils[decord]==0.0.8
128
+ ```
129
+
130
+ If you are not using Linux, you might not be able to install `decord` from PyPI. In that case, you can use `pip install qwen-vl-utils` which will fall back to using torchvision for video processing. However, you can still [install decord from source](https://github.com/dmlc/decord?tab=readme-ov-file#install-from-source) to get decord used when loading video.
131
+
132
+ ### Using 🤗 Transformers to Chat
133
+
134
+ Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`:
135
+
136
+ ```python
137
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
138
+ from qwen_vl_utils import process_vision_info
139
+
140
+ # default: Load the model on the available device(s)
141
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
142
+ "Qwen/Qwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"
143
+ )
144
+
145
+ # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
146
+ # model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
147
+ # "Qwen/Qwen2.5-VL-3B-Instruct",
148
+ # torch_dtype=torch.bfloat16,
149
+ # attn_implementation="flash_attention_2",
150
+ # device_map="auto",
151
+ # )
152
+
153
+ # default processer
154
+ processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")
155
+
156
+ # The default range for the number of visual tokens per image in the model is 4-16384.
157
+ # You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
158
+ # min_pixels = 256*28*28
159
+ # max_pixels = 1280*28*28
160
+ # processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
161
+
162
+ messages = [
163
+ {
164
+ "role": "user",
165
+ "content": [
166
+ {
167
+ "type": "image",
168
+ "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
169
+ },
170
+ {"type": "text", "text": "Describe this image."},
171
+ ],
172
+ }
173
+ ]
174
+
175
+ # Preparation for inference
176
+ text = processor.apply_chat_template(
177
+ messages, tokenize=False, add_generation_prompt=True
178
+ )
179
+ image_inputs, video_inputs = process_vision_info(messages)
180
+ inputs = processor(
181
+ text=[text],
182
+ images=image_inputs,
183
+ videos=video_inputs,
184
+ padding=True,
185
+ return_tensors="pt",
186
+ )
187
+ inputs = inputs.to("cuda")
188
+
189
+ # Inference: Generation of the output
190
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
191
+ generated_ids_trimmed = [
192
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
193
+ ]
194
+ output_text = processor.batch_decode(
195
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
196
+ )
197
+ print(output_text)
198
+ ```
199
+ <details>
200
+ <summary>Multi image inference</summary>
201
+
202
+ ```python
203
+ # Messages containing multiple images and a text query
204
+ messages = [
205
+ {
206
+ "role": "user",
207
+ "content": [
208
+ {"type": "image", "image": "file:///path/to/image1.jpg"},
209
+ {"type": "image", "image": "file:///path/to/image2.jpg"},
210
+ {"type": "text", "text": "Identify the similarities between these images."},
211
+ ],
212
+ }
213
+ ]
214
+
215
+ # Preparation for inference
216
+ text = processor.apply_chat_template(
217
+ messages, tokenize=False, add_generation_prompt=True
218
+ )
219
+ image_inputs, video_inputs = process_vision_info(messages)
220
+ inputs = processor(
221
+ text=[text],
222
+ images=image_inputs,
223
+ videos=video_inputs,
224
+ padding=True,
225
+ return_tensors="pt",
226
+ )
227
+ inputs = inputs.to("cuda")
228
+
229
+ # Inference
230
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
231
+ generated_ids_trimmed = [
232
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
233
+ ]
234
+ output_text = processor.batch_decode(
235
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
236
+ )
237
+ print(output_text)
238
+ ```
239
+ </details>
240
+
241
+ <details>
242
+ <summary>Video inference</summary>
243
+
244
+ ```python
245
+ # Messages containing a images list as a video and a text query
246
+ messages = [
247
+ {
248
+ "role": "user",
249
+ "content": [
250
+ {
251
+ "type": "video",
252
+ "video": [
253
+ "file:///path/to/frame1.jpg",
254
+ "file:///path/to/frame2.jpg",
255
+ "file:///path/to/frame3.jpg",
256
+ "file:///path/to/frame4.jpg",
257
+ ],
258
+ },
259
+ {"type": "text", "text": "Describe this video."},
260
+ ],
261
+ }
262
+ ]
263
+
264
+ # Messages containing a local video path and a text query
265
+ messages = [
266
+ {
267
+ "role": "user",
268
+ "content": [
269
+ {
270
+ "type": "video",
271
+ "video": "file:///path/to/video1.mp4",
272
+ "max_pixels": 360 * 420,
273
+ "fps": 1.0,
274
+ },
275
+ {"type": "text", "text": "Describe this video."},
276
+ ],
277
+ }
278
+ ]
279
+
280
+ # Messages containing a video url and a text query
281
+ messages = [
282
+ {
283
+ "role": "user",
284
+ "content": [
285
+ {
286
+ "type": "video",
287
+ "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
288
+ },
289
+ {"type": "text", "text": "Describe this video."},
290
+ ],
291
+ }
292
+ ]
293
+
294
+ #In Qwen 2.5 VL, frame rate information is also input into the model to align with absolute time.
295
+ # Preparation for inference
296
+ text = processor.apply_chat_template(
297
+ messages, tokenize=False, add_generation_prompt=True
298
+ )
299
+ image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
300
+ inputs = processor(
301
+ text=[text],
302
+ images=image_inputs,
303
+ videos=video_inputs,
304
+ fps=fps,
305
+ padding=True,
306
+ return_tensors="pt",
307
+ **video_kwargs,
308
+ )
309
+ inputs = inputs.to("cuda")
310
+
311
+ # Inference
312
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
313
+ generated_ids_trimmed = [
314
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
315
+ ]
316
+ output_text = processor.batch_decode(
317
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
318
+ )
319
+ print(output_text)
320
+ ```
321
+
322
+ Video URL compatibility largely depends on the third-party library version. The details are in the table below. change the backend by `FORCE_QWENVL_VIDEO_READER=torchvision` or `FORCE_QWENVL_VIDEO_READER=decord` if you prefer not to use the default one.
323
+
324
+ | Backend | HTTP | HTTPS |
325
+ |-------------|------|-------|
326
+ | torchvision >= 0.19.0 | ✅ | ✅ |
327
+ | torchvision < 0.19.0 | ❌ | ❌ |
328
+ | decord | ✅ | ❌ |
329
+ </details>
330
+
331
+ <details>
332
+ <summary>Batch inference</summary>
333
+
334
+ ```python
335
+ # Sample messages for batch inference
336
+ messages1 = [
337
+ {
338
+ "role": "user",
339
+ "content": [
340
+ {"type": "image", "image": "file:///path/to/image1.jpg"},
341
+ {"type": "image", "image": "file:///path/to/image2.jpg"},
342
+ {"type": "text", "text": "What are the common elements in these pictures?"},
343
+ ],
344
+ }
345
+ ]
346
+ messages2 = [
347
+ {"role": "system", "content": "You are a helpful assistant."},
348
+ {"role": "user", "content": "Who are you?"},
349
+ ]
350
+ # Combine messages for batch processing
351
+ messages = [messages1, messages2]
352
+
353
+ # Preparation for batch inference
354
+ texts = [
355
+ processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
356
+ for msg in messages
357
+ ]
358
+ image_inputs, video_inputs = process_vision_info(messages)
359
+ inputs = processor(
360
+ text=texts,
361
+ images=image_inputs,
362
+ videos=video_inputs,
363
+ padding=True,
364
+ return_tensors="pt",
365
+ )
366
+ inputs = inputs.to("cuda")
367
+
368
+ # Batch Inference
369
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
370
+ generated_ids_trimmed = [
371
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
372
+ ]
373
+ output_texts = processor.batch_decode(
374
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
375
+ )
376
+ print(output_texts)
377
+ ```
378
+ </details>
379
+
380
+ ### 🤖 ModelScope
381
+ We strongly advise users especially those in mainland China to use ModelScope. `snapshot_download` can help you solve issues concerning downloading checkpoints.
382
+
383
+
384
+ ### More Usage Tips
385
+
386
+ For input images, we support local files, base64, and URLs. For videos, we currently only support local files.
387
+
388
+ ```python
389
+ # You can directly insert a local file path, a URL, or a base64-encoded image into the position where you want in the text.
390
+ ## Local file path
391
+ messages = [
392
+ {
393
+ "role": "user",
394
+ "content": [
395
+ {"type": "image", "image": "file:///path/to/your/image.jpg"},
396
+ {"type": "text", "text": "Describe this image."},
397
+ ],
398
+ }
399
+ ]
400
+ ## Image URL
401
+ messages = [
402
+ {
403
+ "role": "user",
404
+ "content": [
405
+ {"type": "image", "image": "http://path/to/your/image.jpg"},
406
+ {"type": "text", "text": "Describe this image."},
407
+ ],
408
+ }
409
+ ]
410
+ ## Base64 encoded image
411
+ messages = [
412
+ {
413
+ "role": "user",
414
+ "content": [
415
+ {"type": "image", "image": "data:image;base64,/9j/..."},
416
+ {"type": "text", "text": "Describe this image."},
417
+ ],
418
+ }
419
+ ]
420
+ ```
421
+ #### Image Resolution for performance boost
422
+
423
+ The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs, such as a token count range of 256-1280, to balance speed and memory usage.
424
+
425
+ ```python
426
+ min_pixels = 256 * 28 * 28
427
+ max_pixels = 1280 * 28 * 28
428
+ processor = AutoProcessor.from_pretrained(
429
+ "Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
430
+ )
431
+ ```
432
+
433
+ Besides, We provide two methods for fine-grained control over the image size input to the model:
434
+
435
+ 1. Define min_pixels and max_pixels: Images will be resized to maintain their aspect ratio within the range of min_pixels and max_pixels.
436
+
437
+ 2. Specify exact dimensions: Directly set `resized_height` and `resized_width`. These values will be rounded to the nearest multiple of 28.
438
+
439
+ ```python
440
+ # min_pixels and max_pixels
441
+ messages = [
442
+ {
443
+ "role": "user",
444
+ "content": [
445
+ {
446
+ "type": "image",
447
+ "image": "file:///path/to/your/image.jpg",
448
+ "resized_height": 280,
449
+ "resized_width": 420,
450
+ },
451
+ {"type": "text", "text": "Describe this image."},
452
+ ],
453
+ }
454
+ ]
455
+ # resized_height and resized_width
456
+ messages = [
457
+ {
458
+ "role": "user",
459
+ "content": [
460
+ {
461
+ "type": "image",
462
+ "image": "file:///path/to/your/image.jpg",
463
+ "min_pixels": 50176,
464
+ "max_pixels": 50176,
465
+ },
466
+ {"type": "text", "text": "Describe this image."},
467
+ ],
468
+ }
469
+ ]
470
+ ```
471
+
472
+ ### Processing Long Texts
473
+
474
+ The current `config.json` is set for context length up to 32,768 tokens.
475
+ To handle extensive inputs exceeding 32,768 tokens, we utilize [YaRN](https://arxiv.org/abs/2309.00071), a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts.
476
+
477
+ For supported frameworks, you could add the following to `config.json` to enable YaRN:
478
+
479
+ ```
480
+ {
481
+ ...,
482
+ "type": "yarn",
483
+ "mrope_section": [
484
+ 16,
485
+ 24,
486
+ 24
487
+ ],
488
+ "factor": 4,
489
+ "original_max_position_embeddings": 32768
490
+ }
491
+ ```
492
+
493
+ However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks, and is therefore not recommended for use.
494
+
495
+ At the same time, for long video inputs, since MRoPE itself is more economical with ids, the max_position_embeddings can be directly modified to a larger value, such as 64k.
496
+
497
+
498
+
499
+ ## Citation
500
+
501
+ If you find our work helpful, feel free to give us a cite.
502
+
503
+ ```
504
+ @misc{qwen2.5-VL,
505
+ title = {Qwen2.5-VL},
506
+ url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
507
+ author = {Qwen Team},
508
+ month = {January},
509
+ year = {2025}
510
+ }
511
+
512
+ @article{Qwen2VL,
513
+ title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
514
+ author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
515
+ journal={arXiv preprint arXiv:2409.12191},
516
+ year={2024}
517
+ }
518
+
519
+ @article{Qwen-VL,
520
+ title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
521
+ author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
522
+ journal={arXiv preprint arXiv:2308.12966},
523
+ year={2023}
524
+ }
525
+ ```
configs/qwen2.5-vl-3b-instruct/chat_template.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "chat_template": "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"
3
+ }
configs/qwen2.5-vl-3b-instruct/config.json ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen2_5_VLForConditionalGeneration"
4
+ ],
5
+ "attention_dropout": 0.0,
6
+ "bos_token_id": 151643,
7
+ "eos_token_id": 151645,
8
+ "vision_start_token_id": 151652,
9
+ "vision_end_token_id": 151653,
10
+ "vision_token_id": 151654,
11
+ "image_token_id": 151655,
12
+ "video_token_id": 151656,
13
+ "hidden_act": "silu",
14
+ "hidden_size": 2048,
15
+ "initializer_range": 0.02,
16
+ "intermediate_size": 11008,
17
+ "max_position_embeddings": 128000,
18
+ "max_window_layers": 70,
19
+ "model_type": "qwen2_5_vl",
20
+ "num_attention_heads": 16,
21
+ "num_hidden_layers": 36,
22
+ "num_key_value_heads": 2,
23
+ "rms_norm_eps": 1e-06,
24
+ "rope_theta": 1000000.0,
25
+ "sliding_window": 32768,
26
+ "tie_word_embeddings": true,
27
+ "torch_dtype": "bfloat16",
28
+ "transformers_version": "4.41.2",
29
+ "use_cache": true,
30
+ "use_sliding_window": false,
31
+ "vision_config": {
32
+ "depth": 32,
33
+ "hidden_act": "silu",
34
+ "hidden_size": 1280,
35
+ "intermediate_size": 3420,
36
+ "num_heads": 16,
37
+ "in_chans": 3,
38
+ "out_hidden_size": 2048,
39
+ "patch_size": 14,
40
+ "spatial_merge_size": 2,
41
+ "spatial_patch_size": 14,
42
+ "window_size": 112,
43
+ "fullatt_block_indexes": [
44
+ 7,
45
+ 15,
46
+ 23,
47
+ 31
48
+ ],
49
+ "tokens_per_second": 2,
50
+ "temporal_patch_size": 2
51
+ },
52
+ "rope_scaling": {
53
+ "type": "mrope",
54
+ "mrope_section": [
55
+ 16,
56
+ 24,
57
+ 24
58
+ ]
59
+ },
60
+ "vocab_size": 151936
61
+ }
configs/qwen2.5-vl-3b-instruct/generation_config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 151643,
3
+ "pad_token_id": 151643,
4
+ "do_sample": true,
5
+ "eos_token_id": [
6
+ 151645,
7
+ 151643
8
+ ],
9
+ "repetition_penalty": 1.05,
10
+ "temperature": 0.000001,
11
+ "transformers_version": "4.49.0"
12
+ }
configs/qwen2.5-vl-3b-instruct/merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
configs/qwen2.5-vl-3b-instruct/model-00001-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:41a8895c164b4d32bae6b302f4603fcbc1797f32dafa45c7e9bcda23c6755df8
3
+ size 3982649232
configs/qwen2.5-vl-3b-instruct/model-00002-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:365531ff8752420e89dee707b79d021fb2d6e25abafe486f080555a4fe6972e4
3
+ size 3526688744
configs/qwen2.5-vl-3b-instruct/model.safetensors.index.json ADDED
@@ -0,0 +1,831 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 7509245952
4
+ },
5
+ "weight_map": {
6
+ "model.embed_tokens.weight": "model-00001-of-00002.safetensors",
7
+ "model.layers.0.input_layernorm.weight": "model-00001-of-00002.safetensors",
8
+ "model.layers.0.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
9
+ "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
10
+ "model.layers.0.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
11
+ "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
12
+ "model.layers.0.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
13
+ "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
14
+ "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
15
+ "model.layers.0.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
16
+ "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
17
+ "model.layers.0.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
18
+ "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
19
+ "model.layers.1.input_layernorm.weight": "model-00001-of-00002.safetensors",
20
+ "model.layers.1.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
21
+ "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
22
+ "model.layers.1.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
23
+ "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
24
+ "model.layers.1.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
25
+ "model.layers.1.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
26
+ "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
27
+ "model.layers.1.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
28
+ "model.layers.1.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
29
+ "model.layers.1.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
30
+ "model.layers.1.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
31
+ "model.layers.10.input_layernorm.weight": "model-00001-of-00002.safetensors",
32
+ "model.layers.10.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
33
+ "model.layers.10.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
34
+ "model.layers.10.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
35
+ "model.layers.10.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
36
+ "model.layers.10.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
37
+ "model.layers.10.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
38
+ "model.layers.10.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
39
+ "model.layers.10.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
40
+ "model.layers.10.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
41
+ "model.layers.10.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
42
+ "model.layers.10.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
43
+ "model.layers.11.input_layernorm.weight": "model-00001-of-00002.safetensors",
44
+ "model.layers.11.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
45
+ "model.layers.11.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
46
+ "model.layers.11.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
47
+ "model.layers.11.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
48
+ "model.layers.11.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
49
+ "model.layers.11.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
50
+ "model.layers.11.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
51
+ "model.layers.11.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
52
+ "model.layers.11.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
53
+ "model.layers.11.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
54
+ "model.layers.11.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
55
+ "model.layers.12.input_layernorm.weight": "model-00001-of-00002.safetensors",
56
+ "model.layers.12.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
57
+ "model.layers.12.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
58
+ "model.layers.12.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
59
+ "model.layers.12.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
60
+ "model.layers.12.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
61
+ "model.layers.12.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
62
+ "model.layers.12.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
63
+ "model.layers.12.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
64
+ "model.layers.12.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
65
+ "model.layers.12.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
66
+ "model.layers.12.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
67
+ "model.layers.13.input_layernorm.weight": "model-00001-of-00002.safetensors",
68
+ "model.layers.13.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
69
+ "model.layers.13.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
70
+ "model.layers.13.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
71
+ "model.layers.13.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
72
+ "model.layers.13.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
73
+ "model.layers.13.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
74
+ "model.layers.13.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
75
+ "model.layers.13.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
76
+ "model.layers.13.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
77
+ "model.layers.13.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
78
+ "model.layers.13.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
79
+ "model.layers.14.input_layernorm.weight": "model-00002-of-00002.safetensors",
80
+ "model.layers.14.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
81
+ "model.layers.14.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
82
+ "model.layers.14.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
83
+ "model.layers.14.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
84
+ "model.layers.14.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
85
+ "model.layers.14.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
86
+ "model.layers.14.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
87
+ "model.layers.14.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
88
+ "model.layers.14.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
89
+ "model.layers.14.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
90
+ "model.layers.14.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
91
+ "model.layers.15.input_layernorm.weight": "model-00002-of-00002.safetensors",
92
+ "model.layers.15.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
93
+ "model.layers.15.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
94
+ "model.layers.15.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
95
+ "model.layers.15.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
96
+ "model.layers.15.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
97
+ "model.layers.15.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
98
+ "model.layers.15.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
99
+ "model.layers.15.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
100
+ "model.layers.15.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
101
+ "model.layers.15.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
102
+ "model.layers.15.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
103
+ "model.layers.16.input_layernorm.weight": "model-00002-of-00002.safetensors",
104
+ "model.layers.16.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
105
+ "model.layers.16.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
106
+ "model.layers.16.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
107
+ "model.layers.16.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
108
+ "model.layers.16.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
109
+ "model.layers.16.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
110
+ "model.layers.16.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
111
+ "model.layers.16.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
112
+ "model.layers.16.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
113
+ "model.layers.16.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
114
+ "model.layers.16.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
115
+ "model.layers.17.input_layernorm.weight": "model-00002-of-00002.safetensors",
116
+ "model.layers.17.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
117
+ "model.layers.17.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
118
+ "model.layers.17.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
119
+ "model.layers.17.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
120
+ "model.layers.17.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
121
+ "model.layers.17.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
122
+ "model.layers.17.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
123
+ "model.layers.17.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
124
+ "model.layers.17.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
125
+ "model.layers.17.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
126
+ "model.layers.17.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
127
+ "model.layers.18.input_layernorm.weight": "model-00002-of-00002.safetensors",
128
+ "model.layers.18.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
129
+ "model.layers.18.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
130
+ "model.layers.18.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
131
+ "model.layers.18.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
132
+ "model.layers.18.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
133
+ "model.layers.18.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
134
+ "model.layers.18.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
135
+ "model.layers.18.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
136
+ "model.layers.18.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
137
+ "model.layers.18.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
138
+ "model.layers.18.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
139
+ "model.layers.19.input_layernorm.weight": "model-00002-of-00002.safetensors",
140
+ "model.layers.19.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
141
+ "model.layers.19.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
142
+ "model.layers.19.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
143
+ "model.layers.19.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
144
+ "model.layers.19.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
145
+ "model.layers.19.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
146
+ "model.layers.19.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
147
+ "model.layers.19.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
148
+ "model.layers.19.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
149
+ "model.layers.19.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
150
+ "model.layers.19.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
151
+ "model.layers.2.input_layernorm.weight": "model-00001-of-00002.safetensors",
152
+ "model.layers.2.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
153
+ "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
154
+ "model.layers.2.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
155
+ "model.layers.2.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
156
+ "model.layers.2.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
157
+ "model.layers.2.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
158
+ "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
159
+ "model.layers.2.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
160
+ "model.layers.2.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
161
+ "model.layers.2.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
162
+ "model.layers.2.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
163
+ "model.layers.20.input_layernorm.weight": "model-00002-of-00002.safetensors",
164
+ "model.layers.20.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
165
+ "model.layers.20.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
166
+ "model.layers.20.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
167
+ "model.layers.20.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
168
+ "model.layers.20.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
169
+ "model.layers.20.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
170
+ "model.layers.20.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
171
+ "model.layers.20.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
172
+ "model.layers.20.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
173
+ "model.layers.20.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
174
+ "model.layers.20.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
175
+ "model.layers.21.input_layernorm.weight": "model-00002-of-00002.safetensors",
176
+ "model.layers.21.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
177
+ "model.layers.21.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
178
+ "model.layers.21.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
179
+ "model.layers.21.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
180
+ "model.layers.21.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
181
+ "model.layers.21.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
182
+ "model.layers.21.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
183
+ "model.layers.21.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
184
+ "model.layers.21.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
185
+ "model.layers.21.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
186
+ "model.layers.21.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
187
+ "model.layers.22.input_layernorm.weight": "model-00002-of-00002.safetensors",
188
+ "model.layers.22.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
189
+ "model.layers.22.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
190
+ "model.layers.22.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
191
+ "model.layers.22.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
192
+ "model.layers.22.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
193
+ "model.layers.22.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
194
+ "model.layers.22.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
195
+ "model.layers.22.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
196
+ "model.layers.22.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
197
+ "model.layers.22.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
198
+ "model.layers.22.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
199
+ "model.layers.23.input_layernorm.weight": "model-00002-of-00002.safetensors",
200
+ "model.layers.23.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
201
+ "model.layers.23.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
202
+ "model.layers.23.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
203
+ "model.layers.23.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
204
+ "model.layers.23.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
205
+ "model.layers.23.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
206
+ "model.layers.23.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
207
+ "model.layers.23.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
208
+ "model.layers.23.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
209
+ "model.layers.23.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
210
+ "model.layers.23.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
211
+ "model.layers.24.input_layernorm.weight": "model-00002-of-00002.safetensors",
212
+ "model.layers.24.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
213
+ "model.layers.24.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
214
+ "model.layers.24.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
215
+ "model.layers.24.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
216
+ "model.layers.24.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
217
+ "model.layers.24.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
218
+ "model.layers.24.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
219
+ "model.layers.24.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
220
+ "model.layers.24.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
221
+ "model.layers.24.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
222
+ "model.layers.24.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
223
+ "model.layers.25.input_layernorm.weight": "model-00002-of-00002.safetensors",
224
+ "model.layers.25.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
225
+ "model.layers.25.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
226
+ "model.layers.25.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
227
+ "model.layers.25.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
228
+ "model.layers.25.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
229
+ "model.layers.25.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
230
+ "model.layers.25.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
231
+ "model.layers.25.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
232
+ "model.layers.25.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
233
+ "model.layers.25.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
234
+ "model.layers.25.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
235
+ "model.layers.26.input_layernorm.weight": "model-00002-of-00002.safetensors",
236
+ "model.layers.26.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
237
+ "model.layers.26.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
238
+ "model.layers.26.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
239
+ "model.layers.26.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
240
+ "model.layers.26.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
241
+ "model.layers.26.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
242
+ "model.layers.26.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
243
+ "model.layers.26.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
244
+ "model.layers.26.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
245
+ "model.layers.26.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
246
+ "model.layers.26.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
247
+ "model.layers.27.input_layernorm.weight": "model-00002-of-00002.safetensors",
248
+ "model.layers.27.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
249
+ "model.layers.27.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
250
+ "model.layers.27.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
251
+ "model.layers.27.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
252
+ "model.layers.27.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
253
+ "model.layers.27.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
254
+ "model.layers.27.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
255
+ "model.layers.27.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
256
+ "model.layers.27.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
257
+ "model.layers.27.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
258
+ "model.layers.27.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
259
+ "model.layers.28.input_layernorm.weight": "model-00002-of-00002.safetensors",
260
+ "model.layers.28.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
261
+ "model.layers.28.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
262
+ "model.layers.28.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
263
+ "model.layers.28.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
264
+ "model.layers.28.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
265
+ "model.layers.28.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
266
+ "model.layers.28.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
267
+ "model.layers.28.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
268
+ "model.layers.28.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
269
+ "model.layers.28.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
270
+ "model.layers.28.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
271
+ "model.layers.29.input_layernorm.weight": "model-00002-of-00002.safetensors",
272
+ "model.layers.29.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
273
+ "model.layers.29.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
274
+ "model.layers.29.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
275
+ "model.layers.29.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
276
+ "model.layers.29.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
277
+ "model.layers.29.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
278
+ "model.layers.29.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
279
+ "model.layers.29.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
280
+ "model.layers.29.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
281
+ "model.layers.29.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
282
+ "model.layers.29.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
283
+ "model.layers.3.input_layernorm.weight": "model-00001-of-00002.safetensors",
284
+ "model.layers.3.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
285
+ "model.layers.3.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
286
+ "model.layers.3.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
287
+ "model.layers.3.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
288
+ "model.layers.3.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
289
+ "model.layers.3.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
290
+ "model.layers.3.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
291
+ "model.layers.3.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
292
+ "model.layers.3.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
293
+ "model.layers.3.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
294
+ "model.layers.3.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
295
+ "model.layers.30.input_layernorm.weight": "model-00002-of-00002.safetensors",
296
+ "model.layers.30.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
297
+ "model.layers.30.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
298
+ "model.layers.30.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
299
+ "model.layers.30.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
300
+ "model.layers.30.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
301
+ "model.layers.30.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
302
+ "model.layers.30.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
303
+ "model.layers.30.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
304
+ "model.layers.30.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
305
+ "model.layers.30.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
306
+ "model.layers.30.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
307
+ "model.layers.31.input_layernorm.weight": "model-00002-of-00002.safetensors",
308
+ "model.layers.31.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
309
+ "model.layers.31.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
310
+ "model.layers.31.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
311
+ "model.layers.31.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
312
+ "model.layers.31.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
313
+ "model.layers.31.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
314
+ "model.layers.31.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
315
+ "model.layers.31.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
316
+ "model.layers.31.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
317
+ "model.layers.31.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
318
+ "model.layers.31.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
319
+ "model.layers.32.input_layernorm.weight": "model-00002-of-00002.safetensors",
320
+ "model.layers.32.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
321
+ "model.layers.32.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
322
+ "model.layers.32.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
323
+ "model.layers.32.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
324
+ "model.layers.32.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
325
+ "model.layers.32.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
326
+ "model.layers.32.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
327
+ "model.layers.32.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
328
+ "model.layers.32.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
329
+ "model.layers.32.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
330
+ "model.layers.32.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
331
+ "model.layers.33.input_layernorm.weight": "model-00002-of-00002.safetensors",
332
+ "model.layers.33.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
333
+ "model.layers.33.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
334
+ "model.layers.33.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
335
+ "model.layers.33.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
336
+ "model.layers.33.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
337
+ "model.layers.33.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
338
+ "model.layers.33.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
339
+ "model.layers.33.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
340
+ "model.layers.33.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
341
+ "model.layers.33.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
342
+ "model.layers.33.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
343
+ "model.layers.34.input_layernorm.weight": "model-00002-of-00002.safetensors",
344
+ "model.layers.34.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
345
+ "model.layers.34.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
346
+ "model.layers.34.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
347
+ "model.layers.34.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
348
+ "model.layers.34.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
349
+ "model.layers.34.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
350
+ "model.layers.34.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
351
+ "model.layers.34.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
352
+ "model.layers.34.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
353
+ "model.layers.34.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
354
+ "model.layers.34.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
355
+ "model.layers.35.input_layernorm.weight": "model-00002-of-00002.safetensors",
356
+ "model.layers.35.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
357
+ "model.layers.35.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
358
+ "model.layers.35.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
359
+ "model.layers.35.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
360
+ "model.layers.35.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
361
+ "model.layers.35.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
362
+ "model.layers.35.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
363
+ "model.layers.35.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
364
+ "model.layers.35.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
365
+ "model.layers.35.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
366
+ "model.layers.35.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
367
+ "model.layers.4.input_layernorm.weight": "model-00001-of-00002.safetensors",
368
+ "model.layers.4.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
369
+ "model.layers.4.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
370
+ "model.layers.4.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
371
+ "model.layers.4.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
372
+ "model.layers.4.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
373
+ "model.layers.4.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
374
+ "model.layers.4.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
375
+ "model.layers.4.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
376
+ "model.layers.4.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
377
+ "model.layers.4.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
378
+ "model.layers.4.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
379
+ "model.layers.5.input_layernorm.weight": "model-00001-of-00002.safetensors",
380
+ "model.layers.5.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
381
+ "model.layers.5.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
382
+ "model.layers.5.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
383
+ "model.layers.5.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
384
+ "model.layers.5.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
385
+ "model.layers.5.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
386
+ "model.layers.5.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
387
+ "model.layers.5.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
388
+ "model.layers.5.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
389
+ "model.layers.5.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
390
+ "model.layers.5.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
391
+ "model.layers.6.input_layernorm.weight": "model-00001-of-00002.safetensors",
392
+ "model.layers.6.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
393
+ "model.layers.6.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
394
+ "model.layers.6.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
395
+ "model.layers.6.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
396
+ "model.layers.6.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
397
+ "model.layers.6.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
398
+ "model.layers.6.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
399
+ "model.layers.6.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
400
+ "model.layers.6.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
401
+ "model.layers.6.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
402
+ "model.layers.6.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
403
+ "model.layers.7.input_layernorm.weight": "model-00001-of-00002.safetensors",
404
+ "model.layers.7.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
405
+ "model.layers.7.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
406
+ "model.layers.7.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
407
+ "model.layers.7.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
408
+ "model.layers.7.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
409
+ "model.layers.7.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
410
+ "model.layers.7.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
411
+ "model.layers.7.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
412
+ "model.layers.7.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
413
+ "model.layers.7.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
414
+ "model.layers.7.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
415
+ "model.layers.8.input_layernorm.weight": "model-00001-of-00002.safetensors",
416
+ "model.layers.8.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
417
+ "model.layers.8.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
418
+ "model.layers.8.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
419
+ "model.layers.8.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
420
+ "model.layers.8.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
421
+ "model.layers.8.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
422
+ "model.layers.8.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
423
+ "model.layers.8.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
424
+ "model.layers.8.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
425
+ "model.layers.8.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
426
+ "model.layers.8.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
427
+ "model.layers.9.input_layernorm.weight": "model-00001-of-00002.safetensors",
428
+ "model.layers.9.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
429
+ "model.layers.9.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
430
+ "model.layers.9.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
431
+ "model.layers.9.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
432
+ "model.layers.9.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
433
+ "model.layers.9.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
434
+ "model.layers.9.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
435
+ "model.layers.9.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
436
+ "model.layers.9.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
437
+ "model.layers.9.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
438
+ "model.layers.9.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
439
+ "model.norm.weight": "model-00002-of-00002.safetensors",
440
+ "visual.blocks.0.attn.proj.bias": "model-00001-of-00002.safetensors",
441
+ "visual.blocks.0.attn.proj.weight": "model-00001-of-00002.safetensors",
442
+ "visual.blocks.0.attn.qkv.bias": "model-00001-of-00002.safetensors",
443
+ "visual.blocks.0.attn.qkv.weight": "model-00001-of-00002.safetensors",
444
+ "visual.blocks.0.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
445
+ "visual.blocks.0.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
446
+ "visual.blocks.0.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
447
+ "visual.blocks.0.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
448
+ "visual.blocks.0.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
449
+ "visual.blocks.0.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
450
+ "visual.blocks.0.norm1.weight": "model-00001-of-00002.safetensors",
451
+ "visual.blocks.0.norm2.weight": "model-00001-of-00002.safetensors",
452
+ "visual.blocks.1.attn.proj.bias": "model-00001-of-00002.safetensors",
453
+ "visual.blocks.1.attn.proj.weight": "model-00001-of-00002.safetensors",
454
+ "visual.blocks.1.attn.qkv.bias": "model-00001-of-00002.safetensors",
455
+ "visual.blocks.1.attn.qkv.weight": "model-00001-of-00002.safetensors",
456
+ "visual.blocks.1.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
457
+ "visual.blocks.1.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
458
+ "visual.blocks.1.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
459
+ "visual.blocks.1.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
460
+ "visual.blocks.1.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
461
+ "visual.blocks.1.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
462
+ "visual.blocks.1.norm1.weight": "model-00001-of-00002.safetensors",
463
+ "visual.blocks.1.norm2.weight": "model-00001-of-00002.safetensors",
464
+ "visual.blocks.10.attn.proj.bias": "model-00001-of-00002.safetensors",
465
+ "visual.blocks.10.attn.proj.weight": "model-00001-of-00002.safetensors",
466
+ "visual.blocks.10.attn.qkv.bias": "model-00001-of-00002.safetensors",
467
+ "visual.blocks.10.attn.qkv.weight": "model-00001-of-00002.safetensors",
468
+ "visual.blocks.10.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
469
+ "visual.blocks.10.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
470
+ "visual.blocks.10.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
471
+ "visual.blocks.10.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
472
+ "visual.blocks.10.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
473
+ "visual.blocks.10.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
474
+ "visual.blocks.10.norm1.weight": "model-00001-of-00002.safetensors",
475
+ "visual.blocks.10.norm2.weight": "model-00001-of-00002.safetensors",
476
+ "visual.blocks.11.attn.proj.bias": "model-00001-of-00002.safetensors",
477
+ "visual.blocks.11.attn.proj.weight": "model-00001-of-00002.safetensors",
478
+ "visual.blocks.11.attn.qkv.bias": "model-00001-of-00002.safetensors",
479
+ "visual.blocks.11.attn.qkv.weight": "model-00001-of-00002.safetensors",
480
+ "visual.blocks.11.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
481
+ "visual.blocks.11.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
482
+ "visual.blocks.11.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
483
+ "visual.blocks.11.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
484
+ "visual.blocks.11.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
485
+ "visual.blocks.11.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
486
+ "visual.blocks.11.norm1.weight": "model-00001-of-00002.safetensors",
487
+ "visual.blocks.11.norm2.weight": "model-00001-of-00002.safetensors",
488
+ "visual.blocks.12.attn.proj.bias": "model-00001-of-00002.safetensors",
489
+ "visual.blocks.12.attn.proj.weight": "model-00001-of-00002.safetensors",
490
+ "visual.blocks.12.attn.qkv.bias": "model-00001-of-00002.safetensors",
491
+ "visual.blocks.12.attn.qkv.weight": "model-00001-of-00002.safetensors",
492
+ "visual.blocks.12.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
493
+ "visual.blocks.12.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
494
+ "visual.blocks.12.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
495
+ "visual.blocks.12.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
496
+ "visual.blocks.12.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
497
+ "visual.blocks.12.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
498
+ "visual.blocks.12.norm1.weight": "model-00001-of-00002.safetensors",
499
+ "visual.blocks.12.norm2.weight": "model-00001-of-00002.safetensors",
500
+ "visual.blocks.13.attn.proj.bias": "model-00001-of-00002.safetensors",
501
+ "visual.blocks.13.attn.proj.weight": "model-00001-of-00002.safetensors",
502
+ "visual.blocks.13.attn.qkv.bias": "model-00001-of-00002.safetensors",
503
+ "visual.blocks.13.attn.qkv.weight": "model-00001-of-00002.safetensors",
504
+ "visual.blocks.13.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
505
+ "visual.blocks.13.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
506
+ "visual.blocks.13.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
507
+ "visual.blocks.13.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
508
+ "visual.blocks.13.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
509
+ "visual.blocks.13.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
510
+ "visual.blocks.13.norm1.weight": "model-00001-of-00002.safetensors",
511
+ "visual.blocks.13.norm2.weight": "model-00001-of-00002.safetensors",
512
+ "visual.blocks.14.attn.proj.bias": "model-00001-of-00002.safetensors",
513
+ "visual.blocks.14.attn.proj.weight": "model-00001-of-00002.safetensors",
514
+ "visual.blocks.14.attn.qkv.bias": "model-00001-of-00002.safetensors",
515
+ "visual.blocks.14.attn.qkv.weight": "model-00001-of-00002.safetensors",
516
+ "visual.blocks.14.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
517
+ "visual.blocks.14.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
518
+ "visual.blocks.14.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
519
+ "visual.blocks.14.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
520
+ "visual.blocks.14.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
521
+ "visual.blocks.14.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
522
+ "visual.blocks.14.norm1.weight": "model-00001-of-00002.safetensors",
523
+ "visual.blocks.14.norm2.weight": "model-00001-of-00002.safetensors",
524
+ "visual.blocks.15.attn.proj.bias": "model-00001-of-00002.safetensors",
525
+ "visual.blocks.15.attn.proj.weight": "model-00001-of-00002.safetensors",
526
+ "visual.blocks.15.attn.qkv.bias": "model-00001-of-00002.safetensors",
527
+ "visual.blocks.15.attn.qkv.weight": "model-00001-of-00002.safetensors",
528
+ "visual.blocks.15.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
529
+ "visual.blocks.15.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
530
+ "visual.blocks.15.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
531
+ "visual.blocks.15.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
532
+ "visual.blocks.15.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
533
+ "visual.blocks.15.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
534
+ "visual.blocks.15.norm1.weight": "model-00001-of-00002.safetensors",
535
+ "visual.blocks.15.norm2.weight": "model-00001-of-00002.safetensors",
536
+ "visual.blocks.16.attn.proj.bias": "model-00001-of-00002.safetensors",
537
+ "visual.blocks.16.attn.proj.weight": "model-00001-of-00002.safetensors",
538
+ "visual.blocks.16.attn.qkv.bias": "model-00001-of-00002.safetensors",
539
+ "visual.blocks.16.attn.qkv.weight": "model-00001-of-00002.safetensors",
540
+ "visual.blocks.16.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
541
+ "visual.blocks.16.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
542
+ "visual.blocks.16.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
543
+ "visual.blocks.16.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
544
+ "visual.blocks.16.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
545
+ "visual.blocks.16.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
546
+ "visual.blocks.16.norm1.weight": "model-00001-of-00002.safetensors",
547
+ "visual.blocks.16.norm2.weight": "model-00001-of-00002.safetensors",
548
+ "visual.blocks.17.attn.proj.bias": "model-00001-of-00002.safetensors",
549
+ "visual.blocks.17.attn.proj.weight": "model-00001-of-00002.safetensors",
550
+ "visual.blocks.17.attn.qkv.bias": "model-00001-of-00002.safetensors",
551
+ "visual.blocks.17.attn.qkv.weight": "model-00001-of-00002.safetensors",
552
+ "visual.blocks.17.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
553
+ "visual.blocks.17.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
554
+ "visual.blocks.17.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
555
+ "visual.blocks.17.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
556
+ "visual.blocks.17.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
557
+ "visual.blocks.17.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
558
+ "visual.blocks.17.norm1.weight": "model-00001-of-00002.safetensors",
559
+ "visual.blocks.17.norm2.weight": "model-00001-of-00002.safetensors",
560
+ "visual.blocks.18.attn.proj.bias": "model-00001-of-00002.safetensors",
561
+ "visual.blocks.18.attn.proj.weight": "model-00001-of-00002.safetensors",
562
+ "visual.blocks.18.attn.qkv.bias": "model-00001-of-00002.safetensors",
563
+ "visual.blocks.18.attn.qkv.weight": "model-00001-of-00002.safetensors",
564
+ "visual.blocks.18.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
565
+ "visual.blocks.18.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
566
+ "visual.blocks.18.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
567
+ "visual.blocks.18.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
568
+ "visual.blocks.18.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
569
+ "visual.blocks.18.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
570
+ "visual.blocks.18.norm1.weight": "model-00001-of-00002.safetensors",
571
+ "visual.blocks.18.norm2.weight": "model-00001-of-00002.safetensors",
572
+ "visual.blocks.19.attn.proj.bias": "model-00001-of-00002.safetensors",
573
+ "visual.blocks.19.attn.proj.weight": "model-00001-of-00002.safetensors",
574
+ "visual.blocks.19.attn.qkv.bias": "model-00001-of-00002.safetensors",
575
+ "visual.blocks.19.attn.qkv.weight": "model-00001-of-00002.safetensors",
576
+ "visual.blocks.19.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
577
+ "visual.blocks.19.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
578
+ "visual.blocks.19.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
579
+ "visual.blocks.19.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
580
+ "visual.blocks.19.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
581
+ "visual.blocks.19.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
582
+ "visual.blocks.19.norm1.weight": "model-00001-of-00002.safetensors",
583
+ "visual.blocks.19.norm2.weight": "model-00001-of-00002.safetensors",
584
+ "visual.blocks.2.attn.proj.bias": "model-00001-of-00002.safetensors",
585
+ "visual.blocks.2.attn.proj.weight": "model-00001-of-00002.safetensors",
586
+ "visual.blocks.2.attn.qkv.bias": "model-00001-of-00002.safetensors",
587
+ "visual.blocks.2.attn.qkv.weight": "model-00001-of-00002.safetensors",
588
+ "visual.blocks.2.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
589
+ "visual.blocks.2.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
590
+ "visual.blocks.2.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
591
+ "visual.blocks.2.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
592
+ "visual.blocks.2.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
593
+ "visual.blocks.2.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
594
+ "visual.blocks.2.norm1.weight": "model-00001-of-00002.safetensors",
595
+ "visual.blocks.2.norm2.weight": "model-00001-of-00002.safetensors",
596
+ "visual.blocks.20.attn.proj.bias": "model-00001-of-00002.safetensors",
597
+ "visual.blocks.20.attn.proj.weight": "model-00001-of-00002.safetensors",
598
+ "visual.blocks.20.attn.qkv.bias": "model-00001-of-00002.safetensors",
599
+ "visual.blocks.20.attn.qkv.weight": "model-00001-of-00002.safetensors",
600
+ "visual.blocks.20.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
601
+ "visual.blocks.20.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
602
+ "visual.blocks.20.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
603
+ "visual.blocks.20.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
604
+ "visual.blocks.20.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
605
+ "visual.blocks.20.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
606
+ "visual.blocks.20.norm1.weight": "model-00001-of-00002.safetensors",
607
+ "visual.blocks.20.norm2.weight": "model-00001-of-00002.safetensors",
608
+ "visual.blocks.21.attn.proj.bias": "model-00001-of-00002.safetensors",
609
+ "visual.blocks.21.attn.proj.weight": "model-00001-of-00002.safetensors",
610
+ "visual.blocks.21.attn.qkv.bias": "model-00001-of-00002.safetensors",
611
+ "visual.blocks.21.attn.qkv.weight": "model-00001-of-00002.safetensors",
612
+ "visual.blocks.21.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
613
+ "visual.blocks.21.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
614
+ "visual.blocks.21.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
615
+ "visual.blocks.21.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
616
+ "visual.blocks.21.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
617
+ "visual.blocks.21.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
618
+ "visual.blocks.21.norm1.weight": "model-00001-of-00002.safetensors",
619
+ "visual.blocks.21.norm2.weight": "model-00001-of-00002.safetensors",
620
+ "visual.blocks.22.attn.proj.bias": "model-00001-of-00002.safetensors",
621
+ "visual.blocks.22.attn.proj.weight": "model-00001-of-00002.safetensors",
622
+ "visual.blocks.22.attn.qkv.bias": "model-00001-of-00002.safetensors",
623
+ "visual.blocks.22.attn.qkv.weight": "model-00001-of-00002.safetensors",
624
+ "visual.blocks.22.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
625
+ "visual.blocks.22.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
626
+ "visual.blocks.22.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
627
+ "visual.blocks.22.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
628
+ "visual.blocks.22.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
629
+ "visual.blocks.22.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
630
+ "visual.blocks.22.norm1.weight": "model-00001-of-00002.safetensors",
631
+ "visual.blocks.22.norm2.weight": "model-00001-of-00002.safetensors",
632
+ "visual.blocks.23.attn.proj.bias": "model-00001-of-00002.safetensors",
633
+ "visual.blocks.23.attn.proj.weight": "model-00001-of-00002.safetensors",
634
+ "visual.blocks.23.attn.qkv.bias": "model-00001-of-00002.safetensors",
635
+ "visual.blocks.23.attn.qkv.weight": "model-00001-of-00002.safetensors",
636
+ "visual.blocks.23.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
637
+ "visual.blocks.23.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
638
+ "visual.blocks.23.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
639
+ "visual.blocks.23.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
640
+ "visual.blocks.23.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
641
+ "visual.blocks.23.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
642
+ "visual.blocks.23.norm1.weight": "model-00001-of-00002.safetensors",
643
+ "visual.blocks.23.norm2.weight": "model-00001-of-00002.safetensors",
644
+ "visual.blocks.24.attn.proj.bias": "model-00001-of-00002.safetensors",
645
+ "visual.blocks.24.attn.proj.weight": "model-00001-of-00002.safetensors",
646
+ "visual.blocks.24.attn.qkv.bias": "model-00001-of-00002.safetensors",
647
+ "visual.blocks.24.attn.qkv.weight": "model-00001-of-00002.safetensors",
648
+ "visual.blocks.24.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
649
+ "visual.blocks.24.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
650
+ "visual.blocks.24.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
651
+ "visual.blocks.24.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
652
+ "visual.blocks.24.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
653
+ "visual.blocks.24.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
654
+ "visual.blocks.24.norm1.weight": "model-00001-of-00002.safetensors",
655
+ "visual.blocks.24.norm2.weight": "model-00001-of-00002.safetensors",
656
+ "visual.blocks.25.attn.proj.bias": "model-00001-of-00002.safetensors",
657
+ "visual.blocks.25.attn.proj.weight": "model-00001-of-00002.safetensors",
658
+ "visual.blocks.25.attn.qkv.bias": "model-00001-of-00002.safetensors",
659
+ "visual.blocks.25.attn.qkv.weight": "model-00001-of-00002.safetensors",
660
+ "visual.blocks.25.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
661
+ "visual.blocks.25.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
662
+ "visual.blocks.25.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
663
+ "visual.blocks.25.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
664
+ "visual.blocks.25.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
665
+ "visual.blocks.25.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
666
+ "visual.blocks.25.norm1.weight": "model-00001-of-00002.safetensors",
667
+ "visual.blocks.25.norm2.weight": "model-00001-of-00002.safetensors",
668
+ "visual.blocks.26.attn.proj.bias": "model-00001-of-00002.safetensors",
669
+ "visual.blocks.26.attn.proj.weight": "model-00001-of-00002.safetensors",
670
+ "visual.blocks.26.attn.qkv.bias": "model-00001-of-00002.safetensors",
671
+ "visual.blocks.26.attn.qkv.weight": "model-00001-of-00002.safetensors",
672
+ "visual.blocks.26.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
673
+ "visual.blocks.26.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
674
+ "visual.blocks.26.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
675
+ "visual.blocks.26.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
676
+ "visual.blocks.26.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
677
+ "visual.blocks.26.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
678
+ "visual.blocks.26.norm1.weight": "model-00001-of-00002.safetensors",
679
+ "visual.blocks.26.norm2.weight": "model-00001-of-00002.safetensors",
680
+ "visual.blocks.27.attn.proj.bias": "model-00001-of-00002.safetensors",
681
+ "visual.blocks.27.attn.proj.weight": "model-00001-of-00002.safetensors",
682
+ "visual.blocks.27.attn.qkv.bias": "model-00001-of-00002.safetensors",
683
+ "visual.blocks.27.attn.qkv.weight": "model-00001-of-00002.safetensors",
684
+ "visual.blocks.27.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
685
+ "visual.blocks.27.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
686
+ "visual.blocks.27.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
687
+ "visual.blocks.27.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
688
+ "visual.blocks.27.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
689
+ "visual.blocks.27.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
690
+ "visual.blocks.27.norm1.weight": "model-00001-of-00002.safetensors",
691
+ "visual.blocks.27.norm2.weight": "model-00001-of-00002.safetensors",
692
+ "visual.blocks.28.attn.proj.bias": "model-00001-of-00002.safetensors",
693
+ "visual.blocks.28.attn.proj.weight": "model-00001-of-00002.safetensors",
694
+ "visual.blocks.28.attn.qkv.bias": "model-00001-of-00002.safetensors",
695
+ "visual.blocks.28.attn.qkv.weight": "model-00001-of-00002.safetensors",
696
+ "visual.blocks.28.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
697
+ "visual.blocks.28.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
698
+ "visual.blocks.28.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
699
+ "visual.blocks.28.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
700
+ "visual.blocks.28.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
701
+ "visual.blocks.28.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
702
+ "visual.blocks.28.norm1.weight": "model-00001-of-00002.safetensors",
703
+ "visual.blocks.28.norm2.weight": "model-00001-of-00002.safetensors",
704
+ "visual.blocks.29.attn.proj.bias": "model-00001-of-00002.safetensors",
705
+ "visual.blocks.29.attn.proj.weight": "model-00001-of-00002.safetensors",
706
+ "visual.blocks.29.attn.qkv.bias": "model-00001-of-00002.safetensors",
707
+ "visual.blocks.29.attn.qkv.weight": "model-00001-of-00002.safetensors",
708
+ "visual.blocks.29.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
709
+ "visual.blocks.29.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
710
+ "visual.blocks.29.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
711
+ "visual.blocks.29.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
712
+ "visual.blocks.29.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
713
+ "visual.blocks.29.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
714
+ "visual.blocks.29.norm1.weight": "model-00001-of-00002.safetensors",
715
+ "visual.blocks.29.norm2.weight": "model-00001-of-00002.safetensors",
716
+ "visual.blocks.3.attn.proj.bias": "model-00001-of-00002.safetensors",
717
+ "visual.blocks.3.attn.proj.weight": "model-00001-of-00002.safetensors",
718
+ "visual.blocks.3.attn.qkv.bias": "model-00001-of-00002.safetensors",
719
+ "visual.blocks.3.attn.qkv.weight": "model-00001-of-00002.safetensors",
720
+ "visual.blocks.3.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
721
+ "visual.blocks.3.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
722
+ "visual.blocks.3.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
723
+ "visual.blocks.3.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
724
+ "visual.blocks.3.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
725
+ "visual.blocks.3.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
726
+ "visual.blocks.3.norm1.weight": "model-00001-of-00002.safetensors",
727
+ "visual.blocks.3.norm2.weight": "model-00001-of-00002.safetensors",
728
+ "visual.blocks.30.attn.proj.bias": "model-00001-of-00002.safetensors",
729
+ "visual.blocks.30.attn.proj.weight": "model-00001-of-00002.safetensors",
730
+ "visual.blocks.30.attn.qkv.bias": "model-00001-of-00002.safetensors",
731
+ "visual.blocks.30.attn.qkv.weight": "model-00001-of-00002.safetensors",
732
+ "visual.blocks.30.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
733
+ "visual.blocks.30.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
734
+ "visual.blocks.30.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
735
+ "visual.blocks.30.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
736
+ "visual.blocks.30.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
737
+ "visual.blocks.30.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
738
+ "visual.blocks.30.norm1.weight": "model-00001-of-00002.safetensors",
739
+ "visual.blocks.30.norm2.weight": "model-00001-of-00002.safetensors",
740
+ "visual.blocks.31.attn.proj.bias": "model-00001-of-00002.safetensors",
741
+ "visual.blocks.31.attn.proj.weight": "model-00001-of-00002.safetensors",
742
+ "visual.blocks.31.attn.qkv.bias": "model-00001-of-00002.safetensors",
743
+ "visual.blocks.31.attn.qkv.weight": "model-00001-of-00002.safetensors",
744
+ "visual.blocks.31.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
745
+ "visual.blocks.31.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
746
+ "visual.blocks.31.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
747
+ "visual.blocks.31.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
748
+ "visual.blocks.31.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
749
+ "visual.blocks.31.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
750
+ "visual.blocks.31.norm1.weight": "model-00001-of-00002.safetensors",
751
+ "visual.blocks.31.norm2.weight": "model-00001-of-00002.safetensors",
752
+ "visual.blocks.4.attn.proj.bias": "model-00001-of-00002.safetensors",
753
+ "visual.blocks.4.attn.proj.weight": "model-00001-of-00002.safetensors",
754
+ "visual.blocks.4.attn.qkv.bias": "model-00001-of-00002.safetensors",
755
+ "visual.blocks.4.attn.qkv.weight": "model-00001-of-00002.safetensors",
756
+ "visual.blocks.4.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
757
+ "visual.blocks.4.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
758
+ "visual.blocks.4.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
759
+ "visual.blocks.4.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
760
+ "visual.blocks.4.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
761
+ "visual.blocks.4.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
762
+ "visual.blocks.4.norm1.weight": "model-00001-of-00002.safetensors",
763
+ "visual.blocks.4.norm2.weight": "model-00001-of-00002.safetensors",
764
+ "visual.blocks.5.attn.proj.bias": "model-00001-of-00002.safetensors",
765
+ "visual.blocks.5.attn.proj.weight": "model-00001-of-00002.safetensors",
766
+ "visual.blocks.5.attn.qkv.bias": "model-00001-of-00002.safetensors",
767
+ "visual.blocks.5.attn.qkv.weight": "model-00001-of-00002.safetensors",
768
+ "visual.blocks.5.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
769
+ "visual.blocks.5.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
770
+ "visual.blocks.5.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
771
+ "visual.blocks.5.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
772
+ "visual.blocks.5.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
773
+ "visual.blocks.5.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
774
+ "visual.blocks.5.norm1.weight": "model-00001-of-00002.safetensors",
775
+ "visual.blocks.5.norm2.weight": "model-00001-of-00002.safetensors",
776
+ "visual.blocks.6.attn.proj.bias": "model-00001-of-00002.safetensors",
777
+ "visual.blocks.6.attn.proj.weight": "model-00001-of-00002.safetensors",
778
+ "visual.blocks.6.attn.qkv.bias": "model-00001-of-00002.safetensors",
779
+ "visual.blocks.6.attn.qkv.weight": "model-00001-of-00002.safetensors",
780
+ "visual.blocks.6.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
781
+ "visual.blocks.6.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
782
+ "visual.blocks.6.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
783
+ "visual.blocks.6.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
784
+ "visual.blocks.6.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
785
+ "visual.blocks.6.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
786
+ "visual.blocks.6.norm1.weight": "model-00001-of-00002.safetensors",
787
+ "visual.blocks.6.norm2.weight": "model-00001-of-00002.safetensors",
788
+ "visual.blocks.7.attn.proj.bias": "model-00001-of-00002.safetensors",
789
+ "visual.blocks.7.attn.proj.weight": "model-00001-of-00002.safetensors",
790
+ "visual.blocks.7.attn.qkv.bias": "model-00001-of-00002.safetensors",
791
+ "visual.blocks.7.attn.qkv.weight": "model-00001-of-00002.safetensors",
792
+ "visual.blocks.7.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
793
+ "visual.blocks.7.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
794
+ "visual.blocks.7.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
795
+ "visual.blocks.7.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
796
+ "visual.blocks.7.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
797
+ "visual.blocks.7.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
798
+ "visual.blocks.7.norm1.weight": "model-00001-of-00002.safetensors",
799
+ "visual.blocks.7.norm2.weight": "model-00001-of-00002.safetensors",
800
+ "visual.blocks.8.attn.proj.bias": "model-00001-of-00002.safetensors",
801
+ "visual.blocks.8.attn.proj.weight": "model-00001-of-00002.safetensors",
802
+ "visual.blocks.8.attn.qkv.bias": "model-00001-of-00002.safetensors",
803
+ "visual.blocks.8.attn.qkv.weight": "model-00001-of-00002.safetensors",
804
+ "visual.blocks.8.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
805
+ "visual.blocks.8.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
806
+ "visual.blocks.8.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
807
+ "visual.blocks.8.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
808
+ "visual.blocks.8.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
809
+ "visual.blocks.8.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
810
+ "visual.blocks.8.norm1.weight": "model-00001-of-00002.safetensors",
811
+ "visual.blocks.8.norm2.weight": "model-00001-of-00002.safetensors",
812
+ "visual.blocks.9.attn.proj.bias": "model-00001-of-00002.safetensors",
813
+ "visual.blocks.9.attn.proj.weight": "model-00001-of-00002.safetensors",
814
+ "visual.blocks.9.attn.qkv.bias": "model-00001-of-00002.safetensors",
815
+ "visual.blocks.9.attn.qkv.weight": "model-00001-of-00002.safetensors",
816
+ "visual.blocks.9.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
817
+ "visual.blocks.9.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
818
+ "visual.blocks.9.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
819
+ "visual.blocks.9.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
820
+ "visual.blocks.9.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
821
+ "visual.blocks.9.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
822
+ "visual.blocks.9.norm1.weight": "model-00001-of-00002.safetensors",
823
+ "visual.blocks.9.norm2.weight": "model-00001-of-00002.safetensors",
824
+ "visual.merger.ln_q.weight": "model-00001-of-00002.safetensors",
825
+ "visual.merger.mlp.0.bias": "model-00001-of-00002.safetensors",
826
+ "visual.merger.mlp.0.weight": "model-00001-of-00002.safetensors",
827
+ "visual.merger.mlp.2.bias": "model-00001-of-00002.safetensors",
828
+ "visual.merger.mlp.2.weight": "model-00001-of-00002.safetensors",
829
+ "visual.patch_embed.proj.weight": "model-00001-of-00002.safetensors"
830
+ }
831
+ }
configs/qwen2.5-vl-3b-instruct/preprocessor_config.json ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "min_pixels": 3136,
3
+ "max_pixels": 12845056,
4
+ "patch_size": 14,
5
+ "temporal_patch_size": 2,
6
+ "merge_size": 2,
7
+ "image_mean": [
8
+ 0.48145466,
9
+ 0.4578275,
10
+ 0.40821073
11
+ ],
12
+ "image_std": [
13
+ 0.26862954,
14
+ 0.26130258,
15
+ 0.27577711
16
+ ],
17
+ "image_processor_type": "Qwen2VLImageProcessor",
18
+ "processor_class": "Qwen2_5_VLProcessor"
19
+ }
configs/qwen2.5-vl-3b-instruct/tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
configs/qwen2.5-vl-3b-instruct/tokenizer_config.json ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "151643": {
5
+ "content": "<|endoftext|>",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "151644": {
13
+ "content": "<|im_start|>",
14
+ "lstrip": false,
15
+ "normalized": false,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "151645": {
21
+ "content": "<|im_end|>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "151646": {
29
+ "content": "<|object_ref_start|>",
30
+ "lstrip": false,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "151647": {
37
+ "content": "<|object_ref_end|>",
38
+ "lstrip": false,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ },
44
+ "151648": {
45
+ "content": "<|box_start|>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false,
50
+ "special": true
51
+ },
52
+ "151649": {
53
+ "content": "<|box_end|>",
54
+ "lstrip": false,
55
+ "normalized": false,
56
+ "rstrip": false,
57
+ "single_word": false,
58
+ "special": true
59
+ },
60
+ "151650": {
61
+ "content": "<|quad_start|>",
62
+ "lstrip": false,
63
+ "normalized": false,
64
+ "rstrip": false,
65
+ "single_word": false,
66
+ "special": true
67
+ },
68
+ "151651": {
69
+ "content": "<|quad_end|>",
70
+ "lstrip": false,
71
+ "normalized": false,
72
+ "rstrip": false,
73
+ "single_word": false,
74
+ "special": true
75
+ },
76
+ "151652": {
77
+ "content": "<|vision_start|>",
78
+ "lstrip": false,
79
+ "normalized": false,
80
+ "rstrip": false,
81
+ "single_word": false,
82
+ "special": true
83
+ },
84
+ "151653": {
85
+ "content": "<|vision_end|>",
86
+ "lstrip": false,
87
+ "normalized": false,
88
+ "rstrip": false,
89
+ "single_word": false,
90
+ "special": true
91
+ },
92
+ "151654": {
93
+ "content": "<|vision_pad|>",
94
+ "lstrip": false,
95
+ "normalized": false,
96
+ "rstrip": false,
97
+ "single_word": false,
98
+ "special": true
99
+ },
100
+ "151655": {
101
+ "content": "<|image_pad|>",
102
+ "lstrip": false,
103
+ "normalized": false,
104
+ "rstrip": false,
105
+ "single_word": false,
106
+ "special": true
107
+ },
108
+ "151656": {
109
+ "content": "<|video_pad|>",
110
+ "lstrip": false,
111
+ "normalized": false,
112
+ "rstrip": false,
113
+ "single_word": false,
114
+ "special": true
115
+ },
116
+ "151657": {
117
+ "content": "<tool_call>",
118
+ "lstrip": false,
119
+ "normalized": false,
120
+ "rstrip": false,
121
+ "single_word": false,
122
+ "special": false
123
+ },
124
+ "151658": {
125
+ "content": "</tool_call>",
126
+ "lstrip": false,
127
+ "normalized": false,
128
+ "rstrip": false,
129
+ "single_word": false,
130
+ "special": false
131
+ },
132
+ "151659": {
133
+ "content": "<|fim_prefix|>",
134
+ "lstrip": false,
135
+ "normalized": false,
136
+ "rstrip": false,
137
+ "single_word": false,
138
+ "special": false
139
+ },
140
+ "151660": {
141
+ "content": "<|fim_middle|>",
142
+ "lstrip": false,
143
+ "normalized": false,
144
+ "rstrip": false,
145
+ "single_word": false,
146
+ "special": false
147
+ },
148
+ "151661": {
149
+ "content": "<|fim_suffix|>",
150
+ "lstrip": false,
151
+ "normalized": false,
152
+ "rstrip": false,
153
+ "single_word": false,
154
+ "special": false
155
+ },
156
+ "151662": {
157
+ "content": "<|fim_pad|>",
158
+ "lstrip": false,
159
+ "normalized": false,
160
+ "rstrip": false,
161
+ "single_word": false,
162
+ "special": false
163
+ },
164
+ "151663": {
165
+ "content": "<|repo_name|>",
166
+ "lstrip": false,
167
+ "normalized": false,
168
+ "rstrip": false,
169
+ "single_word": false,
170
+ "special": false
171
+ },
172
+ "151664": {
173
+ "content": "<|file_sep|>",
174
+ "lstrip": false,
175
+ "normalized": false,
176
+ "rstrip": false,
177
+ "single_word": false,
178
+ "special": false
179
+ }
180
+ },
181
+ "additional_special_tokens": [
182
+ "<|im_start|>",
183
+ "<|im_end|>",
184
+ "<|object_ref_start|>",
185
+ "<|object_ref_end|>",
186
+ "<|box_start|>",
187
+ "<|box_end|>",
188
+ "<|quad_start|>",
189
+ "<|quad_end|>",
190
+ "<|vision_start|>",
191
+ "<|vision_end|>",
192
+ "<|vision_pad|>",
193
+ "<|image_pad|>",
194
+ "<|video_pad|>"
195
+ ],
196
+ "bos_token": null,
197
+ "chat_template": "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}",
198
+ "clean_up_tokenization_spaces": false,
199
+ "eos_token": "<|im_end|>",
200
+ "errors": "replace",
201
+ "model_max_length": 131072,
202
+ "pad_token": "<|endoftext|>",
203
+ "split_special_tokens": false,
204
+ "tokenizer_class": "Qwen2Tokenizer",
205
+ "unk_token": null,
206
+ "add_bos_token": false
207
+ }
configs/qwen2.5-vl-3b-instruct/vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
dataset_statistics_bridge.json ADDED
@@ -0,0 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "action": {
3
+ "mean": [
4
+ 0.0002334193413844332,
5
+ 0.0001300490548601374,
6
+ -0.0001276246621273458,
7
+ -0.00015565502690151334,
8
+ -0.0004039333143737167,
9
+ 0.0002355769247515127,
10
+ 0.5764579772949219
11
+ ],
12
+ "std": [
13
+ 0.009765916503965855,
14
+ 0.013689138926565647,
15
+ 0.012667354196310043,
16
+ 0.02853417582809925,
17
+ 0.0306379534304142,
18
+ 0.07691461592912674,
19
+ 0.49737000465393066
20
+ ],
21
+ "max": [
22
+ 0.41691166162490845,
23
+ 0.25864794850349426,
24
+ 0.21218234300613403,
25
+ 3.122201919555664,
26
+ 1.8618112802505493,
27
+ 6.280478477478027,
28
+ 1.0
29
+ ],
30
+ "min": [
31
+ -0.4007510244846344,
32
+ -0.13874775171279907,
33
+ -0.22553899884223938,
34
+ -3.2010786533355713,
35
+ -1.8618112802505493,
36
+ -6.279075622558594,
37
+ 0.0
38
+ ],
39
+ "q01": [
40
+ -0.02872725307941437,
41
+ -0.04170349963009357,
42
+ -0.026093858778476715,
43
+ -0.08092105075716972,
44
+ -0.09288699507713317,
45
+ -0.20718276381492615,
46
+ 0.0
47
+ ],
48
+ "q99": [
49
+ 0.028309678435325586,
50
+ 0.040855254605412394,
51
+ 0.040161586627364146,
52
+ 0.08192047759890528,
53
+ 0.07792850524187081,
54
+ 0.20382574498653397,
55
+ 1.0
56
+ ]
57
+ },
58
+ "proprio": {
59
+ "mean": [
60
+ 0.0,
61
+ 0.0,
62
+ 0.0,
63
+ 0.0,
64
+ 0.0,
65
+ 0.0,
66
+ 0.0
67
+ ],
68
+ "std": [
69
+ 0.0,
70
+ 0.0,
71
+ 0.0,
72
+ 0.0,
73
+ 0.0,
74
+ 0.0,
75
+ 0.0
76
+ ],
77
+ "max": [
78
+ 0.0,
79
+ 0.0,
80
+ 0.0,
81
+ 0.0,
82
+ 0.0,
83
+ 0.0,
84
+ 0.0
85
+ ],
86
+ "min": [
87
+ 0.0,
88
+ 0.0,
89
+ 0.0,
90
+ 0.0,
91
+ 0.0,
92
+ 0.0,
93
+ 0.0
94
+ ],
95
+ "q01": [
96
+ 0.0,
97
+ 0.0,
98
+ 0.0,
99
+ 0.0,
100
+ 0.0,
101
+ 0.0,
102
+ 0.0
103
+ ],
104
+ "q99": [
105
+ 0.0,
106
+ 0.0,
107
+ 0.0,
108
+ 0.0,
109
+ 0.0,
110
+ 0.0,
111
+ 0.0
112
+ ]
113
+ },
114
+ "num_transitions": 2135463,
115
+ "num_trajectories": 60064
116
+ }
patches/convert_ckpt_standalone.py ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Standalone DeepSpeed stage-2 -> FP32 single-file converter.
2
+
3
+ Usage:
4
+ python scripts/convert_ckpt_standalone.py <ds_ckpt_dir> <fp32_out_path>
5
+ """
6
+ import os
7
+ import sys
8
+ from vlm4vla.utils.zero_to_fp32 import convert_zero_checkpoint_to_fp32_state_dict
9
+
10
+ src = sys.argv[1]
11
+ dst = sys.argv[2]
12
+
13
+ assert os.path.isdir(src), f"not a directory: {src}"
14
+ os.makedirs(os.path.dirname(dst), exist_ok=True)
15
+
16
+ print(f"converting {src} -> {dst}")
17
+ convert_zero_checkpoint_to_fp32_state_dict(src, dst)
18
+ print(f"done, size: {os.path.getsize(dst) / 1e9:.2f} GB")
patches/eval_calvin_model_wrapper.py.patch ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ commit b4ddb404e6bce2e116b04c598b9495a99bf40fdc
2
+ Author: yunfeixie <x908717327@gmail.com>
3
+ Date: Wed Apr 22 02:33:50 2026 +0000
4
+
5
+ Eval: SimplerBridge eval harness fixes + parallel sweep launcher
6
+
7
+ - eval/simpler/eval_ckpts_bridge.py: parameterize base_path/exec_steps/device
8
+ via env vars (STEP_FILTER, EXEC_STEPS, CUDA_DEV); point default at
9
+ /workspace/ckpts_archive/run_b_all so reproduction on this box is one command.
10
+ - eval/simpler/main_inference.py: set args.policy_model from configs["model"]
11
+ so maniskill2_evaluator -> get_robot_control_mode no longer
12
+ AttributeErrors on Qwen2.5-VL configs.
13
+ - eval/calvin/model_wrapper.py: call get_text_function with the arity the
14
+ function actually has (2 args); the 4-arg call was dead since data_utils.py
15
+ defines get_text_function(tokenizer, tokenizer_type, max_length=256).
16
+ - eval/simpler/sweep_parallel_bridge.py: new launcher that schedules
17
+ (ckpt, execute_step) cells across NGPU GPUs concurrently, one cell per GPU.
18
+ Ports cleanly to other boxes by --base-path/--ngpu flags; full porting
19
+ guide lives in the module docstring. Expected wall time for 10 ckpts x
20
+ 3 exec_steps on 8 A100s: ~90 min.
21
+ - eval/simpler/diag_one_episode.sh: single-episode helper used to confirm
22
+ FCDecoder output shape is (1, 1, 4, 7) at inference time, i.e. the harness
23
+ correctly replays 4 distinct actions at execute_step=4.
24
+ - .gitignore: ignore the /real_inpainting symlink that points at
25
+ SimplerEnv/ManiSkill2_real2sim/data/real_inpainting.
26
+
27
+ diff --git a/eval/calvin/model_wrapper.py b/eval/calvin/model_wrapper.py
28
+ index 11fcef1..4a89788 100644
29
+ --- a/eval/calvin/model_wrapper.py
30
+ +++ b/eval/calvin/model_wrapper.py
31
+ @@ -141,7 +141,7 @@ class CustomModel:
32
+ else:
33
+ robot_prompt = None
34
+ print('robot_prompt', robot_prompt)
35
+ - self.text_preprocess = get_text_function(self.model.model.tokenizer, configs["model"], qwen25_seq_id, robot_prompt)
36
+ + self.text_preprocess = get_text_function(self.model.model.tokenizer, configs["model"])
37
+
38
+ self.action_space = self.configs["act_head"].get("action_space", "continuous")
39
+
patches/eval_simpler_main_inference.py.patch ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ commit b4ddb404e6bce2e116b04c598b9495a99bf40fdc
2
+ Author: yunfeixie <x908717327@gmail.com>
3
+ Date: Wed Apr 22 02:33:50 2026 +0000
4
+
5
+ Eval: SimplerBridge eval harness fixes + parallel sweep launcher
6
+
7
+ - eval/simpler/eval_ckpts_bridge.py: parameterize base_path/exec_steps/device
8
+ via env vars (STEP_FILTER, EXEC_STEPS, CUDA_DEV); point default at
9
+ /workspace/ckpts_archive/run_b_all so reproduction on this box is one command.
10
+ - eval/simpler/main_inference.py: set args.policy_model from configs["model"]
11
+ so maniskill2_evaluator -> get_robot_control_mode no longer
12
+ AttributeErrors on Qwen2.5-VL configs.
13
+ - eval/calvin/model_wrapper.py: call get_text_function with the arity the
14
+ function actually has (2 args); the 4-arg call was dead since data_utils.py
15
+ defines get_text_function(tokenizer, tokenizer_type, max_length=256).
16
+ - eval/simpler/sweep_parallel_bridge.py: new launcher that schedules
17
+ (ckpt, execute_step) cells across NGPU GPUs concurrently, one cell per GPU.
18
+ Ports cleanly to other boxes by --base-path/--ngpu flags; full porting
19
+ guide lives in the module docstring. Expected wall time for 10 ckpts x
20
+ 3 exec_steps on 8 A100s: ~90 min.
21
+ - eval/simpler/diag_one_episode.sh: single-episode helper used to confirm
22
+ FCDecoder output shape is (1, 1, 4, 7) at inference time, i.e. the harness
23
+ correctly replays 4 distinct actions at execute_step=4.
24
+ - .gitignore: ignore the /real_inpainting symlink that points at
25
+ SimplerEnv/ManiSkill2_real2sim/data/real_inpainting.
26
+
27
+ diff --git a/eval/simpler/main_inference.py b/eval/simpler/main_inference.py
28
+ index 59ce4d5..f438f74 100644
29
+ --- a/eval/simpler/main_inference.py
30
+ +++ b/eval/simpler/main_inference.py
31
+ @@ -235,6 +235,7 @@ if __name__ == "__main__":
32
+ args.model_name += f'_{configs["exp_name"]}'
33
+ if args.double_step:
34
+ args.model_name += "double"
35
+ + args.policy_model = configs.get("model", "vlm4vla")
36
+ os.environ["DISPLAY"] = ""
37
+ # prevent a single jax process from taking up all the GPU memory
38
+ os.environ["XLA_PYTHON_CLIENT_PREALLOCATE"] = "false"
project.json ADDED
@@ -0,0 +1,166 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "robovlm_name": "RoboQwen25VL",
3
+ "parent": null,
4
+ "task_name": "bridge_finetune",
5
+ "model": "qwen25vl",
6
+ "model_url": "https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct",
7
+ "seq_len": 1,
8
+ "image_size": 224,
9
+ "image_mean": [
10
+ 0.48145466,
11
+ 0.4578275,
12
+ 0.40821073
13
+ ],
14
+ "image_std": [
15
+ 0.26862954,
16
+ 0.26130258,
17
+ 0.27577711
18
+ ],
19
+ "window_size": 1,
20
+ "fwd_pred_next_n": 4,
21
+ "arm_gripper_loss_ratio": 0.01,
22
+ "cap_loss_ratio": 0.05,
23
+ "fwd_loss_ratio": 0,
24
+ "seed": 123,
25
+ "batch_size": 8,
26
+ "num_workers": 4,
27
+ "data_scale": 1,
28
+ "optimizer": "adam",
29
+ "learning_rate": 5e-05,
30
+ "min_lr_scale": 0.01,
31
+ "weight_decay": 0,
32
+ "warmup_epochs": 0.25,
33
+ "warmup_steps": 0,
34
+ "warmup_ratio": null,
35
+ "use_hand_rgb": false,
36
+ "use_time_causal_attn": false,
37
+ "use_mim_obs_loss": false,
38
+ "use_pixel_loss": true,
39
+ "use_obs_queries": true,
40
+ "use_vision_resampler": false,
41
+ "vision_masked_ratio": 0.9,
42
+ "use_tube_mask": false,
43
+ "cache_root": "/dev/shm/vlm4vla/cache/qwen25vl",
44
+ "model_load_path": null,
45
+ "model_load_source": "torch",
46
+ "resume": null,
47
+ "model_path": "/workspace/models/Qwen2.5-VL-3B-Instruct",
48
+ "model_config": "/workspace/models/Qwen2.5-VL-3B-Instruct/config.json",
49
+ "train_setup": {
50
+ "precision": "bf16",
51
+ "predict_action": true,
52
+ "predict_forward": false,
53
+ "predict_forward_hand": false,
54
+ "predict_caption": false,
55
+ "train_vision": false,
56
+ "bits": -1,
57
+ "freeze_mm_mlp_adapter": false,
58
+ "freeze_backbone": false,
59
+ "freeze_resampler": false,
60
+ "tune_mm_mlp_adapter": false,
61
+ "mm_use_im_start_end": false,
62
+ "mm_use_im_patch_token": false,
63
+ "gradient_checkpointing": false,
64
+ "lora_enable": false,
65
+ "mm_projector_lr": 0.0001,
66
+ "lora_r": 64,
67
+ "lora_alpha": 16,
68
+ "lora_dropout": 0.05,
69
+ "lora_bias": "none",
70
+ "train_text_embedding": true
71
+ },
72
+ "vision_resampler": {
73
+ "vis_dim": 1024,
74
+ "depth": 8,
75
+ "dim_head": 64,
76
+ "heads": 8,
77
+ "num_latents": 64
78
+ },
79
+ "act_encoder": null,
80
+ "act_head": {
81
+ "type": "FCDecoder",
82
+ "hidden_size": 1024,
83
+ "action_dim": 7,
84
+ "down_sample": "none",
85
+ "latent": 1,
86
+ "fwd_pred_next_n": 1,
87
+ "window_size": 1,
88
+ "action_space": "continuous",
89
+ "with_history": true,
90
+ "history_type": "post"
91
+ },
92
+ "fwd_head": null,
93
+ "tokenizer": {
94
+ "type": "AutoProcessor",
95
+ "pretrained_model_name_or_path": "/workspace/models/Qwen2.5-VL-3B-Instruct",
96
+ "tokenizer_type": "qwen25vl",
97
+ "additional_special_tokens": null
98
+ },
99
+ "vlm": {
100
+ "type": "Qwen2_5_VLForConditionalGeneration",
101
+ "pretrained_model_name_or_path": "/workspace/models/Qwen2.5-VL-3B-Instruct",
102
+ "name": "qwen25vl"
103
+ },
104
+ "trainer": {
105
+ "accelerator": "gpu",
106
+ "strategy": "deepspeed_stage_2",
107
+ "precision": "bf16",
108
+ "logger": [
109
+ "wandb"
110
+ ],
111
+ "gradient_clip_val": 1.0,
112
+ "use_distributed_sampler": false,
113
+ "log_every_n_steps": 10,
114
+ "max_epochs": 5,
115
+ "val_check_interval": 40000,
116
+ "check_val_every_n_epoch": null,
117
+ "max_steps": 50000,
118
+ "accumulate_grad_batches": 8
119
+ },
120
+ "train_dataset": {
121
+ "type": "OpenVLADataset",
122
+ "data_root_dir": "/workspace/data",
123
+ "model_name": "qwen25vl",
124
+ "image_aug": true,
125
+ "mode": "train",
126
+ "data_mix": "bridge",
127
+ "window_sample": "sliding",
128
+ "organize_type": "interleave",
129
+ "shuffle_buffer_size": 51200,
130
+ "train": true
131
+ },
132
+ "val_dataset": {
133
+ "type": "OpenVLADataset",
134
+ "data_root_dir": "/workspace/data",
135
+ "model_name": "qwen25vl",
136
+ "mode": "train",
137
+ "data_mix": "bridge",
138
+ "window_sample": "sliding",
139
+ "organize_type": "interleave",
140
+ "shuffle_buffer_size": 10000,
141
+ "train": false
142
+ },
143
+ "norm_action": true,
144
+ "norm_min": -0.65,
145
+ "norm_max": 0.65,
146
+ "raw_config_path": "/workspace/VLM4VLA/configs/oxe_training/bridge/finetune_qwen25vl-3b_bridge_LOCAL_freezevis.json",
147
+ "num_nodes": 1,
148
+ "config": "/workspace/VLM4VLA/configs/oxe_training/bridge/finetune_qwen25vl-3b_bridge_LOCAL_freezevis.json",
149
+ "gpus": 8,
150
+ "log_dir": "/dev/shm/vlm4vla/logs/qwen25vl/bridge_finetune/2026-04-18/bridge_-bs512-lr5e-05-ws1-FCDecoder-latent1-freeze_vision",
151
+ "output_dir": "/dev/shm/vlm4vla/ckpts/qwen25vl/bridge_finetune/2026-04-18/bridge_-bs512-lr5e-05-ws1-FCDecoder-latent1-freeze_vision",
152
+ "data_dir": null,
153
+ "annotation_file": null,
154
+ "data_subfolder": null,
155
+ "task_num": null,
156
+ "exp_name": "21-04",
157
+ "use_multi_modal_emb": false,
158
+ "no_video_pretrained_model": false,
159
+ "finetune": false,
160
+ "llm": {
161
+ "type": null,
162
+ "n_embd": null,
163
+ "n_layer": null,
164
+ "n_head": null
165
+ }
166
+ }
requirements-eval.txt ADDED
@@ -0,0 +1,212 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ absl-py==2.4.0
2
+ accelerate==1.13.0
3
+ aiohappyeyeballs==2.6.1
4
+ aiohttp==3.13.5
5
+ aiosignal==1.4.0
6
+ annotated-types==0.7.0
7
+ antlr4-python3-runtime==4.9.3
8
+ anyio==4.13.0
9
+ array_record==0.8.1
10
+ asttokens==3.0.1
11
+ astunparse==1.6.3
12
+ async-timeout==5.0.1
13
+ attrs==26.1.0
14
+ av==17.0.1
15
+ beautifulsoup4==4.14.3
16
+ bitsandbytes==0.49.2
17
+ certifi==2026.2.25
18
+ cffi==2.0.0
19
+ charset-normalizer==3.4.7
20
+ click==8.3.2
21
+ cloudpickle==3.1.2
22
+ colorama==0.4.6
23
+ colorlog==6.10.1
24
+ contourpy==1.3.2
25
+ cryptography==46.0.7
26
+ cycler==0.12.1
27
+ datasets==2.12.0
28
+ decorator==5.2.1
29
+ decord==0.6.0
30
+ deepspeed==0.18.9
31
+ diffusers==0.37.1
32
+ dill==0.3.6
33
+ -e git+https://github.com/moojink/dlimp_openvla.git@040105d256bd28866cc6620621a3d5f7b6b91b46#egg=dlimp
34
+ dm-tree==0.1.10
35
+ draccus==0.8.0
36
+ einops==0.8.2
37
+ einops-exts==0.0.4
38
+ etils==1.13.0
39
+ exceptiongroup==1.3.1
40
+ executing==2.2.1
41
+ Farama-Notifications==0.0.4
42
+ filelock==3.29.0
43
+ flamingo-pytorch==0.1.2
44
+ flash_attn @ https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl#sha256=ffe17686fa1a0f288de9eae7c32af209d32a27b037ef28614f042b377af5b15a
45
+ flatbuffers==25.12.19
46
+ fonttools==4.62.1
47
+ frozenlist==1.8.0
48
+ fsspec==2026.3.0
49
+ ftfy==6.3.1
50
+ gast==0.7.0
51
+ gdown==6.0.0
52
+ gitdb==4.0.12
53
+ GitPython==3.1.46
54
+ google-auth==2.49.2
55
+ google-auth-oauthlib==1.3.1
56
+ google-pasta==0.2.0
57
+ grpcio==1.80.0
58
+ gymnasium==0.29.1
59
+ h11==0.16.0
60
+ h5py==3.16.0
61
+ hf-xet==1.4.3
62
+ hjson==3.1.0
63
+ httpcore==1.0.9
64
+ httpx==0.28.1
65
+ huggingface_hub==0.36.2
66
+ hydra-colorlog==1.2.0
67
+ hydra-core==1.3.2
68
+ idna==3.12
69
+ ImageIO==2.37.3
70
+ imageio-ffmpeg==0.6.0
71
+ importlib_metadata==9.0.0
72
+ importlib_resources==7.1.0
73
+ ipython==8.39.0
74
+ jedi==0.19.2
75
+ Jinja2==3.1.6
76
+ joblib==1.5.3
77
+ json-numpy==2.1.1
78
+ jsonlines==4.0.0
79
+ keras==2.15.0
80
+ kiwisolver==1.5.0
81
+ libclang==18.1.1
82
+ lightning==2.6.1
83
+ lightning-lite==1.8.6
84
+ lightning-utilities==0.15.3
85
+ -e git+https://github.com/simpler-env/ManiSkill2_real2sim@ef7a4d4fdf4b69f2c2154db5b15b9ac8dfe10682#egg=mani_skill2_real2sim&subdirectory=../../ManiSkill2_real2sim
86
+ Markdown==3.10.2
87
+ markdown-it-py==4.0.0
88
+ MarkupSafe==3.0.3
89
+ matplotlib==3.10.8
90
+ matplotlib-inline==0.2.1
91
+ mdurl==0.1.2
92
+ mediapy==1.2.6
93
+ mergedeep==1.3.4
94
+ ml-dtypes==0.2.0
95
+ mpmath==1.3.0
96
+ msgpack==1.1.2
97
+ multidict==6.7.1
98
+ multiprocess==0.70.14
99
+ mypy_extensions==1.1.0
100
+ networkx==3.4.2
101
+ ninja==1.13.0
102
+ nltk==3.9.4
103
+ numpy==1.24.4
104
+ nvidia-cublas-cu12==12.4.5.8
105
+ nvidia-cuda-cupti-cu12==12.4.127
106
+ nvidia-cuda-nvrtc-cu12==12.4.127
107
+ nvidia-cuda-runtime-cu12==12.4.127
108
+ nvidia-cudnn-cu12==9.1.0.70
109
+ nvidia-cufft-cu12==11.2.1.3
110
+ nvidia-curand-cu12==10.3.5.147
111
+ nvidia-cusolver-cu12==11.6.1.9
112
+ nvidia-cusparse-cu12==12.3.1.170
113
+ nvidia-cusparselt-cu12==0.6.2
114
+ nvidia-nccl-cu12==2.21.5
115
+ nvidia-nvjitlink-cu12==12.4.127
116
+ nvidia-nvtx-cu12==12.4.127
117
+ oauthlib==3.3.1
118
+ omegaconf==2.3.0
119
+ open-clip-torch==2.20.0
120
+ opencv-python==4.10.0.84
121
+ OpenEXR==3.4.10
122
+ -e git+https://github.com/yunfeixie233/VLM4VLA.git@b4ddb404e6bce2e116b04c598b9495a99bf40fdc#egg=openvla&subdirectory=openvla
123
+ opt_einsum==3.4.0
124
+ packaging @ file:///home/conda/feedstock_root/build_artifacts/bld/rattler-build_packaging_1776209387/work
125
+ pandas==2.3.3
126
+ parso==0.8.6
127
+ pexpect==4.9.0
128
+ pillow==12.2.0
129
+ platformdirs==4.9.6
130
+ pretty-errors==1.2.25
131
+ promise==2.3
132
+ prompt_toolkit==3.0.52
133
+ propcache==0.4.1
134
+ protobuf==4.25.9
135
+ psutil==7.2.2
136
+ ptyprocess==0.7.0
137
+ pure_eval==0.2.3
138
+ py-cpuinfo==9.0.0
139
+ pyarrow==24.0.0
140
+ pyasn1==0.6.3
141
+ pyasn1_modules==0.4.2
142
+ pycparser==3.0
143
+ pydantic==2.13.3
144
+ pydantic_core==2.46.3
145
+ Pygments==2.20.0
146
+ pyparsing==3.3.2
147
+ PySocks==1.7.1
148
+ python-dateutil==2.9.0.post0
149
+ pytorch-lightning==2.6.1
150
+ pytz==2026.1.post1
151
+ PyYAML==6.0.3
152
+ pyyaml-include==1.4.1
153
+ qwen-vl-utils==0.0.14
154
+ regex==2026.4.4
155
+ requests==2.33.1
156
+ requests-oauthlib==2.0.0
157
+ responses==0.18.0
158
+ rich==15.0.0
159
+ rtree==1.4.1
160
+ ruckig==0.17.3
161
+ safetensors==0.7.0
162
+ sapien==2.2.2
163
+ scikit-learn==1.7.2
164
+ scipy==1.15.3
165
+ sentence-transformers==2.2.2
166
+ sentencepiece==0.1.99
167
+ sentry-sdk==2.58.0
168
+ -e git+https://github.com/simpler-env/SimplerEnv@06accaca93535902d408da4855f21cece12bceb7#egg=simpler_env
169
+ six==1.17.0
170
+ smmap==5.0.3
171
+ soupsieve==2.8.3
172
+ stack-data==0.6.3
173
+ sympy==1.13.1
174
+ tabulate==0.10.0
175
+ tensorboard==2.15.2
176
+ tensorboard-data-server==0.7.2
177
+ tensorboardX==2.6.5
178
+ tensorflow==2.15.0
179
+ tensorflow-addons==0.23.0
180
+ tensorflow-datasets==4.9.3
181
+ tensorflow-estimator==2.15.0
182
+ tensorflow-graphics==2021.12.3
183
+ tensorflow-io-gcs-filesystem==0.37.1
184
+ tensorflow-metadata==1.17.3
185
+ termcolor==3.3.0
186
+ threadpoolctl==3.6.0
187
+ timm==1.0.26
188
+ tokenizers==0.22.2
189
+ toml==0.10.2
190
+ torch==2.6.0
191
+ torchmetrics==1.9.0
192
+ torchvision==0.21.0
193
+ tqdm==4.67.3
194
+ traitlets==5.14.3
195
+ transformers==4.57.0
196
+ transforms3d==0.4.2
197
+ trimesh==4.11.5
198
+ triton==3.2.0
199
+ typeguard==2.13.3
200
+ typing-inspect==0.9.0
201
+ typing-inspection==0.4.2
202
+ typing_extensions==4.15.0
203
+ tzdata==2026.1
204
+ urllib3==2.6.3
205
+ -e git+https://github.com/yunfeixie233/VLM4VLA.git@b4ddb404e6bce2e116b04c598b9495a99bf40fdc#egg=vlm4vla
206
+ wandb==0.25.0
207
+ wcwidth==0.6.0
208
+ Werkzeug==3.1.8
209
+ wrapt==1.14.2
210
+ xxhash==3.6.0
211
+ yarl==1.23.0
212
+ zipp==3.23.1
stepstep=0030000.fp32.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:594f4440cc6d83069eb5a0a0f65711e1671cce064e00ddbc11fde9fc670697b0
3
+ size 15045187610