Text-to-Image
Diffusers
Safetensors
rhli commited on
Commit
85c2ed2
·
verified ·
1 Parent(s): 393d8f6

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer/tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,185 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - Alex11556666/Reason_Tuning
5
+ base_model:
6
+ - Qwen/Qwen2.5-VL-3B-Instruct
7
+ pipeline_tag: text-to-image
8
+ ---
9
+
10
+ # 💡 DeepGen 1.0 (Diffusers Format): A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing
11
+
12
+ This is the **diffusers-compatible** version of [DeepGen-1.0](https://huggingface.co/deepgenteam/DeepGen-1.0). The model weights are stored in safetensors format with a self-contained pipeline script (`deepgen_pipeline.py`) — **no need to clone the DeepGen repository**.
13
+
14
+ DeepGen 1.0 is a lightweight unified multimodal model with only 5B parameters (3B VLM + 2B DiT). It integrates five core capabilities—general image generation, general image editing, reasoning image generation, reasoning image editing, and text rendering—within a single model. Across multiple authoritative benchmarks, DeepGen 1.0 is competitive with or surpassing the state-of-the-art unified multimodal models that are 3× to 16× larger.
15
+
16
+ ## 🛠️ Quick Start
17
+
18
+ ### Installation
19
+
20
+ ```bash
21
+ pip install torch diffusers transformers safetensors einops accelerate huggingface_hub
22
+ # Flash Attention (recommended)
23
+ pip install flash-attn --no-build-isolation
24
+ ```
25
+
26
+ ### Load Pipeline
27
+
28
+ ```python
29
+ import torch
30
+ from diffusers import DiffusionPipeline
31
+
32
+ pipe = DiffusionPipeline.from_pretrained(
33
+ "deepgenteam/DeepGen-1.0-diffusers",
34
+ torch_dtype=torch.bfloat16,
35
+ trust_remote_code=True,
36
+ )
37
+ pipe.to("cuda")
38
+
39
+ # Optional: enable CPU offload for GPUs with limited memory (< 24GB)
40
+ # pipe.enable_model_cpu_offload()
41
+ ```
42
+
43
+ ### Text-to-Image
44
+
45
+ ```python
46
+ result = pipe(
47
+ prompt="a racoon holding a shiny red apple over its head",
48
+ height=512, width=512,
49
+ num_inference_steps=50,
50
+ guidance_scale=4.0,
51
+ seed=42,
52
+ )
53
+ result.images[0].save("output.png")
54
+ ```
55
+
56
+ ### Image Editing
57
+
58
+ ```python
59
+ from PIL import Image
60
+
61
+ source_image = Image.open("guitar.png").convert("RGB")
62
+ result = pipe(
63
+ prompt="Take a photo of this guitar placed on a sandy beach with the sunset in the background.",
64
+ image=source_image,
65
+ height=512, width=512,
66
+ num_inference_steps=50,
67
+ guidance_scale=4.0,
68
+ seed=42,
69
+ )
70
+ result.images[0].save("edited.png")
71
+ ```
72
+
73
+ ## 📋 Parameters
74
+
75
+ | Parameter | Default | Description |
76
+ |-----------|---------|-------------|
77
+ | `prompt` | required | Text prompt for generation or editing |
78
+ | `image` | `None` | Input image for editing. If `None`, performs text-to-image generation |
79
+ | `height` | 512 | Output image height |
80
+ | `width` | 512 | Output image width |
81
+ | `num_inference_steps` | 50 | Number of denoising steps |
82
+ | `guidance_scale` | 4.0 | Classifier-free guidance scale |
83
+ | `seed` | `None` | Random seed for reproducibility |
84
+ | `negative_prompt` | `""` | Negative prompt for CFG |
85
+
86
+ ## 💾 Memory Requirements
87
+
88
+ | Mode | VRAM |
89
+ |------|------|
90
+ | Full GPU | ~20 GB |
91
+ | CPU Offload (`pipe.enable_model_cpu_offload()`) | ~14 GB |
92
+
93
+ ## 📁 Directory Structure
94
+
95
+ ```
96
+ DeepGen-1.0-diffusers/
97
+ ├── transformer/ # SD3 DiT weights (safetensors)
98
+ ├── vae/ # AutoencoderKL weights
99
+ ├── connector/ # SCB Connector weights + config
100
+ ├── scheduler/ # FlowMatchEulerDiscreteScheduler config
101
+ ├── tokenizer/ # Qwen2.5-VL tokenizer
102
+ ├── prompt_template.json # Prompt formatting template
103
+ ├── model_index.json # Model metadata
104
+ └── deepgen_pipeline.py # Self-contained pipeline script
105
+ ```
106
+
107
+ > **Note:** The VLM (Qwen2.5-VL-3B-Instruct) is loaded separately from [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct). You can override the VLM path using the `vlm_model_path` parameter in `from_pretrained()`.
108
+
109
+ ## 🧠 Method
110
+
111
+ Our core observation is that a lightweight model, when empowered by synergistic architecture design and data-centric training strategies, can achieve comprehensive capabilities competitive with or even surpassing much larger counterparts. To overcome the limitations of lightweight models in semantic understanding and fine-grained control, we introduce **Stacked Channel Bridging (SCB)**, a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable "think tokens" to provide the generative backbone with structured, reasoning-rich guidance.
112
+
113
+ | Component | Parameters | Description |
114
+ |-----------|-----------|-------------|
115
+ | VLM (Qwen2.5-VL-3B) | 3B | Visual Language Model for understanding prompts and reference images |
116
+ | Connector (SCB) | ~0.8B | 6-layer Transformer bridging VLM hidden states to DiT conditioning |
117
+ | DiT (SD3.5M Kontext) | 2B | Diffusion Transformer for image generation |
118
+ | VAE | ~80M | Image encoder/decoder |
119
+
120
+ ## 📊 Benchmarks
121
+
122
+ ### 1. General Image Generation
123
+
124
+ | Model | Params | Geneval ↑ | DPGBench ↑ | UniGenBench ↑ |
125
+ | --------------------- | ----------- | ----------- | ------------ | ------------- |
126
+ | OmniGen2 | 3B + 4B | 0.80 | 83.57 | 63.09 |
127
+ | BAGEL | 14B | 0.82 | 85.10 | 61.53 |
128
+ | X-Omni | 7B + 12B | 0.83 | 87.65🥉 | 53.77 |
129
+ | Lumina-DiMOO | 8B | 0.88🥇 | 86.04 | 71.12 |
130
+ | Hunyuan-Image-3.0 | 80B | 0.72 | 86.10 | — |
131
+ | Qwen-Image | 7B + 20B | 0.87 🥈 | 88.32 🥇 | 78.81 🥇 |
132
+ | LongCat-Image | 7B + 6B | 0.87 🥈 | 86.80 | — |
133
+ | Z-Image-Turbo | 4B + 6B | 0.84 | 85.15 | 71.40 |
134
+ | GLM-Image | 9B + 7B | — | 84.78 | — |
135
+ | **DeepGen 1.0 (SFT)** | **3B + 2B** | 0.86 🥉 | 87.05 | 74.18 🥉 |
136
+ | **DeepGen 1.0 (RL)** | **3B + 2B** | 0.87 🥈 | 87.90 🥈 | 75.74 🥈 |
137
+
138
+ ### 2. General Image Editing
139
+
140
+ | Model | Params | GEdit-EN ↑ | ImgEdit ↑ |
141
+ | :--- | :--- | :--- | :--- |
142
+ | BAGEL | 14B | 6.52 | 3.20 |
143
+ | Qwen-Image-Edit [2509] | 7B + 20B | 7.54 🥈 | 4.35 🥈 |
144
+ | LongCat-Image-Edit | 7B + 6B | 7.60 🥇 | 4.50 🥇 |
145
+ | Mammoth2 | 8B + 3B + 2B | 6.60 | 4.06 |
146
+ | **DeepGen 1.0 (SFT)** | **3B + 2B** | 7.12 | 4.09 |
147
+ | **DeepGen 1.0 (RL)** | **3B + 2B** | 7.17 🥉 | 4.14 🥉 |
148
+
149
+ ### 3. Reasoning Image Generation
150
+
151
+ | Model | Params | WISE ↑ | T2I-CoREBench ↑ |
152
+ | :--- | :--- | :--- | :--- |
153
+ | OmniGen2 | 3B + 4B | 0.47 | 36.1 |
154
+ | BAGEL | 14B | 0.70 🥉 | 41.1 |
155
+ | Hunyuan-Image-3.0 | 80B | 0.57 | 46.0 |
156
+ | Qwen-Image | 7B + 20B | 0.62 | 46.3 🥉 |
157
+ | LongCat-Image | 7B + 6B | 0.65 | 52.2 🥇 |
158
+ | Z-Image-Turbo | 4B + 6B | - | 43.7 |
159
+ | **DeepGen 1.0 (SFT)** | **3B + 2B** | 0.72 🥈 | 45.7 |
160
+ | **DeepGen 1.0 (RL)** | **3B + 2B** | 0.73 🥇 | 46.5 🥈 |
161
+
162
+ ### 4. Reasoning Image Editing
163
+
164
+ | Model | Params | RISE ↑ | UniREditBench ↑ |
165
+ | :--- | :--- | :--- | :--- |
166
+ | OmniGen2 | 3B + 4B | - | 43.4 |
167
+ | BAGEL | 14B | 11.9 🥈 | 51.0 |
168
+ | Qwen-Image-Edit [2509] | 7B + 20B | 8.9 | 56.5 🥉 |
169
+ | **DeepGen 1.0 (SFT)** | **3B + 2B** | 13.3 🥇 | 77.5 🥇 |
170
+ | **DeepGen 1.0 (RL)** | **3B + 2B** | 10.8 🥉 | 75.7 🥈 |
171
+
172
+ ## ⭐ Citation
173
+
174
+ ```bibtex
175
+ @article{wang2026deepgen,
176
+ title={DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing},
177
+ author={Wang, Dianyi and Li, Ruihang and Han, Feng and Ma, Chaofan and Song, Wei and Wang, Siyuan and Wang, Yibin and Xin, Yi and Liu, Hongjian and Zhang, Zhixiong and others},
178
+ journal={arXiv preprint arXiv:2602.12205},
179
+ year={2026}
180
+ }
181
+ ```
182
+
183
+ ## License
184
+
185
+ Apache 2.0
connector/config.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "connector": {
3
+ "hidden_size": 2048,
4
+ "intermediate_size": 11946,
5
+ "num_hidden_layers": 6,
6
+ "num_attention_heads": 32,
7
+ "hidden_act": "gelu_pytorch_tanh",
8
+ "layer_norm_eps": 1e-06,
9
+ "attention_dropout": 0.0
10
+ },
11
+ "num_queries": 128,
12
+ "projector_1_in": 12288,
13
+ "projector_1_out": 2048,
14
+ "projector_2_in": 2048,
15
+ "projector_2_out": 2048,
16
+ "projector_3_in": 2048,
17
+ "projector_3_out": 4096,
18
+ "llm_hidden_size": 2048,
19
+ "max_length": 1024
20
+ }
connector/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:425ce59f596eeb573f9cd52138b775df47ce69c60d2782fb32cac5498ed208f3
3
+ size 1729809448
deepgen_pipeline.py ADDED
@@ -0,0 +1,1394 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ DeepGen Diffusers Pipeline - Standalone pipeline for DeepGen-1.0.
3
+
4
+ This file is self-contained and does not require the DeepGen repository.
5
+ It can be used with `trust_remote_code=True` when loading from HuggingFace Hub.
6
+
7
+ Usage:
8
+ import torch
9
+ from diffusers import DiffusionPipeline
10
+ pipe = DiffusionPipeline.from_pretrained(
11
+ "deepgenteam/DeepGen-1.0-diffusers",
12
+ torch_dtype=torch.bfloat16,
13
+ trust_remote_code=True,
14
+ )
15
+ pipe.to("cuda")
16
+
17
+ # Text-to-Image
18
+ image = pipe("a racoon holding a shiny red apple", height=512, width=512).images[0]
19
+
20
+ # Image Edit
21
+ from PIL import Image
22
+ image = pipe("Place this guitar on a sandy beach.",
23
+ image=Image.open("guitar.png"), height=512, width=512).images[0]
24
+ """
25
+
26
+ import inspect
27
+ import math
28
+ import os
29
+ import json
30
+ import warnings
31
+ from functools import partial
32
+ from typing import Any, Callable, Dict, List, Optional, Tuple, Union
33
+
34
+ import numpy as np
35
+ import torch
36
+ import torch.nn as nn
37
+ import torch.nn.functional as F
38
+ import torch.utils.checkpoint
39
+ from torch.nn.init import _calculate_fan_in_and_fan_out
40
+ from torch.nn.utils.rnn import pad_sequence
41
+
42
+ from einops import rearrange
43
+ from PIL import Image
44
+ from safetensors.torch import load_file
45
+
46
+ from diffusers import AutoencoderKL, FlowMatchEulerDiscreteScheduler
47
+ from diffusers.configuration_utils import ConfigMixin, register_to_config
48
+ from diffusers.image_processor import PipelineImageInput, VaeImageProcessor
49
+ from diffusers.loaders import (
50
+ FromOriginalModelMixin,
51
+ FromSingleFileMixin,
52
+ PeftAdapterMixin,
53
+ SD3IPAdapterMixin,
54
+ SD3LoraLoaderMixin,
55
+ SD3Transformer2DLoadersMixin,
56
+ )
57
+ from diffusers.models.attention import FeedForward, JointTransformerBlock, _chunked_feed_forward
58
+ from diffusers.models.attention_processor import (
59
+ Attention,
60
+ AttentionProcessor,
61
+ FusedJointAttnProcessor2_0,
62
+ JointAttnProcessor2_0,
63
+ )
64
+ from diffusers.models.embeddings import CombinedTimestepTextProjEmbeddings, PatchEmbed
65
+ from diffusers.models.modeling_outputs import Transformer2DModelOutput
66
+ from diffusers.models.modeling_utils import ModelMixin
67
+ from diffusers.models.normalization import AdaLayerNormContinuous, AdaLayerNormZero
68
+ from diffusers.pipelines.pipeline_utils import DiffusionPipeline
69
+ from diffusers.pipelines.stable_diffusion_3.pipeline_output import StableDiffusion3PipelineOutput
70
+ from diffusers.utils import (
71
+ USE_PEFT_BACKEND,
72
+ is_torch_xla_available,
73
+ logging,
74
+ scale_lora_layers,
75
+ unscale_lora_layers,
76
+ )
77
+ from diffusers.utils.torch_utils import maybe_allow_in_graph, randn_tensor
78
+
79
+ from transformers import (
80
+ AutoTokenizer,
81
+ CLIPTextModelWithProjection,
82
+ CLIPTokenizer,
83
+ Qwen2_5_VLForConditionalGeneration,
84
+ SiglipImageProcessor,
85
+ SiglipVisionModel,
86
+ T5EncoderModel,
87
+ T5TokenizerFast,
88
+ )
89
+ from transformers.activations import ACT2FN
90
+ from transformers.configuration_utils import PretrainedConfig
91
+ from transformers.utils import (
92
+ is_flash_attn_2_available,
93
+ is_flash_attn_greater_or_equal_2_10,
94
+ )
95
+
96
+ if is_flash_attn_2_available():
97
+ from transformers.modeling_flash_attention_utils import _flash_attention_forward
98
+
99
+ if is_torch_xla_available():
100
+ import torch_xla.core.xla_model as xm
101
+ XLA_AVAILABLE = True
102
+ else:
103
+ XLA_AVAILABLE = False
104
+
105
+
106
+ logger = logging.get_logger(__name__)
107
+
108
+ IMAGE_MEAN = (0.48145466, 0.4578275, 0.40821073)
109
+ IMAGE_STD = (0.26862954, 0.26130258, 0.27577711)
110
+
111
+
112
+ # =============================================================================
113
+ # Connector: Config + Attention + MLP + Encoder
114
+ # =============================================================================
115
+
116
+ class ConnectorConfig(PretrainedConfig):
117
+ def __init__(
118
+ self,
119
+ hidden_size=768,
120
+ intermediate_size=3072,
121
+ num_hidden_layers=12,
122
+ num_attention_heads=12,
123
+ hidden_act="gelu_pytorch_tanh",
124
+ layer_norm_eps=1e-6,
125
+ attention_dropout=0.0,
126
+ **kwargs,
127
+ ):
128
+ super().__init__(**kwargs)
129
+ self.hidden_size = hidden_size
130
+ self.intermediate_size = intermediate_size
131
+ self.num_hidden_layers = num_hidden_layers
132
+ self.num_attention_heads = num_attention_heads
133
+ self.attention_dropout = attention_dropout
134
+ self.layer_norm_eps = layer_norm_eps
135
+ self.hidden_act = hidden_act
136
+
137
+
138
+ def _trunc_normal_(tensor, mean, std, a, b):
139
+ def norm_cdf(x):
140
+ return (1.0 + math.erf(x / math.sqrt(2.0))) / 2.0
141
+ if (mean < a - 2 * std) or (mean > b + 2 * std):
142
+ warnings.warn(
143
+ "mean is more than 2 std from [a, b] in nn.init.trunc_normal_. "
144
+ "The distribution of values may be incorrect.", stacklevel=2)
145
+ l = norm_cdf((a - mean) / std)
146
+ u = norm_cdf((b - mean) / std)
147
+ tensor.uniform_(2 * l - 1, 2 * u - 1)
148
+ tensor.erfinv_()
149
+ tensor.mul_(std * math.sqrt(2.0))
150
+ tensor.add_(mean)
151
+ tensor.clamp_(min=a, max=b)
152
+
153
+
154
+ def trunc_normal_tf_(tensor, mean=0.0, std=1.0, a=-2.0, b=2.0):
155
+ with torch.no_grad():
156
+ _trunc_normal_(tensor, 0, 1.0, a, b)
157
+ tensor.mul_(std).add_(mean)
158
+
159
+
160
+ def variance_scaling_(tensor, scale=1.0, mode="fan_in", distribution="normal"):
161
+ fan_in, fan_out = _calculate_fan_in_and_fan_out(tensor)
162
+ denom = {"fan_in": fan_in, "fan_out": fan_out, "fan_avg": (fan_in + fan_out) / 2}[mode]
163
+ variance = scale / denom
164
+ if distribution == "truncated_normal":
165
+ trunc_normal_tf_(tensor, std=math.sqrt(variance) / 0.87962566103423978)
166
+ elif distribution == "normal":
167
+ with torch.no_grad():
168
+ tensor.normal_(std=math.sqrt(variance))
169
+ elif distribution == "uniform":
170
+ bound = math.sqrt(3 * variance)
171
+ with torch.no_grad():
172
+ tensor.uniform_(-bound, bound)
173
+
174
+
175
+ def lecun_normal_(tensor):
176
+ variance_scaling_(tensor, mode="fan_in", distribution="truncated_normal")
177
+
178
+
179
+ def default_flax_embed_init(tensor):
180
+ variance_scaling_(tensor, mode="fan_in", distribution="normal")
181
+
182
+
183
+ class ConnectorAttention(nn.Module):
184
+ def __init__(self, config):
185
+ super().__init__()
186
+ self.config = config
187
+ self.embed_dim = config.hidden_size
188
+ self.num_heads = config.num_attention_heads
189
+ self.head_dim = self.embed_dim // self.num_heads
190
+ if self.head_dim * self.num_heads != self.embed_dim:
191
+ raise ValueError(
192
+ f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} "
193
+ f"and `num_heads`: {self.num_heads}).")
194
+ self.scale = self.head_dim ** -0.5
195
+ self.dropout = config.attention_dropout
196
+ self.k_proj = nn.Linear(self.embed_dim, self.embed_dim)
197
+ self.v_proj = nn.Linear(self.embed_dim, self.embed_dim)
198
+ self.q_proj = nn.Linear(self.embed_dim, self.embed_dim)
199
+ self.out_proj = nn.Linear(self.embed_dim, self.embed_dim)
200
+
201
+ def forward(self, hidden_states, attention_mask=None, output_attentions=False):
202
+ batch_size, q_len, _ = hidden_states.size()
203
+ query_states = self.q_proj(hidden_states).view(batch_size, q_len, self.num_heads, self.head_dim).transpose(1, 2)
204
+ key_states = self.k_proj(hidden_states).view(batch_size, q_len, self.num_heads, self.head_dim).transpose(1, 2)
205
+ value_states = self.v_proj(hidden_states).view(batch_size, q_len, self.num_heads, self.head_dim).transpose(1, 2)
206
+
207
+ k_v_seq_len = key_states.shape[-2]
208
+ attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) * self.scale
209
+ if attention_mask is not None:
210
+ attn_weights = attn_weights + attention_mask
211
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
212
+ attn_weights = nn.functional.dropout(attn_weights, p=self.dropout, training=self.training)
213
+ attn_output = torch.matmul(attn_weights, value_states)
214
+ attn_output = attn_output.transpose(1, 2).contiguous().reshape(batch_size, q_len, self.embed_dim)
215
+ attn_output = self.out_proj(attn_output)
216
+ return attn_output, attn_weights
217
+
218
+
219
+ class ConnectorFlashAttention2(ConnectorAttention):
220
+ is_causal = False
221
+
222
+ def __init__(self, *args, **kwargs):
223
+ super().__init__(*args, **kwargs)
224
+ self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
225
+
226
+ def forward(self, hidden_states, attention_mask=None, output_attentions=False):
227
+ output_attentions = False
228
+ batch_size, q_len, _ = hidden_states.size()
229
+ query_states = self.q_proj(hidden_states).view(batch_size, q_len, self.num_heads, self.head_dim).transpose(1, 2)
230
+ key_states = self.k_proj(hidden_states).view(batch_size, q_len, self.num_heads, self.head_dim).transpose(1, 2)
231
+ value_states = self.v_proj(hidden_states).view(batch_size, q_len, self.num_heads, self.head_dim).transpose(1, 2)
232
+ query_states = query_states.transpose(1, 2)
233
+ key_states = key_states.transpose(1, 2)
234
+ value_states = value_states.transpose(1, 2)
235
+ dropout_rate = self.dropout if self.training else 0.0
236
+ input_dtype = query_states.dtype
237
+ if input_dtype == torch.float32:
238
+ if torch.is_autocast_enabled():
239
+ target_dtype = torch.get_autocast_gpu_dtype()
240
+ elif hasattr(self.config, "_pre_quantization_dtype"):
241
+ target_dtype = self.config._pre_quantization_dtype
242
+ else:
243
+ target_dtype = self.q_proj.weight.dtype
244
+ query_states = query_states.to(target_dtype)
245
+ key_states = key_states.to(target_dtype)
246
+ value_states = value_states.to(target_dtype)
247
+ attn_output = _flash_attention_forward(
248
+ query_states, key_states, value_states, attention_mask, q_len,
249
+ dropout=dropout_rate, is_causal=self.is_causal,
250
+ use_top_left_mask=self._flash_attn_uses_top_left_mask)
251
+ attn_output = attn_output.reshape(batch_size, q_len, self.embed_dim).contiguous()
252
+ attn_output = self.out_proj(attn_output)
253
+ return attn_output, None
254
+
255
+
256
+ class ConnectorSdpaAttention(ConnectorAttention):
257
+ is_causal = False
258
+
259
+ def forward(self, hidden_states, attention_mask=None, output_attentions=False):
260
+ if output_attentions:
261
+ return super().forward(hidden_states=hidden_states, attention_mask=attention_mask, output_attentions=output_attentions)
262
+ batch_size, q_len, _ = hidden_states.size()
263
+ query_states = self.q_proj(hidden_states).view(batch_size, q_len, self.num_heads, self.head_dim).transpose(1, 2)
264
+ key_states = self.k_proj(hidden_states).view(batch_size, q_len, self.num_heads, self.head_dim).transpose(1, 2)
265
+ value_states = self.v_proj(hidden_states).view(batch_size, q_len, self.num_heads, self.head_dim).transpose(1, 2)
266
+ if query_states.device.type == "cuda" and attention_mask is not None:
267
+ query_states = query_states.contiguous()
268
+ key_states = key_states.contiguous()
269
+ value_states = value_states.contiguous()
270
+ is_causal = True if self.is_causal and q_len > 1 else False
271
+ attn_output = torch.nn.functional.scaled_dot_product_attention(
272
+ query_states, key_states, value_states, attn_mask=attention_mask,
273
+ dropout_p=self.dropout if self.training else 0.0, is_causal=is_causal)
274
+ attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, q_len, self.embed_dim)
275
+ attn_output = self.out_proj(attn_output)
276
+ return attn_output, None
277
+
278
+
279
+ CONNECTOR_ATTENTION_CLASSES = {
280
+ "eager": ConnectorAttention,
281
+ "flash_attention_2": ConnectorFlashAttention2,
282
+ "sdpa": ConnectorSdpaAttention,
283
+ }
284
+
285
+
286
+ class ConnectorMLP(nn.Module):
287
+ def __init__(self, config):
288
+ super().__init__()
289
+ self.config = config
290
+ self.activation_fn = ACT2FN[config.hidden_act]
291
+ self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size)
292
+ self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size)
293
+
294
+ def forward(self, hidden_states):
295
+ hidden_states = self.fc1(hidden_states)
296
+ hidden_states = self.activation_fn(hidden_states)
297
+ hidden_states = self.fc2(hidden_states)
298
+ return hidden_states
299
+
300
+
301
+ def _init_connector_weights(module):
302
+ if isinstance(module, nn.Embedding):
303
+ default_flax_embed_init(module.weight)
304
+ elif isinstance(module, ConnectorAttention):
305
+ nn.init.xavier_uniform_(module.q_proj.weight)
306
+ nn.init.xavier_uniform_(module.k_proj.weight)
307
+ nn.init.xavier_uniform_(module.v_proj.weight)
308
+ nn.init.xavier_uniform_(module.out_proj.weight)
309
+ nn.init.zeros_(module.q_proj.bias)
310
+ nn.init.zeros_(module.k_proj.bias)
311
+ nn.init.zeros_(module.v_proj.bias)
312
+ nn.init.zeros_(module.out_proj.bias)
313
+ elif isinstance(module, ConnectorMLP):
314
+ nn.init.xavier_uniform_(module.fc1.weight)
315
+ nn.init.xavier_uniform_(module.fc2.weight)
316
+ nn.init.normal_(module.fc1.bias, std=1e-6)
317
+ nn.init.normal_(module.fc2.bias, std=1e-6)
318
+ elif isinstance(module, (nn.Linear, nn.Conv2d)):
319
+ lecun_normal_(module.weight)
320
+ if module.bias is not None:
321
+ nn.init.zeros_(module.bias)
322
+ elif isinstance(module, nn.LayerNorm):
323
+ module.bias.data.zero_()
324
+ module.weight.data.fill_(1.0)
325
+
326
+
327
+ class ConnectorEncoderLayer(nn.Module):
328
+ def __init__(self, config):
329
+ super().__init__()
330
+ self.embed_dim = config.hidden_size
331
+ self.self_attn = CONNECTOR_ATTENTION_CLASSES[config._attn_implementation](config=config)
332
+ self.layer_norm1 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps)
333
+ self.mlp = ConnectorMLP(config)
334
+ self.layer_norm2 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps)
335
+
336
+ def forward(self, hidden_states, attention_mask, output_attentions=False):
337
+ residual = hidden_states
338
+ hidden_states = self.layer_norm1(hidden_states)
339
+ hidden_states, attn_weights = self.self_attn(
340
+ hidden_states=hidden_states, attention_mask=attention_mask, output_attentions=output_attentions)
341
+ hidden_states = residual + hidden_states
342
+ residual = hidden_states
343
+ hidden_states = self.layer_norm2(hidden_states)
344
+ hidden_states = self.mlp(hidden_states)
345
+ hidden_states = residual + hidden_states
346
+ outputs = (hidden_states,)
347
+ if output_attentions:
348
+ outputs += (attn_weights,)
349
+ return outputs
350
+
351
+
352
+ class ConnectorEncoder(nn.Module):
353
+ def __init__(self, config):
354
+ super().__init__()
355
+ self.config = config
356
+ self.layers = nn.ModuleList([ConnectorEncoderLayer(config) for _ in range(config.num_hidden_layers)])
357
+ self.gradient_checkpointing = False
358
+ self.apply(_init_connector_weights)
359
+
360
+ def forward(self, inputs_embeds):
361
+ hidden_states = inputs_embeds
362
+ for encoder_layer in self.layers:
363
+ if self.gradient_checkpointing and self.training:
364
+ layer_outputs = torch.utils.checkpoint.checkpoint(
365
+ encoder_layer.__call__, hidden_states, None, False, use_reentrant=False)
366
+ else:
367
+ layer_outputs = encoder_layer(hidden_states, None, output_attentions=False)
368
+ hidden_states = layer_outputs[0]
369
+ return hidden_states
370
+
371
+
372
+ class DeepGenConnector(nn.Module):
373
+ """Connector module bridging VLM hidden states to DiT conditioning."""
374
+
375
+ def __init__(self, connector_config, num_queries, llm_hidden_size,
376
+ projector_1_in, projector_1_out,
377
+ projector_2_in, projector_2_out,
378
+ projector_3_in, projector_3_out):
379
+ super().__init__()
380
+ self.connector = ConnectorEncoder(ConnectorConfig(**connector_config))
381
+ self.projector_1 = nn.Linear(projector_1_in, projector_1_out)
382
+ self.projector_2 = nn.Linear(projector_2_in, projector_2_out)
383
+ self.projector_3 = nn.Linear(projector_3_in, projector_3_out)
384
+ self.meta_queries = nn.Parameter(torch.zeros(num_queries, llm_hidden_size))
385
+ self.num_queries = num_queries
386
+
387
+ def llm2dit(self, x):
388
+ x = self.connector(self.projector_1(x))
389
+ pooled_out = self.projector_2(x.mean(1))
390
+ seq_out = self.projector_3(x)
391
+ return pooled_out, seq_out
392
+
393
+
394
+ # =============================================================================
395
+ # Custom SD3 Transformer (dynamic resolution + attention mask)
396
+ # =============================================================================
397
+
398
+ class CustomJointAttnProcessor2_0:
399
+ """Attention processor supporting attention masks for dynamic-resolution SD3."""
400
+
401
+ def __init__(self):
402
+ if not hasattr(F, "scaled_dot_product_attention"):
403
+ raise ImportError("CustomJointAttnProcessor2_0 requires PyTorch 2.0+")
404
+
405
+ def __call__(self, attn, hidden_states, encoder_hidden_states=None,
406
+ attention_mask=None, *args, **kwargs):
407
+ residual = hidden_states
408
+ batch_size = hidden_states.shape[0]
409
+
410
+ query = attn.to_q(hidden_states)
411
+ key = attn.to_k(hidden_states)
412
+ value = attn.to_v(hidden_states)
413
+
414
+ inner_dim = key.shape[-1]
415
+ head_dim = inner_dim // attn.heads
416
+
417
+ query = query.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
418
+ key = key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
419
+ value = value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
420
+
421
+ if attn.norm_q is not None:
422
+ query = attn.norm_q(query)
423
+ if attn.norm_k is not None:
424
+ key = attn.norm_k(key)
425
+
426
+ if encoder_hidden_states is not None:
427
+ ctx_len = encoder_hidden_states.shape[1]
428
+ encoder_hidden_states_query_proj = attn.add_q_proj(encoder_hidden_states).view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
429
+ encoder_hidden_states_key_proj = attn.add_k_proj(encoder_hidden_states).view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
430
+ encoder_hidden_states_value_proj = attn.add_v_proj(encoder_hidden_states).view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
431
+
432
+ if attn.norm_added_q is not None:
433
+ encoder_hidden_states_query_proj = attn.norm_added_q(encoder_hidden_states_query_proj)
434
+ if attn.norm_added_k is not None:
435
+ encoder_hidden_states_key_proj = attn.norm_added_k(encoder_hidden_states_key_proj)
436
+
437
+ query = torch.cat([query, encoder_hidden_states_query_proj], dim=2)
438
+ key = torch.cat([key, encoder_hidden_states_key_proj], dim=2)
439
+ value = torch.cat([value, encoder_hidden_states_value_proj], dim=2)
440
+
441
+ if attention_mask is not None:
442
+ encoder_attention_mask = torch.ones(
443
+ batch_size, ctx_len, dtype=torch.bool, device=hidden_states.device)
444
+ attention_mask = torch.cat([attention_mask, encoder_attention_mask], dim=1)
445
+
446
+ if attention_mask is not None:
447
+ attention_mask = attention_mask[:, None] * attention_mask[..., None]
448
+ indices = range(attention_mask.shape[1])
449
+ attention_mask[:, indices, indices] = True
450
+ attention_mask = attention_mask[:, None]
451
+
452
+ hidden_states = F.scaled_dot_product_attention(
453
+ query, key, value, dropout_p=0.0, is_causal=False, attn_mask=attention_mask)
454
+ hidden_states = hidden_states.transpose(1, 2).reshape(batch_size, -1, attn.heads * head_dim)
455
+ hidden_states = hidden_states.to(query.dtype)
456
+
457
+ if encoder_hidden_states is not None:
458
+ hidden_states, encoder_hidden_states = (
459
+ hidden_states[:, :residual.shape[1]],
460
+ hidden_states[:, residual.shape[1]:])
461
+ if not attn.context_pre_only:
462
+ encoder_hidden_states = attn.to_add_out(encoder_hidden_states)
463
+
464
+ hidden_states = attn.to_out[0](hidden_states)
465
+ hidden_states = attn.to_out[1](hidden_states)
466
+
467
+ if encoder_hidden_states is not None:
468
+ return hidden_states, encoder_hidden_states
469
+ else:
470
+ return hidden_states
471
+
472
+
473
+ class CustomJointTransformerBlock(JointTransformerBlock):
474
+ def __init__(self, *args, **kwargs):
475
+ super().__init__(*args, **kwargs)
476
+ self.attn.set_processor(CustomJointAttnProcessor2_0())
477
+ if self.attn2 is not None:
478
+ self.attn2.set_processor(CustomJointAttnProcessor2_0())
479
+
480
+ def forward(self, hidden_states, encoder_hidden_states, temb,
481
+ attention_mask=None, joint_attention_kwargs=None):
482
+ joint_attention_kwargs = joint_attention_kwargs or {}
483
+ if self.use_dual_attention:
484
+ norm_hidden_states, gate_msa, shift_mlp, scale_mlp, gate_mlp, norm_hidden_states2, gate_msa2 = self.norm1(hidden_states, emb=temb)
485
+ else:
486
+ norm_hidden_states, gate_msa, shift_mlp, scale_mlp, gate_mlp = self.norm1(hidden_states, emb=temb)
487
+
488
+ if self.context_pre_only:
489
+ norm_encoder_hidden_states = self.norm1_context(encoder_hidden_states, temb)
490
+ else:
491
+ norm_encoder_hidden_states, c_gate_msa, c_shift_mlp, c_scale_mlp, c_gate_mlp = self.norm1_context(encoder_hidden_states, emb=temb)
492
+
493
+ attn_output, context_attn_output = self.attn(
494
+ hidden_states=norm_hidden_states, attention_mask=attention_mask,
495
+ encoder_hidden_states=norm_encoder_hidden_states, **joint_attention_kwargs)
496
+
497
+ attn_output = gate_msa.unsqueeze(1) * attn_output
498
+ hidden_states = hidden_states + attn_output
499
+
500
+ if self.use_dual_attention:
501
+ attn_output2 = self.attn2(hidden_states=norm_hidden_states2, attention_mask=attention_mask, **joint_attention_kwargs)
502
+ attn_output2 = gate_msa2.unsqueeze(1) * attn_output2
503
+ hidden_states = hidden_states + attn_output2
504
+
505
+ norm_hidden_states = self.norm2(hidden_states)
506
+ norm_hidden_states = norm_hidden_states * (1 + scale_mlp[:, None]) + shift_mlp[:, None]
507
+ if self._chunk_size is not None:
508
+ ff_output = _chunked_feed_forward(self.ff, norm_hidden_states, self._chunk_dim, self._chunk_size)
509
+ else:
510
+ ff_output = self.ff(norm_hidden_states)
511
+ ff_output = gate_mlp.unsqueeze(1) * ff_output
512
+ hidden_states = hidden_states + ff_output
513
+
514
+ if self.context_pre_only:
515
+ encoder_hidden_states = None
516
+ else:
517
+ context_attn_output = c_gate_msa.unsqueeze(1) * context_attn_output
518
+ encoder_hidden_states = encoder_hidden_states + context_attn_output
519
+ norm_encoder_hidden_states = self.norm2_context(encoder_hidden_states)
520
+ norm_encoder_hidden_states = norm_encoder_hidden_states * (1 + c_scale_mlp[:, None]) + c_shift_mlp[:, None]
521
+ if self._chunk_size is not None:
522
+ context_ff_output = _chunked_feed_forward(self.ff_context, norm_encoder_hidden_states, self._chunk_dim, self._chunk_size)
523
+ else:
524
+ context_ff_output = self.ff_context(norm_encoder_hidden_states)
525
+ encoder_hidden_states = encoder_hidden_states + c_gate_mlp.unsqueeze(1) * context_ff_output
526
+
527
+ return encoder_hidden_states, hidden_states
528
+
529
+
530
+ class SD3Transformer2DModel(
531
+ ModelMixin, ConfigMixin, PeftAdapterMixin, FromOriginalModelMixin, SD3Transformer2DLoadersMixin
532
+ ):
533
+ _supports_gradient_checkpointing = True
534
+ _no_split_modules = ["JointTransformerBlock", "CustomJointTransformerBlock"]
535
+ _skip_layerwise_casting_patterns = ["pos_embed", "norm"]
536
+
537
+ @register_to_config
538
+ def __init__(
539
+ self,
540
+ sample_size: int = 128,
541
+ patch_size: int = 2,
542
+ in_channels: int = 16,
543
+ num_layers: int = 18,
544
+ attention_head_dim: int = 64,
545
+ num_attention_heads: int = 18,
546
+ joint_attention_dim: int = 4096,
547
+ caption_projection_dim: int = 1152,
548
+ pooled_projection_dim: int = 2048,
549
+ out_channels: int = 16,
550
+ pos_embed_max_size: int = 96,
551
+ dual_attention_layers: Tuple[int, ...] = (),
552
+ qk_norm: Optional[str] = None,
553
+ ):
554
+ super().__init__()
555
+ self.out_channels = out_channels if out_channels is not None else in_channels
556
+ self.inner_dim = num_attention_heads * attention_head_dim
557
+
558
+ self.pos_embed = PatchEmbed(
559
+ height=sample_size, width=sample_size, patch_size=patch_size,
560
+ in_channels=in_channels, embed_dim=self.inner_dim,
561
+ pos_embed_max_size=pos_embed_max_size)
562
+ self.time_text_embed = CombinedTimestepTextProjEmbeddings(
563
+ embedding_dim=self.inner_dim, pooled_projection_dim=pooled_projection_dim)
564
+ self.context_embedder = nn.Linear(joint_attention_dim, caption_projection_dim)
565
+
566
+ self.transformer_blocks = nn.ModuleList([
567
+ CustomJointTransformerBlock(
568
+ dim=self.inner_dim,
569
+ num_attention_heads=num_attention_heads,
570
+ attention_head_dim=attention_head_dim,
571
+ context_pre_only=i == num_layers - 1,
572
+ qk_norm=qk_norm,
573
+ use_dual_attention=True if i in dual_attention_layers else False,
574
+ ) for i in range(num_layers)
575
+ ])
576
+
577
+ self.norm_out = AdaLayerNormContinuous(self.inner_dim, self.inner_dim, elementwise_affine=False, eps=1e-6)
578
+ self.proj_out = nn.Linear(self.inner_dim, patch_size * patch_size * self.out_channels, bias=True)
579
+ self.gradient_checkpointing = False
580
+
581
+ @property
582
+ def attn_processors(self):
583
+ processors = {}
584
+ def fn_recursive_add_processors(name, module, processors):
585
+ if hasattr(module, "get_processor"):
586
+ processors[f"{name}.processor"] = module.get_processor()
587
+ for sub_name, child in module.named_children():
588
+ fn_recursive_add_processors(f"{name}.{sub_name}", child, processors)
589
+ return processors
590
+ for name, module in self.named_children():
591
+ fn_recursive_add_processors(name, module, processors)
592
+ return processors
593
+
594
+ def set_attn_processor(self, processor):
595
+ count = len(self.attn_processors.keys())
596
+ if isinstance(processor, dict) and len(processor) != count:
597
+ raise ValueError(f"A dict of processors was passed, but the number of processors {len(processor)} does not match the number of attention layers: {count}.")
598
+ def fn_recursive_attn_processor(name, module, processor):
599
+ if hasattr(module, "set_processor"):
600
+ if not isinstance(processor, dict):
601
+ module.set_processor(processor)
602
+ else:
603
+ module.set_processor(processor.pop(f"{name}.processor"))
604
+ for sub_name, child in module.named_children():
605
+ fn_recursive_attn_processor(f"{name}.{sub_name}", child, processor)
606
+ for name, module in self.named_children():
607
+ fn_recursive_attn_processor(name, module, processor)
608
+
609
+ def forward(
610
+ self,
611
+ hidden_states,
612
+ encoder_hidden_states=None,
613
+ cond_hidden_states=None,
614
+ pooled_projections=None,
615
+ timestep=None,
616
+ block_controlnet_hidden_states=None,
617
+ joint_attention_kwargs=None,
618
+ return_dict=True,
619
+ skip_layers=None,
620
+ ):
621
+ if joint_attention_kwargs is not None:
622
+ joint_attention_kwargs = joint_attention_kwargs.copy()
623
+ lora_scale = joint_attention_kwargs.pop("scale", 1.0)
624
+ else:
625
+ lora_scale = 1.0
626
+
627
+ if USE_PEFT_BACKEND:
628
+ scale_lora_layers(self, lora_scale)
629
+ else:
630
+ if joint_attention_kwargs is not None and joint_attention_kwargs.get("scale", None) is not None:
631
+ logger.warning("Passing `scale` via `joint_attention_kwargs` when not using the PEFT backend is ineffective.")
632
+
633
+ latent_sizes = [hs.shape[-2:] for hs in hidden_states]
634
+ bsz = len(hidden_states)
635
+
636
+ hidden_states_list = []
637
+ for idx in range(bsz):
638
+ hidden_states_per_sample = self.pos_embed(hidden_states[idx][None])[0]
639
+ if cond_hidden_states is not None:
640
+ for ref in cond_hidden_states[idx]:
641
+ hidden_states_per_sample = torch.cat(
642
+ [hidden_states_per_sample, self.pos_embed(ref[None])[0]])
643
+ hidden_states_list.append(hidden_states_per_sample)
644
+
645
+ max_len = max([len(hs) for hs in hidden_states_list])
646
+ attention_mask = torch.zeros(bsz, max_len, dtype=torch.bool, device=self.device)
647
+ for i, hs in enumerate(hidden_states_list):
648
+ attention_mask[i, :len(hs)] = True
649
+
650
+ hidden_states = pad_sequence(hidden_states_list, batch_first=True, padding_value=0.0, padding_side='right')
651
+
652
+ temb = self.time_text_embed(timestep, pooled_projections)
653
+ encoder_hidden_states = self.context_embedder(encoder_hidden_states)
654
+
655
+ if joint_attention_kwargs is not None and "ip_adapter_image_embeds" in joint_attention_kwargs:
656
+ ip_adapter_image_embeds = joint_attention_kwargs.pop("ip_adapter_image_embeds")
657
+ ip_hidden_states, ip_temb = self.image_proj(ip_adapter_image_embeds, timestep)
658
+ joint_attention_kwargs.update(ip_hidden_states=ip_hidden_states, temb=ip_temb)
659
+
660
+ for index_block, block in enumerate(self.transformer_blocks):
661
+ is_skip = True if skip_layers is not None and index_block in skip_layers else False
662
+ if torch.is_grad_enabled() and self.gradient_checkpointing and not is_skip:
663
+ encoder_hidden_states, hidden_states = self._gradient_checkpointing_func(
664
+ block, hidden_states, encoder_hidden_states, temb, attention_mask, joint_attention_kwargs)
665
+ elif not is_skip:
666
+ encoder_hidden_states, hidden_states = block(
667
+ hidden_states=hidden_states, encoder_hidden_states=encoder_hidden_states,
668
+ temb=temb, attention_mask=attention_mask, joint_attention_kwargs=joint_attention_kwargs)
669
+
670
+ if block_controlnet_hidden_states is not None and block.context_pre_only is False:
671
+ interval_control = len(self.transformer_blocks) / len(block_controlnet_hidden_states)
672
+ hidden_states = hidden_states + block_controlnet_hidden_states[int(index_block / interval_control)]
673
+
674
+ hidden_states = self.norm_out(hidden_states, temb)
675
+ hidden_states = self.proj_out(hidden_states)
676
+
677
+ patch_size = self.config.patch_size
678
+ latent_sizes = [(ls[0] // patch_size, ls[1] // patch_size) for ls in latent_sizes]
679
+
680
+ output = [rearrange(hs[:math.prod(latent_size)], '(h w) (p q c) -> c (h p) (w q)',
681
+ h=latent_size[0], w=latent_size[1], p=patch_size, q=patch_size)
682
+ for hs, latent_size in zip(hidden_states, latent_sizes)]
683
+
684
+ try:
685
+ output = torch.stack(output)
686
+ except:
687
+ pass
688
+
689
+ if USE_PEFT_BACKEND:
690
+ unscale_lora_layers(self, lora_scale)
691
+
692
+ if not return_dict:
693
+ return (output,)
694
+ return Transformer2DModelOutput(sample=output)
695
+
696
+
697
+ # =============================================================================
698
+ # Custom StableDiffusion3Pipeline (with cond_latents + dynamic shift)
699
+ # =============================================================================
700
+
701
+ def calculate_shift(image_seq_len, base_seq_len=256, max_seq_len=4096, base_shift=0.5, max_shift=1.15):
702
+ m = (max_shift - base_shift) / (max_seq_len - base_seq_len)
703
+ b = base_shift - m * base_seq_len
704
+ mu = image_seq_len * m + b
705
+ return mu
706
+
707
+
708
+ def retrieve_timesteps(scheduler, num_inference_steps=None, device=None, timesteps=None, sigmas=None, **kwargs):
709
+ if timesteps is not None and sigmas is not None:
710
+ raise ValueError("Only one of `timesteps` or `sigmas` can be passed.")
711
+ if timesteps is not None:
712
+ accepts_timesteps = "timesteps" in set(inspect.signature(scheduler.set_timesteps).parameters.keys())
713
+ if not accepts_timesteps:
714
+ raise ValueError(f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom timestep schedules.")
715
+ scheduler.set_timesteps(timesteps=timesteps, device=device, **kwargs)
716
+ timesteps = scheduler.timesteps
717
+ num_inference_steps = len(timesteps)
718
+ elif sigmas is not None:
719
+ accept_sigmas = "sigmas" in set(inspect.signature(scheduler.set_timesteps).parameters.keys())
720
+ if not accept_sigmas:
721
+ raise ValueError(f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom sigmas schedules.")
722
+ scheduler.set_timesteps(sigmas=sigmas, device=device, **kwargs)
723
+ timesteps = scheduler.timesteps
724
+ num_inference_steps = len(timesteps)
725
+ else:
726
+ scheduler.set_timesteps(num_inference_steps, device=device, **kwargs)
727
+ timesteps = scheduler.timesteps
728
+ return timesteps, num_inference_steps
729
+
730
+
731
+ class _SD3Pipeline(DiffusionPipeline, SD3LoraLoaderMixin, FromSingleFileMixin, SD3IPAdapterMixin):
732
+ """Internal SD3 pipeline with cond_latents support."""
733
+
734
+ model_cpu_offload_seq = "text_encoder->text_encoder_2->text_encoder_3->image_encoder->transformer->vae"
735
+ _optional_components = ["image_encoder", "feature_extractor"]
736
+ _callback_tensor_inputs = ["latents", "prompt_embeds", "negative_prompt_embeds", "negative_pooled_prompt_embeds"]
737
+
738
+ def __init__(self, transformer, scheduler, vae, text_encoder, tokenizer,
739
+ text_encoder_2, tokenizer_2, text_encoder_3, tokenizer_3,
740
+ image_encoder=None, feature_extractor=None):
741
+ super().__init__()
742
+ self.register_modules(
743
+ vae=vae, text_encoder=text_encoder, text_encoder_2=text_encoder_2,
744
+ text_encoder_3=text_encoder_3, tokenizer=tokenizer, tokenizer_2=tokenizer_2,
745
+ tokenizer_3=tokenizer_3, transformer=transformer, scheduler=scheduler,
746
+ image_encoder=image_encoder, feature_extractor=feature_extractor)
747
+ self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1) if getattr(self, "vae", None) else 8
748
+ self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor)
749
+ self.tokenizer_max_length = self.tokenizer.model_max_length if hasattr(self, "tokenizer") and self.tokenizer is not None else 77
750
+ self.default_sample_size = self.transformer.config.sample_size if hasattr(self, "transformer") and self.transformer is not None else 128
751
+ self.patch_size = self.transformer.config.patch_size if hasattr(self, "transformer") and self.transformer is not None else 2
752
+
753
+ def check_inputs(self, prompt, prompt_2, prompt_3, height, width, negative_prompt=None,
754
+ negative_prompt_2=None, negative_prompt_3=None, prompt_embeds=None,
755
+ negative_prompt_embeds=None, pooled_prompt_embeds=None,
756
+ negative_pooled_prompt_embeds=None, callback_on_step_end_tensor_inputs=None,
757
+ max_sequence_length=None):
758
+ if height % (self.vae_scale_factor * self.patch_size) != 0 or width % (self.vae_scale_factor * self.patch_size) != 0:
759
+ raise ValueError(f"`height` and `width` have to be divisible by {self.vae_scale_factor * self.patch_size}.")
760
+ if prompt_embeds is not None and pooled_prompt_embeds is None:
761
+ raise ValueError("If `prompt_embeds` are provided, `pooled_prompt_embeds` also have to be passed.")
762
+ if negative_prompt_embeds is not None and negative_pooled_prompt_embeds is None:
763
+ raise ValueError("If `negative_prompt_embeds` are provided, `negative_pooled_prompt_embeds` also have to be passed.")
764
+
765
+ def prepare_latents(self, batch_size, num_channels_latents, height, width, dtype, device, generator, latents=None):
766
+ if latents is not None:
767
+ return latents.to(device=device, dtype=dtype)
768
+ shape = (batch_size, num_channels_latents, int(height) // self.vae_scale_factor, int(width) // self.vae_scale_factor)
769
+ if isinstance(generator, list) and len(generator) != batch_size:
770
+ raise ValueError(f"You have passed a list of generators of length {len(generator)}, but requested an effective batch size of {batch_size}.")
771
+ latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
772
+ return latents
773
+
774
+ @property
775
+ def guidance_scale(self):
776
+ return self._guidance_scale
777
+
778
+ @property
779
+ def do_classifier_free_guidance(self):
780
+ return self._guidance_scale > 1
781
+
782
+ @property
783
+ def joint_attention_kwargs(self):
784
+ return self._joint_attention_kwargs
785
+
786
+ @torch.no_grad()
787
+ def __call__(
788
+ self,
789
+ prompt=None, prompt_2=None, prompt_3=None,
790
+ height=None, width=None, num_inference_steps=28, sigmas=None,
791
+ guidance_scale=7.0,
792
+ negative_prompt=None, negative_prompt_2=None, negative_prompt_3=None,
793
+ num_images_per_prompt=1, generator=None, latents=None,
794
+ cond_latents=None,
795
+ prompt_embeds=None, negative_prompt_embeds=None,
796
+ pooled_prompt_embeds=None, negative_pooled_prompt_embeds=None,
797
+ output_type="pil", return_dict=True,
798
+ joint_attention_kwargs=None, callback_on_step_end=None,
799
+ callback_on_step_end_tensor_inputs=["latents"],
800
+ max_sequence_length=256, mu=None, **kwargs,
801
+ ):
802
+ height = height or self.default_sample_size * self.vae_scale_factor
803
+ width = width or self.default_sample_size * self.vae_scale_factor
804
+
805
+ self.check_inputs(prompt, prompt_2, prompt_3, height, width,
806
+ negative_prompt=negative_prompt, prompt_embeds=prompt_embeds,
807
+ negative_prompt_embeds=negative_prompt_embeds,
808
+ pooled_prompt_embeds=pooled_prompt_embeds,
809
+ negative_pooled_prompt_embeds=negative_pooled_prompt_embeds)
810
+
811
+ self._guidance_scale = guidance_scale
812
+ self._joint_attention_kwargs = joint_attention_kwargs
813
+ self._interrupt = False
814
+
815
+ if prompt is not None and isinstance(prompt, str):
816
+ batch_size = 1
817
+ elif prompt is not None and isinstance(prompt, list):
818
+ batch_size = len(prompt)
819
+ else:
820
+ batch_size = prompt_embeds.shape[0]
821
+
822
+ device = self._execution_device
823
+
824
+ (prompt_embeds, negative_prompt_embeds, pooled_prompt_embeds, negative_pooled_prompt_embeds) = (
825
+ prompt_embeds, negative_prompt_embeds, pooled_prompt_embeds, negative_pooled_prompt_embeds)
826
+
827
+ if self.do_classifier_free_guidance:
828
+ prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds], dim=0)
829
+ pooled_prompt_embeds = torch.cat([negative_pooled_prompt_embeds, pooled_prompt_embeds], dim=0)
830
+
831
+ num_channels_latents = self.transformer.config.in_channels
832
+ latents = self.prepare_latents(
833
+ batch_size * num_images_per_prompt, num_channels_latents, height, width,
834
+ prompt_embeds.dtype, device, generator, latents)
835
+
836
+ scheduler_kwargs = {}
837
+ if self.scheduler.config.get("use_dynamic_shifting", None) and mu is None:
838
+ _, _, h, w = latents.shape
839
+ image_seq_len = (h // self.transformer.config.patch_size) * (w // self.transformer.config.patch_size)
840
+ mu = calculate_shift(
841
+ image_seq_len,
842
+ self.scheduler.config.get("base_image_seq_len", 256),
843
+ self.scheduler.config.get("max_image_seq_len", 4096),
844
+ self.scheduler.config.get("base_shift", 0.5),
845
+ self.scheduler.config.get("max_shift", 1.16))
846
+ scheduler_kwargs["mu"] = mu
847
+ elif mu is not None:
848
+ scheduler_kwargs["mu"] = mu
849
+
850
+ timesteps, num_inference_steps = retrieve_timesteps(
851
+ self.scheduler, num_inference_steps, device, sigmas=sigmas, **scheduler_kwargs)
852
+ num_warmup_steps = max(len(timesteps) - num_inference_steps * self.scheduler.order, 0)
853
+
854
+ if cond_latents is not None and self.do_classifier_free_guidance:
855
+ if len(cond_latents) == latents.shape[0]:
856
+ cond_latents = cond_latents * 2
857
+
858
+ with self.progress_bar(total=num_inference_steps) as progress_bar:
859
+ for i, t in enumerate(timesteps):
860
+ if self._interrupt:
861
+ continue
862
+ latent_model_input = torch.cat([latents] * 2) if self.do_classifier_free_guidance else latents
863
+ timestep = t.expand(latent_model_input.shape[0])
864
+ noise_pred = self.transformer(
865
+ hidden_states=latent_model_input, cond_hidden_states=cond_latents,
866
+ timestep=timestep, encoder_hidden_states=prompt_embeds,
867
+ pooled_projections=pooled_prompt_embeds,
868
+ joint_attention_kwargs=self.joint_attention_kwargs,
869
+ return_dict=False)[0]
870
+
871
+ if self.do_classifier_free_guidance:
872
+ noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
873
+ noise_pred = noise_pred_uncond + self.guidance_scale * (noise_pred_text - noise_pred_uncond)
874
+
875
+ latents_dtype = latents.dtype
876
+ latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]
877
+ if latents.dtype != latents_dtype:
878
+ if torch.backends.mps.is_available():
879
+ latents = latents.to(latents_dtype)
880
+
881
+ if callback_on_step_end is not None:
882
+ callback_kwargs = {}
883
+ for k in callback_on_step_end_tensor_inputs:
884
+ callback_kwargs[k] = locals()[k]
885
+ callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)
886
+ latents = callback_outputs.pop("latents", latents)
887
+
888
+ if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
889
+ progress_bar.update()
890
+
891
+ if XLA_AVAILABLE:
892
+ xm.mark_step()
893
+
894
+ if output_type == "latent":
895
+ image = latents
896
+ else:
897
+ latents = (latents / self.vae.config.scaling_factor) + self.vae.config.shift_factor
898
+ image = self.vae.decode(latents, return_dict=False)[0]
899
+ image = self.image_processor.postprocess(image, output_type=output_type)
900
+
901
+ self.maybe_free_model_hooks()
902
+
903
+ if not return_dict:
904
+ return (image,)
905
+ return StableDiffusion3PipelineOutput(images=image)
906
+
907
+
908
+ # =============================================================================
909
+ # DeepGen Pipeline (main entry point)
910
+ # =============================================================================
911
+
912
+ class DeepGenPipeline(DiffusionPipeline):
913
+ """
914
+ DeepGen 1.0 Pipeline for text-to-image generation and image editing.
915
+
916
+ This pipeline integrates Qwen2.5-VL (VLM) + SCB Connector + SD3 DiT into a
917
+ single interface. Standard diffusers components (transformer, vae, scheduler)
918
+ are loaded by DiffusionPipeline; non-standard components (VLM, connector,
919
+ tokenizer, prompt_template) are loaded automatically on first use.
920
+
921
+ Usage:
922
+ pipe = DiffusionPipeline.from_pretrained(
923
+ "deepgenteam/DeepGen-1.0-diffusers",
924
+ torch_dtype=torch.bfloat16,
925
+ trust_remote_code=True,
926
+ )
927
+ pipe.to("cuda")
928
+ result = pipe("a raccoon holding an apple", height=512, width=512)
929
+ result.images[0].save("output.png")
930
+ """
931
+
932
+ _optional_components = []
933
+
934
+ def __init__(
935
+ self,
936
+ transformer: SD3Transformer2DModel,
937
+ vae: AutoencoderKL,
938
+ scheduler: FlowMatchEulerDiscreteScheduler,
939
+ ):
940
+ super().__init__()
941
+ self.register_modules(
942
+ transformer=transformer,
943
+ vae=vae,
944
+ scheduler=scheduler,
945
+ )
946
+ self._upgrade_transformer()
947
+ self._extras_loaded = False
948
+ self._cpu_offload = False
949
+ self._gpu_device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
950
+ self.lmm = None
951
+ self.tokenizer = None
952
+ self.connector_module = None
953
+ self.prompt_template = None
954
+ self.max_length = 1024
955
+ self.image_token_id = None
956
+ self.vit_mean = torch.tensor(IMAGE_MEAN)
957
+ self.vit_std = torch.tensor(IMAGE_STD)
958
+
959
+ def _upgrade_transformer(self):
960
+ """Convert standard diffusers SD3Transformer2DModel to custom version
961
+ with cond_latents support for image editing. No weight copying needed."""
962
+ from diffusers.models.transformers.transformer_sd3 import SD3Transformer2DModel as _OrigSD3
963
+ if isinstance(self.transformer, _OrigSD3) and not isinstance(self.transformer, SD3Transformer2DModel):
964
+ self.transformer.__class__ = SD3Transformer2DModel
965
+ for block in self.transformer.transformer_blocks:
966
+ block.__class__ = CustomJointTransformerBlock
967
+ block.attn.set_processor(CustomJointAttnProcessor2_0())
968
+ if block.attn2 is not None:
969
+ block.attn2.set_processor(CustomJointAttnProcessor2_0())
970
+
971
+ def _resolve_pretrained_path(self):
972
+ path = self.config._name_or_path
973
+ if os.path.isdir(path):
974
+ return path
975
+ from huggingface_hub import snapshot_download
976
+ return snapshot_download(repo_id=path)
977
+
978
+ def _load_extras(self, vlm_model_path=None, attn_implementation="flash_attention_2"):
979
+ """Load non-standard components (VLM, connector, tokenizer, prompt_template)."""
980
+ if self._extras_loaded:
981
+ return
982
+ path = self._resolve_pretrained_path()
983
+ dtype = next(self.transformer.parameters()).dtype
984
+
985
+ model_index_path = os.path.join(path, "model_index.json")
986
+ extra_cfg = {}
987
+ if os.path.isfile(model_index_path):
988
+ with open(model_index_path, "r") as f:
989
+ extra_cfg = json.load(f)
990
+
991
+ # Resolve VLM path: prefer local merged VLM (with LoRA baked in)
992
+ vlm_path = vlm_model_path
993
+ if vlm_path is None:
994
+ local_merged = os.path.join(path, "vlm")
995
+ if os.path.isdir(local_merged):
996
+ vlm_path = local_merged
997
+ else:
998
+ vlm_path = extra_cfg.get("vlm", "Qwen/Qwen2.5-VL-3B-Instruct")
999
+ if not os.path.isdir(vlm_path):
1000
+ local_candidate = os.path.join("/data/huggingface", vlm_path.split("/")[-1])
1001
+ if os.path.isdir(local_candidate):
1002
+ vlm_path = local_candidate
1003
+ print(f"Loading VLM from {vlm_path}...")
1004
+ try:
1005
+ self.lmm = Qwen2_5_VLForConditionalGeneration.from_pretrained(
1006
+ vlm_path, torch_dtype=dtype, attn_implementation=attn_implementation)
1007
+ except Exception:
1008
+ self.lmm = Qwen2_5_VLForConditionalGeneration.from_pretrained(
1009
+ vlm_path, torch_dtype=dtype, attn_implementation="sdpa")
1010
+ self.lmm.requires_grad_(False)
1011
+
1012
+ print("Loading tokenizer...")
1013
+ tokenizer_path = os.path.join(path, "tokenizer")
1014
+ if os.path.isdir(tokenizer_path):
1015
+ self.tokenizer = AutoTokenizer.from_pretrained(
1016
+ tokenizer_path, trust_remote_code=True, padding_side='right')
1017
+ else:
1018
+ self.tokenizer = AutoTokenizer.from_pretrained(
1019
+ vlm_path, trust_remote_code=True, padding_side='right')
1020
+
1021
+ print("Loading connector...")
1022
+ connector_dir = os.path.join(path, "connector")
1023
+ with open(os.path.join(connector_dir, "config.json"), "r") as f:
1024
+ connector_cfg = json.load(f)
1025
+
1026
+ conn_cfg = connector_cfg["connector"].copy()
1027
+ conn_cfg["_attn_implementation"] = "sdpa"
1028
+
1029
+ self.connector_module = DeepGenConnector(
1030
+ connector_config=conn_cfg,
1031
+ num_queries=connector_cfg["num_queries"],
1032
+ llm_hidden_size=connector_cfg["llm_hidden_size"],
1033
+ projector_1_in=connector_cfg["projector_1_in"],
1034
+ projector_1_out=connector_cfg["projector_1_out"],
1035
+ projector_2_in=connector_cfg["projector_2_in"],
1036
+ projector_2_out=connector_cfg["projector_2_out"],
1037
+ projector_3_in=connector_cfg["projector_3_in"],
1038
+ projector_3_out=connector_cfg["projector_3_out"],
1039
+ )
1040
+ connector_state = load_file(os.path.join(connector_dir, "model.safetensors"))
1041
+ self.connector_module.load_state_dict(connector_state, strict=True)
1042
+ self.connector_module = self.connector_module.to(dtype=dtype)
1043
+
1044
+ prompt_template_path = os.path.join(path, "prompt_template.json")
1045
+ with open(prompt_template_path, "r") as f:
1046
+ self.prompt_template = json.load(f)
1047
+
1048
+ self.max_length = connector_cfg.get("max_length", 1024)
1049
+ self.image_token_id = self.tokenizer.convert_tokens_to_ids(
1050
+ self.prompt_template['IMG_CONTEXT_TOKEN'])
1051
+
1052
+ if not self._cpu_offload:
1053
+ device = self._gpu_device
1054
+ self.lmm = self.lmm.to(device=device)
1055
+ self.connector_module = self.connector_module.to(device=device, dtype=dtype)
1056
+
1057
+ self.vit_mean = self.vit_mean.to(device=self._gpu_device)
1058
+ self.vit_std = self.vit_std.to(device=self._gpu_device)
1059
+
1060
+ self._extras_loaded = True
1061
+ print("All components loaded.")
1062
+
1063
+ @property
1064
+ def llm(self):
1065
+ return self.lmm.language_model
1066
+
1067
+ @property
1068
+ def num_queries(self):
1069
+ return self.connector_module.num_queries
1070
+
1071
+ def to(self, *args, **kwargs):
1072
+ result = super().to(*args, **kwargs)
1073
+ device = None
1074
+ dtype = None
1075
+ for a in args:
1076
+ if isinstance(a, torch.device):
1077
+ device = a
1078
+ elif isinstance(a, str):
1079
+ device = torch.device(a)
1080
+ elif isinstance(a, torch.dtype):
1081
+ dtype = a
1082
+ device = device or kwargs.get("device")
1083
+ dtype = dtype or kwargs.get("dtype")
1084
+
1085
+ if device is not None:
1086
+ self._gpu_device = device
1087
+ if self._extras_loaded:
1088
+ if device is not None:
1089
+ self.lmm = self.lmm.to(device=device)
1090
+ self.connector_module = self.connector_module.to(device=device)
1091
+ self.vit_mean = self.vit_mean.to(device=device)
1092
+ self.vit_std = self.vit_std.to(device=device)
1093
+ if dtype is not None:
1094
+ self.lmm = self.lmm.to(dtype=dtype)
1095
+ self.connector_module = self.connector_module.to(dtype=dtype)
1096
+ return result
1097
+
1098
+ def enable_model_cpu_offload(self, gpu_id=None, device=None):
1099
+ """Enable sequential CPU offload to reduce GPU memory usage (~14GB)."""
1100
+ self._cpu_offload = True
1101
+ if device is not None:
1102
+ self._gpu_device = torch.device(device) if isinstance(device, str) else device
1103
+ elif gpu_id is not None:
1104
+ self._gpu_device = torch.device(f"cuda:{gpu_id}")
1105
+ self.transformer = self.transformer.to("cpu")
1106
+ self.vae = self.vae.to("cpu")
1107
+ if self._extras_loaded:
1108
+ self.lmm = self.lmm.to("cpu")
1109
+ self.connector_module = self.connector_module.to("cpu")
1110
+ self.vit_mean = self.vit_mean.to(self._gpu_device)
1111
+ self.vit_std = self.vit_std.to(self._gpu_device)
1112
+ torch.cuda.empty_cache()
1113
+
1114
+ def _offload_to(self, module, device):
1115
+ module.to(device)
1116
+ if device == torch.device("cpu") or device == "cpu":
1117
+ torch.cuda.empty_cache()
1118
+
1119
+ @classmethod
1120
+ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
1121
+ """
1122
+ Load the full pipeline. When called directly (not via DiffusionPipeline),
1123
+ loads all components immediately including VLM and connector.
1124
+ """
1125
+ vlm_model_path = kwargs.pop("vlm_model_path", None)
1126
+ attn_implementation = kwargs.pop("attn_implementation", "flash_attention_2")
1127
+
1128
+ pipe = super().from_pretrained(pretrained_model_name_or_path, **kwargs)
1129
+
1130
+ pipe._load_extras(vlm_model_path=vlm_model_path,
1131
+ attn_implementation=attn_implementation)
1132
+ return pipe
1133
+
1134
+ @torch.no_grad()
1135
+ def pixels_to_latents(self, x):
1136
+ z = self.vae.encode(x).latent_dist.sample()
1137
+ z = (z - self.vae.config.shift_factor) * self.vae.config.scaling_factor
1138
+ return z
1139
+
1140
+ @torch.no_grad()
1141
+ def latents_to_pixels(self, z):
1142
+ z = (z / self.vae.config.scaling_factor) + self.vae.config.shift_factor
1143
+ x_rec = self.vae.decode(z).sample
1144
+ return x_rec
1145
+
1146
+ def prepare_text2image_prompts(self, texts):
1147
+ texts = [self.prompt_template['GENERATION'].format(input=text) for text in texts]
1148
+ texts = [self.prompt_template['INSTRUCTION'].format(input=text) for text in texts]
1149
+ return self.tokenizer(
1150
+ texts, add_special_tokens=True, return_tensors='pt',
1151
+ padding=True, padding_side='left').to(self._gpu_device)
1152
+
1153
+ def prepare_image2image_prompts(self, texts, num_refs, ref_lens):
1154
+ prompts = []
1155
+ cnt = 0
1156
+ for text, num_ref in zip(texts, num_refs):
1157
+ image_tokens = ''
1158
+ for _ in range(num_ref):
1159
+ image_tokens += (self.prompt_template['IMG_START_TOKEN'] +
1160
+ self.prompt_template['IMG_CONTEXT_TOKEN'] * ref_lens[cnt] +
1161
+ self.prompt_template['IMG_END_TOKEN'])
1162
+ cnt += 1
1163
+ prompts.append(self.prompt_template['INSTRUCTION'].format(
1164
+ input=f'{image_tokens}\n{text}'))
1165
+ return self.tokenizer(
1166
+ prompts, add_special_tokens=True, return_tensors='pt',
1167
+ padding=True, padding_side='left').to(self._gpu_device)
1168
+
1169
+ def prepare_forward_input(self, query_embeds, input_ids=None,
1170
+ image_embeds=None, image_grid_thw=None,
1171
+ attention_mask=None, past_key_values=None):
1172
+ b, l, _ = query_embeds.shape
1173
+ attention_mask = attention_mask.to(device=self._gpu_device, dtype=torch.bool)
1174
+ input_ids = torch.cat([input_ids, input_ids.new_zeros(b, l)], dim=1)
1175
+ attention_mask = torch.cat([attention_mask, attention_mask.new_ones(b, l)], dim=1)
1176
+
1177
+ position_ids, _ = self.lmm.model.get_rope_index(
1178
+ input_ids=input_ids, image_grid_thw=image_grid_thw,
1179
+ video_grid_thw=None, second_per_grid_ts=None,
1180
+ attention_mask=attention_mask)
1181
+
1182
+ if past_key_values is not None:
1183
+ inputs_embeds = query_embeds
1184
+ position_ids = position_ids[..., -l:]
1185
+ else:
1186
+ input_ids = input_ids[:, :-l]
1187
+ if image_embeds is None:
1188
+ inputs_embeds = self.llm.get_input_embeddings()(input_ids)
1189
+ else:
1190
+ inputs_embeds = torch.zeros(
1191
+ *input_ids.shape, self.llm.config.hidden_size,
1192
+ device=self._gpu_device, dtype=self.transformer.dtype)
1193
+ inputs_embeds[input_ids == self.image_token_id] = \
1194
+ image_embeds.contiguous().view(-1, self.llm.config.hidden_size)
1195
+ inputs_embeds[input_ids != self.image_token_id] = \
1196
+ self.llm.get_input_embeddings()(input_ids[input_ids != self.image_token_id])
1197
+ inputs_embeds = torch.cat([inputs_embeds, query_embeds], dim=1)
1198
+
1199
+ return dict(inputs_embeds=inputs_embeds, attention_mask=attention_mask,
1200
+ position_ids=position_ids, past_key_values=past_key_values)
1201
+
1202
+ @torch.no_grad()
1203
+ def get_semantic_features(self, pixel_values, resize=True):
1204
+ pixel_values = (pixel_values + 1.0) / 2
1205
+ pixel_values = pixel_values - self.vit_mean.view(1, 3, 1, 1)
1206
+ pixel_values = pixel_values / self.vit_std.view(1, 3, 1, 1)
1207
+
1208
+ if resize:
1209
+ pixel_values = F.interpolate(pixel_values, size=(448, 448), mode='bilinear')
1210
+ b, c, h, w = pixel_values.shape
1211
+
1212
+ patch_size = self.lmm.config.vision_config.patch_size
1213
+ spatial_merge_size = self.lmm.config.vision_config.spatial_merge_size
1214
+ temporal_patch_size = self.lmm.config.vision_config.temporal_patch_size
1215
+
1216
+ pixel_values = pixel_values[:, None].expand(b, temporal_patch_size, c, h, w)
1217
+ grid_t = 1
1218
+ grid_h, grid_w = h // patch_size, w // patch_size
1219
+
1220
+ pixel_values = pixel_values.view(
1221
+ b, grid_t, temporal_patch_size, c,
1222
+ grid_h // spatial_merge_size, spatial_merge_size, patch_size,
1223
+ grid_w // spatial_merge_size, spatial_merge_size, patch_size)
1224
+ pixel_values = rearrange(
1225
+ pixel_values, 'b t tp c h m p w n q -> (b t h w m n) (c tp p q)')
1226
+
1227
+ image_grid_thw = torch.tensor(
1228
+ [(grid_t, grid_h, grid_w)] * b).to(self._gpu_device).long()
1229
+ image_embeds = self.lmm.visual(pixel_values, grid_thw=image_grid_thw)
1230
+ image_embeds = rearrange(image_embeds, '(b l) d -> b l d', b=b)
1231
+ return image_embeds, image_grid_thw
1232
+
1233
+ @torch.no_grad()
1234
+ def get_semantic_features_dynamic(self, pixel_values):
1235
+ def multi_apply(func, *args, **kwargs):
1236
+ pfunc = partial(func, **kwargs) if kwargs else func
1237
+ map_results = map(pfunc, *args)
1238
+ return tuple(map(list, zip(*map_results)))
1239
+
1240
+ pixel_values = [F.interpolate(p[None], scale_factor=28/32, mode='bilinear')
1241
+ for p in pixel_values]
1242
+ image_embeds, image_grid_thw = multi_apply(
1243
+ self.get_semantic_features, pixel_values, resize=False)
1244
+ image_embeds = [x[0] for x in image_embeds]
1245
+ image_grid_thw = torch.cat(image_grid_thw, dim=0)
1246
+ return image_embeds, image_grid_thw
1247
+
1248
+ @torch.no_grad()
1249
+ def __call__(
1250
+ self,
1251
+ prompt: Union[str, List[str]],
1252
+ image: Optional[Union[Image.Image, List[Image.Image]]] = None,
1253
+ negative_prompt: str = "",
1254
+ height: int = 512,
1255
+ width: int = 512,
1256
+ num_inference_steps: int = 50,
1257
+ guidance_scale: float = 4.0,
1258
+ seed: Optional[int] = None,
1259
+ num_images_per_prompt: int = 1,
1260
+ ):
1261
+ """
1262
+ Generate or edit images.
1263
+
1264
+ Args:
1265
+ prompt: Text prompt for generation/editing.
1266
+ image: Optional input image(s) for editing. If None, does text-to-image.
1267
+ negative_prompt: Negative prompt for CFG.
1268
+ height: Output image height.
1269
+ width: Output image width.
1270
+ num_inference_steps: Number of denoising steps.
1271
+ guidance_scale: CFG guidance scale.
1272
+ seed: Random seed for reproducibility.
1273
+ num_images_per_prompt: Number of images to generate per prompt.
1274
+
1275
+ Returns:
1276
+ SimpleNamespace with .images attribute (list of PIL Images).
1277
+ """
1278
+ from types import SimpleNamespace
1279
+ self._load_extras()
1280
+
1281
+ offload = self._cpu_offload
1282
+ gpu = self._gpu_device
1283
+
1284
+ if isinstance(prompt, str):
1285
+ prompt = [prompt]
1286
+ b = len(prompt) * num_images_per_prompt
1287
+ prompt = prompt * num_images_per_prompt
1288
+ cfg_prompt = [negative_prompt] * b
1289
+
1290
+ generator = None
1291
+ if seed is not None:
1292
+ generator = torch.Generator(device=gpu).manual_seed(seed)
1293
+
1294
+ # === Stage 1: VLM + Connector ===
1295
+ if offload:
1296
+ self._offload_to(self.lmm, gpu)
1297
+ self._offload_to(self.connector_module, gpu)
1298
+
1299
+ pixel_values_src = None
1300
+ cond_latents = None
1301
+ if image is not None:
1302
+ if isinstance(image, Image.Image):
1303
+ image = [image]
1304
+ ref_images = []
1305
+ for img in image:
1306
+ img = img.convert('RGB').resize((width, height))
1307
+ pv = torch.from_numpy(np.array(img)).float() / 255.0
1308
+ pv = 2 * pv - 1
1309
+ pv = rearrange(pv, 'h w c -> c h w')
1310
+ ref_images.append(pv.to(dtype=self.transformer.dtype, device=gpu))
1311
+
1312
+ pixel_values_src = [[img for img in ref_images]] * b
1313
+ num_refs = [len(ref_images)] * b
1314
+ image_embeds, image_grid_thw = self.get_semantic_features_dynamic(
1315
+ [img for ref_imgs in pixel_values_src for img in ref_imgs])
1316
+ ref_lens = [len(x) for x in image_embeds]
1317
+
1318
+ text_inputs = self.prepare_image2image_prompts(
1319
+ prompt + cfg_prompt, num_refs=num_refs * 2, ref_lens=ref_lens * 2)
1320
+ text_inputs.update(
1321
+ image_embeds=torch.cat(image_embeds * 2),
1322
+ image_grid_thw=torch.cat([image_grid_thw] * 2))
1323
+
1324
+ if offload:
1325
+ self._offload_to(self.vae, gpu)
1326
+ cond_latents = [[self.pixels_to_latents(img[None])[0] for img in ref_imgs]
1327
+ for ref_imgs in pixel_values_src]
1328
+ cond_latents = cond_latents * 2
1329
+ if offload:
1330
+ self._offload_to(self.vae, "cpu")
1331
+ else:
1332
+ text_inputs = self.prepare_text2image_prompts(prompt + cfg_prompt)
1333
+
1334
+ hidden_states = self.connector_module.meta_queries[None].expand(
1335
+ 2 * b, self.num_queries, -1)
1336
+ inputs = self.prepare_forward_input(query_embeds=hidden_states, **text_inputs)
1337
+ output = self.llm(**inputs, return_dict=True, output_hidden_states=True)
1338
+
1339
+ # SCB: extract multi-layer hidden states
1340
+ hidden_states = output.hidden_states
1341
+ num_layers = len(hidden_states) - 1
1342
+ selected_layers = list(range(num_layers - 1, 0, -6))
1343
+ selected_hiddens = [hidden_states[i] for i in selected_layers]
1344
+ merged_hidden = torch.cat(selected_hiddens, dim=-1)
1345
+ pooled_out, seq_out = self.connector_module.llm2dit(merged_hidden)
1346
+
1347
+ if offload:
1348
+ del output, hidden_states, selected_hiddens, merged_hidden
1349
+ self._offload_to(self.lmm, "cpu")
1350
+ self._offload_to(self.connector_module, "cpu")
1351
+
1352
+ # === Stage 2: DiT denoising ===
1353
+ if offload:
1354
+ self._offload_to(self.transformer, gpu)
1355
+
1356
+ pipeline = _SD3Pipeline(
1357
+ transformer=self.transformer, scheduler=self.scheduler,
1358
+ vae=self.vae, text_encoder=None, tokenizer=None,
1359
+ text_encoder_2=None, tokenizer_2=None,
1360
+ text_encoder_3=None, tokenizer_3=None)
1361
+
1362
+ samples = pipeline(
1363
+ height=height, width=width,
1364
+ guidance_scale=guidance_scale,
1365
+ num_inference_steps=num_inference_steps,
1366
+ prompt_embeds=seq_out[:b],
1367
+ pooled_prompt_embeds=pooled_out[:b],
1368
+ negative_prompt_embeds=seq_out[b:],
1369
+ negative_pooled_prompt_embeds=pooled_out[b:],
1370
+ generator=generator,
1371
+ output_type='latent',
1372
+ cond_latents=cond_latents,
1373
+ ).images.to(self.transformer.dtype)
1374
+
1375
+ if offload:
1376
+ self._offload_to(self.transformer, "cpu")
1377
+
1378
+ # === Stage 3: VAE decode ===
1379
+ if offload:
1380
+ self._offload_to(self.vae, gpu)
1381
+
1382
+ pixels = self.latents_to_pixels(samples)
1383
+
1384
+ if offload:
1385
+ self._offload_to(self.vae, "cpu")
1386
+
1387
+ images = []
1388
+ for i in range(pixels.shape[0]):
1389
+ img = pixels[i]
1390
+ img = rearrange(img, 'c h w -> h w c')
1391
+ img = torch.clamp(127.5 * img + 128.0, 0, 255).to("cpu", dtype=torch.uint8).numpy()
1392
+ images.append(Image.fromarray(img))
1393
+
1394
+ return SimpleNamespace(images=images)
model_index.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": ["deepgen_pipeline", "DeepGenPipeline"],
3
+ "_diffusers_version": "0.35.2",
4
+ "transformer": ["diffusers", "SD3Transformer2DModel"],
5
+ "vae": ["diffusers", "AutoencoderKL"],
6
+ "scheduler": ["diffusers", "FlowMatchEulerDiscreteScheduler"]
7
+ }
prompt_template.json ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "IMG_START_TOKEN": "<|vision_start|>",
3
+ "IMG_END_TOKEN": "<|vision_end|>",
4
+ "IMG_CONTEXT_TOKEN": "<|image_pad|>",
5
+ "IMG_START_TOKEN_FOR_GENERATION": false,
6
+ "SYSTEM": "<|im_start|>system\n{system}<|im_end|>\n",
7
+ "INSTRUCTION": "<|im_start|>user\n{input}<|im_end|>\n<|im_start|>assistant\n",
8
+ "SUFFIX": "<|im_end|>",
9
+ "SUFFIX_AS_EOS": true,
10
+ "SEP": "\n",
11
+ "STOP_WORDS": [
12
+ "<|im_end|>",
13
+ "<|endoftext|>"
14
+ ],
15
+ "GENERATION": "Generate an image: {input}",
16
+ "CFG": "Generate an image."
17
+ }
scheduler/scheduler_config.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "FlowMatchEulerDiscreteScheduler",
3
+ "_diffusers_version": "0.35.2",
4
+ "base_image_seq_len": 256,
5
+ "base_shift": 0.5,
6
+ "invert_sigmas": false,
7
+ "max_image_seq_len": 4096,
8
+ "max_shift": 1.15,
9
+ "num_train_timesteps": 1000,
10
+ "shift": 3.0,
11
+ "shift_terminal": null,
12
+ "stochastic_sampling": false,
13
+ "time_shift_type": "exponential",
14
+ "use_beta_sigmas": false,
15
+ "use_dynamic_shifting": false,
16
+ "use_exponential_sigmas": false,
17
+ "use_karras_sigmas": false
18
+ }
tokenizer/added_tokens.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</tool_call>": 151658,
3
+ "<tool_call>": 151657,
4
+ "<|box_end|>": 151649,
5
+ "<|box_start|>": 151648,
6
+ "<|endoftext|>": 151643,
7
+ "<|file_sep|>": 151664,
8
+ "<|fim_middle|>": 151660,
9
+ "<|fim_pad|>": 151662,
10
+ "<|fim_prefix|>": 151659,
11
+ "<|fim_suffix|>": 151661,
12
+ "<|im_end|>": 151645,
13
+ "<|im_start|>": 151644,
14
+ "<|image_pad|>": 151655,
15
+ "<|object_ref_end|>": 151647,
16
+ "<|object_ref_start|>": 151646,
17
+ "<|quad_end|>": 151651,
18
+ "<|quad_start|>": 151650,
19
+ "<|repo_name|>": 151663,
20
+ "<|video_pad|>": 151656,
21
+ "<|vision_end|>": 151653,
22
+ "<|vision_pad|>": 151654,
23
+ "<|vision_start|>": 151652
24
+ }
tokenizer/chat_template.jinja ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system
2
+ You are a helpful assistant.<|im_end|>
3
+ {% endif %}<|im_start|>{{ message['role'] }}
4
+ {% if message['content'] is string %}{{ message['content'] }}<|im_end|>
5
+ {% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>
6
+ {% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant
7
+ {% endif %}
tokenizer/merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer/special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>"
16
+ ],
17
+ "eos_token": {
18
+ "content": "<|im_end|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "pad_token": {
25
+ "content": "<|endoftext|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
tokenizer/tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9c5ae00e602b8860cbd784ba82a8aa14e8feecec692e7076590d014d7b7fdafa
3
+ size 11421896
tokenizer/tokenizer_config.json ADDED
@@ -0,0 +1,208 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ }
181
+ },
182
+ "additional_special_tokens": [
183
+ "<|im_start|>",
184
+ "<|im_end|>",
185
+ "<|object_ref_start|>",
186
+ "<|object_ref_end|>",
187
+ "<|box_start|>",
188
+ "<|box_end|>",
189
+ "<|quad_start|>",
190
+ "<|quad_end|>",
191
+ "<|vision_start|>",
192
+ "<|vision_end|>",
193
+ "<|vision_pad|>",
194
+ "<|image_pad|>",
195
+ "<|video_pad|>"
196
+ ],
197
+ "bos_token": null,
198
+ "clean_up_tokenization_spaces": false,
199
+ "eos_token": "<|im_end|>",
200
+ "errors": "replace",
201
+ "extra_special_tokens": {},
202
+ "model_max_length": 131072,
203
+ "pad_token": "<|endoftext|>",
204
+ "padding_side": "right",
205
+ "split_special_tokens": false,
206
+ "tokenizer_class": "Qwen2Tokenizer",
207
+ "unk_token": null
208
+ }
tokenizer/vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
transformer/config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "SD3Transformer2DModel",
3
+ "_diffusers_version": "0.35.2",
4
+ "_name_or_path": "model_zoo/UniPic2-SD3.5M-Kontext-2B",
5
+ "attention_head_dim": 64,
6
+ "caption_projection_dim": 1536,
7
+ "dual_attention_layers": [
8
+ 0,
9
+ 1,
10
+ 2,
11
+ 3,
12
+ 4,
13
+ 5,
14
+ 6,
15
+ 7,
16
+ 8,
17
+ 9,
18
+ 10,
19
+ 11,
20
+ 12
21
+ ],
22
+ "in_channels": 16,
23
+ "joint_attention_dim": 4096,
24
+ "num_attention_heads": 24,
25
+ "num_layers": 24,
26
+ "out_channels": 16,
27
+ "patch_size": 2,
28
+ "pooled_projection_dim": 2048,
29
+ "pos_embed_max_size": 384,
30
+ "qk_norm": "rms_norm",
31
+ "sample_size": 128
32
+ }
transformer/diffusion_pytorch_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:54078ee51ab477275e94b610249ea912d6f80d24256c4197fe48945eef706883
3
+ size 4939433672
vae/config.json ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "AutoencoderKL",
3
+ "_diffusers_version": "0.35.2",
4
+ "_name_or_path": "model_zoo/UniPic2-SD3.5M-Kontext-2B",
5
+ "act_fn": "silu",
6
+ "block_out_channels": [
7
+ 128,
8
+ 256,
9
+ 512,
10
+ 512
11
+ ],
12
+ "down_block_types": [
13
+ "DownEncoderBlock2D",
14
+ "DownEncoderBlock2D",
15
+ "DownEncoderBlock2D",
16
+ "DownEncoderBlock2D"
17
+ ],
18
+ "force_upcast": true,
19
+ "in_channels": 3,
20
+ "latent_channels": 16,
21
+ "latents_mean": null,
22
+ "latents_std": null,
23
+ "layers_per_block": 2,
24
+ "mid_block_add_attention": true,
25
+ "norm_num_groups": 32,
26
+ "out_channels": 3,
27
+ "sample_size": 1024,
28
+ "scaling_factor": 1.5305,
29
+ "shift_factor": 0.0609,
30
+ "up_block_types": [
31
+ "UpDecoderBlock2D",
32
+ "UpDecoderBlock2D",
33
+ "UpDecoderBlock2D",
34
+ "UpDecoderBlock2D"
35
+ ],
36
+ "use_post_quant_conv": false,
37
+ "use_quant_conv": false
38
+ }
vae/diffusion_pytorch_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5a170129752b4ad4ab98996d2e59097c8c693afc31a5964bd2ad60896686c1e2
3
+ size 167666902
vlm/config.json ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen2_5_VLForConditionalGeneration"
4
+ ],
5
+ "attention_dropout": 0.0,
6
+ "bos_token_id": 151643,
7
+ "dtype": "bfloat16",
8
+ "eos_token_id": 151645,
9
+ "hidden_act": "silu",
10
+ "hidden_size": 2048,
11
+ "image_token_id": 151655,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 11008,
14
+ "max_position_embeddings": 128000,
15
+ "max_window_layers": 70,
16
+ "model_type": "qwen2_5_vl",
17
+ "num_attention_heads": 16,
18
+ "num_hidden_layers": 36,
19
+ "num_key_value_heads": 2,
20
+ "rms_norm_eps": 1e-06,
21
+ "rope_scaling": {
22
+ "mrope_section": [
23
+ 16,
24
+ 24,
25
+ 24
26
+ ],
27
+ "rope_type": "default",
28
+ "type": "default"
29
+ },
30
+ "rope_theta": 1000000.0,
31
+ "sliding_window": 32768,
32
+ "text_config": {
33
+ "architectures": [
34
+ "Qwen2_5_VLForConditionalGeneration"
35
+ ],
36
+ "attention_dropout": 0.0,
37
+ "bos_token_id": 151643,
38
+ "dtype": "bfloat16",
39
+ "eos_token_id": 151645,
40
+ "hidden_act": "silu",
41
+ "hidden_size": 2048,
42
+ "image_token_id": null,
43
+ "initializer_range": 0.02,
44
+ "intermediate_size": 11008,
45
+ "layer_types": [
46
+ "full_attention",
47
+ "full_attention",
48
+ "full_attention",
49
+ "full_attention",
50
+ "full_attention",
51
+ "full_attention",
52
+ "full_attention",
53
+ "full_attention",
54
+ "full_attention",
55
+ "full_attention",
56
+ "full_attention",
57
+ "full_attention",
58
+ "full_attention",
59
+ "full_attention",
60
+ "full_attention",
61
+ "full_attention",
62
+ "full_attention",
63
+ "full_attention",
64
+ "full_attention",
65
+ "full_attention",
66
+ "full_attention",
67
+ "full_attention",
68
+ "full_attention",
69
+ "full_attention",
70
+ "full_attention",
71
+ "full_attention",
72
+ "full_attention",
73
+ "full_attention",
74
+ "full_attention",
75
+ "full_attention",
76
+ "full_attention",
77
+ "full_attention",
78
+ "full_attention",
79
+ "full_attention",
80
+ "full_attention",
81
+ "full_attention"
82
+ ],
83
+ "max_position_embeddings": 128000,
84
+ "max_window_layers": 70,
85
+ "model_type": "qwen2_5_vl_text",
86
+ "num_attention_heads": 16,
87
+ "num_hidden_layers": 36,
88
+ "num_key_value_heads": 2,
89
+ "rms_norm_eps": 1e-06,
90
+ "rope_scaling": {
91
+ "mrope_section": [
92
+ 16,
93
+ 24,
94
+ 24
95
+ ],
96
+ "rope_type": "default",
97
+ "type": "default"
98
+ },
99
+ "rope_theta": 1000000.0,
100
+ "sliding_window": null,
101
+ "tie_word_embeddings": true,
102
+ "use_cache": true,
103
+ "use_sliding_window": false,
104
+ "video_token_id": null,
105
+ "vision_end_token_id": 151653,
106
+ "vision_start_token_id": 151652,
107
+ "vision_token_id": 151654,
108
+ "vocab_size": 151936
109
+ },
110
+ "transformers_version": "4.56.1",
111
+ "use_cache": true,
112
+ "use_sliding_window": false,
113
+ "video_token_id": 151656,
114
+ "vision_config": {
115
+ "depth": 32,
116
+ "dtype": "bfloat16",
117
+ "fullatt_block_indexes": [
118
+ 7,
119
+ 15,
120
+ 23,
121
+ 31
122
+ ],
123
+ "hidden_act": "silu",
124
+ "hidden_size": 1280,
125
+ "in_channels": 3,
126
+ "in_chans": 3,
127
+ "initializer_range": 0.02,
128
+ "intermediate_size": 3420,
129
+ "model_type": "qwen2_5_vl",
130
+ "num_heads": 16,
131
+ "out_hidden_size": 2048,
132
+ "patch_size": 14,
133
+ "spatial_merge_size": 2,
134
+ "spatial_patch_size": 14,
135
+ "temporal_patch_size": 2,
136
+ "tokens_per_second": 2,
137
+ "window_size": 112
138
+ },
139
+ "vision_end_token_id": 151653,
140
+ "vision_start_token_id": 151652,
141
+ "vision_token_id": 151654,
142
+ "vocab_size": 151936
143
+ }
vlm/generation_config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 151643,
3
+ "do_sample": true,
4
+ "eos_token_id": [
5
+ 151645,
6
+ 151643
7
+ ],
8
+ "pad_token_id": 151643,
9
+ "repetition_penalty": 1.05,
10
+ "temperature": 1e-06,
11
+ "transformers_version": "4.56.1"
12
+ }
vlm/model-00001-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d6f11cc9c30184ce92605f25ac1e6b6644ee217e7b86a28282b3c9b17de3e609
3
+ size 4997750760
vlm/model-00002-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c675c19a231b0ead141d8d3a05fe930892ddf17f1324d75bb94d2a42a79f7ebc
3
+ size 2511587184
vlm/model.safetensors.index.json ADDED
@@ -0,0 +1,832 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_parameters": 3754622976,
4
+ "total_size": 7509245952
5
+ },
6
+ "weight_map": {
7
+ "model.embed_tokens.weight": "model-00001-of-00002.safetensors",
8
+ "model.layers.0.input_layernorm.weight": "model-00001-of-00002.safetensors",
9
+ "model.layers.0.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
10
+ "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
11
+ "model.layers.0.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
12
+ "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
13
+ "model.layers.0.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
14
+ "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
15
+ "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
16
+ "model.layers.0.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
17
+ "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
18
+ "model.layers.0.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
19
+ "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
20
+ "model.layers.1.input_layernorm.weight": "model-00001-of-00002.safetensors",
21
+ "model.layers.1.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
22
+ "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
23
+ "model.layers.1.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
24
+ "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
25
+ "model.layers.1.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
26
+ "model.layers.1.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
27
+ "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
28
+ "model.layers.1.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
29
+ "model.layers.1.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
30
+ "model.layers.1.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
31
+ "model.layers.1.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
32
+ "model.layers.10.input_layernorm.weight": "model-00001-of-00002.safetensors",
33
+ "model.layers.10.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
34
+ "model.layers.10.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
35
+ "model.layers.10.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
36
+ "model.layers.10.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
37
+ "model.layers.10.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
38
+ "model.layers.10.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
39
+ "model.layers.10.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
40
+ "model.layers.10.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
41
+ "model.layers.10.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
42
+ "model.layers.10.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
43
+ "model.layers.10.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
44
+ "model.layers.11.input_layernorm.weight": "model-00001-of-00002.safetensors",
45
+ "model.layers.11.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
46
+ "model.layers.11.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
47
+ "model.layers.11.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
48
+ "model.layers.11.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
49
+ "model.layers.11.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
50
+ "model.layers.11.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
51
+ "model.layers.11.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
52
+ "model.layers.11.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
53
+ "model.layers.11.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
54
+ "model.layers.11.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
55
+ "model.layers.11.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
56
+ "model.layers.12.input_layernorm.weight": "model-00001-of-00002.safetensors",
57
+ "model.layers.12.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
58
+ "model.layers.12.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
59
+ "model.layers.12.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
60
+ "model.layers.12.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
61
+ "model.layers.12.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
62
+ "model.layers.12.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
63
+ "model.layers.12.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
64
+ "model.layers.12.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
65
+ "model.layers.12.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
66
+ "model.layers.12.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
67
+ "model.layers.12.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
68
+ "model.layers.13.input_layernorm.weight": "model-00001-of-00002.safetensors",
69
+ "model.layers.13.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
70
+ "model.layers.13.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
71
+ "model.layers.13.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
72
+ "model.layers.13.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
73
+ "model.layers.13.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
74
+ "model.layers.13.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
75
+ "model.layers.13.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
76
+ "model.layers.13.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
77
+ "model.layers.13.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
78
+ "model.layers.13.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
79
+ "model.layers.13.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
80
+ "model.layers.14.input_layernorm.weight": "model-00001-of-00002.safetensors",
81
+ "model.layers.14.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
82
+ "model.layers.14.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
83
+ "model.layers.14.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
84
+ "model.layers.14.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
85
+ "model.layers.14.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
86
+ "model.layers.14.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
87
+ "model.layers.14.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
88
+ "model.layers.14.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
89
+ "model.layers.14.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
90
+ "model.layers.14.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
91
+ "model.layers.14.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
92
+ "model.layers.15.input_layernorm.weight": "model-00001-of-00002.safetensors",
93
+ "model.layers.15.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
94
+ "model.layers.15.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
95
+ "model.layers.15.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
96
+ "model.layers.15.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
97
+ "model.layers.15.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
98
+ "model.layers.15.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
99
+ "model.layers.15.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
100
+ "model.layers.15.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
101
+ "model.layers.15.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
102
+ "model.layers.15.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
103
+ "model.layers.15.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
104
+ "model.layers.16.input_layernorm.weight": "model-00001-of-00002.safetensors",
105
+ "model.layers.16.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
106
+ "model.layers.16.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
107
+ "model.layers.16.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
108
+ "model.layers.16.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
109
+ "model.layers.16.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
110
+ "model.layers.16.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
111
+ "model.layers.16.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
112
+ "model.layers.16.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
113
+ "model.layers.16.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
114
+ "model.layers.16.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
115
+ "model.layers.16.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
116
+ "model.layers.17.input_layernorm.weight": "model-00001-of-00002.safetensors",
117
+ "model.layers.17.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
118
+ "model.layers.17.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
119
+ "model.layers.17.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
120
+ "model.layers.17.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
121
+ "model.layers.17.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
122
+ "model.layers.17.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
123
+ "model.layers.17.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
124
+ "model.layers.17.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
125
+ "model.layers.17.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
126
+ "model.layers.17.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
127
+ "model.layers.17.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
128
+ "model.layers.18.input_layernorm.weight": "model-00001-of-00002.safetensors",
129
+ "model.layers.18.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
130
+ "model.layers.18.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
131
+ "model.layers.18.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
132
+ "model.layers.18.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
133
+ "model.layers.18.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
134
+ "model.layers.18.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
135
+ "model.layers.18.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
136
+ "model.layers.18.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
137
+ "model.layers.18.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
138
+ "model.layers.18.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
139
+ "model.layers.18.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
140
+ "model.layers.19.input_layernorm.weight": "model-00002-of-00002.safetensors",
141
+ "model.layers.19.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
142
+ "model.layers.19.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
143
+ "model.layers.19.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
144
+ "model.layers.19.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
145
+ "model.layers.19.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
146
+ "model.layers.19.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
147
+ "model.layers.19.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
148
+ "model.layers.19.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
149
+ "model.layers.19.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
150
+ "model.layers.19.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
151
+ "model.layers.19.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
152
+ "model.layers.2.input_layernorm.weight": "model-00001-of-00002.safetensors",
153
+ "model.layers.2.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
154
+ "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
155
+ "model.layers.2.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
156
+ "model.layers.2.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
157
+ "model.layers.2.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
158
+ "model.layers.2.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
159
+ "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
160
+ "model.layers.2.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
161
+ "model.layers.2.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
162
+ "model.layers.2.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
163
+ "model.layers.2.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
164
+ "model.layers.20.input_layernorm.weight": "model-00002-of-00002.safetensors",
165
+ "model.layers.20.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
166
+ "model.layers.20.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
167
+ "model.layers.20.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
168
+ "model.layers.20.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
169
+ "model.layers.20.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
170
+ "model.layers.20.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
171
+ "model.layers.20.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
172
+ "model.layers.20.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
173
+ "model.layers.20.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
174
+ "model.layers.20.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
175
+ "model.layers.20.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
176
+ "model.layers.21.input_layernorm.weight": "model-00002-of-00002.safetensors",
177
+ "model.layers.21.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
178
+ "model.layers.21.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
179
+ "model.layers.21.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
180
+ "model.layers.21.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
181
+ "model.layers.21.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
182
+ "model.layers.21.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
183
+ "model.layers.21.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
184
+ "model.layers.21.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
185
+ "model.layers.21.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
186
+ "model.layers.21.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
187
+ "model.layers.21.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
188
+ "model.layers.22.input_layernorm.weight": "model-00002-of-00002.safetensors",
189
+ "model.layers.22.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
190
+ "model.layers.22.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
191
+ "model.layers.22.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
192
+ "model.layers.22.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
193
+ "model.layers.22.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
194
+ "model.layers.22.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
195
+ "model.layers.22.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
196
+ "model.layers.22.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
197
+ "model.layers.22.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
198
+ "model.layers.22.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
199
+ "model.layers.22.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
200
+ "model.layers.23.input_layernorm.weight": "model-00002-of-00002.safetensors",
201
+ "model.layers.23.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
202
+ "model.layers.23.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
203
+ "model.layers.23.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
204
+ "model.layers.23.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
205
+ "model.layers.23.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
206
+ "model.layers.23.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
207
+ "model.layers.23.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
208
+ "model.layers.23.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
209
+ "model.layers.23.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
210
+ "model.layers.23.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
211
+ "model.layers.23.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
212
+ "model.layers.24.input_layernorm.weight": "model-00002-of-00002.safetensors",
213
+ "model.layers.24.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
214
+ "model.layers.24.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
215
+ "model.layers.24.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
216
+ "model.layers.24.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
217
+ "model.layers.24.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
218
+ "model.layers.24.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
219
+ "model.layers.24.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
220
+ "model.layers.24.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
221
+ "model.layers.24.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
222
+ "model.layers.24.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
223
+ "model.layers.24.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
224
+ "model.layers.25.input_layernorm.weight": "model-00002-of-00002.safetensors",
225
+ "model.layers.25.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
226
+ "model.layers.25.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
227
+ "model.layers.25.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
228
+ "model.layers.25.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
229
+ "model.layers.25.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
230
+ "model.layers.25.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
231
+ "model.layers.25.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
232
+ "model.layers.25.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
233
+ "model.layers.25.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
234
+ "model.layers.25.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
235
+ "model.layers.25.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
236
+ "model.layers.26.input_layernorm.weight": "model-00002-of-00002.safetensors",
237
+ "model.layers.26.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
238
+ "model.layers.26.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
239
+ "model.layers.26.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
240
+ "model.layers.26.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
241
+ "model.layers.26.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
242
+ "model.layers.26.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
243
+ "model.layers.26.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
244
+ "model.layers.26.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
245
+ "model.layers.26.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
246
+ "model.layers.26.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
247
+ "model.layers.26.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
248
+ "model.layers.27.input_layernorm.weight": "model-00002-of-00002.safetensors",
249
+ "model.layers.27.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
250
+ "model.layers.27.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
251
+ "model.layers.27.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
252
+ "model.layers.27.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
253
+ "model.layers.27.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
254
+ "model.layers.27.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
255
+ "model.layers.27.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
256
+ "model.layers.27.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
257
+ "model.layers.27.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
258
+ "model.layers.27.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
259
+ "model.layers.27.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
260
+ "model.layers.28.input_layernorm.weight": "model-00002-of-00002.safetensors",
261
+ "model.layers.28.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
262
+ "model.layers.28.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
263
+ "model.layers.28.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
264
+ "model.layers.28.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
265
+ "model.layers.28.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
266
+ "model.layers.28.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
267
+ "model.layers.28.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
268
+ "model.layers.28.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
269
+ "model.layers.28.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
270
+ "model.layers.28.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
271
+ "model.layers.28.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
272
+ "model.layers.29.input_layernorm.weight": "model-00002-of-00002.safetensors",
273
+ "model.layers.29.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
274
+ "model.layers.29.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
275
+ "model.layers.29.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
276
+ "model.layers.29.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
277
+ "model.layers.29.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
278
+ "model.layers.29.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
279
+ "model.layers.29.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
280
+ "model.layers.29.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
281
+ "model.layers.29.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
282
+ "model.layers.29.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
283
+ "model.layers.29.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
284
+ "model.layers.3.input_layernorm.weight": "model-00001-of-00002.safetensors",
285
+ "model.layers.3.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
286
+ "model.layers.3.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
287
+ "model.layers.3.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
288
+ "model.layers.3.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
289
+ "model.layers.3.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
290
+ "model.layers.3.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
291
+ "model.layers.3.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
292
+ "model.layers.3.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
293
+ "model.layers.3.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
294
+ "model.layers.3.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
295
+ "model.layers.3.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
296
+ "model.layers.30.input_layernorm.weight": "model-00002-of-00002.safetensors",
297
+ "model.layers.30.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
298
+ "model.layers.30.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
299
+ "model.layers.30.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
300
+ "model.layers.30.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
301
+ "model.layers.30.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
302
+ "model.layers.30.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
303
+ "model.layers.30.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
304
+ "model.layers.30.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
305
+ "model.layers.30.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
306
+ "model.layers.30.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
307
+ "model.layers.30.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
308
+ "model.layers.31.input_layernorm.weight": "model-00002-of-00002.safetensors",
309
+ "model.layers.31.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
310
+ "model.layers.31.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
311
+ "model.layers.31.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
312
+ "model.layers.31.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
313
+ "model.layers.31.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
314
+ "model.layers.31.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
315
+ "model.layers.31.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
316
+ "model.layers.31.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
317
+ "model.layers.31.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
318
+ "model.layers.31.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
319
+ "model.layers.31.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
320
+ "model.layers.32.input_layernorm.weight": "model-00002-of-00002.safetensors",
321
+ "model.layers.32.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
322
+ "model.layers.32.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
323
+ "model.layers.32.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
324
+ "model.layers.32.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
325
+ "model.layers.32.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
326
+ "model.layers.32.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
327
+ "model.layers.32.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
328
+ "model.layers.32.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
329
+ "model.layers.32.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
330
+ "model.layers.32.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
331
+ "model.layers.32.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
332
+ "model.layers.33.input_layernorm.weight": "model-00002-of-00002.safetensors",
333
+ "model.layers.33.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
334
+ "model.layers.33.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
335
+ "model.layers.33.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
336
+ "model.layers.33.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
337
+ "model.layers.33.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
338
+ "model.layers.33.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
339
+ "model.layers.33.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
340
+ "model.layers.33.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
341
+ "model.layers.33.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
342
+ "model.layers.33.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
343
+ "model.layers.33.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
344
+ "model.layers.34.input_layernorm.weight": "model-00002-of-00002.safetensors",
345
+ "model.layers.34.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
346
+ "model.layers.34.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
347
+ "model.layers.34.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
348
+ "model.layers.34.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
349
+ "model.layers.34.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
350
+ "model.layers.34.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
351
+ "model.layers.34.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
352
+ "model.layers.34.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
353
+ "model.layers.34.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
354
+ "model.layers.34.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
355
+ "model.layers.34.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
356
+ "model.layers.35.input_layernorm.weight": "model-00002-of-00002.safetensors",
357
+ "model.layers.35.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
358
+ "model.layers.35.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
359
+ "model.layers.35.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
360
+ "model.layers.35.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
361
+ "model.layers.35.self_attn.k_proj.bias": "model-00002-of-00002.safetensors",
362
+ "model.layers.35.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
363
+ "model.layers.35.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
364
+ "model.layers.35.self_attn.q_proj.bias": "model-00002-of-00002.safetensors",
365
+ "model.layers.35.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
366
+ "model.layers.35.self_attn.v_proj.bias": "model-00002-of-00002.safetensors",
367
+ "model.layers.35.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
368
+ "model.layers.4.input_layernorm.weight": "model-00001-of-00002.safetensors",
369
+ "model.layers.4.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
370
+ "model.layers.4.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
371
+ "model.layers.4.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
372
+ "model.layers.4.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
373
+ "model.layers.4.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
374
+ "model.layers.4.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
375
+ "model.layers.4.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
376
+ "model.layers.4.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
377
+ "model.layers.4.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
378
+ "model.layers.4.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
379
+ "model.layers.4.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
380
+ "model.layers.5.input_layernorm.weight": "model-00001-of-00002.safetensors",
381
+ "model.layers.5.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
382
+ "model.layers.5.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
383
+ "model.layers.5.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
384
+ "model.layers.5.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
385
+ "model.layers.5.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
386
+ "model.layers.5.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
387
+ "model.layers.5.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
388
+ "model.layers.5.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
389
+ "model.layers.5.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
390
+ "model.layers.5.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
391
+ "model.layers.5.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
392
+ "model.layers.6.input_layernorm.weight": "model-00001-of-00002.safetensors",
393
+ "model.layers.6.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
394
+ "model.layers.6.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
395
+ "model.layers.6.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
396
+ "model.layers.6.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
397
+ "model.layers.6.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
398
+ "model.layers.6.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
399
+ "model.layers.6.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
400
+ "model.layers.6.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
401
+ "model.layers.6.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
402
+ "model.layers.6.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
403
+ "model.layers.6.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
404
+ "model.layers.7.input_layernorm.weight": "model-00001-of-00002.safetensors",
405
+ "model.layers.7.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
406
+ "model.layers.7.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
407
+ "model.layers.7.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
408
+ "model.layers.7.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
409
+ "model.layers.7.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
410
+ "model.layers.7.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
411
+ "model.layers.7.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
412
+ "model.layers.7.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
413
+ "model.layers.7.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
414
+ "model.layers.7.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
415
+ "model.layers.7.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
416
+ "model.layers.8.input_layernorm.weight": "model-00001-of-00002.safetensors",
417
+ "model.layers.8.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
418
+ "model.layers.8.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
419
+ "model.layers.8.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
420
+ "model.layers.8.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
421
+ "model.layers.8.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
422
+ "model.layers.8.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
423
+ "model.layers.8.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
424
+ "model.layers.8.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
425
+ "model.layers.8.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
426
+ "model.layers.8.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
427
+ "model.layers.8.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
428
+ "model.layers.9.input_layernorm.weight": "model-00001-of-00002.safetensors",
429
+ "model.layers.9.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
430
+ "model.layers.9.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
431
+ "model.layers.9.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
432
+ "model.layers.9.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
433
+ "model.layers.9.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
434
+ "model.layers.9.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
435
+ "model.layers.9.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
436
+ "model.layers.9.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
437
+ "model.layers.9.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
438
+ "model.layers.9.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
439
+ "model.layers.9.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
440
+ "model.norm.weight": "model-00002-of-00002.safetensors",
441
+ "visual.blocks.0.attn.proj.bias": "model-00001-of-00002.safetensors",
442
+ "visual.blocks.0.attn.proj.weight": "model-00001-of-00002.safetensors",
443
+ "visual.blocks.0.attn.qkv.bias": "model-00001-of-00002.safetensors",
444
+ "visual.blocks.0.attn.qkv.weight": "model-00001-of-00002.safetensors",
445
+ "visual.blocks.0.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
446
+ "visual.blocks.0.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
447
+ "visual.blocks.0.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
448
+ "visual.blocks.0.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
449
+ "visual.blocks.0.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
450
+ "visual.blocks.0.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
451
+ "visual.blocks.0.norm1.weight": "model-00001-of-00002.safetensors",
452
+ "visual.blocks.0.norm2.weight": "model-00001-of-00002.safetensors",
453
+ "visual.blocks.1.attn.proj.bias": "model-00001-of-00002.safetensors",
454
+ "visual.blocks.1.attn.proj.weight": "model-00001-of-00002.safetensors",
455
+ "visual.blocks.1.attn.qkv.bias": "model-00001-of-00002.safetensors",
456
+ "visual.blocks.1.attn.qkv.weight": "model-00001-of-00002.safetensors",
457
+ "visual.blocks.1.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
458
+ "visual.blocks.1.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
459
+ "visual.blocks.1.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
460
+ "visual.blocks.1.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
461
+ "visual.blocks.1.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
462
+ "visual.blocks.1.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
463
+ "visual.blocks.1.norm1.weight": "model-00001-of-00002.safetensors",
464
+ "visual.blocks.1.norm2.weight": "model-00001-of-00002.safetensors",
465
+ "visual.blocks.10.attn.proj.bias": "model-00001-of-00002.safetensors",
466
+ "visual.blocks.10.attn.proj.weight": "model-00001-of-00002.safetensors",
467
+ "visual.blocks.10.attn.qkv.bias": "model-00001-of-00002.safetensors",
468
+ "visual.blocks.10.attn.qkv.weight": "model-00001-of-00002.safetensors",
469
+ "visual.blocks.10.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
470
+ "visual.blocks.10.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
471
+ "visual.blocks.10.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
472
+ "visual.blocks.10.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
473
+ "visual.blocks.10.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
474
+ "visual.blocks.10.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
475
+ "visual.blocks.10.norm1.weight": "model-00001-of-00002.safetensors",
476
+ "visual.blocks.10.norm2.weight": "model-00001-of-00002.safetensors",
477
+ "visual.blocks.11.attn.proj.bias": "model-00001-of-00002.safetensors",
478
+ "visual.blocks.11.attn.proj.weight": "model-00001-of-00002.safetensors",
479
+ "visual.blocks.11.attn.qkv.bias": "model-00001-of-00002.safetensors",
480
+ "visual.blocks.11.attn.qkv.weight": "model-00001-of-00002.safetensors",
481
+ "visual.blocks.11.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
482
+ "visual.blocks.11.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
483
+ "visual.blocks.11.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
484
+ "visual.blocks.11.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
485
+ "visual.blocks.11.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
486
+ "visual.blocks.11.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
487
+ "visual.blocks.11.norm1.weight": "model-00001-of-00002.safetensors",
488
+ "visual.blocks.11.norm2.weight": "model-00001-of-00002.safetensors",
489
+ "visual.blocks.12.attn.proj.bias": "model-00001-of-00002.safetensors",
490
+ "visual.blocks.12.attn.proj.weight": "model-00001-of-00002.safetensors",
491
+ "visual.blocks.12.attn.qkv.bias": "model-00001-of-00002.safetensors",
492
+ "visual.blocks.12.attn.qkv.weight": "model-00001-of-00002.safetensors",
493
+ "visual.blocks.12.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
494
+ "visual.blocks.12.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
495
+ "visual.blocks.12.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
496
+ "visual.blocks.12.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
497
+ "visual.blocks.12.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
498
+ "visual.blocks.12.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
499
+ "visual.blocks.12.norm1.weight": "model-00001-of-00002.safetensors",
500
+ "visual.blocks.12.norm2.weight": "model-00001-of-00002.safetensors",
501
+ "visual.blocks.13.attn.proj.bias": "model-00001-of-00002.safetensors",
502
+ "visual.blocks.13.attn.proj.weight": "model-00001-of-00002.safetensors",
503
+ "visual.blocks.13.attn.qkv.bias": "model-00001-of-00002.safetensors",
504
+ "visual.blocks.13.attn.qkv.weight": "model-00001-of-00002.safetensors",
505
+ "visual.blocks.13.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
506
+ "visual.blocks.13.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
507
+ "visual.blocks.13.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
508
+ "visual.blocks.13.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
509
+ "visual.blocks.13.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
510
+ "visual.blocks.13.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
511
+ "visual.blocks.13.norm1.weight": "model-00001-of-00002.safetensors",
512
+ "visual.blocks.13.norm2.weight": "model-00001-of-00002.safetensors",
513
+ "visual.blocks.14.attn.proj.bias": "model-00001-of-00002.safetensors",
514
+ "visual.blocks.14.attn.proj.weight": "model-00001-of-00002.safetensors",
515
+ "visual.blocks.14.attn.qkv.bias": "model-00001-of-00002.safetensors",
516
+ "visual.blocks.14.attn.qkv.weight": "model-00001-of-00002.safetensors",
517
+ "visual.blocks.14.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
518
+ "visual.blocks.14.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
519
+ "visual.blocks.14.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
520
+ "visual.blocks.14.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
521
+ "visual.blocks.14.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
522
+ "visual.blocks.14.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
523
+ "visual.blocks.14.norm1.weight": "model-00001-of-00002.safetensors",
524
+ "visual.blocks.14.norm2.weight": "model-00001-of-00002.safetensors",
525
+ "visual.blocks.15.attn.proj.bias": "model-00001-of-00002.safetensors",
526
+ "visual.blocks.15.attn.proj.weight": "model-00001-of-00002.safetensors",
527
+ "visual.blocks.15.attn.qkv.bias": "model-00001-of-00002.safetensors",
528
+ "visual.blocks.15.attn.qkv.weight": "model-00001-of-00002.safetensors",
529
+ "visual.blocks.15.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
530
+ "visual.blocks.15.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
531
+ "visual.blocks.15.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
532
+ "visual.blocks.15.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
533
+ "visual.blocks.15.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
534
+ "visual.blocks.15.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
535
+ "visual.blocks.15.norm1.weight": "model-00001-of-00002.safetensors",
536
+ "visual.blocks.15.norm2.weight": "model-00001-of-00002.safetensors",
537
+ "visual.blocks.16.attn.proj.bias": "model-00001-of-00002.safetensors",
538
+ "visual.blocks.16.attn.proj.weight": "model-00001-of-00002.safetensors",
539
+ "visual.blocks.16.attn.qkv.bias": "model-00001-of-00002.safetensors",
540
+ "visual.blocks.16.attn.qkv.weight": "model-00001-of-00002.safetensors",
541
+ "visual.blocks.16.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
542
+ "visual.blocks.16.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
543
+ "visual.blocks.16.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
544
+ "visual.blocks.16.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
545
+ "visual.blocks.16.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
546
+ "visual.blocks.16.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
547
+ "visual.blocks.16.norm1.weight": "model-00001-of-00002.safetensors",
548
+ "visual.blocks.16.norm2.weight": "model-00001-of-00002.safetensors",
549
+ "visual.blocks.17.attn.proj.bias": "model-00001-of-00002.safetensors",
550
+ "visual.blocks.17.attn.proj.weight": "model-00001-of-00002.safetensors",
551
+ "visual.blocks.17.attn.qkv.bias": "model-00001-of-00002.safetensors",
552
+ "visual.blocks.17.attn.qkv.weight": "model-00001-of-00002.safetensors",
553
+ "visual.blocks.17.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
554
+ "visual.blocks.17.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
555
+ "visual.blocks.17.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
556
+ "visual.blocks.17.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
557
+ "visual.blocks.17.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
558
+ "visual.blocks.17.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
559
+ "visual.blocks.17.norm1.weight": "model-00001-of-00002.safetensors",
560
+ "visual.blocks.17.norm2.weight": "model-00001-of-00002.safetensors",
561
+ "visual.blocks.18.attn.proj.bias": "model-00001-of-00002.safetensors",
562
+ "visual.blocks.18.attn.proj.weight": "model-00001-of-00002.safetensors",
563
+ "visual.blocks.18.attn.qkv.bias": "model-00001-of-00002.safetensors",
564
+ "visual.blocks.18.attn.qkv.weight": "model-00001-of-00002.safetensors",
565
+ "visual.blocks.18.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
566
+ "visual.blocks.18.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
567
+ "visual.blocks.18.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
568
+ "visual.blocks.18.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
569
+ "visual.blocks.18.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
570
+ "visual.blocks.18.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
571
+ "visual.blocks.18.norm1.weight": "model-00001-of-00002.safetensors",
572
+ "visual.blocks.18.norm2.weight": "model-00001-of-00002.safetensors",
573
+ "visual.blocks.19.attn.proj.bias": "model-00001-of-00002.safetensors",
574
+ "visual.blocks.19.attn.proj.weight": "model-00001-of-00002.safetensors",
575
+ "visual.blocks.19.attn.qkv.bias": "model-00001-of-00002.safetensors",
576
+ "visual.blocks.19.attn.qkv.weight": "model-00001-of-00002.safetensors",
577
+ "visual.blocks.19.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
578
+ "visual.blocks.19.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
579
+ "visual.blocks.19.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
580
+ "visual.blocks.19.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
581
+ "visual.blocks.19.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
582
+ "visual.blocks.19.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
583
+ "visual.blocks.19.norm1.weight": "model-00001-of-00002.safetensors",
584
+ "visual.blocks.19.norm2.weight": "model-00001-of-00002.safetensors",
585
+ "visual.blocks.2.attn.proj.bias": "model-00001-of-00002.safetensors",
586
+ "visual.blocks.2.attn.proj.weight": "model-00001-of-00002.safetensors",
587
+ "visual.blocks.2.attn.qkv.bias": "model-00001-of-00002.safetensors",
588
+ "visual.blocks.2.attn.qkv.weight": "model-00001-of-00002.safetensors",
589
+ "visual.blocks.2.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
590
+ "visual.blocks.2.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
591
+ "visual.blocks.2.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
592
+ "visual.blocks.2.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
593
+ "visual.blocks.2.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
594
+ "visual.blocks.2.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
595
+ "visual.blocks.2.norm1.weight": "model-00001-of-00002.safetensors",
596
+ "visual.blocks.2.norm2.weight": "model-00001-of-00002.safetensors",
597
+ "visual.blocks.20.attn.proj.bias": "model-00001-of-00002.safetensors",
598
+ "visual.blocks.20.attn.proj.weight": "model-00001-of-00002.safetensors",
599
+ "visual.blocks.20.attn.qkv.bias": "model-00001-of-00002.safetensors",
600
+ "visual.blocks.20.attn.qkv.weight": "model-00001-of-00002.safetensors",
601
+ "visual.blocks.20.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
602
+ "visual.blocks.20.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
603
+ "visual.blocks.20.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
604
+ "visual.blocks.20.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
605
+ "visual.blocks.20.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
606
+ "visual.blocks.20.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
607
+ "visual.blocks.20.norm1.weight": "model-00001-of-00002.safetensors",
608
+ "visual.blocks.20.norm2.weight": "model-00001-of-00002.safetensors",
609
+ "visual.blocks.21.attn.proj.bias": "model-00001-of-00002.safetensors",
610
+ "visual.blocks.21.attn.proj.weight": "model-00001-of-00002.safetensors",
611
+ "visual.blocks.21.attn.qkv.bias": "model-00001-of-00002.safetensors",
612
+ "visual.blocks.21.attn.qkv.weight": "model-00001-of-00002.safetensors",
613
+ "visual.blocks.21.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
614
+ "visual.blocks.21.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
615
+ "visual.blocks.21.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
616
+ "visual.blocks.21.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
617
+ "visual.blocks.21.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
618
+ "visual.blocks.21.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
619
+ "visual.blocks.21.norm1.weight": "model-00001-of-00002.safetensors",
620
+ "visual.blocks.21.norm2.weight": "model-00001-of-00002.safetensors",
621
+ "visual.blocks.22.attn.proj.bias": "model-00001-of-00002.safetensors",
622
+ "visual.blocks.22.attn.proj.weight": "model-00001-of-00002.safetensors",
623
+ "visual.blocks.22.attn.qkv.bias": "model-00001-of-00002.safetensors",
624
+ "visual.blocks.22.attn.qkv.weight": "model-00001-of-00002.safetensors",
625
+ "visual.blocks.22.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
626
+ "visual.blocks.22.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
627
+ "visual.blocks.22.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
628
+ "visual.blocks.22.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
629
+ "visual.blocks.22.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
630
+ "visual.blocks.22.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
631
+ "visual.blocks.22.norm1.weight": "model-00001-of-00002.safetensors",
632
+ "visual.blocks.22.norm2.weight": "model-00001-of-00002.safetensors",
633
+ "visual.blocks.23.attn.proj.bias": "model-00001-of-00002.safetensors",
634
+ "visual.blocks.23.attn.proj.weight": "model-00001-of-00002.safetensors",
635
+ "visual.blocks.23.attn.qkv.bias": "model-00001-of-00002.safetensors",
636
+ "visual.blocks.23.attn.qkv.weight": "model-00001-of-00002.safetensors",
637
+ "visual.blocks.23.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
638
+ "visual.blocks.23.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
639
+ "visual.blocks.23.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
640
+ "visual.blocks.23.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
641
+ "visual.blocks.23.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
642
+ "visual.blocks.23.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
643
+ "visual.blocks.23.norm1.weight": "model-00001-of-00002.safetensors",
644
+ "visual.blocks.23.norm2.weight": "model-00001-of-00002.safetensors",
645
+ "visual.blocks.24.attn.proj.bias": "model-00001-of-00002.safetensors",
646
+ "visual.blocks.24.attn.proj.weight": "model-00001-of-00002.safetensors",
647
+ "visual.blocks.24.attn.qkv.bias": "model-00001-of-00002.safetensors",
648
+ "visual.blocks.24.attn.qkv.weight": "model-00001-of-00002.safetensors",
649
+ "visual.blocks.24.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
650
+ "visual.blocks.24.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
651
+ "visual.blocks.24.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
652
+ "visual.blocks.24.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
653
+ "visual.blocks.24.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
654
+ "visual.blocks.24.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
655
+ "visual.blocks.24.norm1.weight": "model-00001-of-00002.safetensors",
656
+ "visual.blocks.24.norm2.weight": "model-00001-of-00002.safetensors",
657
+ "visual.blocks.25.attn.proj.bias": "model-00001-of-00002.safetensors",
658
+ "visual.blocks.25.attn.proj.weight": "model-00001-of-00002.safetensors",
659
+ "visual.blocks.25.attn.qkv.bias": "model-00001-of-00002.safetensors",
660
+ "visual.blocks.25.attn.qkv.weight": "model-00001-of-00002.safetensors",
661
+ "visual.blocks.25.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
662
+ "visual.blocks.25.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
663
+ "visual.blocks.25.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
664
+ "visual.blocks.25.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
665
+ "visual.blocks.25.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
666
+ "visual.blocks.25.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
667
+ "visual.blocks.25.norm1.weight": "model-00001-of-00002.safetensors",
668
+ "visual.blocks.25.norm2.weight": "model-00001-of-00002.safetensors",
669
+ "visual.blocks.26.attn.proj.bias": "model-00001-of-00002.safetensors",
670
+ "visual.blocks.26.attn.proj.weight": "model-00001-of-00002.safetensors",
671
+ "visual.blocks.26.attn.qkv.bias": "model-00001-of-00002.safetensors",
672
+ "visual.blocks.26.attn.qkv.weight": "model-00001-of-00002.safetensors",
673
+ "visual.blocks.26.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
674
+ "visual.blocks.26.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
675
+ "visual.blocks.26.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
676
+ "visual.blocks.26.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
677
+ "visual.blocks.26.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
678
+ "visual.blocks.26.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
679
+ "visual.blocks.26.norm1.weight": "model-00001-of-00002.safetensors",
680
+ "visual.blocks.26.norm2.weight": "model-00001-of-00002.safetensors",
681
+ "visual.blocks.27.attn.proj.bias": "model-00001-of-00002.safetensors",
682
+ "visual.blocks.27.attn.proj.weight": "model-00001-of-00002.safetensors",
683
+ "visual.blocks.27.attn.qkv.bias": "model-00001-of-00002.safetensors",
684
+ "visual.blocks.27.attn.qkv.weight": "model-00001-of-00002.safetensors",
685
+ "visual.blocks.27.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
686
+ "visual.blocks.27.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
687
+ "visual.blocks.27.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
688
+ "visual.blocks.27.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
689
+ "visual.blocks.27.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
690
+ "visual.blocks.27.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
691
+ "visual.blocks.27.norm1.weight": "model-00001-of-00002.safetensors",
692
+ "visual.blocks.27.norm2.weight": "model-00001-of-00002.safetensors",
693
+ "visual.blocks.28.attn.proj.bias": "model-00001-of-00002.safetensors",
694
+ "visual.blocks.28.attn.proj.weight": "model-00001-of-00002.safetensors",
695
+ "visual.blocks.28.attn.qkv.bias": "model-00001-of-00002.safetensors",
696
+ "visual.blocks.28.attn.qkv.weight": "model-00001-of-00002.safetensors",
697
+ "visual.blocks.28.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
698
+ "visual.blocks.28.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
699
+ "visual.blocks.28.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
700
+ "visual.blocks.28.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
701
+ "visual.blocks.28.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
702
+ "visual.blocks.28.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
703
+ "visual.blocks.28.norm1.weight": "model-00001-of-00002.safetensors",
704
+ "visual.blocks.28.norm2.weight": "model-00001-of-00002.safetensors",
705
+ "visual.blocks.29.attn.proj.bias": "model-00001-of-00002.safetensors",
706
+ "visual.blocks.29.attn.proj.weight": "model-00001-of-00002.safetensors",
707
+ "visual.blocks.29.attn.qkv.bias": "model-00001-of-00002.safetensors",
708
+ "visual.blocks.29.attn.qkv.weight": "model-00001-of-00002.safetensors",
709
+ "visual.blocks.29.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
710
+ "visual.blocks.29.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
711
+ "visual.blocks.29.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
712
+ "visual.blocks.29.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
713
+ "visual.blocks.29.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
714
+ "visual.blocks.29.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
715
+ "visual.blocks.29.norm1.weight": "model-00001-of-00002.safetensors",
716
+ "visual.blocks.29.norm2.weight": "model-00001-of-00002.safetensors",
717
+ "visual.blocks.3.attn.proj.bias": "model-00001-of-00002.safetensors",
718
+ "visual.blocks.3.attn.proj.weight": "model-00001-of-00002.safetensors",
719
+ "visual.blocks.3.attn.qkv.bias": "model-00001-of-00002.safetensors",
720
+ "visual.blocks.3.attn.qkv.weight": "model-00001-of-00002.safetensors",
721
+ "visual.blocks.3.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
722
+ "visual.blocks.3.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
723
+ "visual.blocks.3.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
724
+ "visual.blocks.3.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
725
+ "visual.blocks.3.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
726
+ "visual.blocks.3.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
727
+ "visual.blocks.3.norm1.weight": "model-00001-of-00002.safetensors",
728
+ "visual.blocks.3.norm2.weight": "model-00001-of-00002.safetensors",
729
+ "visual.blocks.30.attn.proj.bias": "model-00001-of-00002.safetensors",
730
+ "visual.blocks.30.attn.proj.weight": "model-00001-of-00002.safetensors",
731
+ "visual.blocks.30.attn.qkv.bias": "model-00001-of-00002.safetensors",
732
+ "visual.blocks.30.attn.qkv.weight": "model-00001-of-00002.safetensors",
733
+ "visual.blocks.30.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
734
+ "visual.blocks.30.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
735
+ "visual.blocks.30.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
736
+ "visual.blocks.30.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
737
+ "visual.blocks.30.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
738
+ "visual.blocks.30.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
739
+ "visual.blocks.30.norm1.weight": "model-00001-of-00002.safetensors",
740
+ "visual.blocks.30.norm2.weight": "model-00001-of-00002.safetensors",
741
+ "visual.blocks.31.attn.proj.bias": "model-00001-of-00002.safetensors",
742
+ "visual.blocks.31.attn.proj.weight": "model-00001-of-00002.safetensors",
743
+ "visual.blocks.31.attn.qkv.bias": "model-00001-of-00002.safetensors",
744
+ "visual.blocks.31.attn.qkv.weight": "model-00001-of-00002.safetensors",
745
+ "visual.blocks.31.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
746
+ "visual.blocks.31.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
747
+ "visual.blocks.31.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
748
+ "visual.blocks.31.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
749
+ "visual.blocks.31.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
750
+ "visual.blocks.31.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
751
+ "visual.blocks.31.norm1.weight": "model-00001-of-00002.safetensors",
752
+ "visual.blocks.31.norm2.weight": "model-00001-of-00002.safetensors",
753
+ "visual.blocks.4.attn.proj.bias": "model-00001-of-00002.safetensors",
754
+ "visual.blocks.4.attn.proj.weight": "model-00001-of-00002.safetensors",
755
+ "visual.blocks.4.attn.qkv.bias": "model-00001-of-00002.safetensors",
756
+ "visual.blocks.4.attn.qkv.weight": "model-00001-of-00002.safetensors",
757
+ "visual.blocks.4.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
758
+ "visual.blocks.4.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
759
+ "visual.blocks.4.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
760
+ "visual.blocks.4.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
761
+ "visual.blocks.4.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
762
+ "visual.blocks.4.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
763
+ "visual.blocks.4.norm1.weight": "model-00001-of-00002.safetensors",
764
+ "visual.blocks.4.norm2.weight": "model-00001-of-00002.safetensors",
765
+ "visual.blocks.5.attn.proj.bias": "model-00001-of-00002.safetensors",
766
+ "visual.blocks.5.attn.proj.weight": "model-00001-of-00002.safetensors",
767
+ "visual.blocks.5.attn.qkv.bias": "model-00001-of-00002.safetensors",
768
+ "visual.blocks.5.attn.qkv.weight": "model-00001-of-00002.safetensors",
769
+ "visual.blocks.5.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
770
+ "visual.blocks.5.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
771
+ "visual.blocks.5.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
772
+ "visual.blocks.5.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
773
+ "visual.blocks.5.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
774
+ "visual.blocks.5.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
775
+ "visual.blocks.5.norm1.weight": "model-00001-of-00002.safetensors",
776
+ "visual.blocks.5.norm2.weight": "model-00001-of-00002.safetensors",
777
+ "visual.blocks.6.attn.proj.bias": "model-00001-of-00002.safetensors",
778
+ "visual.blocks.6.attn.proj.weight": "model-00001-of-00002.safetensors",
779
+ "visual.blocks.6.attn.qkv.bias": "model-00001-of-00002.safetensors",
780
+ "visual.blocks.6.attn.qkv.weight": "model-00001-of-00002.safetensors",
781
+ "visual.blocks.6.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
782
+ "visual.blocks.6.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
783
+ "visual.blocks.6.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
784
+ "visual.blocks.6.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
785
+ "visual.blocks.6.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
786
+ "visual.blocks.6.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
787
+ "visual.blocks.6.norm1.weight": "model-00001-of-00002.safetensors",
788
+ "visual.blocks.6.norm2.weight": "model-00001-of-00002.safetensors",
789
+ "visual.blocks.7.attn.proj.bias": "model-00001-of-00002.safetensors",
790
+ "visual.blocks.7.attn.proj.weight": "model-00001-of-00002.safetensors",
791
+ "visual.blocks.7.attn.qkv.bias": "model-00001-of-00002.safetensors",
792
+ "visual.blocks.7.attn.qkv.weight": "model-00001-of-00002.safetensors",
793
+ "visual.blocks.7.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
794
+ "visual.blocks.7.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
795
+ "visual.blocks.7.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
796
+ "visual.blocks.7.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
797
+ "visual.blocks.7.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
798
+ "visual.blocks.7.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
799
+ "visual.blocks.7.norm1.weight": "model-00001-of-00002.safetensors",
800
+ "visual.blocks.7.norm2.weight": "model-00001-of-00002.safetensors",
801
+ "visual.blocks.8.attn.proj.bias": "model-00001-of-00002.safetensors",
802
+ "visual.blocks.8.attn.proj.weight": "model-00001-of-00002.safetensors",
803
+ "visual.blocks.8.attn.qkv.bias": "model-00001-of-00002.safetensors",
804
+ "visual.blocks.8.attn.qkv.weight": "model-00001-of-00002.safetensors",
805
+ "visual.blocks.8.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
806
+ "visual.blocks.8.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
807
+ "visual.blocks.8.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
808
+ "visual.blocks.8.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
809
+ "visual.blocks.8.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
810
+ "visual.blocks.8.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
811
+ "visual.blocks.8.norm1.weight": "model-00001-of-00002.safetensors",
812
+ "visual.blocks.8.norm2.weight": "model-00001-of-00002.safetensors",
813
+ "visual.blocks.9.attn.proj.bias": "model-00001-of-00002.safetensors",
814
+ "visual.blocks.9.attn.proj.weight": "model-00001-of-00002.safetensors",
815
+ "visual.blocks.9.attn.qkv.bias": "model-00001-of-00002.safetensors",
816
+ "visual.blocks.9.attn.qkv.weight": "model-00001-of-00002.safetensors",
817
+ "visual.blocks.9.mlp.down_proj.bias": "model-00001-of-00002.safetensors",
818
+ "visual.blocks.9.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
819
+ "visual.blocks.9.mlp.gate_proj.bias": "model-00001-of-00002.safetensors",
820
+ "visual.blocks.9.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
821
+ "visual.blocks.9.mlp.up_proj.bias": "model-00001-of-00002.safetensors",
822
+ "visual.blocks.9.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
823
+ "visual.blocks.9.norm1.weight": "model-00001-of-00002.safetensors",
824
+ "visual.blocks.9.norm2.weight": "model-00001-of-00002.safetensors",
825
+ "visual.merger.ln_q.weight": "model-00001-of-00002.safetensors",
826
+ "visual.merger.mlp.0.bias": "model-00001-of-00002.safetensors",
827
+ "visual.merger.mlp.0.weight": "model-00001-of-00002.safetensors",
828
+ "visual.merger.mlp.2.bias": "model-00001-of-00002.safetensors",
829
+ "visual.merger.mlp.2.weight": "model-00001-of-00002.safetensors",
830
+ "visual.patch_embed.proj.weight": "model-00001-of-00002.safetensors"
831
+ }
832
+ }