Safetensors
javisgpt
kkail8 commited on
Commit
feeac5e
·
verified ·
1 Parent(s): 5265378

Upload folder using huggingface_hub

Browse files
Files changed (5) hide show
  1. README.md +178 -3
  2. adapter_config.json +196 -0
  3. adapter_model.safetensors +3 -0
  4. config.json +88 -0
  5. mm_proj_all.bin +3 -0
README.md CHANGED
@@ -1,3 +1,178 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ arxiv: 2512.22905
4
+ ---
5
+
6
+ ## <div align="center"> JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation</div>
7
+
8
+ <div align="center">
9
+
10
+ [[`HomePage`](https://javisverse.github.io/JavisGPT-page/)]
11
+ [[`Paper`](https://arxiv.org/abs/2512.22905)]
12
+ [[`GitHub`](https://github.com/JavisVerse/JavisGPT)]
13
+ [[`Model`](https://huggingface.co/collections/JavisVerse/javisgpt)]
14
+ [[`Dataset`](https://huggingface.co/collections/JavisVerse/javisgpt)]
15
+
16
+ </div>
17
+
18
+
19
+ ## TL;DR
20
+
21
+ We introduce **`JavisGPT`**, a multimodal LLM that can understand audiovisual inputs and simultaneously generate synchronized sounding videos in a unified model.
22
+ We also curate the **`JavisInst-Omni`** dataset to facilitate instruction-tuning for comprehension and generation on sounding videos.
23
+
24
+
25
+
26
+ ## 📰 News
27
+
28
+ - **[2026.2.26]** 🔥🔥 We release the upgraded [JavisGPT-v1.0-7B-Instruct](https://huggingface.co/JavisVerse/JavisGPT-v1.0-7B-Instruct) checkpoint at huggingface, which is empowered by [JavisDiT-v1.0-jav](https://huggingface.co/JavisVerse/JavisDiT-v1.0-jav) to achieve better audio-video generation.
29
+ - **[2025.12.30]** 🚀 We release the training dataset of [JavisInst-Omni](https://huggingface.co/datasets/JavisVerse/JavisInst-Omni) to support multimodal instruction tuning on sounding video comprehension and generation tasks, as well as [MM-PreTrain](https://huggingface.co/datasets/JavisVerse/MM-PreTrain) and [AV-FineTune](https://huggingface.co/datasets/JavisVerse/AV-FineTune) datasets to enable preliminary multimodal alignment for LLMs.
30
+ - **[2025.12.26]** 🔥 We release the code of [JavisGPT](https://arxiv.org/abs/2512.22905), with the preview [JavisGPT-v0.1-7B-Instruct](https://huggingface.co/JavisVerse/JavisGPT-v0.1-7B-Instruct) checkpoint at huggingface. Feel free to play with it!
31
+
32
+
33
+ ## Code
34
+
35
+
36
+ ### Installation
37
+
38
+ Install the necessary packages:
39
+
40
+ ```bash
41
+ conda create -n javisgpt python=3.10 -y
42
+ conda activate javisgpt
43
+ pip install --upgrade pip # Enable PEP 660 support.
44
+ pip install flash-attn==2.7.4.post1 --no-build-isolation
45
+ pip install -v -e ".[train]"
46
+ cp assets/src/dynamic_modules_utils.py /path/to/python3.10/site-packages/diffusers/utils/
47
+ conda install "ffmpeg<7" -c conda-forge -y # install ffpmeg
48
+ ```
49
+
50
+ Install [JavisDiT](https://arxiv.org/abs/2602.19163) dependencies:
51
+
52
+ ```bash
53
+ cd ..
54
+ git clone https://github.com/JavisVerse/JavisDiT.git
55
+ cd JavisDiT
56
+ pip install -v -e . --no-deps
57
+ cd ../JavisGPT
58
+
59
+ # # make soft links if necessary
60
+ # ln -s ../JavisDiT/javisdit javisdit
61
+ ```
62
+
63
+ ### Inference
64
+
65
+ We assume the data structure as:
66
+
67
+ ```bash
68
+ /path/to/user/root
69
+ |-- projects
70
+ | | |-- JavisDiT # downstream JAV-DiT
71
+ | | └-- JavisGPT # workspace of this project
72
+ |-- weights
73
+ | |-- pretrained
74
+ | | |-- dit # pretrained weights for JavisDiT
75
+ | | | |-- Wan2.1-T2V-1.3B
76
+ | | | |-- audioldm2
77
+ | | |-- mllm # pretrained weights for JavisGPT
78
+ | | | |-- BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt
79
+ | | | └-- Qwen2.5-VL-7B-Instruct
80
+ | |-- JavisVerse
81
+ | | |-- JavisDiT-v1.0-jav
82
+ | | └-- JavisGPT-v1.0-7B-Instruct
83
+ ```
84
+
85
+ #### 1. Prepare Pretrained Weights
86
+
87
+ First, download [BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt](https://github.com/microsoft/unilm/tree/master/beats) from [here](https://1drv.ms/u/s!AqeByhGUtINrgcpj8ujXH1YUtxooEg?e=E9Ncea) and [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct), and put (or link) them into `../../weights/pretrained/mllm`.
88
+
89
+ ```bash
90
+ hf download Qwen/Qwen2.5-VL-7B-Instruct --local-dir ../../weights/pretrained/mllm/Qwen2.5-VL-7B-Instruct
91
+ ```
92
+
93
+ Then, download our [JavisGPT-v1.0-7B-Instruct](https://huggingface.co/JavisVerse/JavisGPT-v1.0-7B-Instruct) and put them into `../../weights/JavisVerse`, e.g.,
94
+
95
+ ```bash
96
+ hf download JavisVerse/JavisGPT-v1.0-7B-Instruct --local-dir ../../weights/JavisVerse/JavisGPT-v1.0-7B-Instruct
97
+ ```
98
+
99
+ Finally, download necessary checkpoints of the downstream JAVG model ([JavisDiT](https://github.com/JavisVerse/JavisDiT.git)) and put them into `../../weights/pretrained/dit` or `../../weights/JavisVerse`, according to path definition in `./interface/config/*.py` coordinately.
100
+
101
+ ```bash
102
+ hf download Wan-AI/Wan2.1-T2V-1.3B --local-dir ../../weights/pretrained/dit/Wan2.1-T2V-1.3B
103
+ hf download cvssp/audioldm2 --local-dir ../../weights/pretrained/dit/audioldm2
104
+ hf download JavisVerse/JavisDiT-v1.0-jav --local-dir ../../weights/JavisVerse/JavisDiT-v1.0-jav
105
+ ```
106
+
107
+ #### 2. Run Target Inference
108
+
109
+ - **Standalone Audio/Visual Comprehension**
110
+
111
+ Use the following commands to evaluate the preserved single-modality understanding capability.
112
+
113
+ For audio comprehension:
114
+
115
+ ```bash
116
+ AUDIO_PATH="assets/demos/audio/Creaking_pier.wav"
117
+ PROMPT="Is the sound caused by pressure from/against wood?"
118
+ JAV_VERSION="v1.0"
119
+
120
+ JAV_VERSION=${JAV_VERSION} AUDIO_PATH=${AUDIO_PATH} PROMPT=${PROMPT} \
121
+ bash scripts/demo/demo_audio_visual.sh
122
+ ```
123
+
124
+ For video comprehension:
125
+
126
+ ```bash
127
+ VIDEO_PATH="assets/demos/video/ZS9XR.mp4"
128
+ PROMPT="What happened after the person took the box? A. Ate the medicine. B. Tidied up the blanket. C. Put down the cup/glass/bottle. D. Open the computer."
129
+ JAV_VERSION="v1.0"
130
+
131
+ JAV_VERSION=${JAV_VERSION} VIDEO_PATH=${VIDEO_PATH} PROMPT=${PROMPT} \
132
+ bash scripts/demo/demo_audio_visual.sh
133
+ ```
134
+
135
+ - **Joint Audio-Video Comprehension**
136
+
137
+ Use the following command to evaluate the joint audio-video comprehension capability.
138
+
139
+
140
+ ```bash
141
+ VIDEO_PATH="assets/demos/audio_video/00002617.mp4"
142
+ PROMPT="How many instruments in the room did not sound from beginning to end? Answer the question using a single word."
143
+ USE_AUDIO_IN_VIDEO=True
144
+ JAV_VERSION="v1.0"
145
+
146
+ JAV_VERSION=${JAV_VERSION} VIDEO_PATH=${VIDEO_PATH} PROMPT=${PROMPT} USE_AUDIO_IN_VIDEO=${USE_AUDIO_IN_VIDEO} \
147
+ bash scripts/demo/demo_audio_visual.sh
148
+ ```
149
+
150
+
151
+ - **Joint Audio-Video Generation**
152
+
153
+ Use the following command to evaluate the sounding video generation capability.
154
+
155
+ ```bash
156
+ PROMPT="Build a video, ensuring the content is echoed by complementary scenes: A beautiful waterfall cascades down a steep cliff into a clear pool below. Sunlight filters through the surrounding trees, creating shimmering reflections on the falling water. The scene is calm and natural, with continuous flowing water and gentle mist rising from the base. The sound consists of steady rushing water, soft splashes, and faint ambient forest noise."
157
+ AV_GENERATE=True
158
+ SAVE_PREFIX="./results/avgen/demo"
159
+ JAV_VERSION="v1.0"
160
+
161
+ JAV_VERSION=${JAV_VERSION} AV_GENERATE=${AV_GENERATE} PROMPT=${PROMPT} SAVE_PREFIX=${SAVE_PREFIX} \
162
+ bash scripts/demo/demo_audio_visual.sh
163
+ ```
164
+
165
+ The generated sample will be saved at `${SAVE_PREFIX}.mp4`, e.g., `./results/avgen/demo.mp4`.
166
+
167
+
168
+ ## Citation
169
+
170
+ If you find JavisGPT is useful and use it in your project, please kindly cite:
171
+ ```
172
+ @inproceedings{liu2025javisgpt,
173
+ title={JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation},
174
+ author={Kai Liu and Jungang Li and Yuchong Sun and Shengqiong Wu and jianzhang gao and Daoan Zhang and Wei Zhang and Sheng Jin and Sicheng Yu and Geng Zhan and Jiayi Ji and Fan Zhou and Liang Zheng and Shuicheng YAN and Hao Fei and Tat-Seng Chua},
175
+ booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
176
+ year={2025},
177
+ }
178
+ ```
adapter_config.json ADDED
@@ -0,0 +1,196 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alpha_pattern": {},
3
+ "auto_mapping": null,
4
+ "base_model_name_or_path": "/mnt/HithinkOmniSSD/user_workspace/liukai4/weights/pretrained/mllm/Qwen2.5-VL-7B-Instruct",
5
+ "bias": "none",
6
+ "fan_in_fan_out": false,
7
+ "inference_mode": true,
8
+ "init_lora_weights": true,
9
+ "layer_replication": null,
10
+ "layers_pattern": null,
11
+ "layers_to_transform": null,
12
+ "loftq_config": {},
13
+ "lora_alpha": 256,
14
+ "lora_dropout": 0.05,
15
+ "megatron_config": null,
16
+ "megatron_core": "megatron.core",
17
+ "modules_to_save": null,
18
+ "peft_type": "LORA",
19
+ "r": 128,
20
+ "rank_pattern": {},
21
+ "revision": null,
22
+ "target_modules": [
23
+ "layers.8.mlp.up_proj",
24
+ "layers.21.mlp.up_proj",
25
+ "model.layers.9.self_attn.k_proj",
26
+ "model.layers.0.self_attn.k_proj",
27
+ "layers.5.mlp.gate_proj",
28
+ "o_proj",
29
+ "layers.25.mlp.gate_proj",
30
+ "layers.10.mlp.up_proj",
31
+ "layers.21.mlp.down_proj",
32
+ "21.self_attn.q_proj",
33
+ "layers.12.mlp.up_proj",
34
+ "model.layers.5.self_attn.v_proj",
35
+ "layers.5.mlp.down_proj",
36
+ "13.self_attn.k_proj",
37
+ "15.self_attn.q_proj",
38
+ "18.self_attn.v_proj",
39
+ "26.self_attn.q_proj",
40
+ "model.layers.1.self_attn.k_proj",
41
+ "22.self_attn.k_proj",
42
+ "layers.12.mlp.gate_proj",
43
+ "model.layers.4.self_attn.k_proj",
44
+ "model.layers.11.self_attn.k_proj",
45
+ "layers.4.mlp.up_proj",
46
+ "model.layers.9.self_attn.v_proj",
47
+ "layers.5.mlp.up_proj",
48
+ "18.self_attn.k_proj",
49
+ "layers.18.mlp.gate_proj",
50
+ "layers.25.mlp.up_proj",
51
+ "24.self_attn.q_proj",
52
+ "layers.20.mlp.down_proj",
53
+ "23.self_attn.v_proj",
54
+ "19.self_attn.q_proj",
55
+ "layers.24.mlp.up_proj",
56
+ "layers.23.mlp.gate_proj",
57
+ "25.self_attn.v_proj",
58
+ "model.layers.7.self_attn.v_proj",
59
+ "15.self_attn.k_proj",
60
+ "layers.0.mlp.down_proj",
61
+ "model.layers.4.self_attn.v_proj",
62
+ "23.self_attn.k_proj",
63
+ "13.self_attn.v_proj",
64
+ "layers.1.mlp.gate_proj",
65
+ "model.layers.9.self_attn.q_proj",
66
+ "layers.16.mlp.down_proj",
67
+ "22.self_attn.q_proj",
68
+ "layers.14.mlp.up_proj",
69
+ "layers.26.mlp.up_proj",
70
+ "layers.19.mlp.up_proj",
71
+ "layers.12.mlp.down_proj",
72
+ "layers.19.mlp.gate_proj",
73
+ "model.layers.3.self_attn.q_proj",
74
+ "layers.16.mlp.up_proj",
75
+ "layers.11.mlp.up_proj",
76
+ "layers.1.mlp.down_proj",
77
+ "model.layers.10.self_attn.v_proj",
78
+ "model.layers.6.self_attn.q_proj",
79
+ "model.layers.6.self_attn.k_proj",
80
+ "layers.22.mlp.down_proj",
81
+ "model.layers.8.self_attn.q_proj",
82
+ "layers.25.mlp.down_proj",
83
+ "14.self_attn.v_proj",
84
+ "layers.0.mlp.gate_proj",
85
+ "layers.2.mlp.up_proj",
86
+ "model.layers.4.self_attn.q_proj",
87
+ "layers.11.mlp.down_proj",
88
+ "layers.26.mlp.gate_proj",
89
+ "14.self_attn.k_proj",
90
+ "layers.17.mlp.up_proj",
91
+ "model.layers.3.self_attn.k_proj",
92
+ "layers.9.mlp.up_proj",
93
+ "layers.7.mlp.gate_proj",
94
+ "15.self_attn.v_proj",
95
+ "20.self_attn.v_proj",
96
+ "layers.27.mlp.gate_proj",
97
+ "model.layers.7.self_attn.q_proj",
98
+ "model.layers.2.self_attn.q_proj",
99
+ "layers.7.mlp.up_proj",
100
+ "27.self_attn.k_proj",
101
+ "model.layers.10.self_attn.k_proj",
102
+ "layers.1.mlp.up_proj",
103
+ "layers.14.mlp.gate_proj",
104
+ "layers.19.mlp.down_proj",
105
+ "layers.27.mlp.up_proj",
106
+ "layers.24.mlp.down_proj",
107
+ "layers.8.mlp.gate_proj",
108
+ "layers.4.mlp.gate_proj",
109
+ "18.self_attn.q_proj",
110
+ "layers.15.mlp.gate_proj",
111
+ "model.layers.1.self_attn.q_proj",
112
+ "layers.8.mlp.down_proj",
113
+ "layers.13.mlp.down_proj",
114
+ "model.layers.0.self_attn.q_proj",
115
+ "layers.11.mlp.gate_proj",
116
+ "layers.17.mlp.gate_proj",
117
+ "17.self_attn.q_proj",
118
+ "25.self_attn.q_proj",
119
+ "layers.15.mlp.down_proj",
120
+ "layers.10.mlp.down_proj",
121
+ "12.self_attn.k_proj",
122
+ "layers.15.mlp.up_proj",
123
+ "layers.7.mlp.down_proj",
124
+ "layers.9.mlp.down_proj",
125
+ "16.self_attn.q_proj",
126
+ "layers.13.mlp.gate_proj",
127
+ "layers.20.mlp.up_proj",
128
+ "23.self_attn.q_proj",
129
+ "layers.14.mlp.down_proj",
130
+ "layers.24.mlp.gate_proj",
131
+ "layers.26.mlp.down_proj",
132
+ "24.self_attn.k_proj",
133
+ "model.layers.3.self_attn.v_proj",
134
+ "model.layers.0.self_attn.v_proj",
135
+ "22.self_attn.v_proj",
136
+ "layers.3.mlp.down_proj",
137
+ "25.self_attn.k_proj",
138
+ "layers.2.mlp.down_proj",
139
+ "layers.13.mlp.up_proj",
140
+ "layers.16.mlp.gate_proj",
141
+ "17.self_attn.k_proj",
142
+ "layers.22.mlp.up_proj",
143
+ "layers.6.mlp.gate_proj",
144
+ "19.self_attn.v_proj",
145
+ "model.layers.11.self_attn.v_proj",
146
+ "model.layers.7.self_attn.k_proj",
147
+ "20.self_attn.q_proj",
148
+ "layers.20.mlp.gate_proj",
149
+ "layers.21.mlp.gate_proj",
150
+ "model.layers.8.self_attn.k_proj",
151
+ "24.self_attn.v_proj",
152
+ "21.self_attn.v_proj",
153
+ "27.self_attn.v_proj",
154
+ "layers.6.mlp.up_proj",
155
+ "16.self_attn.k_proj",
156
+ "26.self_attn.k_proj",
157
+ "layers.23.mlp.down_proj",
158
+ "layers.4.mlp.down_proj",
159
+ "layers.3.mlp.up_proj",
160
+ "layers.23.mlp.up_proj",
161
+ "model.layers.6.self_attn.v_proj",
162
+ "26.self_attn.v_proj",
163
+ "16.self_attn.v_proj",
164
+ "13.self_attn.q_proj",
165
+ "12.self_attn.v_proj",
166
+ "model.layers.2.self_attn.k_proj",
167
+ "layers.10.mlp.gate_proj",
168
+ "17.self_attn.v_proj",
169
+ "layers.22.mlp.gate_proj",
170
+ "model.layers.8.self_attn.v_proj",
171
+ "layers.27.mlp.down_proj",
172
+ "model.layers.5.self_attn.k_proj",
173
+ "20.self_attn.k_proj",
174
+ "layers.3.mlp.gate_proj",
175
+ "14.self_attn.q_proj",
176
+ "layers.9.mlp.gate_proj",
177
+ "model.layers.1.self_attn.v_proj",
178
+ "layers.6.mlp.down_proj",
179
+ "model.layers.10.self_attn.q_proj",
180
+ "layers.0.mlp.up_proj",
181
+ "19.self_attn.k_proj",
182
+ "layers.17.mlp.down_proj",
183
+ "layers.2.mlp.gate_proj",
184
+ "27.self_attn.q_proj",
185
+ "model.layers.11.self_attn.q_proj",
186
+ "layers.18.mlp.down_proj",
187
+ "21.self_attn.k_proj",
188
+ "layers.18.mlp.up_proj",
189
+ "model.layers.5.self_attn.q_proj",
190
+ "12.self_attn.q_proj",
191
+ "model.layers.2.self_attn.v_proj"
192
+ ],
193
+ "task_type": "CAUSAL_LM",
194
+ "use_dora": false,
195
+ "use_rslora": false
196
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:247d2b2f1f7b4ba822781b1bbb150bdb06a0857766154927829c2d00592c1772
3
+ size 645976488
config.json ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_attn_implementation_autoset": true,
3
+ "_name_or_path": "/opt/data/private/weights/pretrained/mllm/Qwen2.5-VL-7B-Instruct",
4
+ "architectures": [
5
+ "JavisGPTForConditionalGeneration"
6
+ ],
7
+ "attention_dropout": 0.0,
8
+ "audio_end_token_id": 151666,
9
+ "audio_pad_token_id": 151667,
10
+ "audio_start_token_id": 151665,
11
+ "audio_video_end_token_id": 151669,
12
+ "audio_video_pad_token_id": 151670,
13
+ "audio_video_start_token_id": 151668,
14
+ "avgen_cfg_path": "/opt/data/private/projects/JavisGPT-dev/config/javisdit2.py",
15
+ "avsync_mode": "merge",
16
+ "avsync_onset_modulate": false,
17
+ "beats_cfg": {
18
+ "activation_dropout": 0.0,
19
+ "activation_fn": "gelu",
20
+ "attention_dropout": 0.0,
21
+ "conv_bias": false,
22
+ "conv_pos": 128,
23
+ "conv_pos_groups": 16,
24
+ "deep_norm": true,
25
+ "dropout": 0.0,
26
+ "dropout_input": 0.0,
27
+ "embed_dim": 512,
28
+ "encoder_attention_heads": 12,
29
+ "encoder_embed_dim": 768,
30
+ "encoder_ffn_embed_dim": 3072,
31
+ "encoder_layerdrop": 0.05,
32
+ "encoder_layers": 12,
33
+ "finetuned_model": true,
34
+ "gru_rel_pos": true,
35
+ "input_patch_size": 16,
36
+ "layer_norm_first": false,
37
+ "layer_wise_gradient_decay_ratio": 0.6,
38
+ "max_distance": 800,
39
+ "num_buckets": 320,
40
+ "predictor_class": 527,
41
+ "predictor_dropout": 0.0,
42
+ "relative_position_embedding": true
43
+ },
44
+ "bos_token_id": 151643,
45
+ "calc_dummy_loss": true,
46
+ "eos_token_id": 151645,
47
+ "hidden_act": "silu",
48
+ "hidden_size": 3584,
49
+ "image_token_id": 151655,
50
+ "initializer_range": 0.02,
51
+ "intermediate_size": 18944,
52
+ "max_position_embeddings": 128000,
53
+ "max_window_layers": 28,
54
+ "model_type": "javisgpt",
55
+ "num_attention_heads": 28,
56
+ "num_hidden_layers": 28,
57
+ "num_key_value_heads": 4,
58
+ "rms_norm_eps": 1e-06,
59
+ "rope_scaling": {
60
+ "mrope_section": [
61
+ 16,
62
+ 24,
63
+ 24
64
+ ],
65
+ "rope_type": "default",
66
+ "type": "default"
67
+ },
68
+ "rope_theta": 1000000.0,
69
+ "sliding_window": 32768,
70
+ "tie_word_embeddings": false,
71
+ "torch_dtype": "bfloat16",
72
+ "transformers_version": "4.49.0",
73
+ "use_cache": true,
74
+ "use_sliding_window": false,
75
+ "video_token_id": 151656,
76
+ "vision_config": {
77
+ "hidden_size": 1280,
78
+ "in_chans": 3,
79
+ "model_type": "qwen2_5_vl",
80
+ "spatial_patch_size": 14,
81
+ "tokens_per_second": 2,
82
+ "torch_dtype": "bfloat16"
83
+ },
84
+ "vision_end_token_id": 151653,
85
+ "vision_start_token_id": 151652,
86
+ "vision_token_id": 151654,
87
+ "vocab_size": 152064
88
+ }
mm_proj_all.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bc12ef8848e83575d7c49209e44a388627a53542a4ca9d94bf2edd261039ffe0
3
+ size 2837066875