Upload folder using huggingface_hub

Browse files

Files changed (5) hide show

README.md +178 -3
adapter_config.json +196 -0
adapter_model.safetensors +3 -0
config.json +88 -0
mm_proj_all.bin +3 -0

README.md CHANGED Viewed

@@ -1,3 +1,178 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+arxiv: 2512.22905
+---
+## <div align="center"> JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation</div>
+<div align="center">
+[[`HomePage`](https://javisverse.github.io/JavisGPT-page/)]
+[[`Paper`](https://arxiv.org/abs/2512.22905)]
+[[`GitHub`](https://github.com/JavisVerse/JavisGPT)]
+[[`Model`](https://huggingface.co/collections/JavisVerse/javisgpt)]
+[[`Dataset`](https://huggingface.co/collections/JavisVerse/javisgpt)]
+</div>
+## TL;DR
+We introduce **`JavisGPT`**, a multimodal LLM that can understand audiovisual inputs and simultaneously generate synchronized sounding videos in a unified model.
+We also curate the **`JavisInst-Omni`** dataset to facilitate instruction-tuning for comprehension and generation on sounding videos.
+## 📰 News
+- **[2026.2.26]** 🔥🔥 We release the upgraded [JavisGPT-v1.0-7B-Instruct](https://huggingface.co/JavisVerse/JavisGPT-v1.0-7B-Instruct) checkpoint at huggingface, which is empowered by [JavisDiT-v1.0-jav](https://huggingface.co/JavisVerse/JavisDiT-v1.0-jav) to achieve better audio-video generation.
+- **[2025.12.30]** 🚀 We release the training dataset of [JavisInst-Omni](https://huggingface.co/datasets/JavisVerse/JavisInst-Omni) to support multimodal instruction tuning on sounding video comprehension and generation tasks, as well as [MM-PreTrain](https://huggingface.co/datasets/JavisVerse/MM-PreTrain) and [AV-FineTune](https://huggingface.co/datasets/JavisVerse/AV-FineTune) datasets to enable preliminary multimodal alignment for LLMs.
+- **[2025.12.26]** 🔥 We release the code of [JavisGPT](https://arxiv.org/abs/2512.22905), with the preview [JavisGPT-v0.1-7B-Instruct](https://huggingface.co/JavisVerse/JavisGPT-v0.1-7B-Instruct) checkpoint at huggingface. Feel free to play with it!
+## Code
+### Installation
+Install the necessary packages:
+```bash
+conda create -n javisgpt python=3.10 -y
+conda activate javisgpt
+pip install --upgrade pip  # Enable PEP 660 support.
+pip install flash-attn==2.7.4.post1 --no-build-isolation
+pip install -v -e ".[train]"
+cp assets/src/dynamic_modules_utils.py /path/to/python3.10/site-packages/diffusers/utils/
+conda install "ffmpeg<7" -c conda-forge -y  # install ffpmeg
+```
+Install [JavisDiT](https://arxiv.org/abs/2602.19163) dependencies:
+```bash
+cd ..
+git clone https://github.com/JavisVerse/JavisDiT.git
+cd JavisDiT
+pip install -v -e . --no-deps
+cd ../JavisGPT
+# # make soft links if necessary
+# ln -s ../JavisDiT/javisdit javisdit
+```
+### Inference
+We assume the data structure as:
+```bash
+/path/to/user/root
+|-- projects
+|   |   |-- JavisDiT  # downstream JAV-DiT
+|   |   └-- JavisGPT  # workspace of this project
+|-- weights
+|   |-- pretrained
+|   |   |-- dit   # pretrained weights for JavisDiT
+|   |   |   |-- Wan2.1-T2V-1.3B
+|   |   |   |-- audioldm2
+|   |   |-- mllm  # pretrained weights for JavisGPT
+|   |   |   |-- BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt
+|   |   |   └-- Qwen2.5-VL-7B-Instruct
+|   |-- JavisVerse
+|   |   |-- JavisDiT-v1.0-jav
+|   |   └-- JavisGPT-v1.0-7B-Instruct
+```
+#### 1. Prepare Pretrained Weights
+First, download [BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt](https://github.com/microsoft/unilm/tree/master/beats) from [here](https://1drv.ms/u/s!AqeByhGUtINrgcpj8ujXH1YUtxooEg?e=E9Ncea) and [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct), and put (or link) them into `../../weights/pretrained/mllm`.
+```bash
+hf download Qwen/Qwen2.5-VL-7B-Instruct --local-dir ../../weights/pretrained/mllm/Qwen2.5-VL-7B-Instruct
+```
+Then, download our [JavisGPT-v1.0-7B-Instruct](https://huggingface.co/JavisVerse/JavisGPT-v1.0-7B-Instruct) and put them into `../../weights/JavisVerse`, e.g.,
+```bash
+hf download JavisVerse/JavisGPT-v1.0-7B-Instruct --local-dir ../../weights/JavisVerse/JavisGPT-v1.0-7B-Instruct
+```
+Finally, download necessary checkpoints of the downstream JAVG model ([JavisDiT](https://github.com/JavisVerse/JavisDiT.git)) and put them into `../../weights/pretrained/dit` or `../../weights/JavisVerse`, according to path definition in `./interface/config/*.py` coordinately.
+```bash
+hf download Wan-AI/Wan2.1-T2V-1.3B --local-dir ../../weights/pretrained/dit/Wan2.1-T2V-1.3B
+hf download cvssp/audioldm2 --local-dir ../../weights/pretrained/dit/audioldm2
+hf download JavisVerse/JavisDiT-v1.0-jav --local-dir ../../weights/JavisVerse/JavisDiT-v1.0-jav
+```
+#### 2. Run Target Inference
+- **Standalone Audio/Visual Comprehension**
+Use the following commands to evaluate the preserved single-modality understanding capability.
+For audio comprehension:
+```bash
+AUDIO_PATH="assets/demos/audio/Creaking_pier.wav"
+PROMPT="Is the sound caused by pressure from/against wood?"
+JAV_VERSION="v1.0"
+JAV_VERSION=${JAV_VERSION} AUDIO_PATH=${AUDIO_PATH} PROMPT=${PROMPT} \
+bash scripts/demo/demo_audio_visual.sh
+```
+For video comprehension:
+```bash
+VIDEO_PATH="assets/demos/video/ZS9XR.mp4"
+PROMPT="What happened after the person took the box? A. Ate the medicine. B. Tidied up the blanket. C. Put down the cup/glass/bottle. D. Open the computer."
+JAV_VERSION="v1.0"
+JAV_VERSION=${JAV_VERSION} VIDEO_PATH=${VIDEO_PATH} PROMPT=${PROMPT} \
+bash scripts/demo/demo_audio_visual.sh
+```
+- **Joint Audio-Video Comprehension**
+Use the following command to evaluate the joint audio-video comprehension capability.
+```bash
+VIDEO_PATH="assets/demos/audio_video/00002617.mp4"
+PROMPT="How many instruments in the room did not sound from beginning to end? Answer the question using a single word."
+USE_AUDIO_IN_VIDEO=True
+JAV_VERSION="v1.0"
+JAV_VERSION=${JAV_VERSION} VIDEO_PATH=${VIDEO_PATH} PROMPT=${PROMPT} USE_AUDIO_IN_VIDEO=${USE_AUDIO_IN_VIDEO} \
+bash scripts/demo/demo_audio_visual.sh
+```
+- **Joint Audio-Video Generation**
+Use the following command to evaluate the sounding video generation capability.
+```bash
+PROMPT="Build a video, ensuring the content is echoed by complementary scenes: A beautiful waterfall cascades down a steep cliff into a clear pool below. Sunlight filters through the surrounding trees, creating shimmering reflections on the falling water. The scene is calm and natural, with continuous flowing water and gentle mist rising from the base. The sound consists of steady rushing water, soft splashes, and faint ambient forest noise."
+AV_GENERATE=True
+SAVE_PREFIX="./results/avgen/demo"
+JAV_VERSION="v1.0"
+JAV_VERSION=${JAV_VERSION} AV_GENERATE=${AV_GENERATE} PROMPT=${PROMPT} SAVE_PREFIX=${SAVE_PREFIX} \
+bash scripts/demo/demo_audio_visual.sh
+```
+The generated sample will be saved at `${SAVE_PREFIX}.mp4`, e.g., `./results/avgen/demo.mp4`.
+## Citation
+If you find JavisGPT is useful and use it in your project, please kindly cite:
+```
+@inproceedings{liu2025javisgpt,
+    title={JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation},
+    author={Kai Liu and Jungang Li and Yuchong Sun and Shengqiong Wu and jianzhang gao and Daoan Zhang and Wei Zhang and Sheng Jin and Sicheng Yu and Geng Zhan and Jiayi Ji and Fan Zhou and Liang Zheng and Shuicheng YAN and Hao Fei and Tat-Seng Chua},
+    booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
+    year={2025},
+}
+```

adapter_config.json ADDED Viewed

	@@ -0,0 +1,196 @@

+{
+  "alpha_pattern": {},
+  "auto_mapping": null,
+  "base_model_name_or_path": "/mnt/HithinkOmniSSD/user_workspace/liukai4/weights/pretrained/mllm/Qwen2.5-VL-7B-Instruct",
+  "bias": "none",
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 256,
+  "lora_dropout": 0.05,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 128,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "layers.8.mlp.up_proj",
+    "layers.21.mlp.up_proj",
+    "model.layers.9.self_attn.k_proj",
+    "model.layers.0.self_attn.k_proj",
+    "layers.5.mlp.gate_proj",
+    "o_proj",
+    "layers.25.mlp.gate_proj",
+    "layers.10.mlp.up_proj",
+    "layers.21.mlp.down_proj",
+    "21.self_attn.q_proj",
+    "layers.12.mlp.up_proj",
+    "model.layers.5.self_attn.v_proj",
+    "layers.5.mlp.down_proj",
+    "13.self_attn.k_proj",
+    "15.self_attn.q_proj",
+    "18.self_attn.v_proj",
+    "26.self_attn.q_proj",
+    "model.layers.1.self_attn.k_proj",
+    "22.self_attn.k_proj",
+    "layers.12.mlp.gate_proj",
+    "model.layers.4.self_attn.k_proj",
+    "model.layers.11.self_attn.k_proj",
+    "layers.4.mlp.up_proj",
+    "model.layers.9.self_attn.v_proj",
+    "layers.5.mlp.up_proj",
+    "18.self_attn.k_proj",
+    "layers.18.mlp.gate_proj",
+    "layers.25.mlp.up_proj",
+    "24.self_attn.q_proj",
+    "layers.20.mlp.down_proj",
+    "23.self_attn.v_proj",
+    "19.self_attn.q_proj",
+    "layers.24.mlp.up_proj",
+    "layers.23.mlp.gate_proj",
+    "25.self_attn.v_proj",
+    "model.layers.7.self_attn.v_proj",
+    "15.self_attn.k_proj",
+    "layers.0.mlp.down_proj",
+    "model.layers.4.self_attn.v_proj",
+    "23.self_attn.k_proj",
+    "13.self_attn.v_proj",
+    "layers.1.mlp.gate_proj",
+    "model.layers.9.self_attn.q_proj",
+    "layers.16.mlp.down_proj",
+    "22.self_attn.q_proj",
+    "layers.14.mlp.up_proj",
+    "layers.26.mlp.up_proj",
+    "layers.19.mlp.up_proj",
+    "layers.12.mlp.down_proj",
+    "layers.19.mlp.gate_proj",
+    "model.layers.3.self_attn.q_proj",
+    "layers.16.mlp.up_proj",
+    "layers.11.mlp.up_proj",
+    "layers.1.mlp.down_proj",
+    "model.layers.10.self_attn.v_proj",
+    "model.layers.6.self_attn.q_proj",
+    "model.layers.6.self_attn.k_proj",
+    "layers.22.mlp.down_proj",
+    "model.layers.8.self_attn.q_proj",
+    "layers.25.mlp.down_proj",
+    "14.self_attn.v_proj",
+    "layers.0.mlp.gate_proj",
+    "layers.2.mlp.up_proj",
+    "model.layers.4.self_attn.q_proj",
+    "layers.11.mlp.down_proj",
+    "layers.26.mlp.gate_proj",
+    "14.self_attn.k_proj",
+    "layers.17.mlp.up_proj",
+    "model.layers.3.self_attn.k_proj",
+    "layers.9.mlp.up_proj",
+    "layers.7.mlp.gate_proj",
+    "15.self_attn.v_proj",
+    "20.self_attn.v_proj",
+    "layers.27.mlp.gate_proj",
+    "model.layers.7.self_attn.q_proj",
+    "model.layers.2.self_attn.q_proj",
+    "layers.7.mlp.up_proj",
+    "27.self_attn.k_proj",
+    "model.layers.10.self_attn.k_proj",
+    "layers.1.mlp.up_proj",
+    "layers.14.mlp.gate_proj",
+    "layers.19.mlp.down_proj",
+    "layers.27.mlp.up_proj",
+    "layers.24.mlp.down_proj",
+    "layers.8.mlp.gate_proj",
+    "layers.4.mlp.gate_proj",
+    "18.self_attn.q_proj",
+    "layers.15.mlp.gate_proj",
+    "model.layers.1.self_attn.q_proj",
+    "layers.8.mlp.down_proj",
+    "layers.13.mlp.down_proj",
+    "model.layers.0.self_attn.q_proj",
+    "layers.11.mlp.gate_proj",
+    "layers.17.mlp.gate_proj",
+    "17.self_attn.q_proj",
+    "25.self_attn.q_proj",
+    "layers.15.mlp.down_proj",
+    "layers.10.mlp.down_proj",
+    "12.self_attn.k_proj",
+    "layers.15.mlp.up_proj",
+    "layers.7.mlp.down_proj",
+    "layers.9.mlp.down_proj",
+    "16.self_attn.q_proj",
+    "layers.13.mlp.gate_proj",
+    "layers.20.mlp.up_proj",
+    "23.self_attn.q_proj",
+    "layers.14.mlp.down_proj",
+    "layers.24.mlp.gate_proj",
+    "layers.26.mlp.down_proj",
+    "24.self_attn.k_proj",
+    "model.layers.3.self_attn.v_proj",
+    "model.layers.0.self_attn.v_proj",
+    "22.self_attn.v_proj",
+    "layers.3.mlp.down_proj",
+    "25.self_attn.k_proj",
+    "layers.2.mlp.down_proj",
+    "layers.13.mlp.up_proj",
+    "layers.16.mlp.gate_proj",
+    "17.self_attn.k_proj",
+    "layers.22.mlp.up_proj",
+    "layers.6.mlp.gate_proj",
+    "19.self_attn.v_proj",
+    "model.layers.11.self_attn.v_proj",
+    "model.layers.7.self_attn.k_proj",
+    "20.self_attn.q_proj",
+    "layers.20.mlp.gate_proj",
+    "layers.21.mlp.gate_proj",
+    "model.layers.8.self_attn.k_proj",
+    "24.self_attn.v_proj",
+    "21.self_attn.v_proj",
+    "27.self_attn.v_proj",
+    "layers.6.mlp.up_proj",
+    "16.self_attn.k_proj",
+    "26.self_attn.k_proj",
+    "layers.23.mlp.down_proj",
+    "layers.4.mlp.down_proj",
+    "layers.3.mlp.up_proj",
+    "layers.23.mlp.up_proj",
+    "model.layers.6.self_attn.v_proj",
+    "26.self_attn.v_proj",
+    "16.self_attn.v_proj",
+    "13.self_attn.q_proj",
+    "12.self_attn.v_proj",
+    "model.layers.2.self_attn.k_proj",
+    "layers.10.mlp.gate_proj",
+    "17.self_attn.v_proj",
+    "layers.22.mlp.gate_proj",
+    "model.layers.8.self_attn.v_proj",
+    "layers.27.mlp.down_proj",
+    "model.layers.5.self_attn.k_proj",
+    "20.self_attn.k_proj",
+    "layers.3.mlp.gate_proj",
+    "14.self_attn.q_proj",
+    "layers.9.mlp.gate_proj",
+    "model.layers.1.self_attn.v_proj",
+    "layers.6.mlp.down_proj",
+    "model.layers.10.self_attn.q_proj",
+    "layers.0.mlp.up_proj",
+    "19.self_attn.k_proj",
+    "layers.17.mlp.down_proj",
+    "layers.2.mlp.gate_proj",
+    "27.self_attn.q_proj",
+    "model.layers.11.self_attn.q_proj",
+    "layers.18.mlp.down_proj",
+    "21.self_attn.k_proj",
+    "layers.18.mlp.up_proj",
+    "model.layers.5.self_attn.q_proj",
+    "12.self_attn.q_proj",
+    "model.layers.2.self_attn.v_proj"
+  ],
+  "task_type": "CAUSAL_LM",
+  "use_dora": false,
+  "use_rslora": false
+}

adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:247d2b2f1f7b4ba822781b1bbb150bdb06a0857766154927829c2d00592c1772
+size 645976488

config.json ADDED Viewed

	@@ -0,0 +1,88 @@

+{
+  "_attn_implementation_autoset": true,
+  "_name_or_path": "/opt/data/private/weights/pretrained/mllm/Qwen2.5-VL-7B-Instruct",
+  "architectures": [
+    "JavisGPTForConditionalGeneration"
+  ],
+  "attention_dropout": 0.0,
+  "audio_end_token_id": 151666,
+  "audio_pad_token_id": 151667,
+  "audio_start_token_id": 151665,
+  "audio_video_end_token_id": 151669,
+  "audio_video_pad_token_id": 151670,
+  "audio_video_start_token_id": 151668,
+  "avgen_cfg_path": "/opt/data/private/projects/JavisGPT-dev/config/javisdit2.py",
+  "avsync_mode": "merge",
+  "avsync_onset_modulate": false,
+  "beats_cfg": {
+    "activation_dropout": 0.0,
+    "activation_fn": "gelu",
+    "attention_dropout": 0.0,
+    "conv_bias": false,
+    "conv_pos": 128,
+    "conv_pos_groups": 16,
+    "deep_norm": true,
+    "dropout": 0.0,
+    "dropout_input": 0.0,
+    "embed_dim": 512,
+    "encoder_attention_heads": 12,
+    "encoder_embed_dim": 768,
+    "encoder_ffn_embed_dim": 3072,
+    "encoder_layerdrop": 0.05,
+    "encoder_layers": 12,
+    "finetuned_model": true,
+    "gru_rel_pos": true,
+    "input_patch_size": 16,
+    "layer_norm_first": false,
+    "layer_wise_gradient_decay_ratio": 0.6,
+    "max_distance": 800,
+    "num_buckets": 320,
+    "predictor_class": 527,
+    "predictor_dropout": 0.0,
+    "relative_position_embedding": true
+  },
+  "bos_token_id": 151643,
+  "calc_dummy_loss": true,
+  "eos_token_id": 151645,
+  "hidden_act": "silu",
+  "hidden_size": 3584,
+  "image_token_id": 151655,
+  "initializer_range": 0.02,
+  "intermediate_size": 18944,
+  "max_position_embeddings": 128000,
+  "max_window_layers": 28,
+  "model_type": "javisgpt",
+  "num_attention_heads": 28,
+  "num_hidden_layers": 28,
+  "num_key_value_heads": 4,
+  "rms_norm_eps": 1e-06,
+  "rope_scaling": {
+    "mrope_section": [
+      16,
+      24,
+      24
+    ],
+    "rope_type": "default",
+    "type": "default"
+  },
+  "rope_theta": 1000000.0,
+  "sliding_window": 32768,
+  "tie_word_embeddings": false,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.49.0",
+  "use_cache": true,
+  "use_sliding_window": false,
+  "video_token_id": 151656,
+  "vision_config": {
+    "hidden_size": 1280,
+    "in_chans": 3,
+    "model_type": "qwen2_5_vl",
+    "spatial_patch_size": 14,
+    "tokens_per_second": 2,
+    "torch_dtype": "bfloat16"
+  },
+  "vision_end_token_id": 151653,
+  "vision_start_token_id": 151652,
+  "vision_token_id": 151654,
+  "vocab_size": 152064
+}

mm_proj_all.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bc12ef8848e83575d7c49209e44a388627a53542a4ca9d94bf2edd261039ffe0
+size 2837066875