Files changed (2) hide show
  1. README.md +144 -191
  2. config.json +116 -1
README.md CHANGED
@@ -1,233 +1,186 @@
1
  ---
 
2
  language:
3
- - en
4
- - zh
5
- library_name: diffusers
6
- license: mit
7
- pipeline_tag: text-to-video
8
  tags:
9
- - transformers
10
- - diffusers
11
- - image-to-video
12
- - video-continuation
 
 
13
  ---
14
-
15
- # LongCat-Video
16
-
17
  <div align="center">
18
- <img src="assets/longcat_logo.svg" width="45%" alt="LongCat-Video" />
19
  </div>
20
  <hr>
 
 
 
 
 
 
 
21
 
22
- <div align="center" style="line-height: 1;">
23
- <a href='https://meituan-longcat.github.io/LongCat-Video/'><img src='https://img.shields.io/badge/Project-Page-green'></a>
24
- <a href='https://huggingface.co/papers/2510.22200'><img src='https://img.shields.io/badge/Paper-HuggingFace-red'></a>
25
- <a href='https://huggingface.co/meituan-longcat/LongCat-Video'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue'></a>
26
- </div>
27
-
28
- <div align="center" style="line-height: 1;">
29
- <a href='https://github.com/meituan-longcat/LongCat-Flash-Chat/blob/main/figures/wechat_official_accounts.png'><img src='https://img.shields.io/badge/WeChat-LongCat-brightgreen?logo=wechat&logoColor=white'></a>
30
- <a href='https://x.com/Meituan_LongCat'><img src='https://img.shields.io/badge/Twitter-LongCat-white?logo=x&logoColor=white'></a>
31
- </div>
32
-
33
- <div align="center" style="line-height: 1;">
34
- <a href='https://huggingface.co/meituan-longcat/LongCat-Video/blob/main/LICENSE'><img src='https://img.shields.io/badge/License-MIT-f5de53?&color=f5de53'></a>
35
  </div>
36
 
37
- ## Model Introduction
38
- We introduce LongCat-Video, a foundational video generation model with 13.6B parameters, delivering strong performance across *Text-to-Video*, *Image-to-Video*, and *Video-Continuation* generation tasks. It particularly excels in efficient and high-quality long video generation, representing our first step toward world models.
39
-
40
- ### Key Features
41
- - 🌟 **Unified architecture for multiple tasks**: LongCat-Video unifies *Text-to-Video*, *Image-to-Video*, and *Video-Continuation* tasks within a single video generation framework. It natively supports all these tasks with a single model and consistently delivers strong performance across each individual task.
42
- - 🌟 **Long video generation**: LongCat-Video is natively pretrained on *Video-Continuation* tasks, enabling it to produce minutes-long videos without color drifting or quality degradation.
43
- - 🌟 **Efficient inference**: LongCat-Video generates $720p$, $30fps$ videos within minutes by employing a coarse-to-fine generation strategy along both the temporal and spatial axes. Block Sparse Attention further enhances efficiency, particularly at high resolutions
44
- - 🌟 **Strong performance with multi-reward RLHF**: Powered by multi-reward Group Relative Policy Optimization (GRPO), comprehensive evaluations on both internal and public benchmarks demonstrate that LongCat-Video achieves performance comparable to leading open-source video generation models as well as the latest commercial solutions.
45
-
46
- For more detail, please refer to the comprehensive [***LongCat-Video Technical Report***](https://huggingface.co/papers/2510.22200).
47
 
48
- ## 🎥 Teaser Video
 
 
 
 
 
49
 
50
- <div align="center">
51
- <video src="https://github.com/user-attachments/assets/00fa63f0-9c4e-461a-a79e-c662ad596d7d" width="2264" height="384"> </video>
52
  </div>
53
 
54
- ## Quick Start
55
 
56
- ### Installation
57
 
58
- Clone the repo:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
 
60
- ```shell
61
- git clone https://github.com/meituan-longcat/LongCat-Video
62
- cd LongCat-Video
63
  ```
64
-
65
- Install dependencies:
66
-
67
- ```shell
68
- # create conda environment
69
- conda create -n longcat-video python=3.10
70
- conda activate longcat-video
71
-
72
- # install torch (configure according to your CUDA version)
73
- pip install torch==2.6.0+cu124 torchvision==0.21.0+cu124 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
74
-
75
- # install flash-attn-2
76
- pip install ninja
77
- pip install psutil
78
- pip install packaging
79
- pip install flash_attn==2.7.4.post1
80
-
81
- # install other requirements
82
- pip install -r requirements.txt
83
  ```
84
 
85
- FlashAttention-2 is enabled in the model config by default; you can also change the model config to use FlashAttention-3 or xformers.
 
 
 
 
 
86
 
87
- ### Model Download
 
 
88
 
89
- | Models | Download Link |
90
- | --- | --- |
91
- | LongCat-Video | 🤗 [Huggingface](https://huggingface.co/meituan-longcat/LongCat-Video) |
 
92
 
93
- Download models using huggingface-cli:
94
- ```shell
95
- pip install "huggingface_hub[cli]"
96
- huggingface-cli download meituan-longcat/LongCat-Video --local-dir ./weights/LongCat-Video
97
- ```
98
 
99
- ### Run Text-to-Video
 
 
 
100
 
101
- ```shell
102
- # Single-GPU inference
103
- torchrun run_demo_text_to_video.py --checkpoint_dir=./weights/LongCat-Video --enable_compile
104
 
105
- # Multi-GPU inference
106
- torchrun --nproc_per_node=2 run_demo_text_to_video.py --context_parallel_size=2 --checkpoint_dir=./weights/LongCat-Video --enable_compile
107
  ```
108
 
109
- ### Run Image-to-Video
 
110
 
 
111
  ```shell
112
- # Single-GPU inference
113
- torchrun run_demo_image_to_video.py --checkpoint_dir=./weights/LongCat-Video --enable_compile
114
-
115
- # Multi-GPU inference
116
- torchrun --nproc_per_node=2 run_demo_image_to_video.py --context_parallel_size=2 --checkpoint_dir=./weights/LongCat-Video --enable_compile
117
  ```
118
 
119
- ### Run Video-Continuation
120
-
121
- ```shell
122
- # Single-GPU inference
123
- torchrun run_demo_video_continuation.py --checkpoint_dir=./weights/LongCat-Video --enable_compile
124
-
125
- # Multi-GPU inference
126
- torchrun --nproc_per_node=2 run_demo_video_continuation.py --context_parallel_size=2 --checkpoint_dir=./weights/LongCat-Video --enable_compile
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
127
  ```
128
 
129
- ### Run Long-Video Generation
130
 
131
- ```shell
132
- # Single-GPU inference
133
- torchrun run_demo_long_video.py --checkpoint_dir=./weights/LongCat-Video --enable_compile
 
 
 
 
 
 
 
 
134
 
135
- # Multi-GPU inference
136
- torchrun --nproc_per_node=2 run_demo_long_video.py --context_parallel_size=2 --checkpoint_dir=./weights/LongCat-Video --enable_compile
137
- ```
138
 
139
- ### Run Interactive Video Generation
140
 
141
- ```shell
142
- # Single-GPU inference
143
- torchrun run_demo_interactive_video.py --checkpoint_dir=./weights/LongCat-Video --enable_compile
144
 
145
- # Multi-GPU inference
146
- torchrun --nproc_per_node=2 run_demo_interactive_video.py --context_parallel_size=2 --checkpoint_dir=./weights/LongCat-Video --enable_compile
147
- ```
148
-
149
- ### Run Streamlit
150
-
151
- ```shell
152
- # Single-GPU inference
153
- streamlit run ./run_streamlit.py --server.fileWatcherType none --server.headless=false
154
- ```
155
-
156
-
157
-
158
- ## Evaluation Results
159
-
160
- ### Text-to-Video
161
- The *Text-to-Video* MOS evaluation results on our internal benchmark.
162
-
163
- | **MOS score** | **Veo3** | **PixVerse-V5** | **Wan 2.2-T2V-A14B** | **LongCat-Video** |
164
- |---------------|-------------------|--------------------|-------------|-------------|
165
- | **Accessibility** | Proprietary | Proprietary | Open Source | Open Source |
166
- | **Architecture** | - | - | MoE | Dense |
167
- | **# Total Params** | - | - | 28B | 13.6B |
168
- | **# Activated Params** | - | - | 14B | 13.6B |
169
- | Text-Alignment↑ | 3.99 | 3.81 | 3.70 | 3.76 |
170
- | Visual Quality↑ | 3.23 | 3.13 | 3.26 | 3.25 |
171
- | Motion Quality↑ | 3.86 | 3.81 | 3.78 | 3.74 |
172
- | Overall Quality↑ | 3.48 | 3.36 | 3.35 | 3.38 |
173
-
174
- ### Image-to-Video
175
- The *Image-to-Video* MOS evaluation results on our internal benchmark.
176
-
177
- | **MOS score** | **Seedance 1.0** | **Hailuo-02** | **Wan 2.2-I2V-A14B** | **LongCat-Video** |
178
- |---------------|-------------------|--------------------|-------------|-------------|
179
- | **Accessibility** | Proprietary | Proprietary | Open Source | Open Source |
180
- | **Architecture** | - | - | MoE | Dense |
181
- | **# Total Params** | - | - | 28B | 13.6B |
182
- | **# Activated Params** | - | - | 14B | 13.6B |
183
- | Image-Alignment↑ | 4.12 | 4.18 | 4.18 | 4.04 |
184
- | Text-Alignment↑ | 3.70 | 3.85 | 3.33 | 3.49 |
185
- | Visual Quality↑ | 3.22 | 3.18 | 3.23 | 3.27 |
186
- | Motion Quality↑ | 3.77 | 3.80 | 3.79 | 3.59 |
187
- | Overall Quality↑ | 3.35 | 3.27 | 3.26 | 3.17 |
188
-
189
- ## Community Works
190
-
191
- Community works are welcome! Please PR or inform us in Issue to add your work.
192
-
193
- - [CacheDiT](https://github.com/vipshop/cache-dit) offers Fully Cache Acceleration support for LongCat-Video with DBCache and TaylorSeer, achieved nearly 1.7x speedup without obvious loss of precision. Visit their [example](https://github.com/vipshop/cache-dit/blob/main/examples/pipeline/run_longcat_video.py) for more details.
194
 
195
 
196
- ## License Agreement
197
-
198
- The **model weights** are released under the **MIT License**.
199
-
200
- Any contributions to this repository are licensed under the MIT License, unless otherwise stated. This license does not grant any rights to use Meituan trademarks or patents.
201
-
202
- See the [LICENSE](https://huggingface.co/meituan-longcat/LongCat-Video/blob/main/LICENSE) file for the full license text.
203
-
204
- ## Usage Considerations
205
- This model has not been specifically designed or comprehensively evaluated for every possible downstream application.
206
-
207
- Developers should take into account the known limitations of large language models, including performance variations across different languages, and carefully assess accuracy, safety, and fairness before deploying the model in sensitive or high-risk scenarios.
208
- It is the responsibility of developers and downstream users to understand and comply with all applicable laws and regulations relevant to their use case, including but not limited to data protection, privacy, and content safety requirements.
209
-
210
- Nothing in this Model Card should be interpreted as altering or restricting the terms of the MIT License under which the model is released.
211
-
212
  ## Citation
213
- We kindly encourage citation of our work if you find it useful.
214
-
215
- ```
216
- @misc{meituanlongcatteam2025longcatvideotechnicalreport,
217
- title={LongCat-Video Technical Report},
218
- author={Meituan LongCat Team and Xunliang Cai and Qilong Huang and Zhuoliang Kang and Hongyu Li and Shijun Liang and Liya Ma and Siyu Ren and Xiaoming Wei and Rixu Xie and Tong Zhang},
219
- year={2025},
220
- eprint={2510.22200},
221
- archivePrefix={arXiv},
222
- primaryClass={cs.CV},
223
- url={https://arxiv.org/abs/2510.22200},
224
- }
225
- ```
226
-
227
- ## Acknowledgements
228
-
229
- We would like to thank the contributors to the [Wan](https://huggingface.co/Wan-AI), [UMT5-XXL](https://huggingface.co/google/umt5-xxl), [Diffusers](https://github.com/huggingface/diffusers) and [HuggingFace](https://huggingface.co) repositories, for their open research.
230
-
231
-
232
- ## Contact
233
- Please contact us at <a href="mailto:longcat-team@meituan.com">longcat-team@meituan.com</a> or join our WeChat Group if you have any questions.
 
1
  ---
2
+ pipeline_tag: image-text-to-text
3
  language:
4
+ - multilingual
 
 
 
 
5
  tags:
6
+ - deepseek
7
+ - vision-language
8
+ - ocr
9
+ - custom_code
10
+ license: mit
11
+ library_name: transformers
12
  ---
 
 
 
13
  <div align="center">
14
+ <img src="https://github.com/deepseek-ai/DeepSeek-V2/blob/main/figures/logo.svg?raw=true" width="60%" alt="DeepSeek AI" />
15
  </div>
16
  <hr>
17
+ <div align="center">
18
+ <a href="https://www.deepseek.com/" target="_blank">
19
+ <img alt="Homepage" src="https://github.com/deepseek-ai/DeepSeek-V2/blob/main/figures/badge.svg?raw=true" />
20
+ </a>
21
+ <a href="https://huggingface.co/deepseek-ai/DeepSeek-OCR" target="_blank">
22
+ <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-DeepSeek%20AI-ffc107?color=ffc107&logoColor=white" />
23
+ </a>
24
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  </div>
26
 
27
+ <div align="center">
 
 
 
 
 
 
 
 
 
28
 
29
+ <a href="https://discord.gg/Tc7c45Zzu5" target="_blank">
30
+ <img alt="Discord" src="https://img.shields.io/badge/Discord-DeepSeek%20AI-7289da?logo=discord&logoColor=white&color=7289da" />
31
+ </a>
32
+ <a href="https://twitter.com/deepseek_ai" target="_blank">
33
+ <img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-deepseek_ai-white?logo=x&logoColor=white" />
34
+ </a>
35
 
 
 
36
  </div>
37
 
 
38
 
 
39
 
40
+ <p align="center">
41
+ <a href="https://github.com/deepseek-ai/DeepSeek-OCR"><b>🌟 Github</b></a> |
42
+ <a href="https://huggingface.co/deepseek-ai/DeepSeek-OCR"><b>📥 Model Download</b></a> |
43
+ <a href="https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf"><b>📄 Paper Link</b></a> |
44
+ <a href="https://arxiv.org/abs/2510.18234"><b>📄 Arxiv Paper Link</b></a> |
45
+ </p>
46
+ <h2>
47
+ <p align="center">
48
+ <a href="https://huggingface.co/papers/2510.18234">DeepSeek-OCR: Contexts Optical Compression</a>
49
+ </p>
50
+ </h2>
51
+ <p align="center">
52
+ <img src="assets/fig1.png" style="width: 1000px" align=center>
53
+ </p>
54
+ <p align="center">
55
+ <a href="https://huggingface.co/papers/2510.18234">Explore the boundaries of visual-text compression.</a>
56
+ </p>
57
+
58
+ ## Usage
59
+ Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.12.9 + CUDA11.8:
60
 
 
 
 
61
  ```
62
+ torch==2.6.0
63
+ transformers==4.46.3
64
+ tokenizers==0.20.3
65
+ einops
66
+ addict
67
+ easydict
68
+ pip install flash-attn==2.7.3 --no-build-isolation
 
 
 
 
 
 
 
 
 
 
 
 
69
  ```
70
 
71
+ ```python
72
+ from transformers import AutoModel, AutoTokenizer
73
+ import torch
74
+ import os
75
+ os.environ["CUDA_VISIBLE_DEVICES"] = '0'
76
+ model_name = 'deepseek-ai/DeepSeek-OCR'
77
 
78
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
79
+ model = AutoModel.from_pretrained(model_name, _attn_implementation='flash_attention_2', trust_remote_code=True, use_safetensors=True)
80
+ model = model.eval().cuda().to(torch.bfloat16)
81
 
82
+ # prompt = "<image>\nFree OCR. "
83
+ prompt = "<image>\n<|grounding|>Convert the document to markdown. "
84
+ image_file = 'your_image.jpg'
85
+ output_path = 'your/output/dir'
86
 
87
+ # infer(self, tokenizer, prompt='', image_file='', output_path = ' ', base_size = 1024, image_size = 640, crop_mode = True, test_compress = False, save_results = False):
 
 
 
 
88
 
89
+ # Tiny: base_size = 512, image_size = 512, crop_mode = False
90
+ # Small: base_size = 640, image_size = 640, crop_mode = False
91
+ # Base: base_size = 1024, image_size = 1024, crop_mode = False
92
+ # Large: base_size = 1280, image_size = 1280, crop_mode = False
93
 
94
+ # Gundam: base_size = 1024, image_size = 640, crop_mode = True
 
 
95
 
96
+ res = model.infer(tokenizer, prompt=prompt, image_file=image_file, output_path = output_path, base_size = 1024, image_size = 640, crop_mode=True, save_results = True, test_compress = True)
 
97
  ```
98
 
99
+ ## vLLM
100
+ Refer to [🌟GitHub](https://github.com/deepseek-ai/DeepSeek-OCR/) for guidance on model inference acceleration and PDF processing, etc.<!-- -->
101
 
102
+ [2025/10/23] 🚀🚀🚀 DeepSeek-OCR is now officially supported in upstream [vLLM](https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-OCR.html#installing-vllm).
103
  ```shell
104
+ uv venv
105
+ source .venv/bin/activate
106
+ # Until v0.11.1 release, you need to install vLLM from nightly build
107
+ uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
 
108
  ```
109
 
110
+ ```python
111
+ from vllm import LLM, SamplingParams
112
+ from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
113
+ from PIL import Image
114
+
115
+ # Create model instance
116
+ llm = LLM(
117
+ model="deepseek-ai/DeepSeek-OCR",
118
+ enable_prefix_caching=False,
119
+ mm_processor_cache_gb=0,
120
+ logits_processors=[NGramPerReqLogitsProcessor]
121
+ )
122
+
123
+ # Prepare batched input with your image file
124
+ image_1 = Image.open("path/to/your/image_1.png").convert("RGB")
125
+ image_2 = Image.open("path/to/your/image_2.png").convert("RGB")
126
+ prompt = "<image>\nFree OCR."
127
+
128
+ model_input = [
129
+ {
130
+ "prompt": prompt,
131
+ "multi_modal_data": {"image": image_1}
132
+ },
133
+ {
134
+ "prompt": prompt,
135
+ "multi_modal_data": {"image": image_2}
136
+ }
137
+ ]
138
+
139
+ sampling_param = SamplingParams(
140
+ temperature=0.0,
141
+ max_tokens=8192,
142
+ # ngram logit processor args
143
+ extra_args=dict(
144
+ ngram_size=30,
145
+ window_size=90,
146
+ whitelist_token_ids={128821, 128822}, # whitelist: <td>, </td>
147
+ ),
148
+ skip_special_tokens=False,
149
+ )
150
+ # Generate output
151
+ model_outputs = llm.generate(model_input, sampling_param)
152
+
153
+ # Print output
154
+ for output in model_outputs:
155
+ print(output.outputs[0].text)
156
  ```
157
 
 
158
 
159
+ ## Visualizations
160
+ <table>
161
+ <tr>
162
+ <td><img src="assets/show1.jpg" style="width: 500px"></td>
163
+ <td><img src="assets/show2.jpg" style="width: 500px"></td>
164
+ </tr>
165
+ <tr>
166
+ <td><img src="assets/show3.jpg" style="width: 500px"></td>
167
+ <td><img src="assets/show4.jpg" style="width: 500px"></td>
168
+ </tr>
169
+ </table>
170
 
 
 
 
171
 
172
+ ## Acknowledgement
173
 
174
+ We would like to thank [Vary](https://github.com/Ucas-HaoranWei/Vary/), [GOT-OCR2.0](https://github.com/Ucas-HaoranWei/GOT-OCR2.0/), [MinerU](https://github.com/opendatalab/MinerU), [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR), [OneChart](https://github.com/LingyvKong/OneChart), [Slow Perception](https://github.com/Ucas-HaoranWei/Slow-Perception) for their valuable models and ideas.
 
 
175
 
176
+ We also appreciate the benchmarks: [Fox](https://github.com/ucaslcl/Fox), [OminiDocBench](https://github.com/opendatalab/OmniDocBench).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
177
 
178
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
179
  ## Citation
180
+ ```bibtex
181
+ @article{wei2025deepseek,
182
+ title={DeepSeek-OCR: Contexts Optical Compression},
183
+ author={Wei, Haoran and Sun, Yaofeng and Li, Yukun},
184
+ journal={arXiv preprint arXiv:2510.18234},
185
+ year={2025}
186
+ }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
config.json CHANGED
@@ -1,3 +1,118 @@
1
  {
2
- "model_name": "LongCat-Video"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  }
 
1
  {
2
+ "_name_or_path": "deepseek-ai/DeepSeek-OCR",
3
+ "candidate_resolutions": [
4
+ [
5
+ 1024,
6
+ 1024
7
+ ]
8
+ ],
9
+ "global_view_pos": "head",
10
+ "architectures": [
11
+ "DeepseekOCRForCausalLM"
12
+ ],
13
+ "auto_map": {
14
+ "AutoConfig": "modeling_deepseekocr.DeepseekOCRConfig",
15
+ "AutoModel": "modeling_deepseekocr.DeepseekOCRForCausalLM"
16
+ },
17
+ "language_config": {
18
+ "architectures": [
19
+ "DeepseekV2ForCausalLM"
20
+ ],
21
+ "auto_map": {
22
+ "AutoConfig": "configuration_deepseekv2.DeepseekV2Config",
23
+ "AutoModel": "modeling_deepseek.DeepseekV2Model",
24
+ "AutoModelForCausalLM": "modeling_deepseek.DeepseekV2ForCausalLM"
25
+ },
26
+ "bos_token_id": 0,
27
+ "eos_token_id": 1,
28
+ "first_k_dense_replace": 1,
29
+ "hidden_size": 1280,
30
+ "intermediate_size": 6848,
31
+ "kv_lora_rank": null,
32
+ "lm_head": true,
33
+ "max_position_embeddings": 8192,
34
+ "moe_intermediate_size": 896,
35
+ "n_group": 1,
36
+ "n_routed_experts": 64,
37
+ "n_shared_experts": 2,
38
+ "num_attention_heads": 10,
39
+ "num_experts_per_tok": 6,
40
+ "num_hidden_layers": 12,
41
+ "num_key_value_heads": 10,
42
+ "q_lora_rank": null,
43
+ "qk_nope_head_dim": 0,
44
+ "qk_rope_head_dim": 0,
45
+ "rm_head": false,
46
+ "topk_group": 1,
47
+ "topk_method": "greedy",
48
+ "torch_dtype": "bfloat16",
49
+ "use_mla": false,
50
+ "v_head_dim": 0,
51
+ "vocab_size": 129280
52
+ },
53
+ "model_type": "deepseek_vl_v2",
54
+ "projector_config": {
55
+ "input_dim": 2048,
56
+ "model_type": "mlp_projector",
57
+ "n_embed": 1280,
58
+ "projector_type": "linear"
59
+ },
60
+ "tile_tag": "2D",
61
+ "torch_dtype": "bfloat16",
62
+ "transformers_version": "4.46.3",
63
+ "vision_config": {
64
+ "image_size": 1024,
65
+ "mlp_ratio": 3.7362,
66
+ "model_name": "deeplip_b_l",
67
+ "model_type": "vision",
68
+ "width": {
69
+ "clip-l-14-224": {
70
+ "heads": 16,
71
+ "image_size": 224,
72
+ "layers": 24,
73
+ "patch_size": 14,
74
+ "width": 1024
75
+ },
76
+ "sam_vit_b": {
77
+ "downsample_channels": [
78
+ 512,
79
+ 1024
80
+ ],
81
+ "global_attn_indexes": [
82
+ 2,
83
+ 5,
84
+ 8,
85
+ 11
86
+ ],
87
+ "heads": 12,
88
+ "layers": 12,
89
+ "width": 768
90
+ }
91
+ }
92
+ },
93
+ "bos_token_id": 0,
94
+ "eos_token_id": 1,
95
+ "first_k_dense_replace": 1,
96
+ "hidden_size": 1280,
97
+ "intermediate_size": 6848,
98
+ "kv_lora_rank": null,
99
+ "lm_head": true,
100
+ "max_position_embeddings": 8192,
101
+ "moe_intermediate_size": 896,
102
+ "n_group": 1,
103
+ "n_routed_experts": 64,
104
+ "n_shared_experts": 2,
105
+ "num_attention_heads": 10,
106
+ "num_experts_per_tok": 6,
107
+ "num_hidden_layers": 12,
108
+ "num_key_value_heads": 10,
109
+ "q_lora_rank": null,
110
+ "qk_nope_head_dim": 0,
111
+ "qk_rope_head_dim": 0,
112
+ "rm_head": false,
113
+ "topk_group": 1,
114
+ "topk_method": "greedy",
115
+ "use_mla": false,
116
+ "v_head_dim": 0,
117
+ "vocab_size": 129280
118
  }