Improve language tag

#1
by lbourdois - opened
Files changed (1) hide show
  1. README.md +98 -84
README.md CHANGED
@@ -1,84 +1,98 @@
1
- ---
2
- license: apache-2.0
3
- base_model:
4
- - Qwen/Qwen2.5-7B-Instruct
5
- ---
6
- # Valley 2.0
7
-
8
- <p align="center">
9
- <img src="https://raw.githubusercontent.com/bytedance/Valley/refs/heads/main/assets/valley_logo.jpg" width="500"/>
10
- <p>
11
-
12
- <p align="center">
13
- 🎮️ <a href="https://github.com/bytedance/Valley">Github</a>&nbsp&nbsp | &nbsp&nbsp 🤗 <a href="https://huggingface.co/bytedance-research/Valley-Eagle-7B">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://www.modelscope.cn/models/Hyggge/Valley-Eagle-7B">ModelScope</a> &nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://hyggge.github.io/projects/valley/index.html">Home Page</a> &nbsp&nbsp | &nbsp&nbsp 📙 <a href="https://arxiv.org/abs/2501.05901">Paper</a>
14
- </p>
15
-
16
- ## Introduction
17
- Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data, which is developed by ByteDance. Our model not only
18
-
19
- - Achieved the best results in the inhouse e-commerce and short-video benchmarks
20
- - Demonstrated comparatively outstanding performance in the OpenCompass (average scores > 67) tests
21
-
22
- when evaluated against models of the same scale.
23
-
24
- ## Release
25
- - [02/15] 🔥 Update Valley-Eagle-DPO, achieve 69.6 on OpenCompass and update AutoModel usage for checkpoints.
26
- - [01/13] 🔥 Release TechReport. [Valley2: Exploring Multimodal Models with Scalable Vision-Language Design](https://arxiv.org/abs/2501.05901)
27
- - [12/23] Announcing [Valley-Qwen2.5-7B](https://huggingface.co/ByteDance)!
28
-
29
- ## Valley-Eagle
30
- The foundational version of Valley is a multimodal large model aligned with Siglip and Qwen2.5, incorporating LargeMLP and ConvAdapter to construct the projector.
31
-
32
- - In the final version, we also referenced Eagle, introducing an additional VisionEncoder that can flexibly adjust the number of tokens and is parallelized with the original visual tokens.
33
- - This enhancement supplements the model’s performance in extreme scenarios, and we chose the Qwen2vl VisionEncoder for this purpose.
34
-
35
- and the model structure is shown as follows:
36
-
37
- <div style="display:flex;">
38
- <img src="valley_structure.jpeg" alt="opencompass" style="height:600px;" />
39
- </div>
40
-
41
-
42
- ## Environment Setup
43
- ``` bash
44
- pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
45
- pip install -r requirements.txt
46
- ```
47
-
48
- ## License Agreement
49
- All of our open-source models are licensed under the Apache-2.0 license.
50
-
51
-
52
- ## Related Project
53
- We list related Project
54
- - [Valley: Video Assistant with Large Language model Enhanced abilitY](https://github.com/RupertLuo/Valley)
55
- - [LLaVA: Large Language and Vision Assistant](https://github.com/haotian-liu/LLaVA)
56
- - [Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders](https://github.com/NVlabs/EAGLE)
57
- - [LLaVA-CoT: Let Vision Language Models Reason Step-by-Step](https://github.com/PKU-YuanGroup/LLaVA-CoT)
58
- - [Qwen2.5](https://github.com/QwenLM/Qwen2.5)
59
-
60
- ## License Agreement
61
- All of our open-source models are licensed under the [Apache-2.0](./LICENSE) license.
62
-
63
- ## We are Hiring
64
- The Data-Ecommerce-Platform Governance-Basic Algorithms Team focuses on the research and development of multi-modal large model algorithms and foundational algorithms, continuously delving deeply into this field. Our mission is to optimize algorithms and collaborate with business teams to comprehensively govern the quality and ecosystem of ByteDance's e-commerce products. Currently, the team has a strong demand for foundational algorithm expertise in NLP, CV, and multimodal technologies. We welcome inquiries and look forward to working on challenging projects with talented individuals like you!
65
-
66
- Location: Beijing / Shanghai / Singapore
67
-
68
- Contact & Resume Submission: wuheng.2024@bytedance.com
69
-
70
- > Tiktok-电商,基础算法团队专注于多模态大模型算法和基础算法的研发,并在此方向上持续深耕,期待和优秀的你(实习/全职),一起做有挑战的事情!
71
- >
72
- > 岗位城市:北京/上海/新加坡
73
- >
74
- > 咨询&简历投递:wuheng.2024@bytedance.com
75
-
76
- ## Citation
77
- ```
78
- @article{wu2025valley2,
79
- title={Valley2: Exploring Multimodal Models with Scalable Vision-Language Design},
80
- author={Wu, Ziheng and Chen, Zhenghao and Luo, Ruipu and Zhang, Can and Gao, Yuan and He, Zhentao and Wang, Xian and Lin, Haoran and Qiu, Minghui},
81
- journal={arXiv preprint arXiv:2501.05901},
82
- year={2025}
83
- }
84
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - Qwen/Qwen2.5-7B-Instruct
5
+ language:
6
+ - zho
7
+ - eng
8
+ - fra
9
+ - spa
10
+ - por
11
+ - deu
12
+ - ita
13
+ - rus
14
+ - jpn
15
+ - kor
16
+ - vie
17
+ - tha
18
+ - ara
19
+ ---
20
+ # Valley 2.0
21
+
22
+ <p align="center">
23
+ <img src="https://raw.githubusercontent.com/bytedance/Valley/refs/heads/main/assets/valley_logo.jpg" width="500"/>
24
+ <p>
25
+
26
+ <p align="center">
27
+ 🎮️ <a href="https://github.com/bytedance/Valley">Github</a>&nbsp&nbsp | &nbsp&nbsp 🤗 <a href="https://huggingface.co/bytedance-research/Valley-Eagle-7B">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://www.modelscope.cn/models/Hyggge/Valley-Eagle-7B">ModelScope</a> &nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://hyggge.github.io/projects/valley/index.html">Home Page</a> &nbsp&nbsp | &nbsp&nbsp 📙 <a href="https://arxiv.org/abs/2501.05901">Paper</a>
28
+ </p>
29
+
30
+ ## Introduction
31
+ Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data, which is developed by ByteDance. Our model not only
32
+
33
+ - Achieved the best results in the inhouse e-commerce and short-video benchmarks
34
+ - Demonstrated comparatively outstanding performance in the OpenCompass (average scores > 67) tests
35
+
36
+ when evaluated against models of the same scale.
37
+
38
+ ## Release
39
+ - [02/15] 🔥 Update Valley-Eagle-DPO, achieve 69.6 on OpenCompass and update AutoModel usage for checkpoints.
40
+ - [01/13] 🔥 Release TechReport. [Valley2: Exploring Multimodal Models with Scalable Vision-Language Design](https://arxiv.org/abs/2501.05901)
41
+ - [12/23] Announcing [Valley-Qwen2.5-7B](https://huggingface.co/ByteDance)!
42
+
43
+ ## Valley-Eagle
44
+ The foundational version of Valley is a multimodal large model aligned with Siglip and Qwen2.5, incorporating LargeMLP and ConvAdapter to construct the projector.
45
+
46
+ - In the final version, we also referenced Eagle, introducing an additional VisionEncoder that can flexibly adjust the number of tokens and is parallelized with the original visual tokens.
47
+ - This enhancement supplements the model’s performance in extreme scenarios, and we chose the Qwen2vl VisionEncoder for this purpose.
48
+
49
+ and the model structure is shown as follows:
50
+
51
+ <div style="display:flex;">
52
+ <img src="valley_structure.jpeg" alt="opencompass" style="height:600px;" />
53
+ </div>
54
+
55
+
56
+ ## Environment Setup
57
+ ``` bash
58
+ pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
59
+ pip install -r requirements.txt
60
+ ```
61
+
62
+ ## License Agreement
63
+ All of our open-source models are licensed under the Apache-2.0 license.
64
+
65
+
66
+ ## Related Project
67
+ We list related Project
68
+ - [Valley: Video Assistant with Large Language model Enhanced abilitY](https://github.com/RupertLuo/Valley)
69
+ - [LLaVA: Large Language and Vision Assistant](https://github.com/haotian-liu/LLaVA)
70
+ - [Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders](https://github.com/NVlabs/EAGLE)
71
+ - [LLaVA-CoT: Let Vision Language Models Reason Step-by-Step](https://github.com/PKU-YuanGroup/LLaVA-CoT)
72
+ - [Qwen2.5](https://github.com/QwenLM/Qwen2.5)
73
+
74
+ ## License Agreement
75
+ All of our open-source models are licensed under the [Apache-2.0](./LICENSE) license.
76
+
77
+ ## We are Hiring
78
+ The Data-Ecommerce-Platform Governance-Basic Algorithms Team focuses on the research and development of multi-modal large model algorithms and foundational algorithms, continuously delving deeply into this field. Our mission is to optimize algorithms and collaborate with business teams to comprehensively govern the quality and ecosystem of ByteDance's e-commerce products. Currently, the team has a strong demand for foundational algorithm expertise in NLP, CV, and multimodal technologies. We welcome inquiries and look forward to working on challenging projects with talented individuals like you!
79
+
80
+ Location: Beijing / Shanghai / Singapore
81
+
82
+ Contact & Resume Submission: wuheng.2024@bytedance.com
83
+
84
+ > Tiktok-电商,基础算法团队专注于多模态大模型算法和基础算法的研发,并在此方向上持续深耕,期待和优秀的你(实习/全职),一起做有挑战的事情!
85
+ >
86
+ > 岗位城市:北京/上海/新加坡
87
+ >
88
+ > 咨询&简历投递:wuheng.2024@bytedance.com
89
+
90
+ ## Citation
91
+ ```
92
+ @article{wu2025valley2,
93
+ title={Valley2: Exploring Multimodal Models with Scalable Vision-Language Design},
94
+ author={Wu, Ziheng and Chen, Zhenghao and Luo, Ruipu and Zhang, Can and Gao, Yuan and He, Zhentao and Wang, Xian and Lin, Haoran and Qiu, Minghui},
95
+ journal={arXiv preprint arXiv:2501.05901},
96
+ year={2025}
97
+ }
98
+ ```