Hyggge commited on
Commit
92a8c71
·
1 Parent(s): 3e3a71d

feat: update README.md, add update structure image

Browse files
.gitattributes CHANGED
@@ -34,3 +34,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  valley_structure.jpeg filter=lfs diff=lfs merge=lfs -text
 
 
 
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  valley_structure.jpeg filter=lfs diff=lfs merge=lfs -text
37
+ valley_structure.png filter=lfs diff=lfs merge=lfs -text
38
+ *.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -3,7 +3,7 @@ license: apache-2.0
3
  base_model:
4
  - Qwen/Qwen2.5-7B-Instruct
5
  ---
6
- # Valley 2.0
7
 
8
  <p align="center">
9
  <img src="https://raw.githubusercontent.com/bytedance/Valley/refs/heads/main/assets/valley_logo.jpg" width="500"/>
@@ -17,25 +17,23 @@ base_model:
17
  Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data, which is developed by ByteDance. Our model not only
18
 
19
  - Achieved the best results in the inhouse e-commerce and short-video benchmarks
20
- - Demonstrated comparatively outstanding performance in the OpenCompass (average scores > 67) tests
21
-
22
- when evaluated against models of the same scale.
23
 
24
  ## Release
25
- - [02/15] 🔥 Update Valley-Eagle-DPO, achieve 69.6 on OpenCompass and update AutoModel usage for checkpoints.
26
- - [01/13] 🔥 Release TechReport. [Valley2: Exploring Multimodal Models with Scalable Vision-Language Design](https://arxiv.org/abs/2501.05901)
27
- - [12/23] Announcing [Valley-Qwen2.5-7B](https://huggingface.co/ByteDance)!
28
 
29
- ## Valley-Eagle
30
- The foundational version of Valley is a multimodal large model aligned with Siglip and Qwen2.5, incorporating LargeMLP and ConvAdapter to construct the projector.
31
 
32
  - In the final version, we also referenced Eagle, introducing an additional VisionEncoder that can flexibly adjust the number of tokens and is parallelized with the original visual tokens.
33
  - This enhancement supplements the model’s performance in extreme scenarios, and we chose the Qwen2vl VisionEncoder for this purpose.
34
 
35
- and the model structure is shown as follows:
36
 
37
- <div style="display:flex;">
38
- <img src="valley_structure.jpeg" alt="opencompass" style="height:600px;" />
39
  </div>
40
 
41
 
 
3
  base_model:
4
  - Qwen/Qwen2.5-7B-Instruct
5
  ---
6
+ # Valley2
7
 
8
  <p align="center">
9
  <img src="https://raw.githubusercontent.com/bytedance/Valley/refs/heads/main/assets/valley_logo.jpg" width="500"/>
 
17
  Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data, which is developed by ByteDance. Our model not only
18
 
19
  - Achieved the best results in the inhouse e-commerce and short-video benchmarks
20
+ - Demonstrated comparatively outstanding performance in the OpenCompass leaderboard when evaluated against models of the same scale.
 
 
21
 
22
  ## Release
23
+ - [2025/02/15] 🔥 Update Valley2-DPO, achieve 69.6 on OpenCompass and update AutoModel usage for checkpoints.
24
+ - [2025/01/13] 🔥 Release TechReport. [Valley2: Exploring Multimodal Models with Scalable Vision-Language Design](https://arxiv.org/abs/2501.05901)
25
+ - [2024/12/23] 🔥 Announcing [Valley2](https://huggingface.co/ByteDance) (Valley-Eagle-7B) !
26
 
27
+ ## Architecture
28
+ The foundational version of Valley2 is a multimodal large model aligned with Siglip and Qwen2.5, incorporating LargeMLP and ConvAdapter to construct the projector.
29
 
30
  - In the final version, we also referenced Eagle, introducing an additional VisionEncoder that can flexibly adjust the number of tokens and is parallelized with the original visual tokens.
31
  - This enhancement supplements the model’s performance in extreme scenarios, and we chose the Qwen2vl VisionEncoder for this purpose.
32
 
33
+ The model structure is shown as follows:
34
 
35
+ <div style="display: flex;">
36
+ <img src="valley_structure.png" alt="opencompass" style="width: 100%; height: auto;" />
37
  </div>
38
 
39
 
valley_structure.jpeg → valley_structure.png RENAMED
File without changes