feat: update README.md, add update structure image
Browse files- .gitattributes +2 -0
- README.md +10 -12
- valley_structure.jpeg → valley_structure.png +2 -2
.gitattributes
CHANGED
|
@@ -34,3 +34,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
valley_structure.jpeg filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
valley_structure.jpeg filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
valley_structure.png filter=lfs diff=lfs merge=lfs -text
|
| 38 |
+
*.png filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -3,7 +3,7 @@ license: apache-2.0
|
|
| 3 |
base_model:
|
| 4 |
- Qwen/Qwen2.5-7B-Instruct
|
| 5 |
---
|
| 6 |
-
#
|
| 7 |
|
| 8 |
<p align="center">
|
| 9 |
<img src="https://raw.githubusercontent.com/bytedance/Valley/refs/heads/main/assets/valley_logo.jpg" width="500"/>
|
|
@@ -17,25 +17,23 @@ base_model:
|
|
| 17 |
Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data, which is developed by ByteDance. Our model not only
|
| 18 |
|
| 19 |
- Achieved the best results in the inhouse e-commerce and short-video benchmarks
|
| 20 |
-
- Demonstrated comparatively outstanding performance in the OpenCompass
|
| 21 |
-
|
| 22 |
-
when evaluated against models of the same scale.
|
| 23 |
|
| 24 |
## Release
|
| 25 |
-
- [02/15] 🔥 Update
|
| 26 |
-
- [01/13] 🔥 Release TechReport. [Valley2: Exploring Multimodal Models with Scalable Vision-Language Design](https://arxiv.org/abs/2501.05901)
|
| 27 |
-
- [12/23] Announcing [
|
| 28 |
|
| 29 |
-
##
|
| 30 |
-
The foundational version of
|
| 31 |
|
| 32 |
- In the final version, we also referenced Eagle, introducing an additional VisionEncoder that can flexibly adjust the number of tokens and is parallelized with the original visual tokens.
|
| 33 |
- This enhancement supplements the model’s performance in extreme scenarios, and we chose the Qwen2vl VisionEncoder for this purpose.
|
| 34 |
|
| 35 |
-
|
| 36 |
|
| 37 |
-
<div style="display:flex;">
|
| 38 |
-
<img src="valley_structure.
|
| 39 |
</div>
|
| 40 |
|
| 41 |
|
|
|
|
| 3 |
base_model:
|
| 4 |
- Qwen/Qwen2.5-7B-Instruct
|
| 5 |
---
|
| 6 |
+
# Valley2
|
| 7 |
|
| 8 |
<p align="center">
|
| 9 |
<img src="https://raw.githubusercontent.com/bytedance/Valley/refs/heads/main/assets/valley_logo.jpg" width="500"/>
|
|
|
|
| 17 |
Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data, which is developed by ByteDance. Our model not only
|
| 18 |
|
| 19 |
- Achieved the best results in the inhouse e-commerce and short-video benchmarks
|
| 20 |
+
- Demonstrated comparatively outstanding performance in the OpenCompass leaderboard when evaluated against models of the same scale.
|
|
|
|
|
|
|
| 21 |
|
| 22 |
## Release
|
| 23 |
+
- [2025/02/15] 🔥 Update Valley2-DPO, achieve 69.6 on OpenCompass and update AutoModel usage for checkpoints.
|
| 24 |
+
- [2025/01/13] 🔥 Release TechReport. [Valley2: Exploring Multimodal Models with Scalable Vision-Language Design](https://arxiv.org/abs/2501.05901)
|
| 25 |
+
- [2024/12/23] 🔥 Announcing [Valley2](https://huggingface.co/ByteDance) (Valley-Eagle-7B) !
|
| 26 |
|
| 27 |
+
## Architecture
|
| 28 |
+
The foundational version of Valley2 is a multimodal large model aligned with Siglip and Qwen2.5, incorporating LargeMLP and ConvAdapter to construct the projector.
|
| 29 |
|
| 30 |
- In the final version, we also referenced Eagle, introducing an additional VisionEncoder that can flexibly adjust the number of tokens and is parallelized with the original visual tokens.
|
| 31 |
- This enhancement supplements the model’s performance in extreme scenarios, and we chose the Qwen2vl VisionEncoder for this purpose.
|
| 32 |
|
| 33 |
+
The model structure is shown as follows:
|
| 34 |
|
| 35 |
+
<div style="display: flex;">
|
| 36 |
+
<img src="valley_structure.png" alt="opencompass" style="width: 100%; height: auto;" />
|
| 37 |
</div>
|
| 38 |
|
| 39 |
|
valley_structure.jpeg → valley_structure.png
RENAMED
|
File without changes
|