feat: update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,81 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
base_model:
|
| 4 |
+
- Qwen/Qwen2.5-7B-Instruct
|
| 5 |
+
---
|
| 6 |
+
# Valley2
|
| 7 |
+
|
| 8 |
+
<p align="center">
|
| 9 |
+
<img src="https://raw.githubusercontent.com/bytedance/Valley/refs/heads/main/assets/valley_logo.jpg" width="500"/>
|
| 10 |
+
<p>
|
| 11 |
+
|
| 12 |
+
<p align="center">
|
| 13 |
+
🎮️ <a href="https://github.com/bytedance/Valley">Github</a>   |    🤗 <a href="https://huggingface.co/bytedance-research/Valley-Eagle-7B">Hugging Face</a>   |   🤖 <a href="https://www.modelscope.cn/models/Hyggge/Valley-Eagle-7B">ModelScope</a>    |    📑 <a href="https://hyggge.github.io/projects/valley/index.html">Home Page</a>    |    📙 <a href="https://arxiv.org/abs/2501.05901">Paper</a>
|
| 14 |
+
</p>
|
| 15 |
+
|
| 16 |
+
## Introduction
|
| 17 |
+
Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data, which is developed by ByteDance. Our model
|
| 18 |
+
|
| 19 |
+
- Achieved the best results in the inhouse e-commerce and short-video benchmarks, much better then other SOTA opensource models.
|
| 20 |
+
- Demonstrated comparatively outstanding performance in the OpenCompass Benchmark.
|
| 21 |
+
|
| 22 |
+
## Release
|
| 23 |
+
- [2025/10/26] 🔥🔥🔥 Update [Valley3](https://huggingface.co/bytedance-research/Valley3), significantly enhance multimodal understanding and reasoning capabilities, achieving 74.4 on OpenCompass Multi-modal Academic Leaderboard!
|
| 24 |
+
- [2025/02/15] 🔥 Update [Valley2-DPO](https://huggingface.co/bytedance-research/Valley2-DPO), achieve 69.6 on OpenCompass Multi-modal Academic Leaderboard and update AutoModel usage for checkpoints.
|
| 25 |
+
- [2025/01/13] 🔥 Release TechReport. [Valley2: Exploring Multimodal Models with Scalable Vision-Language Design](https://arxiv.org/abs/2501.05901)
|
| 26 |
+
- [2024/12/23] 🔥 Announcing [Valley2](https://huggingface.co/bytedance-research/Valley-Eagle-7B) (Valley-Eagle-7B)!
|
| 27 |
+
|
| 28 |
+
## Architecture
|
| 29 |
+
For the LLM, we select Qwen3-8B-Base, chosen for its strong reasoning and language comprehension abilities. The Vision Encoder leverages Qwen2-VL-ViT, capable of processing dynamic-resolution inputs—a more robust alternative to the commonly used tiling approach when dealing with images of extreme aspect ratios. The Projector employs a 2×2 pixelshuffle downsampling on visual tokens, followed by a two-layer MLP with a 64k hidden dimension, providing high alignment capacity between modalities.
|
| 30 |
+
This architectural design ensures that Valley3 achieves a balanced trade-off between representational power, computational efficiency, and multimodal adaptability.
|
| 31 |
+
|
| 32 |
+
The overall architecture is shown as follows:
|
| 33 |
+
|
| 34 |
+
<div style="display: flex;">
|
| 35 |
+
<img src="valley_structure.png" alt="opencompass" style="width: 100%; height: auto;" />
|
| 36 |
+
</div>
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
## Environment Setup
|
| 40 |
+
``` bash
|
| 41 |
+
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
|
| 42 |
+
pip install -r requirements.txt
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
## License Agreement
|
| 46 |
+
All of our open-source models are licensed under the Apache-2.0 license.
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
## Related Project
|
| 50 |
+
We list related Project
|
| 51 |
+
- [Valley: Video Assistant with Large Language model Enhanced abilitY](https://github.com/RupertLuo/Valley)
|
| 52 |
+
- [LLaVA: Large Language and Vision Assistant](https://github.com/haotian-liu/LLaVA)
|
| 53 |
+
- [Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders](https://github.com/NVlabs/EAGLE)
|
| 54 |
+
- [LLaVA-CoT: Let Vision Language Models Reason Step-by-Step](https://github.com/PKU-YuanGroup/LLaVA-CoT)
|
| 55 |
+
- [Qwen2.5](https://github.com/QwenLM/Qwen2.5)
|
| 56 |
+
|
| 57 |
+
## License Agreement
|
| 58 |
+
All of our open-source models are licensed under the [Apache-2.0](./LICENSE) license.
|
| 59 |
+
|
| 60 |
+
## We are Hiring
|
| 61 |
+
The Data-Ecommerce-Platform Governance-Basic Algorithms Team focuses on the research and development of multi-modal large model algorithms and foundational algorithms, continuously delving deeply into this field. Our mission is to optimize algorithms and collaborate with business teams to comprehensively govern the quality and ecosystem of ByteDance's e-commerce products. Currently, the team has a strong demand for foundational algorithm expertise in NLP, CV, and multimodal technologies. We welcome inquiries and look forward to working on challenging projects with talented individuals like you!
|
| 62 |
+
|
| 63 |
+
Location: Beijing / Shanghai / Singapore
|
| 64 |
+
|
| 65 |
+
Contact & Resume Submission: wuheng.2024@bytedance.com
|
| 66 |
+
|
| 67 |
+
> Tiktok-电商,基础算法团队专注于多模态大模型算法和基础算法的研发,并在此方向上持续深耕,期待和优秀的你(实习/全职),一起做有挑战的事情!
|
| 68 |
+
>
|
| 69 |
+
> 岗位城市:北京/上海/新加坡
|
| 70 |
+
>
|
| 71 |
+
> 咨询&简历投递:wuheng.2024@bytedance.com
|
| 72 |
+
|
| 73 |
+
## Citation
|
| 74 |
+
```
|
| 75 |
+
@article{wu2025valley2,
|
| 76 |
+
title={Valley2: Exploring Multimodal Models with Scalable Vision-Language Design},
|
| 77 |
+
author={Wu, Ziheng and Chen, Zhenghao and Luo, Ruipu and Zhang, Can and Gao, Yuan and He, Zhentao and Wang, Xian and Lin, Haoran and Qiu, Minghui},
|
| 78 |
+
journal={arXiv preprint arXiv:2501.05901},
|
| 79 |
+
year={2025}
|
| 80 |
+
}
|
| 81 |
+
```
|