Improve model card metadata and documentation

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +34 -29
README.md CHANGED
@@ -1,14 +1,26 @@
 
 
 
 
 
1
  # A Simple Baseline for Unifying Understanding, Generation, and Editing via Vanilla Next-token Prediction
2
 
3
  <div align="center" style="line-height: 1;">
4
- <a href="https://arxiv.org/abs/2509.03498" target="_blank" style="margin: 2px;">
5
  <img alt="Arxiv" src="https://img.shields.io/badge/Wallaroo-Paper-red?logo=arxiv&logoColor=red" fill-opacity="1" style="display: inline-block; vertical-align: middle;"/>
6
  </a>
 
 
 
7
  <a href="https://huggingface.co/jiezhueval/Wallaroo" target="_blank" style="margin: 2px;">
8
  <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Wallaroo-Model-yellow" style="display: inline-block; vertical-align: middle;"/>
9
  </a>
10
  </div>
11
 
 
 
 
 
12
 
13
  <p align="center">
14
  <img src="overview.png" height=400>
@@ -16,7 +28,7 @@
16
 
17
  ## Why we develop Wallaroo?
18
 
19
- It is widely acknowledged that unifying understanding, generation, and editing has become an inevitable trend. To achieve this, autoregressive paradigm, as a representative choice, has been naturally considered. To advance this direction and establish a benchmark, we introduce Wallaroo, a straightforward autoregressive baseline that leverages next-token prediction to unify multi-modal understanding, image generation and editing at the same time. Moreover, Wallaroo supports multi-resolution image input and output, as well as bilingual support for both Chinese and English. In a nutshell, Wallaroo is a comprehensive comparison baseline model.
20
 
21
  ## Getting Started
22
 
@@ -29,15 +41,16 @@ It is widely acknowledged that unifying understanding, generation, and editing h
29
  pip3 install -r requirements.txt
30
  ```
31
  - Download the [Wallaroo 7B](https://huggingface.co/jiezhueval/Wallaroo)
32
- - Download the [LLamaGen Tokenizer](https://huggingface.co/peizesun/llamagen_t2i/resolve/main/vq_ds16_t2i.pt)
33
 
34
  ### Evaluation
35
 
36
  #### Visual Understanding
37
 
38
  - Download the [VLMEvalKit](https://github.com/open-compass/VLMEvalKit)
39
- - Add the code in vlm/qwen2_vl/model.py 313L to allow Wallaroo 7B checkpoint loading
40
- ```
 
41
  else:
42
  self.model = MODEL_CLS.from_pretrained(
43
  model_path, torch_dtype='auto', device_map="auto", attn_implementation='flash_attention_2'
@@ -53,14 +66,14 @@ else:
53
  new_dict = {}
54
  for key, value in resume_checkpoint['state_dict'].items():
55
 
56
- if 'visual' in key:
57
- new_dict[key.replace('wallaroo', 'model')] = value
58
 
59
- elif 'model' in key:
60
- new_dict[key.replace('model', 'language_model').replace('wallaroo', 'model')] = value
61
 
62
- elif 'lm_head' in key:
63
- new_dict['lm_head.weight'] = value
64
 
65
 
66
  m, u = self.model.load_state_dict(new_dict, strict=False)
@@ -68,44 +81,36 @@ else:
68
 
69
  self.model.eval()
70
  ```
71
- - Follow the instructions in VLMEvalKit
72
 
73
  #### Image Generation
74
 
75
- ```
76
  cd scripts/evaluate
77
  sh test_ar_t2i.sh
78
  ```
79
 
80
  #### Image Editing
81
 
82
- ```
83
  cd scripts/evaluate
84
  sh test_ar_i2i.sh
85
  ```
86
 
87
  ### Training
88
 
89
- See examples/wallaroo/ar_wallaroo_7
90
-
91
- This folder contains the config yaml files and corresponding training python files from different stages.
92
-
93
- One can see detailed command in train_script.sh.
94
 
95
  ## Citation
96
- ```
97
  @article{Zhu2026Simple,
98
- title = {# A Simple Baseline for Unifying Understanding, Generation, and Editing via Vanilla Next-token Prediction},
99
- author = {Jie Zhu, Hanghang Ma, Jia Wang, Yayong Guan, Yanbing Zeng, Lishuai Gao,
100
- Junqiang Wu, Jie Hu, Leye Wang},
101
- journal = {arXiv preprint arXiv:2603.04980},
102
- year = {2026}
103
  }
104
  ```
105
 
106
-
107
-
108
  ## Acknowledgments
109
 
110
- This work is built on Qwen2.5 VL, Show-o, and LLamaGen. Thanks for their wonderful open-source work.
111
-
 
1
+ ---
2
+ pipeline_tag: any-to-any
3
+ library_name: transformers
4
+ ---
5
+
6
  # A Simple Baseline for Unifying Understanding, Generation, and Editing via Vanilla Next-token Prediction
7
 
8
  <div align="center" style="line-height: 1;">
9
+ <a href="https://huggingface.co/papers/2603.04980" target="_blank" style="margin: 2px;">
10
  <img alt="Arxiv" src="https://img.shields.io/badge/Wallaroo-Paper-red?logo=arxiv&logoColor=red" fill-opacity="1" style="display: inline-block; vertical-align: middle;"/>
11
  </a>
12
+ <a href="https://github.com/JiePKU/Wallaroo" target="_blank" style="margin: 2px;">
13
+ <img alt="GitHub" src="https://img.shields.io/badge/Wallaroo-Code-blue?logo=github" style="display: inline-block; vertical-align: middle;"/>
14
+ </a>
15
  <a href="https://huggingface.co/jiezhueval/Wallaroo" target="_blank" style="margin: 2px;">
16
  <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Wallaroo-Model-yellow" style="display: inline-block; vertical-align: middle;"/>
17
  </a>
18
  </div>
19
 
20
+ Wallaroo is a simple autoregressive baseline that leverages next-token prediction to unify multi-modal understanding, image generation, and editing. It supports multi-resolution image input and output, as well as bilingual support for both Chinese and English.
21
+
22
+ - **Paper:** [A Simple Baseline for Unifying Understanding, Generation, and Editing via Vanilla Next-token Prediction](https://huggingface.co/papers/2603.04980)
23
+ - **Code:** [https://github.com/JiePKU/Wallaroo](https://github.com/JiePKU/Wallaroo)
24
 
25
  <p align="center">
26
  <img src="overview.png" height=400>
 
28
 
29
  ## Why we develop Wallaroo?
30
 
31
+ It is widely acknowledged that unifying understanding, generation, and editing has become an inevitable trend. To achieve this, the autoregressive paradigm, as a representative choice, has been naturally considered. To advance this direction and establish a benchmark, we introduce Wallaroo, a straightforward autoregressive baseline that leverages next-token prediction to unify multi-modal understanding, image generation and editing at the same time. In a nutshell, Wallaroo is a comprehensive comparison baseline model.
32
 
33
  ## Getting Started
34
 
 
41
  pip3 install -r requirements.txt
42
  ```
43
  - Download the [Wallaroo 7B](https://huggingface.co/jiezhueval/Wallaroo)
44
+ - Download the [LLamaGen Tokenizer](https://huggingface.co/peizesun/llamagen_t2i/resolve/main/vq_ds16_t2i.pt)
45
 
46
  ### Evaluation
47
 
48
  #### Visual Understanding
49
 
50
  - Download the [VLMEvalKit](https://github.com/open-compass/VLMEvalKit)
51
+ - Add the following code in `vlm/qwen2_vl/model.py` (around line 313) to allow Wallaroo 7B checkpoint loading:
52
+
53
+ ```python
54
  else:
55
  self.model = MODEL_CLS.from_pretrained(
56
  model_path, torch_dtype='auto', device_map="auto", attn_implementation='flash_attention_2'
 
66
  new_dict = {}
67
  for key, value in resume_checkpoint['state_dict'].items():
68
 
69
+ if 'visual' in key:
70
+ new_dict[key.replace('wallaroo', 'model')] = value
71
 
72
+ elif 'model' in key:
73
+ new_dict[key.replace('model', 'language_model').replace('wallaroo', 'model')] = value
74
 
75
+ elif 'lm_head' in key:
76
+ new_dict['lm_head.weight'] = value
77
 
78
 
79
  m, u = self.model.load_state_dict(new_dict, strict=False)
 
81
 
82
  self.model.eval()
83
  ```
84
+ - Follow the instructions in VLMEvalKit.
85
 
86
  #### Image Generation
87
 
88
+ ```bash
89
  cd scripts/evaluate
90
  sh test_ar_t2i.sh
91
  ```
92
 
93
  #### Image Editing
94
 
95
+ ```bash
96
  cd scripts/evaluate
97
  sh test_ar_i2i.sh
98
  ```
99
 
100
  ### Training
101
 
102
+ See `examples/wallaroo/ar_wallaroo_7` in the official repository. This folder contains the config yaml files and corresponding training python files from different stages. Detailed commands can be found in `train_script.sh`.
 
 
 
 
103
 
104
  ## Citation
105
+ ```bibtex
106
  @article{Zhu2026Simple,
107
+ title = {A Simple Baseline for Unifying Understanding, Generation, and Editing via Vanilla Next-token Prediction},
108
+ author = {Jie Zhu, Hanghang Ma, Jia Wang, Yayong Guan, Yanbing Zeng, Lishuai Gao, Junqiang Wu, Jie Hu, Leye Wang},
109
+ journal = {arXiv preprint arXiv:2603.04980},
110
+ year = {2026}
 
111
  }
112
  ```
113
 
 
 
114
  ## Acknowledgments
115
 
116
+ This work is built on Qwen2.5 VL, Show-o, and LLamaGen. Thanks for their wonderful open-source work.