sungjunhan-trl nielsr HF Staff commited on
Commit
94e705e
·
1 Parent(s): b5b63bd

Improve model card: Add pipeline tag, library name, and sample usage (#1)

Browse files

- Improve model card: Add pipeline tag, library name, and sample usage (774aeb3713cef4f9a6f40b727bf1c19e94b24f85)


Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +120 -17
README.md CHANGED
@@ -1,15 +1,14 @@
1
  ---
2
- license: apache-2.0
3
  language:
4
  - en
5
  - ko
 
 
 
 
6
  ---
7
 
8
-
9
-
10
-
11
-
12
- # gWorld-32B 🌍📱
13
 
14
  <p align="center">
15
  <picture>
@@ -50,7 +49,7 @@ language:
50
  </p>
51
 
52
  **gWorld-32B** establishes a new **Pareto frontier** in the trade-off between model size and GUI world modeling accuracy.
53
- - **Efficiency:** Outperforms frontier models up to **12.6x larger** (e.g., `Llama 4 402B0-A17B`) on GUI-specific benchmarks.
54
  - **Accuracy:** Achieves a **+27.1% gain** in Instruction Accuracy (IAcc.) over the base Qwen3-VL model.
55
  - **Zero-Shot Generalization:** Demonstrated high performance on out-of-distribution benchmarks like AndroidWorld and KApps (Korean).
56
 
@@ -64,21 +63,125 @@ The model treats the mobile interface as a coordinate space and predicts how tha
64
  By outputting HTML/CSS, gWorld ensures that text remains perfectly sharp and layouts are responsive.
65
  - **High Renderability:** <1% render failure rate.
66
  - **Speed:** Rendering via Playwright takes ~0.3s, significantly faster than multi-step diffusion pipelines.
67
- - **Setup:** For rendering utilities, visit the [official GitHub](https://github.com/trillion-labs/gWorld).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
 
69
  ## License and Contact
70
  This model is licensed under the Apache License 2.0. For inquiries, please contact: info@trillionlabs.co
71
 
72
-
73
  ## Citation
74
- ```
75
  @misc{koh2026generativevisualcodemobile,
76
-       title={Generative Visual Code Mobile World Models},
77
-       author={Woosung Koh and Sungjun Han and Segyu Lee and Se-Young Yun and Jamin Shin},
78
-       year={2026},
79
-       eprint={2602.01576},
80
-       archivePrefix={arXiv},
81
-       primaryClass={cs.LG},
82
-       url={https://arxiv.org/abs/2602.01576},
83
  }
84
  ```
 
1
  ---
 
2
  language:
3
  - en
4
  - ko
5
+ license: apache-2.0
6
+ pipeline_tag: image-text-to-text
7
+ library_name: transformers
8
+ base_model: Qwen/Qwen3-VL-32B
9
  ---
10
 
11
+ # gWorld-32B 🌍📱
 
 
 
 
12
 
13
  <p align="center">
14
  <picture>
 
49
  </p>
50
 
51
  **gWorld-32B** establishes a new **Pareto frontier** in the trade-off between model size and GUI world modeling accuracy.
52
+ - **Efficiency:** Outperforms frontier models up to **12.6x larger** (e.g., `Llama 4 402B-A17B`) on GUI-specific benchmarks.
53
  - **Accuracy:** Achieves a **+27.1% gain** in Instruction Accuracy (IAcc.) over the base Qwen3-VL model.
54
  - **Zero-Shot Generalization:** Demonstrated high performance on out-of-distribution benchmarks like AndroidWorld and KApps (Korean).
55
 
 
63
  By outputting HTML/CSS, gWorld ensures that text remains perfectly sharp and layouts are responsive.
64
  - **High Renderability:** <1% render failure rate.
65
  - **Speed:** Rendering via Playwright takes ~0.3s, significantly faster than multi-step diffusion pipelines.
66
+
67
+ ## Sample Usage
68
+
69
+ ### Inference with vLLM
70
+
71
+ To use the model, you can use the following snippet from the official repository:
72
+
73
+ ```python
74
+ from vllm import LLM, SamplingParams
75
+ from transformers import AutoProcessor
76
+ from PIL import Image
77
+
78
+ # Model configuration
79
+ MODEL_PATH = "trillionlabs/gWorld-32B"
80
+ BASE_MODEL = "Qwen/Qwen3-VL-32B"
81
+
82
+ # Image processing settings
83
+ MM_PROCESSOR_KWARGS = {
84
+ "max_pixels": 4233600,
85
+ "min_pixels": 3136,
86
+ }
87
+
88
+ # Load model
89
+ llm = LLM(
90
+ model=MODEL_PATH,
91
+ tokenizer=BASE_MODEL,
92
+ tensor_parallel_size=8,
93
+ gpu_memory_utilization=0.9,
94
+ max_model_len=19384,
95
+ trust_remote_code=True,
96
+ mm_processor_kwargs=MM_PROCESSOR_KWARGS,
97
+ enable_chunked_prefill=True,
98
+ max_num_batched_tokens=16384,
99
+ )
100
+
101
+ # Load processor for chat template
102
+ processor = AutoProcessor.from_pretrained(BASE_MODEL, trust_remote_code=True)
103
+
104
+ # Prepare input
105
+ image = Image.open("screenshot.png") # Replace with your screenshot
106
+ if image.mode != 'RGB':
107
+ image = image.convert('RGB')
108
+
109
+ action = '{"action_type": "TAP", "coordinates": [512, 890]}'
110
+
111
+ # World model prompt template
112
+ user_content = f"""You are an expert mobile UI World Model that can accurately predict the next state given an action.
113
+ Given a screenshot of a mobile interface and an action, you must generate clean, responsive HTML code that represents the state of the interface AFTER the action is performed.
114
+ First generate reasoning about what the next state should look like based on the action.
115
+ Afterwards, generate the HTML code representing the next state that logically follows the action.
116
+ You will render this HTML in a mobile viewport to see how similar it looks and acts like the mobile screenshot.
117
+
118
+ Requirements:
119
+ 1. Provide reasoning about what the next state should look like based on the action
120
+ 2. Generate complete, valid HTML5 code
121
+ 3. Choose between using inline CSS and utility classes from Bootstrap, Tailwind CSS, or MUI for styling, depending on which option generates the closest code to the screenshot.
122
+ 4. Use mobile-first design principles matching screenshot dimensions.
123
+ 5. For images, use inline SVG placeholders with explicit width and height attributes that match the approximate dimensions from the screenshot. Matching the approximate color is also good.
124
+ 6. Use modern web standards and best practices
125
+ 7. Return ONLY the HTML code, no explanations or markdown formatting
126
+ 8. The generated HTML should render properly in a mobile viewport.
127
+ 9. Generated HTML should look like the screen that logically follows the current screen and the action.
128
+
129
+ Action:
130
+ {action}
131
+
132
+ Output format:
133
+ # Next State Reasoning: <your reasoning about what the next state should look like>
134
+ # HTML: <valid_html_code>
135
+
136
+ Generate the next state reasoning and the next state in html:"""
137
+
138
+ # Build messages
139
+ messages = [
140
+ {
141
+ "role": "user",
142
+ "content": [
143
+ {"type": "image", "image": image},
144
+ {"type": "text", "text": user_content},
145
+ ],
146
+ }
147
+ ]
148
+
149
+ # Apply chat template
150
+ prompt = processor.apply_chat_template(
151
+ messages,
152
+ tokenize=False,
153
+ add_generation_prompt=True,
154
+ )
155
+
156
+ # Generation parameters
157
+ sampling_params = SamplingParams(
158
+ max_tokens=15000,
159
+ temperature=0,
160
+ seed=42,
161
+ top_p=1.0,
162
+ )
163
+
164
+ # Generate
165
+ outputs = llm.generate(
166
+ [{"prompt": prompt, "multi_modal_data": {"image": image}}],
167
+ sampling_params=sampling_params
168
+ )
169
+
170
+ print(outputs[0].outputs[0].text)
171
+ ```
172
 
173
  ## License and Contact
174
  This model is licensed under the Apache License 2.0. For inquiries, please contact: info@trillionlabs.co
175
 
 
176
  ## Citation
177
+ ```bibtex
178
  @misc{koh2026generativevisualcodemobile,
179
+ title={Generative Visual Code Mobile World Models},
180
+ author={Woosung Koh and Sungjun Han and Segyu Lee and Se-Young Yun and Jamin Shin},
181
+ year={2026},
182
+ eprint={2602.01576},
183
+ archivePrefix={arXiv},
184
+ primaryClass={cs.LG},
185
+ url={https://arxiv.org/abs/2602.01576},
186
  }
187
  ```