Improve model card: Add pipeline tag, library name, and sample usage

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +120 -17
README.md CHANGED
@@ -1,15 +1,14 @@
1
  ---
2
- license: apache-2.0
3
  language:
4
  - en
5
  - ko
 
 
 
 
6
  ---
7
 
8
-
9
-
10
-
11
-
12
- # gWorld-32B 🌍📱
13
 
14
  <p align="center">
15
  <picture>
@@ -51,7 +50,7 @@ language:
51
  </p>
52
 
53
  **gWorld-32B** establishes a new **Pareto frontier** in the trade-off between model size and GUI world modeling accuracy.
54
- - **Efficiency:** Outperforms frontier models up to **12.6x larger** (e.g., `Llama 4 402B0-A17B`) on GUI-specific benchmarks.
55
  - **Accuracy:** Achieves a **+27.1% gain** in Instruction Accuracy (IAcc.) over the base Qwen3-VL model.
56
  - **Zero-Shot Generalization:** Demonstrated high performance on out-of-distribution benchmarks like AndroidWorld and KApps (Korean).
57
 
@@ -65,21 +64,125 @@ The model treats the mobile interface as a coordinate space and predicts how tha
65
  By outputting HTML/CSS, gWorld ensures that text remains perfectly sharp and layouts are responsive.
66
  - **High Renderability:** <1% render failure rate.
67
  - **Speed:** Rendering via Playwright takes ~0.3s, significantly faster than multi-step diffusion pipelines.
68
- - **Setup:** For rendering utilities, visit the [official GitHub](https://github.com/trillion-labs/gWorld).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
 
70
  ## License and Contact
71
  This model is licensed under the Apache License 2.0. For inquiries, please contact: info@trillionlabs.co
72
 
73
-
74
  ## Citation
75
- ```
76
  @misc{koh2026generativevisualcodemobile,
77
-       title={Generative Visual Code Mobile World Models},
78
-       author={Woosung Koh and Sungjun Han and Segyu Lee and Se-Young Yun and Jamin Shin},
79
-       year={2026},
80
-       eprint={2602.01576},
81
-       archivePrefix={arXiv},
82
-       primaryClass={cs.LG},
83
-       url={https://arxiv.org/abs/2602.01576},
84
  }
85
  ```
 
1
  ---
 
2
  language:
3
  - en
4
  - ko
5
+ license: apache-2.0
6
+ pipeline_tag: image-text-to-text
7
+ library_name: transformers
8
+ base_model: Qwen/Qwen3-VL-32B
9
  ---
10
 
11
+ # gWorld-32B 🌍📱
 
 
 
 
12
 
13
  <p align="center">
14
  <picture>
 
50
  </p>
51
 
52
  **gWorld-32B** establishes a new **Pareto frontier** in the trade-off between model size and GUI world modeling accuracy.
53
+ - **Efficiency:** Outperforms frontier models up to **12.6x larger** (e.g., `Llama 4 402B-A17B`) on GUI-specific benchmarks.
54
  - **Accuracy:** Achieves a **+27.1% gain** in Instruction Accuracy (IAcc.) over the base Qwen3-VL model.
55
  - **Zero-Shot Generalization:** Demonstrated high performance on out-of-distribution benchmarks like AndroidWorld and KApps (Korean).
56
 
 
64
  By outputting HTML/CSS, gWorld ensures that text remains perfectly sharp and layouts are responsive.
65
  - **High Renderability:** <1% render failure rate.
66
  - **Speed:** Rendering via Playwright takes ~0.3s, significantly faster than multi-step diffusion pipelines.
67
+
68
+ ## Sample Usage
69
+
70
+ ### Inference with vLLM
71
+
72
+ To use the model, you can use the following snippet from the official repository:
73
+
74
+ ```python
75
+ from vllm import LLM, SamplingParams
76
+ from transformers import AutoProcessor
77
+ from PIL import Image
78
+
79
+ # Model configuration
80
+ MODEL_PATH = "trillionlabs/gWorld-32B"
81
+ BASE_MODEL = "Qwen/Qwen3-VL-32B"
82
+
83
+ # Image processing settings
84
+ MM_PROCESSOR_KWARGS = {
85
+ "max_pixels": 4233600,
86
+ "min_pixels": 3136,
87
+ }
88
+
89
+ # Load model
90
+ llm = LLM(
91
+ model=MODEL_PATH,
92
+ tokenizer=BASE_MODEL,
93
+ tensor_parallel_size=8,
94
+ gpu_memory_utilization=0.9,
95
+ max_model_len=19384,
96
+ trust_remote_code=True,
97
+ mm_processor_kwargs=MM_PROCESSOR_KWARGS,
98
+ enable_chunked_prefill=True,
99
+ max_num_batched_tokens=16384,
100
+ )
101
+
102
+ # Load processor for chat template
103
+ processor = AutoProcessor.from_pretrained(BASE_MODEL, trust_remote_code=True)
104
+
105
+ # Prepare input
106
+ image = Image.open("screenshot.png") # Replace with your screenshot
107
+ if image.mode != 'RGB':
108
+ image = image.convert('RGB')
109
+
110
+ action = '{"action_type": "TAP", "coordinates": [512, 890]}'
111
+
112
+ # World model prompt template
113
+ user_content = f"""You are an expert mobile UI World Model that can accurately predict the next state given an action.
114
+ Given a screenshot of a mobile interface and an action, you must generate clean, responsive HTML code that represents the state of the interface AFTER the action is performed.
115
+ First generate reasoning about what the next state should look like based on the action.
116
+ Afterwards, generate the HTML code representing the next state that logically follows the action.
117
+ You will render this HTML in a mobile viewport to see how similar it looks and acts like the mobile screenshot.
118
+
119
+ Requirements:
120
+ 1. Provide reasoning about what the next state should look like based on the action
121
+ 2. Generate complete, valid HTML5 code
122
+ 3. Choose between using inline CSS and utility classes from Bootstrap, Tailwind CSS, or MUI for styling, depending on which option generates the closest code to the screenshot.
123
+ 4. Use mobile-first design principles matching screenshot dimensions.
124
+ 5. For images, use inline SVG placeholders with explicit width and height attributes that match the approximate dimensions from the screenshot. Matching the approximate color is also good.
125
+ 6. Use modern web standards and best practices
126
+ 7. Return ONLY the HTML code, no explanations or markdown formatting
127
+ 8. The generated HTML should render properly in a mobile viewport.
128
+ 9. Generated HTML should look like the screen that logically follows the current screen and the action.
129
+
130
+ Action:
131
+ {action}
132
+
133
+ Output format:
134
+ # Next State Reasoning: <your reasoning about what the next state should look like>
135
+ # HTML: <valid_html_code>
136
+
137
+ Generate the next state reasoning and the next state in html:"""
138
+
139
+ # Build messages
140
+ messages = [
141
+ {
142
+ "role": "user",
143
+ "content": [
144
+ {"type": "image", "image": image},
145
+ {"type": "text", "text": user_content},
146
+ ],
147
+ }
148
+ ]
149
+
150
+ # Apply chat template
151
+ prompt = processor.apply_chat_template(
152
+ messages,
153
+ tokenize=False,
154
+ add_generation_prompt=True,
155
+ )
156
+
157
+ # Generation parameters
158
+ sampling_params = SamplingParams(
159
+ max_tokens=15000,
160
+ temperature=0,
161
+ seed=42,
162
+ top_p=1.0,
163
+ )
164
+
165
+ # Generate
166
+ outputs = llm.generate(
167
+ [{"prompt": prompt, "multi_modal_data": {"image": image}}],
168
+ sampling_params=sampling_params
169
+ )
170
+
171
+ print(outputs[0].outputs[0].text)
172
+ ```
173
 
174
  ## License and Contact
175
  This model is licensed under the Apache License 2.0. For inquiries, please contact: info@trillionlabs.co
176
 
 
177
  ## Citation
178
+ ```bibtex
179
  @misc{koh2026generativevisualcodemobile,
180
+ title={Generative Visual Code Mobile World Models},
181
+ author={Woosung Koh and Sungjun Han and Segyu Lee and Se-Young Yun and Jamin Shin},
182
+ year={2026},
183
+ eprint={2602.01576},
184
+ archivePrefix={arXiv},
185
+ primaryClass={cs.LG},
186
+ url={https://arxiv.org/abs/2602.01576},
187
  }
188
  ```