Improve model card: Add pipeline_tag, library_name, paper link, and sample usage

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +120 -9
README.md CHANGED
@@ -1,11 +1,13 @@
1
  ---
2
- license: apache-2.0
3
  language:
4
  - en
5
  - ko
 
 
 
6
  ---
7
 
8
- # gWorld-8B 🌍📱
9
 
10
  <p align="center">
11
  <picture>
@@ -21,6 +23,7 @@ language:
21
 
22
  **gWorld-8B 🌍📱** is the first open-weight, single self-contained Vision-Language Model (VLM) specialized for visual mobile GUI world modeling. Unlike traditional visual world models that predict pixels directly, **gWorld-8B** predicts the **next GUI state as executable web code**. This approach ensures pixel-perfect text rendering and structurally accurate layouts, overcoming the hallucination and legibility issues common in pixel-generation models.
23
 
 
24
 
25
  <p align="center">
26
  <picture>
@@ -28,6 +31,114 @@ language:
28
  </picture>
29
  </p>
30
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
  ## Model Summary
33
  - **Architecture:** Based on `Qwen3-VL-8B`
@@ -70,12 +181,12 @@ This model is licensed under the Apache License 2.0. For inquiries, please conta
70
  ## Citation
71
  ```
72
  @misc{koh2026generativevisualcodemobile,
73
-       title={Generative Visual Code Mobile World Models},
74
-       author={Woosung Koh and Sungjun Han and Segyu Lee and Se-Young Yun and Jamin Shin},
75
-       year={2026},
76
-       eprint={2602.01576},
77
-       archivePrefix={arXiv},
78
-       primaryClass={cs.LG},
79
-       url={https://arxiv.org/abs/2602.01576},
80
  }
81
  ```
 
1
  ---
 
2
  language:
3
  - en
4
  - ko
5
+ license: apache-2.0
6
+ pipeline_tag: image-text-to-text
7
+ library_name: transformers
8
  ---
9
 
10
+ # gWorld-8B 🌍📱
11
 
12
  <p align="center">
13
  <picture>
 
23
 
24
  **gWorld-8B 🌍📱** is the first open-weight, single self-contained Vision-Language Model (VLM) specialized for visual mobile GUI world modeling. Unlike traditional visual world models that predict pixels directly, **gWorld-8B** predicts the **next GUI state as executable web code**. This approach ensures pixel-perfect text rendering and structurally accurate layouts, overcoming the hallucination and legibility issues common in pixel-generation models.
25
 
26
+ This model was presented in the paper [Generative Visual Code Mobile World Models](https://huggingface.co/papers/2602.01576).
27
 
28
  <p align="center">
29
  <picture>
 
31
  </picture>
32
  </p>
33
 
34
+ ## Sample Usage
35
+
36
+ You can run inference using the `vLLM` library as follows:
37
+
38
+ ```python
39
+ from vllm import LLM, SamplingParams
40
+ from transformers import AutoProcessor
41
+ from PIL import Image
42
+
43
+ # Model configuration (choose one)
44
+ # For gWorld-8B:
45
+ MODEL_PATH = "trillionlabs/gWorld-8B"
46
+ BASE_MODEL = "Qwen/Qwen3-VL-8B-Instruct"
47
+
48
+ # For gWorld-32B:
49
+ # MODEL_PATH = "trillionlabs/gWorld-32B"
50
+ # BASE_MODEL = "Qwen/Qwen3-VL-32B"
51
+
52
+ # Image processing settings
53
+ MM_PROCESSOR_KWARGS = {
54
+ "max_pixels": 4233600,
55
+ "min_pixels": 3136,
56
+ }
57
+
58
+ # Load model
59
+ llm = LLM(
60
+ model=MODEL_PATH,
61
+ tokenizer=BASE_MODEL,
62
+ tensor_parallel_size=8,
63
+ gpu_memory_utilization=0.9,
64
+ max_model_len=19384,
65
+ trust_remote_code=True,
66
+ mm_processor_kwargs=MM_PROCESSOR_KWARGS,
67
+ enable_chunked_prefill=True,
68
+ max_num_batched_tokens=16384,
69
+ )
70
+
71
+ # Load processor for chat template
72
+ processor = AutoProcessor.from_pretrained(BASE_MODEL, trust_remote_code=True)
73
+
74
+ # Prepare input
75
+ image = Image.open("screenshot.png")
76
+ if image.mode != 'RGB':
77
+ image = image.convert('RGB')
78
+
79
+ action = '{"action_type": "TAP", "coordinates": [512, 890]}'
80
+
81
+ # World model prompt template
82
+ user_content = f"""You are an expert mobile UI World Model that can accurately predict the next state given an action.
83
+ Given a screenshot of a mobile interface and an action, you must generate clean, responsive HTML code that represents the state of the interface AFTER the action is performed.
84
+ First generate reasoning about what the next state should look like based on the action.
85
+ Afterwards, generate the HTML code representing the next state that logically follows the action.
86
+ You will render this HTML in a mobile viewport to see how similar it looks and acts like the mobile screenshot.
87
+
88
+ Requirements:
89
+ 1. Provide reasoning about what the next state should look like based on the action
90
+ 2. Generate complete, valid HTML5 code
91
+ 3. Choose between using inline CSS and utility classes from Bootstrap, Tailwind CSS, or MUI for styling, depending on which option generates the closest code to the screenshot.
92
+ 4. Use mobile-first design principles matching screenshot dimensions.
93
+ 5. For images, use inline SVG placeholders with explicit width and height attributes that match the approximate dimensions from the screenshot. Matching the approximate color is also good.
94
+ 6. Use modern web standards and best practices
95
+ 7. Return ONLY the HTML code, no explanations or markdown formatting
96
+ 8. The generated HTML should render properly in a mobile viewport.
97
+ 9. Generated HTML should look like the screen that logically follows the current screen and the action.
98
+
99
+ Action:
100
+ {action}
101
+
102
+ Output format:
103
+ # Next State Reasoning: <your reasoning about what the next state should look like>
104
+ # HTML: <valid_html_code>
105
+
106
+ Generate the next state reasoning and the next state in html:"""
107
+
108
+ # Build messages
109
+ messages = [
110
+ {
111
+ "role": "user",
112
+ "content": [
113
+ {"type": "image", "image": image},
114
+ {"type": "text", "text": user_content},
115
+ ],
116
+ }
117
+ ]
118
+
119
+ # Apply chat template
120
+ prompt = processor.apply_chat_template(
121
+ messages,
122
+ tokenize=False,
123
+ add_generation_prompt=True,
124
+ )
125
+
126
+ # Generation parameters
127
+ sampling_params = SamplingParams(
128
+ max_tokens=15000,
129
+ temperature=0,
130
+ seed=42,
131
+ top_p=1.0,
132
+ )
133
+
134
+ # Generate
135
+ outputs = llm.generate(
136
+ [{"prompt": prompt, "multi_modal_data": {"image": image}}],
137
+ sampling_params=sampling_params
138
+ )
139
+
140
+ print(outputs[0].outputs[0].text)
141
+ ```
142
 
143
  ## Model Summary
144
  - **Architecture:** Based on `Qwen3-VL-8B`
 
181
  ## Citation
182
  ```
183
  @misc{koh2026generativevisualcodemobile,
184
+ title={Generative Visual Code Mobile World Models},
185
+ author={Woosung Koh and Sungjun Han and Segyu Lee and Se-Young Yun and Jamin Shin},
186
+ year={2026},
187
+ eprint={2602.01576},
188
+ archivePrefix={arXiv},
189
+ primaryClass={cs.LG},
190
+ url={https://arxiv.org/abs/2602.01576},
191
  }
192
  ```