ZhenYang21 commited on
Commit
67775f0
·
verified ·
1 Parent(s): 2d747a1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +100 -3
README.md CHANGED
@@ -1,3 +1,100 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - zh
5
+ - en
6
+ base_model:
7
+ - zai-org/GLM-4.1V-9B-Base
8
+ pipeline_tag: image-text-to-text
9
+ library_name: transformers
10
+ ---
11
+
12
+
13
+ <h1>WebVIA: A Web-based Vision-Language Agentic Framework for Interactive and Verifiable UI-to-Code Generation</h1>
14
+
15
+ - **Repository:** https://github.com/zheny2751-dotcom/WebVIA
16
+ - **Paper:** https://arxiv.org/pdf/2511.06251
17
+
18
+
19
+ <p align="center">
20
+ <img src="https://raw.githubusercontent.com/zheny2751-dotcom/WebVIA/main/assets/WEBVIA.png" alt="abs" style="width:90%;" />
21
+ </p>
22
+
23
+ **WebVIA** is **the first agentic framework** for interactive and verifiable UI-to-Code generation. While prior vision-language models only produce static HTML/CSS layouts, WebVIA enables executable and interactive web interfaces.
24
+ The framework consists of three modules:
25
+ <ul>
26
+ <li><strong>WebVIA-Agent</strong> – navigates websites and captures multi-state UI screenshots.</li>
27
+ <li><strong>WebVIA-UI2Code</strong> – generates functional HTML/CSS/JavaScript code with interactivity.</li>
28
+ <li><strong>Validation Module</strong> – verifies whether the generated UI behaves as expected.</li>
29
+ </ul>
30
+
31
+
32
+ ### Backbone Model
33
+
34
+ Our model is built on [GLM-4.1V-9B-Base](https://huggingface.co/zai-org/GLM-4.1V-9B-Base).
35
+
36
+
37
+ ### Quick Inference
38
+
39
+ This is a simple example of running single-image inference using the `transformers` library.
40
+ First, install the `transformers` library:
41
+
42
+ ```
43
+ pip install transformers>=4.57.1
44
+ ```
45
+
46
+ Then, run the following code:
47
+
48
+ ```python
49
+ from transformers import AutoProcessor, AutoModelForImageTextToText
50
+ import torch
51
+
52
+ messages = [
53
+ {
54
+ "role": "user",
55
+ "content": [
56
+ {
57
+ "type": "image",
58
+ "url": "https://raw.githubusercontent.com/zheny2751-dotcom/UI2Code-N/main/assets/example.png"
59
+ },
60
+ {
61
+ "type": "text",
62
+ "text": "Please generate the corresponding html code for the given UI screenshot."
63
+ }
64
+ ],
65
+ }
66
+ ]
67
+ processor = AutoProcessor.from_pretrained("zai-org/WebVIA-Agent")
68
+ model = AutoModelForImageTextToText.from_pretrained(
69
+ pretrained_model_name_or_path="zai-org/WebVIA-Agent",
70
+ torch_dtype=torch.bfloat16,
71
+ device_map="auto",
72
+ )
73
+ inputs = processor.apply_chat_template(
74
+ messages,
75
+ tokenize=True,
76
+ add_generation_prompt=True,
77
+ return_dict=True,
78
+ return_tensors="pt"
79
+ ).to(model.device)
80
+ generated_ids = model.generate(**inputs, max_new_tokens=16384)
81
+ output_text = processor.decode(generated_ids[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False)
82
+ print(output_text)
83
+ ```
84
+
85
+ See our [Github Repo](https://github.com/zheny2751-dotcom/WebVIA) for more detailed usage.
86
+
87
+
88
+
89
+
90
+ ## Citation
91
+ If you find our model useful in your work, please cite it with:
92
+ ```
93
+ @article{xu2025webvia,
94
+ title={WebVIA: A Web-based Vision-Language Agentic Framework for Interactive and Verifiable UI-to-Code Generation},
95
+ author={Xu, Mingde and Yang, Zhen and Hong, Wenyi and Pan, Lihang and Fan, Xinyue and Wang, Yan and Gu, Xiaotao and Xu, Bin and Tang, Jie},
96
+ year={2025},
97
+ journal={arXiv preprint arXiv:2511.06251}
98
+ }
99
+ ```
100
+