| --- |
| title: UI-UX UX Defect Diagnosis |
| emoji: 🤗 |
| colorFrom: blue |
| colorTo: purple |
| sdk: gradio |
| sdk_version: "5.29.0" |
| app_file: app.py |
| pinned: false |
| license: mit |
| hardware: zero-a10g |
| tags: |
| - computer-vision |
| - multimodal |
| - image-classification |
| - ux |
| --- |
| |
| # UI-UX |
|
|
| **UI-UX** is a multimodal LLM optimized for UI/UX defect diagnosis on mobile interfaces. |
| Built on Qwen3.5 with task-aware GRPO reinforcement learning, |
| it achieves **79.63%** on UXBench — surpassing Claude-4.5-Sonnet (65.50%) and GPT-4o class models. |
|
|
| [](https://cvpr.thecvf.com/virtual/2026/poster/41386) |
| [](https://openaccess.thecvf.com/content/CVPR2026F/papers/Mao_Reasoning_for_Mobile_User_Experience_with_Multimodal_LLMs_Task_Benchmark_CVPRF_2026_paper.pdf) |
| [](https://mason111-ui-ux.hf.space/) |
| [](#license) |
|
|
| --- |
|
|
| ## Highlights |
|
|
| - **State-of-the-art UX reasoning** — 79.63% on UXBench |
| - **UXBench** — first vision-language benchmark for UX defect diagnosis, 2,000 samples across 8 tasks |
| - **8 diagnostic tasks** covering Usability, Efficiency, and Trustworthiness |
| - **Thinking mode** — chain-of-thought reasoning via `<think>...</think>` tokens, with reward-based mitigation of overthinking |
|
|
| --- |
|
|
| ## Model Details |
|
|
| | Property | Value | |
| |----------|-------| |
| | Base model | Qwen3.5 | |
| | Architecture | Qwen3_5ForConditionalGeneration (hybrid linear + full attention) | |
| | Training method | GRPO with Asymmetric Transition Reward | |
| | Context length | 16K tokens | |
| | Precision | bfloat16 | |
| | Task | Image-text-to-text (UX defect diagnosis) | |
| |
| **Training data**: Multi-task UX datasets — BubbleBtn, BubbleNum, MiniPgm, ModelClose, MultiModel, TextCon, TextOcc, MarkerCons |
| |
| --- |
| |
| ## Quick Start |
| |
| ### Installation |
| |
| ```bash |
| pip install transformers accelerate |
| ``` |
| |
| ### Inference with Transformers |
| |
| ```python |
| from transformers import AutoModelForImageTextToText, AutoProcessor |
| |
| model = AutoModelForImageTextToText.from_pretrained( |
| "afx-team/UI-UX", dtype="auto", device_map="auto" |
| ) |
| processor = AutoProcessor.from_pretrained("afx-team/UI-UX") |
| |
| messages = [ |
| { |
| "role": "user", |
| "content": [ |
| {"type": "image", "image": "screenshot.png"}, |
| {"type": "text", "text": """Core Task: Evaluate whether the 'pop-up' offers users an explicit control for closing it. |
| Options: |
| A. No modal pop-up is present |
| B. The modal pop-up lacks an explicit close control |
| C. The modal pop-up has an explicit close control |
| Output Format: $\\boxed{X}$ (where X is one of A-C)."""}, |
| ], |
| } |
| ] |
| |
| inputs = processor.apply_chat_template( |
| messages, tokenize=True, add_generation_prompt=True, |
| return_dict=True, return_tensors="pt" |
| ).to(model.device) |
| |
| output = model.generate(**inputs, max_new_tokens=8192) |
| response = processor.decode(output[0][len(inputs.input_ids[0]):], skip_special_tokens=True) |
| print(response) |
| ``` |
| |
| ### Inference with vLLM (Recommended) |
| |
| ```bash |
| vllm serve afx-team/UI-UX --port 8000 --max-model-len 16384 \ |
| --dtype bfloat16 --enable-reasoning --reasoning-parser deepseek_r1 |
| ``` |
| |
| ```python |
| from openai import OpenAI |
| |
| client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy") |
| response = client.chat.completions.create( |
| model="afx-team/UI-UX", |
| messages=[{"role": "user", "content": [...]}], |
| max_tokens=8192 |
| ) |
| print(response.choices[0].message.content) |
| ``` |
| |
| --- |
| |
| ## UXBench |
| |
| UXBench is the first vision-language benchmark for UX defect diagnosis, containing **2,000** VQA samples across **8 tasks** in **3 dimensions**: |
|
|
| | Dimension | Tasks | Description | |
| |-----------|-------|-------------| |
| | **Usability** | BubbleOcclT, BubbleOcclBtn, PopupNoClose | Text/button occlusion, missing close controls | |
| | **Efficiency** | PopupBlock, PopupStack | Popup blocking interactions, dialog stacking | |
| | **Trustworthiness** | MismatchBadge, MismatchContent, MismatchFunc | Inconsistencies in badges, content, and ads | |
|
|
| --- |
|
|
| ## Citation |
|
|
| ```bibtex |
| @inproceedings{uiux2026, |
| title={Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach}, |
| author={Mao, Ruichao and Fang, Zhou and Guo, Teng and Yang, Hao and Li, Yaping and Peng, Shaohua and Huang, Maji and Lin, Xiaoyu and Liu, Shuoyang and Li, Xuepeng and Zhang, Yuyu and Rao, Hai}, |
| booktitle={CVPR Findings}, |
| year={2026} |
| } |
| ``` |
|
|
| --- |
|
|
| ## License |
|
|
| This project is licensed under the [MIT License](https://opensource.org/licenses/MIT). |
|
|
| --- |
|
|
| ## Acknowledgements |
|
|
| UI-UX is built upon [Qwen3.5](https://huggingface.co/Qwen/Qwen3.5-4B-Base) and trained with [ms-swift](https://github.com/modelscope/ms-swift). |
|
|