File size: 4,944 Bytes
244d6df
 
 
 
 
 
 
 
00e5ca7
244d6df
00e5ca7
244d6df
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
---
datasets:
  - HuggingFaceM4/WebSight
base_model:
  - Salesforce/codet5-base
  - google/siglip2-base-patch16-512
---

# No Code, No Cloud: On-Device Mockup-to-Code with Lightweight Vision-Language AI

Bridging the gap between visual design and functional code remains a persistent challenge in modern UI workflows, especially for small teams and non-programmers. Existing solutions, such as Figma-to-code tools and recent vision-language models (VLMs), often depend on proprietary cloud APIs or large-scale architectures, limiting offline operation, privacy, and control. We present LiteViT5, a lightweight, on-device vision-language model that directly generates HTML from images of design mockups, enabling private, no-code prototyping without cloud infrastructure. Built on a compact ViT–T5 encoder–decoder framework with 235M parameters, LiteViT5 achieves competitive results on both in-distribution (WebSight) and out-of-distribution (Design2Code) benchmarks. We evaluate its performance across structure, position, color, and CLIP-based similarity metrics and report its comparable performance to models 10–30× larger such as PaliGemma-3B, LLaVA-7B, and DeepSeek-VL-7B. We further assess LiteViT5 in a user study with 24 participants assessing perceived accuracy, code quality, and editability. Our findings show that LiteViT5 supports rapid design iteration, reduces reliance on developer handoff, making it a practical, assistive tool for democratizing web interface creation. This work highlights the potential of efficient, human-centered generative AI to empower interface design beyond expert-only workflows. To support transparency and reproducibility, we release LiteViT5 as an open-source model on Hugging Face: https://huggingface.co/LiteVit5/model.

## Model Architecture

- **Vision Encoder**: SigLIP2 (frozen)
- **Vision Processing**: Multi-view fusion
- **Seq2Seq Decoder**: CodeT5-based decoder with language modeling head
- **Input**: Images (5 views per sample - 4 quarter views + 1 full view)
- **Output**: Generated HTML

## Installation

```bash
uv add transformers torch accelerate
```

## Usage

### Loading the Model

```python
from transformers import AutoModel, AutoTokenizer
from transformers import SiglipProcessor

# Load the model
model = AutoModel.from_pretrained("LiteVit5/model", trust_remote_code=True)

# Load tokenizer and processor
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-base")
processor = SiglipProcessor.from_pretrained("google/siglip2-base-patch16-512")
```

### Inference Example

```python
from PIL import Image
import torch

from transformers import AutoModel, AutoTokenizer
from transformers import SiglipProcessor

# Load the model
model = AutoModel.from_pretrained("LiteVit5/model", trust_remote_code=True, device_map="auto")

# Load tokenizer and processor
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-base")
processor = SiglipProcessor.from_pretrained("google/siglip2-base-patch16-512")

# Preprocess image (split into 4 parts + full image = 5 views)
def prepare_image(image_path: str, processor):
    """
    Prepare image with 5 views (4 quarters + full).

    Args:
        image_path: Path to the image file
        processor: SigLIP processor

    Returns:
        Tensor of shape [5, 3, 512, 512]
    """
    image = Image.open(image_path).convert("RGB")

    # Split into 4 quarters
    width, height = image.size
    quarters = [
        image.crop((0, 0, width//2, height//2)),           # top-left
        image.crop((width//2, 0, width, height//2)),       # top-right
        image.crop((0, height//2, width//2, height)),      # bottom-left
        image.crop((width//2, height//2, width, height)),  # bottom-right
    ]

    # Process all views
    processed = [
        processor(images=q, return_tensors="pt")["pixel_values"]
        for q in quarters
    ]
    # Add full image
    processed.append(
        processor(images=image, return_tensors="pt")["pixel_values"]
    )

    pixel_values = torch.cat(processed, dim=0)
    return pixel_values

def generate_text(model, pixel_values, tokenizer, max_length=512):
    """
    Generate text from image.

    Args:
        model: LiteVit5 model
        pixel_values: Preprocessed image tensor
        tokenizer: Tokenizer for decoding
        max_length: Maximum generation length

    Returns:
        Generated text string
    """
    with torch.no_grad():
        output_ids = model.generate(pixel_values, max_length=max_length)

    text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return text

device = next(model.parameters()).device

# Process images
pixel_values = prepare_image("./image_13.png", processor)
pixel_values = pixel_values.to(device)
print("\nGenerating HTML from image_13.png...")
output_ids = model.generate(pixel_values, max_length=2024)
text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(f"Generated: {text}")
```