|
|
--- |
|
|
datasets: |
|
|
- HuggingFaceM4/WebSight |
|
|
base_model: |
|
|
- Salesforce/codet5-base |
|
|
- google/siglip2-base-patch16-512 |
|
|
--- |
|
|
|
|
|
# No Code, No Cloud: On-Device Mockup-to-Code with Lightweight Vision-Language AI |
|
|
|
|
|
Bridging the gap between visual design and functional code remains a persistent challenge in modern UI workflows, especially for small teams and non-programmers. Existing solutions, such as Figma-to-code tools and recent vision-language models (VLMs), often depend on proprietary cloud APIs or large-scale architectures, limiting offline operation, privacy, and control. We present LiteViT5, a lightweight, on-device vision-language model that directly generates HTML from images of design mockups, enabling private, no-code prototyping without cloud infrastructure. Built on a compact ViT–T5 encoder–decoder framework with 235M parameters, LiteViT5 achieves competitive results on both in-distribution (WebSight) and out-of-distribution (Design2Code) benchmarks. We evaluate its performance across structure, position, color, and CLIP-based similarity metrics and report its comparable performance to models 10–30× larger such as PaliGemma-3B, LLaVA-7B, and DeepSeek-VL-7B. We further assess LiteViT5 in a user study with 24 participants assessing perceived accuracy, code quality, and editability. Our findings show that LiteViT5 supports rapid design iteration, reduces reliance on developer handoff, making it a practical, assistive tool for democratizing web interface creation. This work highlights the potential of efficient, human-centered generative AI to empower interface design beyond expert-only workflows. To support transparency and reproducibility, we release LiteViT5 as an open-source model on Hugging Face: https://huggingface.co/LiteVit5/model. |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
- **Vision Encoder**: SigLIP2 (frozen) |
|
|
- **Vision Processing**: Multi-view fusion |
|
|
- **Seq2Seq Decoder**: CodeT5-based decoder with language modeling head |
|
|
- **Input**: Images (5 views per sample - 4 quarter views + 1 full view) |
|
|
- **Output**: Generated HTML |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
uv add transformers torch accelerate |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Loading the Model |
|
|
|
|
|
```python |
|
|
from transformers import AutoModel, AutoTokenizer |
|
|
from transformers import SiglipProcessor |
|
|
|
|
|
# Load the model |
|
|
model = AutoModel.from_pretrained("LiteVit5/model", trust_remote_code=True) |
|
|
|
|
|
# Load tokenizer and processor |
|
|
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-base") |
|
|
processor = SiglipProcessor.from_pretrained("google/siglip2-base-patch16-512") |
|
|
``` |
|
|
|
|
|
### Inference Example |
|
|
|
|
|
```python |
|
|
from PIL import Image |
|
|
import torch |
|
|
|
|
|
from transformers import AutoModel, AutoTokenizer |
|
|
from transformers import SiglipProcessor |
|
|
|
|
|
# Load the model |
|
|
model = AutoModel.from_pretrained("LiteVit5/model", trust_remote_code=True, device_map="auto") |
|
|
|
|
|
# Load tokenizer and processor |
|
|
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-base") |
|
|
processor = SiglipProcessor.from_pretrained("google/siglip2-base-patch16-512") |
|
|
|
|
|
# Preprocess image (split into 4 parts + full image = 5 views) |
|
|
def prepare_image(image_path: str, processor): |
|
|
""" |
|
|
Prepare image with 5 views (4 quarters + full). |
|
|
|
|
|
Args: |
|
|
image_path: Path to the image file |
|
|
processor: SigLIP processor |
|
|
|
|
|
Returns: |
|
|
Tensor of shape [5, 3, 512, 512] |
|
|
""" |
|
|
image = Image.open(image_path).convert("RGB") |
|
|
|
|
|
# Split into 4 quarters |
|
|
width, height = image.size |
|
|
quarters = [ |
|
|
image.crop((0, 0, width//2, height//2)), # top-left |
|
|
image.crop((width//2, 0, width, height//2)), # top-right |
|
|
image.crop((0, height//2, width//2, height)), # bottom-left |
|
|
image.crop((width//2, height//2, width, height)), # bottom-right |
|
|
] |
|
|
|
|
|
# Process all views |
|
|
processed = [ |
|
|
processor(images=q, return_tensors="pt")["pixel_values"] |
|
|
for q in quarters |
|
|
] |
|
|
# Add full image |
|
|
processed.append( |
|
|
processor(images=image, return_tensors="pt")["pixel_values"] |
|
|
) |
|
|
|
|
|
pixel_values = torch.cat(processed, dim=0) |
|
|
return pixel_values |
|
|
|
|
|
def generate_text(model, pixel_values, tokenizer, max_length=512): |
|
|
""" |
|
|
Generate text from image. |
|
|
|
|
|
Args: |
|
|
model: LiteVit5 model |
|
|
pixel_values: Preprocessed image tensor |
|
|
tokenizer: Tokenizer for decoding |
|
|
max_length: Maximum generation length |
|
|
|
|
|
Returns: |
|
|
Generated text string |
|
|
""" |
|
|
with torch.no_grad(): |
|
|
output_ids = model.generate(pixel_values, max_length=max_length) |
|
|
|
|
|
text = tokenizer.decode(output_ids[0], skip_special_tokens=True) |
|
|
return text |
|
|
|
|
|
device = next(model.parameters()).device |
|
|
|
|
|
# Process images |
|
|
pixel_values = prepare_image("./image_13.png", processor) |
|
|
pixel_values = pixel_values.to(device) |
|
|
print("\nGenerating HTML from image_13.png...") |
|
|
output_ids = model.generate(pixel_values, max_length=2024) |
|
|
text = tokenizer.decode(output_ids[0], skip_special_tokens=True) |
|
|
print(f"Generated: {text}") |
|
|
``` |
|
|
|