--- datasets: - HuggingFaceM4/WebSight base_model: - Salesforce/codet5-base - google/siglip2-base-patch16-512 --- # No Code, No Cloud: On-Device Mockup-to-Code with Lightweight Vision-Language AI Bridging the gap between visual design and functional code remains a persistent challenge in modern UI workflows, especially for small teams and non-programmers. Existing solutions, such as Figma-to-code tools and recent vision-language models (VLMs), often depend on proprietary cloud APIs or large-scale architectures, limiting offline operation, privacy, and control. We present LiteViT5, a lightweight, on-device vision-language model that directly generates HTML from images of design mockups, enabling private, no-code prototyping without cloud infrastructure. Built on a compact ViT–T5 encoder–decoder framework with 235M parameters, LiteViT5 achieves competitive results on both in-distribution (WebSight) and out-of-distribution (Design2Code) benchmarks. We evaluate its performance across structure, position, color, and CLIP-based similarity metrics and report its comparable performance to models 10–30× larger such as PaliGemma-3B, LLaVA-7B, and DeepSeek-VL-7B. We further assess LiteViT5 in a user study with 24 participants assessing perceived accuracy, code quality, and editability. Our findings show that LiteViT5 supports rapid design iteration, reduces reliance on developer handoff, making it a practical, assistive tool for democratizing web interface creation. This work highlights the potential of efficient, human-centered generative AI to empower interface design beyond expert-only workflows. To support transparency and reproducibility, we release LiteViT5 as an open-source model on Hugging Face: https://huggingface.co/OSTswiss/LiteViT5. ## Model Architecture - **Vision Encoder**: SigLIP2 (frozen) - **Vision Processing**: Multi-view fusion - **Seq2Seq Decoder**: CodeT5-based decoder with language modeling head - **Input**: Images (5 views per sample - 4 quarter views + 1 full view) - **Output**: Generated HTML ## Installation ```bash uv add transformers torch accelerate ``` ## Usage ### Loading the Model ```python from transformers import AutoModel, AutoTokenizer from transformers import SiglipProcessor # Load the model model = AutoModel.from_pretrained("OSTswiss/LiteViT5", trust_remote_code=True) # Load tokenizer and processor tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-base") processor = SiglipProcessor.from_pretrained("google/siglip2-base-patch16-512") ``` ### Inference Example ```python from PIL import Image import torch from transformers import AutoModel, AutoTokenizer from transformers import SiglipProcessor # Load the model model = AutoModel.from_pretrained("OSTswiss/LiteViT5", trust_remote_code=True, device_map="auto") # Load tokenizer and processor tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-base") processor = SiglipProcessor.from_pretrained("google/siglip2-base-patch16-512") # Preprocess image (split into 4 parts + full image = 5 views) def prepare_image(image_path: str, processor): """ Prepare image with 5 views (4 quarters + full). Args: image_path: Path to the image file processor: SigLIP processor Returns: Tensor of shape [5, 3, 512, 512] """ image = Image.open(image_path).convert("RGB") # Split into 4 quarters width, height = image.size quarters = [ image.crop((0, 0, width//2, height//2)), # top-left image.crop((width//2, 0, width, height//2)), # top-right image.crop((0, height//2, width//2, height)), # bottom-left image.crop((width//2, height//2, width, height)), # bottom-right ] # Process all views processed = [ processor(images=q, return_tensors="pt")["pixel_values"] for q in quarters ] # Add full image processed.append( processor(images=image, return_tensors="pt")["pixel_values"] ) pixel_values = torch.cat(processed, dim=0) return pixel_values def generate_text(model, pixel_values, tokenizer, max_length=512): """ Generate text from image. Args: model: LiteVit5 model pixel_values: Preprocessed image tensor tokenizer: Tokenizer for decoding max_length: Maximum generation length Returns: Generated text string """ with torch.no_grad(): output_ids = model.generate(pixel_values, max_length=max_length) text = tokenizer.decode(output_ids[0], skip_special_tokens=True) return text device = next(model.parameters()).device # Process images pixel_values = prepare_image("./image_13.png", processor) pixel_values = pixel_values.to(device) print("\nGenerating HTML from image_13.png...") output_ids = model.generate(pixel_values, max_length=2024) text = tokenizer.decode(output_ids[0], skip_special_tokens=True) print(f"Generated: {text}") ```