model / README.md

litvit5

abstract

00e5ca7 28 days ago

4.94 kB

	---
	datasets:
	- HuggingFaceM4/WebSight
	base_model:
	- Salesforce/codet5-base
	- google/siglip2-base-patch16-512
	---

	# No Code, No Cloud: On-Device Mockup-to-Code with Lightweight Vision-Language AI

	Bridging the gap between visual design and functional code remains a persistent challenge in modern UI workflows, especially for small teams and non-programmers. Existing solutions, such as Figma-to-code tools and recent vision-language models (VLMs), often depend on proprietary cloud APIs or large-scale architectures, limiting offline operation, privacy, and control. We present LiteViT5, a lightweight, on-device vision-language model that directly generates HTML from images of design mockups, enabling private, no-code prototyping without cloud infrastructure. Built on a compact ViT–T5 encoder–decoder framework with 235M parameters, LiteViT5 achieves competitive results on both in-distribution (WebSight) and out-of-distribution (Design2Code) benchmarks. We evaluate its performance across structure, position, color, and CLIP-based similarity metrics and report its comparable performance to models 10–30× larger such as PaliGemma-3B, LLaVA-7B, and DeepSeek-VL-7B. We further assess LiteViT5 in a user study with 24 participants assessing perceived accuracy, code quality, and editability. Our findings show that LiteViT5 supports rapid design iteration, reduces reliance on developer handoff, making it a practical, assistive tool for democratizing web interface creation. This work highlights the potential of efficient, human-centered generative AI to empower interface design beyond expert-only workflows. To support transparency and reproducibility, we release LiteViT5 as an open-source model on Hugging Face: https://huggingface.co/LiteVit5/model.

	## Model Architecture

	- Vision Encoder: SigLIP2 (frozen)
	- Vision Processing: Multi-view fusion
	- Seq2Seq Decoder: CodeT5-based decoder with language modeling head
	- Input: Images (5 views per sample - 4 quarter views + 1 full view)
	- Output: Generated HTML

	## Installation

	```bash
	uv add transformers torch accelerate
	```

	## Usage

	### Loading the Model

	```python
	from transformers import AutoModel, AutoTokenizer
	from transformers import SiglipProcessor

	# Load the model
	model = AutoModel.from_pretrained("LiteVit5/model", trust_remote_code=True)

	# Load tokenizer and processor
	tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-base")
	processor = SiglipProcessor.from_pretrained("google/siglip2-base-patch16-512")
	```

	### Inference Example

	```python
	from PIL import Image
	import torch

	from transformers import AutoModel, AutoTokenizer
	from transformers import SiglipProcessor

	# Load the model
	model = AutoModel.from_pretrained("LiteVit5/model", trust_remote_code=True, device_map="auto")

	# Load tokenizer and processor
	tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-base")
	processor = SiglipProcessor.from_pretrained("google/siglip2-base-patch16-512")

	# Preprocess image (split into 4 parts + full image = 5 views)
	def prepare_image(image_path: str, processor):
	"""
	Prepare image with 5 views (4 quarters + full).

	Args:
	image_path: Path to the image file
	processor: SigLIP processor

	Returns:
	Tensor of shape [5, 3, 512, 512]
	"""
	image = Image.open(image_path).convert("RGB")

	# Split into 4 quarters
	width, height = image.size
	quarters = [
	image.crop((0, 0, width//2, height//2)), # top-left
	image.crop((width//2, 0, width, height//2)), # top-right
	image.crop((0, height//2, width//2, height)), # bottom-left
	image.crop((width//2, height//2, width, height)), # bottom-right
	]

	# Process all views
	processed = [
	processor(images=q, return_tensors="pt")["pixel_values"]
	for q in quarters
	]
	# Add full image
	processed.append(
	processor(images=image, return_tensors="pt")["pixel_values"]
	)

	pixel_values = torch.cat(processed, dim=0)
	return pixel_values

	def generate_text(model, pixel_values, tokenizer, max_length=512):
	"""
	Generate text from image.

	Args:
	model: LiteVit5 model
	pixel_values: Preprocessed image tensor
	tokenizer: Tokenizer for decoding
	max_length: Maximum generation length

	Returns:
	Generated text string
	"""
	with torch.no_grad():
	output_ids = model.generate(pixel_values, max_length=max_length)

	text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
	return text

	device = next(model.parameters()).device

	# Process images
	pixel_values = prepare_image("./image_13.png", processor)
	pixel_values = pixel_values.to(device)
	print("\nGenerating HTML from image_13.png...")
	output_ids = model.generate(pixel_values, max_length=2024)
	text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
	print(f"Generated: {text}")
	```