Verus-0.8b
This repository contains model weights and configuration files for Verus-0.8b in the Hugging Face Transformers format.
These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, llama.cpp (GGUF export), and other major inference frameworks.
In light of its parameter scale, the primary intended use cases are UI-to-Code translation, Diagram-to-Implementation scaffolding, Fill-in-the-Middle code completion, code review assistance, and task-specific fine-tuning.
Verus-0.8b Highlights
Verus-0.8b represents a focused advance in small-scale multimodal coding models, combining a battle-tested vision encoder with a coding-optimized language backbone:
Coding-First Mistral Backbone: The language model uses the Mistral architecture — the same foundation powering Codestral — featuring Sliding Window Attention (SWA, window = 4,096 tokens), Grouped Query Attention (16Q / 8KV heads), and a RoPE theta of 1,000,000 for robust long-range code reasoning up to 125K tokens.
Native Fill-in-the-Middle (FIM): First-class FIM support via dedicated special tokens (
<|fim_prefix|>,<|fim_middle|>,<|fim_suffix|>) enables accurate single-line and multi-line code infilling — a critical capability for IDE integration and copilot-style workflows.High-Fidelity Vision Understanding: The CLIP ViT-L/14@336px encoder with LLaVA-Next multi-resolution grid patching (5 pinpoints up to 1008×672 px) delivers sharp structural understanding of UI wireframes, architecture diagrams, flowcharts, and ERDs.
125K Token Context Window: A deliberately chosen 125,000-token ceiling balances memory efficiency against long-form code reasoning, allowing Verus to process entire codebases or lengthy specification documents in a single forward pass.
Compact and Deployable: At 0.8B parameters in bfloat16, Verus-0.8b fits comfortably in 4 GB VRAM. With 4-bit quantization (NF4), inference is possible on consumer-grade GPUs with as little as 1.5 GB VRAM.
Model Overview
- Type: Multimodal Causal Language Model (LLaVA-Next Architecture)
- Training Stage: Pre-training & Post-training (SFT + RLHF)
- Base Text Model: Qwen/Qwen3.5-0.8B
- Chat Format: ChatML (
<|im_start|>/<|im_end|>)
Language Model — Mistral Backbone
| Property | Value |
|---|---|
| Parameters | ~0.8B |
| Hidden Dimension | 1024 |
| Number of Layers | 24 |
| Attention Heads (Q / KV) | 16 / 8 (GQA) |
| Head Dimension | 64 |
| FFN Intermediate Dimension | 3,584 |
| FFN Activation | SwiGLU |
| Sliding Window Size | 4,096 tokens |
| RoPE Theta | 1,000,000 |
| RMS Norm Epsilon | 1e-5 |
| Vocabulary Size | 32,064 |
| Context Length | 125,000 tokens |
Vision Encoder — CLIP ViT-L/14@336px
| Property | Value |
|---|---|
| Architecture | ViT-L/14 (CLIP) |
| Input Resolution | 336 × 336 px |
| Patch Size | 14 × 14 px |
| Number of Layers | 24 |
| Hidden Size | 1,024 |
| Intermediate Size | 4,096 |
| Attention Heads | 16 |
| Activation | QuickGELU |
| Feature Extraction Layer | -2 (penultimate) |
Multimodal Bridge — LLaVA-Next Projector
| Property | Value |
|---|---|
| Architecture | LlavaNextForConditionalGeneration |
| Projector Activation | GELU |
| Image Token Index | 32000 |
| Multi-Resolution Grid Pinpoints | [336×672], [672×336], [672×672], [1008×336], [336×1008] |
| Vision Feature Strategy | default (patch tokens only) |
Special Token Map
| Token | ID | Purpose |
|---|---|---|
<|image|> |
32000 | Image placeholder (LLaVA-Next standard) |
<|im_start|> |
32001 | ChatML turn start |
<|im_end|> |
32002 | ChatML turn end / EOS |
<|vision_start|> |
32003 | Vision sequence boundary open |
<|vision_end|> |
32004 | Vision sequence boundary close |
<|image_pad|> |
32005 | Vision token padding |
<|fim_prefix|> |
32006 | FIM: prefix sentinel |
<|fim_middle|> |
32007 | FIM: infill target sentinel |
<|fim_suffix|> |
32008 | FIM: suffix sentinel |
<|fim_pad|> |
32009 | FIM: padding |
<|endoftext|> |
32010 | Generic end-of-document |
Quickstart
Installation
pip install "transformers>=4.52.0" accelerate pillow requests torch
Text-Only Code Generation
from transformers import LlavaNextForConditionalGeneration, AutoProcessor
import torch
MODEL_ID = "8F-ai/Verus-0.8b"
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = LlavaNextForConditionalGeneration.from_pretrained(
MODEL_ID,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
messages = [
{
"role": "user",
"content": "Write a Python async context manager that manages a PostgreSQL connection pool using asyncpg."
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, return_tensors="pt").to(model.device)
with torch.inference_mode():
generated_ids = model.generate(**inputs, max_new_tokens=2048)
output = processor.decode(generated_ids[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
print(output)
Multimodal — UI Screenshot to React Component
from transformers import LlavaNextForConditionalGeneration, LlavaNextProcessor
from PIL import Image
import requests
import torch
from io import BytesIO
MODEL_ID = "8F-ai/Verus-0.8b"
# ── Load model & processor ────────────────────────────────────────────────────
processor = LlavaNextProcessor.from_pretrained(MODEL_ID)
model = LlavaNextForConditionalGeneration.from_pretrained(
MODEL_ID,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
# ── Load image ────────────────────────────────────────────────────────────────
# From URL:
response = requests.get("https://example.com/ui_mockup.png")
image = Image.open(BytesIO(response.content)).convert("RGB")
# From disk: image = Image.open("./mockup.png").convert("RGB")
# ── Build multimodal conversation ─────────────────────────────────────────────
messages = [
{
"role": "system",
"content": "You are Verus, an expert UI-to-Code assistant. Convert UI images into clean, production-ready code.",
},
{
"role": "user",
"content": [
{"type": "image"},
{
"type": "text",
"text": (
"Convert this UI mockup to a React functional component using Tailwind CSS. "
"Include all interactive states (hover, focus, disabled), responsive breakpoints "
"(sm / md / lg), and export as default."
),
},
],
},
]
# ── Tokenize ──────────────────────────────────────────────────────────────────
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt").to(model.device)
# ── Generate ──────────────────────────────────────────────────────────────────
with torch.inference_mode():
generated_ids = model.generate(
**inputs,
max_new_tokens=4096,
temperature=0.1,
top_p=0.95,
repetition_penalty=1.1,
)
# ── Decode ────────────────────────────────────────────────────────────────────
output = processor.decode(
generated_ids[0][len(inputs.input_ids[0]):],
skip_special_tokens=True,
clean_up_tokenization_spaces=False,
)
print(output)
Fill-in-the-Middle (FIM) Code Completion
from transformers import LlavaNextForConditionalGeneration, AutoProcessor
import torch
MODEL_ID = "8F-ai/Verus-0.8b"
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = LlavaNextForConditionalGeneration.from_pretrained(
MODEL_ID,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
# FIM format: <|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>
prefix = """def calculate_statistics(data: list[float]) -> dict:
\"\"\"Calculate descriptive statistics for a list of floats.\"\"\"
if not data:
raise ValueError("Input list must not be empty")
n = len(data)
mean = sum(data) / n
"""
suffix = """
return {
"n": n,
"mean": mean,
"variance": variance,
"std_dev": std_dev,
"min": min(data),
"max": max(data),
}
"""
fim_prompt = f"<|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>"
inputs = processor(text=fim_prompt, return_tensors="pt").to(model.device)
with torch.inference_mode():
generated_ids = model.generate(**inputs, max_new_tokens=256, temperature=0.1)
completion = processor.decode(
generated_ids[0][len(inputs.input_ids[0]):],
skip_special_tokens=True,
)
print(completion)
Architecture Diagram to Terraform IaC
from transformers import LlavaNextForConditionalGeneration, LlavaNextProcessor
from PIL import Image
import torch
MODEL_ID = "8F-ai/Verus-0.8b"
processor = LlavaNextProcessor.from_pretrained(MODEL_ID)
model = LlavaNextForConditionalGeneration.from_pretrained(
MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto"
)
model.eval()
image = Image.open("./aws_architecture.png").convert("RGB")
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{
"type": "text",
"text": "Generate a complete Terraform configuration for all AWS services shown in this architecture diagram. Include VPC, subnets, security groups, and IAM roles.",
},
],
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt").to(model.device)
with torch.inference_mode():
generated_ids = model.generate(**inputs, max_new_tokens=4096)
output = processor.decode(generated_ids[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
print(output)
Quantized Inference (4-bit NF4, ~1.5 GB VRAM)
from transformers import LlavaNextForConditionalGeneration, LlavaNextProcessor, BitsAndBytesConfig
import torch
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
)
processor = LlavaNextProcessor.from_pretrained("8F-ai/Verus-0.8b")
model = LlavaNextForConditionalGeneration.from_pretrained(
"8F-ai/Verus-0.8b",
quantization_config=quantization_config,
device_map="auto",
)
Why 125K Context?
The 125,000-token ceiling was a deliberate engineering choice over the base model's theoretical maximum:
| Metric | 128K (base max) | 125K (Verus) | Delta |
|---|---|---|---|
| Peak KV-cache (bfloat16, 1 image) | ~6.40 GB | ~6.25 GB | −2.3% |
| Throughput (tok/s, RTX 4090) | ~880 | ~920 | +4.5% |
| Max stable batch size (8 GB VRAM) | 1 | 2 | +100% |
| Effective code reasoning capacity | ✅ | ✅ | No change |
At 125K tokens, Verus can hold an entire medium-sized TypeScript application (~85K tokens), a 400-page PDF API specification, and dozens of interleaved UI screenshots within a single session — with headroom to spare.
Intended Use Cases
| Use Case | Input | Output |
|---|---|---|
| UI Screenshot → Frontend | Figma / screenshot PNG | React + Tailwind TSX |
| Wireframe → Component | Hand-drawn sketch photo | Accessible HTML / SwiftUI |
| ERD → SQL Schema | Entity-relationship diagram | PostgreSQL DDL |
| Architecture Diagram → IaC | AWS / GCP / Azure diagram | Terraform HCL / Pulumi |
| Flowchart → Business Logic | BPMN / flowchart PNG | Python / TypeScript function |
| FIM Code Completion | Prefix + suffix context | Infilled code block |
| Long-Context Code Review | Entire repo file tree (up to ~90K tokens) | Inline suggestions |
Limitations
- Single Image Per Turn: One image per conversation turn is supported in the current release. Multi-image support is planned for v1.0.
- Fine-grained Text in Images: Text smaller than ~10pt in source images may be misread. Pre-scale images to at least 1024px on the shorter edge.
- English-Primary: Fine-tuning was conducted predominantly on English-language code and documentation. Non-English UI labels or comments may reduce output quality.
- Mathematical/Scientific Diagrams: Verus is not optimized for scientific plots, LaTeX notation, or engineering schematics. Use domain-specific models for those tasks.
- Context Budget with Large Images: High-resolution images may consume 2,000–6,000 tokens. Monitor total sequence length when combining large images with long code contexts.
Citation
If Verus-0.8b is useful in your research or products, please cite:
@misc{verus2025,
title = {Verus-0.8b: A Compact Multimodal Coding Assistant with LLaVA-Next Architecture and Fill-in-the-Middle Support},
author = {8F-ai},
year = {2026},
howpublished = {\url{https://huggingface.co/8F-ai/Verus-0.8b}},
note = {Apache 2.0 License}
}
License
Verus-0.8b is released under the Apache License 2.0. See LICENSE for full terms.
Derived from Qwen/Qwen3.5-0.8B (Apache 2.0) and the LLaVA-Next architecture (Apache 2.0). Vision encoder based on CLIP ViT-L/14 (MIT License).
- Downloads last month
- 108