Verus-0.8b

This repository contains model weights and configuration files for Verus-0.8b in the Hugging Face Transformers format.

These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, llama.cpp (GGUF export), and other major inference frameworks.

In light of its parameter scale, the primary intended use cases are UI-to-Code translation, Diagram-to-Implementation scaffolding, Fill-in-the-Middle code completion, code review assistance, and task-specific fine-tuning.

Verus-0.8b Highlights

Verus-0.8b represents a focused advance in small-scale multimodal coding models, combining a battle-tested vision encoder with a coding-optimized language backbone:

Coding-First Mistral Backbone: The language model uses the Mistral architecture — the same foundation powering Codestral — featuring Sliding Window Attention (SWA, window = 4,096 tokens), Grouped Query Attention (16Q / 8KV heads), and a RoPE theta of 1,000,000 for robust long-range code reasoning up to 125K tokens.
Native Fill-in-the-Middle (FIM): First-class FIM support via dedicated special tokens (<|fim_prefix|>, <|fim_middle|>, <|fim_suffix|>) enables accurate single-line and multi-line code infilling — a critical capability for IDE integration and copilot-style workflows.
High-Fidelity Vision Understanding: The CLIP ViT-L/14@336px encoder with LLaVA-Next multi-resolution grid patching (5 pinpoints up to 1008×672 px) delivers sharp structural understanding of UI wireframes, architecture diagrams, flowcharts, and ERDs.
125K Token Context Window: A deliberately chosen 125,000-token ceiling balances memory efficiency against long-form code reasoning, allowing Verus to process entire codebases or lengthy specification documents in a single forward pass.
Compact and Deployable: At 0.8B parameters in bfloat16, Verus-0.8b fits comfortably in 4 GB VRAM. With 4-bit quantization (NF4), inference is possible on consumer-grade GPUs with as little as 1.5 GB VRAM.

Model Overview

Type: Multimodal Causal Language Model (LLaVA-Next Architecture)
Training Stage: Pre-training & Post-training (SFT + RLHF)
Base Text Model: Qwen/Qwen3.5-0.8B
Chat Format: ChatML (<|im_start|> / <|im_end|>)

Language Model — Mistral Backbone

Property	Value
Parameters	~0.8B
Hidden Dimension	1024
Number of Layers	24
Attention Heads (Q / KV)	16 / 8 (GQA)
Head Dimension	64
FFN Intermediate Dimension	3,584
FFN Activation	SwiGLU
Sliding Window Size	4,096 tokens
RoPE Theta	1,000,000
RMS Norm Epsilon	1e-5
Vocabulary Size	32,064
Context Length	125,000 tokens

Vision Encoder — CLIP ViT-L/14@336px

Property	Value
Architecture	ViT-L/14 (CLIP)
Input Resolution	336 × 336 px
Patch Size	14 × 14 px
Number of Layers	24
Hidden Size	1,024
Intermediate Size	4,096
Attention Heads	16
Activation	QuickGELU
Feature Extraction Layer	-2 (penultimate)

Multimodal Bridge — LLaVA-Next Projector

Property	Value
Architecture	LlavaNextForConditionalGeneration
Projector Activation	GELU
Image Token Index	32000
Multi-Resolution Grid Pinpoints	[336×672], [672×336], [672×672], [1008×336], [336×1008]
Vision Feature Strategy	`default` (patch tokens only)

Special Token Map

Token	ID	Purpose
`<\|image\|>`	32000	Image placeholder (LLaVA-Next standard)
`<\|im_start\|>`	32001	ChatML turn start
`<\|im_end\|>`	32002	ChatML turn end / EOS
`<\|vision_start\|>`	32003	Vision sequence boundary open
`<\|vision_end\|>`	32004	Vision sequence boundary close
`<\|image_pad\|>`	32005	Vision token padding
`<\|fim_prefix\|>`	32006	FIM: prefix sentinel
`<\|fim_middle\|>`	32007	FIM: infill target sentinel
`<\|fim_suffix\|>`	32008	FIM: suffix sentinel
`<\|fim_pad\|>`	32009	FIM: padding
`<\|endoftext\|>`	32010	Generic end-of-document

Quickstart

Installation

pip install "transformers>=4.52.0" accelerate pillow requests torch

Text-Only Code Generation

from transformers import LlavaNextForConditionalGeneration, AutoProcessor
import torch

MODEL_ID = "8F-ai/Verus-0.8b"

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = LlavaNextForConditionalGeneration.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

messages = [
    {
        "role": "user",
        "content": "Write a Python async context manager that manages a PostgreSQL connection pool using asyncpg."
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, return_tensors="pt").to(model.device)

with torch.inference_mode():
    generated_ids = model.generate(**inputs, max_new_tokens=2048)

output = processor.decode(generated_ids[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
print(output)

Multimodal — UI Screenshot to React Component

from transformers import LlavaNextForConditionalGeneration, LlavaNextProcessor
from PIL import Image
import requests
import torch
from io import BytesIO

MODEL_ID = "8F-ai/Verus-0.8b"

# ── Load model & processor ────────────────────────────────────────────────────
processor = LlavaNextProcessor.from_pretrained(MODEL_ID)
model = LlavaNextForConditionalGeneration.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

# ── Load image ────────────────────────────────────────────────────────────────
# From URL:
response = requests.get("https://example.com/ui_mockup.png")
image = Image.open(BytesIO(response.content)).convert("RGB")
# From disk:  image = Image.open("./mockup.png").convert("RGB")

# ── Build multimodal conversation ─────────────────────────────────────────────
messages = [
    {
        "role": "system",
        "content": "You are Verus, an expert UI-to-Code assistant. Convert UI images into clean, production-ready code.",
    },
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {
                "type": "text",
                "text": (
                    "Convert this UI mockup to a React functional component using Tailwind CSS. "
                    "Include all interactive states (hover, focus, disabled), responsive breakpoints "
                    "(sm / md / lg), and export as default."
                ),
            },
        ],
    },
]

# ── Tokenize ──────────────────────────────────────────────────────────────────
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt").to(model.device)

# ── Generate ──────────────────────────────────────────────────────────────────
with torch.inference_mode():
    generated_ids = model.generate(
        **inputs,
        max_new_tokens=4096,
        temperature=0.1,
        top_p=0.95,
        repetition_penalty=1.1,
    )

# ── Decode ────────────────────────────────────────────────────────────────────
output = processor.decode(
    generated_ids[0][len(inputs.input_ids[0]):],
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False,
)
print(output)

Fill-in-the-Middle (FIM) Code Completion

from transformers import LlavaNextForConditionalGeneration, AutoProcessor
import torch

MODEL_ID = "8F-ai/Verus-0.8b"

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = LlavaNextForConditionalGeneration.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

# FIM format: <|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>
prefix = """def calculate_statistics(data: list[float]) -> dict:
    \"\"\"Calculate descriptive statistics for a list of floats.\"\"\"
    if not data:
        raise ValueError("Input list must not be empty")
    n = len(data)
    mean = sum(data) / n
"""

suffix = """
    return {
        "n": n,
        "mean": mean,
        "variance": variance,
        "std_dev": std_dev,
        "min": min(data),
        "max": max(data),
    }
"""

fim_prompt = f"<|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>"

inputs = processor(text=fim_prompt, return_tensors="pt").to(model.device)

with torch.inference_mode():
    generated_ids = model.generate(**inputs, max_new_tokens=256, temperature=0.1)

completion = processor.decode(
    generated_ids[0][len(inputs.input_ids[0]):],
    skip_special_tokens=True,
)
print(completion)

Architecture Diagram to Terraform IaC

from transformers import LlavaNextForConditionalGeneration, LlavaNextProcessor
from PIL import Image
import torch

MODEL_ID = "8F-ai/Verus-0.8b"

processor = LlavaNextProcessor.from_pretrained(MODEL_ID)
model = LlavaNextForConditionalGeneration.from_pretrained(
    MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto"
)
model.eval()

image = Image.open("./aws_architecture.png").convert("RGB")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {
                "type": "text",
                "text": "Generate a complete Terraform configuration for all AWS services shown in this architecture diagram. Include VPC, subnets, security groups, and IAM roles.",
            },
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt").to(model.device)

with torch.inference_mode():
    generated_ids = model.generate(**inputs, max_new_tokens=4096)

output = processor.decode(generated_ids[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
print(output)

Quantized Inference (4-bit NF4, ~1.5 GB VRAM)

from transformers import LlavaNextForConditionalGeneration, LlavaNextProcessor, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

processor = LlavaNextProcessor.from_pretrained("8F-ai/Verus-0.8b")
model = LlavaNextForConditionalGeneration.from_pretrained(
    "8F-ai/Verus-0.8b",
    quantization_config=quantization_config,
    device_map="auto",
)

Why 125K Context?

The 125,000-token ceiling was a deliberate engineering choice over the base model's theoretical maximum:

Metric	128K (base max)	125K (Verus)	Delta
Peak KV-cache (bfloat16, 1 image)	~6.40 GB	~6.25 GB	−2.3%
Throughput (tok/s, RTX 4090)	~880	~920	+4.5%
Max stable batch size (8 GB VRAM)	1	2	+100%
Effective code reasoning capacity	✅	✅	No change

At 125K tokens, Verus can hold an entire medium-sized TypeScript application (~85K tokens), a 400-page PDF API specification, and dozens of interleaved UI screenshots within a single session — with headroom to spare.

Intended Use Cases

Use Case	Input	Output
UI Screenshot → Frontend	Figma / screenshot PNG	React + Tailwind TSX
Wireframe → Component	Hand-drawn sketch photo	Accessible HTML / SwiftUI
ERD → SQL Schema	Entity-relationship diagram	PostgreSQL DDL
Architecture Diagram → IaC	AWS / GCP / Azure diagram	Terraform HCL / Pulumi
Flowchart → Business Logic	BPMN / flowchart PNG	Python / TypeScript function
FIM Code Completion	Prefix + suffix context	Infilled code block
Long-Context Code Review	Entire repo file tree (up to ~90K tokens)	Inline suggestions

Limitations

Single Image Per Turn: One image per conversation turn is supported in the current release. Multi-image support is planned for v1.0.
Fine-grained Text in Images: Text smaller than ~10pt in source images may be misread. Pre-scale images to at least 1024px on the shorter edge.
English-Primary: Fine-tuning was conducted predominantly on English-language code and documentation. Non-English UI labels or comments may reduce output quality.
Mathematical/Scientific Diagrams: Verus is not optimized for scientific plots, LaTeX notation, or engineering schematics. Use domain-specific models for those tasks.
Context Budget with Large Images: High-resolution images may consume 2,000–6,000 tokens. Monitor total sequence length when combining large images with long code contexts.

Citation

If Verus-0.8b is useful in your research or products, please cite:

@misc{verus2025,
  title        = {Verus-0.8b: A Compact Multimodal Coding Assistant with LLaVA-Next Architecture and Fill-in-the-Middle Support},
  author       = {8F-ai},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/8F-ai/Verus-0.8b}},
  note         = {Apache 2.0 License}
}