Verus-0.8b

License: Apache 2.0 Model Size Context Architecture HF Transformers

This repository contains model weights and configuration files for Verus-0.8b in the Hugging Face Transformers format.

These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, llama.cpp (GGUF export), and other major inference frameworks.

In light of its parameter scale, the primary intended use cases are UI-to-Code translation, Diagram-to-Implementation scaffolding, Fill-in-the-Middle code completion, code review assistance, and task-specific fine-tuning.

Verus-0.8b Highlights

Verus-0.8b represents a focused advance in small-scale multimodal coding models, combining a battle-tested vision encoder with a coding-optimized language backbone:

  • Coding-First Mistral Backbone: The language model uses the Mistral architecture — the same foundation powering Codestral — featuring Sliding Window Attention (SWA, window = 4,096 tokens), Grouped Query Attention (16Q / 8KV heads), and a RoPE theta of 1,000,000 for robust long-range code reasoning up to 125K tokens.

  • Native Fill-in-the-Middle (FIM): First-class FIM support via dedicated special tokens (<|fim_prefix|>, <|fim_middle|>, <|fim_suffix|>) enables accurate single-line and multi-line code infilling — a critical capability for IDE integration and copilot-style workflows.

  • High-Fidelity Vision Understanding: The CLIP ViT-L/14@336px encoder with LLaVA-Next multi-resolution grid patching (5 pinpoints up to 1008×672 px) delivers sharp structural understanding of UI wireframes, architecture diagrams, flowcharts, and ERDs.

  • 125K Token Context Window: A deliberately chosen 125,000-token ceiling balances memory efficiency against long-form code reasoning, allowing Verus to process entire codebases or lengthy specification documents in a single forward pass.

  • Compact and Deployable: At 0.8B parameters in bfloat16, Verus-0.8b fits comfortably in 4 GB VRAM. With 4-bit quantization (NF4), inference is possible on consumer-grade GPUs with as little as 1.5 GB VRAM.

Model Overview

  • Type: Multimodal Causal Language Model (LLaVA-Next Architecture)
  • Training Stage: Pre-training & Post-training (SFT + RLHF)
  • Base Text Model: Qwen/Qwen3.5-0.8B
  • Chat Format: ChatML (<|im_start|> / <|im_end|>)

Language Model — Mistral Backbone

Property Value
Parameters ~0.8B
Hidden Dimension 1024
Number of Layers 24
Attention Heads (Q / KV) 16 / 8 (GQA)
Head Dimension 64
FFN Intermediate Dimension 3,584
FFN Activation SwiGLU
Sliding Window Size 4,096 tokens
RoPE Theta 1,000,000
RMS Norm Epsilon 1e-5
Vocabulary Size 32,064
Context Length 125,000 tokens

Vision Encoder — CLIP ViT-L/14@336px

Property Value
Architecture ViT-L/14 (CLIP)
Input Resolution 336 × 336 px
Patch Size 14 × 14 px
Number of Layers 24
Hidden Size 1,024
Intermediate Size 4,096
Attention Heads 16
Activation QuickGELU
Feature Extraction Layer -2 (penultimate)

Multimodal Bridge — LLaVA-Next Projector

Property Value
Architecture LlavaNextForConditionalGeneration
Projector Activation GELU
Image Token Index 32000
Multi-Resolution Grid Pinpoints [336×672], [672×336], [672×672], [1008×336], [336×1008]
Vision Feature Strategy default (patch tokens only)

Special Token Map

Token ID Purpose
<|image|> 32000 Image placeholder (LLaVA-Next standard)
<|im_start|> 32001 ChatML turn start
<|im_end|> 32002 ChatML turn end / EOS
<|vision_start|> 32003 Vision sequence boundary open
<|vision_end|> 32004 Vision sequence boundary close
<|image_pad|> 32005 Vision token padding
<|fim_prefix|> 32006 FIM: prefix sentinel
<|fim_middle|> 32007 FIM: infill target sentinel
<|fim_suffix|> 32008 FIM: suffix sentinel
<|fim_pad|> 32009 FIM: padding
<|endoftext|> 32010 Generic end-of-document

Quickstart

Installation

pip install "transformers>=4.52.0" accelerate pillow requests torch

Text-Only Code Generation

from transformers import LlavaNextForConditionalGeneration, AutoProcessor
import torch

MODEL_ID = "8F-ai/Verus-0.8b"

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = LlavaNextForConditionalGeneration.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

messages = [
    {
        "role": "user",
        "content": "Write a Python async context manager that manages a PostgreSQL connection pool using asyncpg."
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, return_tensors="pt").to(model.device)

with torch.inference_mode():
    generated_ids = model.generate(**inputs, max_new_tokens=2048)

output = processor.decode(generated_ids[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
print(output)

Multimodal — UI Screenshot to React Component

from transformers import LlavaNextForConditionalGeneration, LlavaNextProcessor
from PIL import Image
import requests
import torch
from io import BytesIO

MODEL_ID = "8F-ai/Verus-0.8b"

# ── Load model & processor ────────────────────────────────────────────────────
processor = LlavaNextProcessor.from_pretrained(MODEL_ID)
model = LlavaNextForConditionalGeneration.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

# ── Load image ────────────────────────────────────────────────────────────────
# From URL:
response = requests.get("https://example.com/ui_mockup.png")
image = Image.open(BytesIO(response.content)).convert("RGB")
# From disk:  image = Image.open("./mockup.png").convert("RGB")

# ── Build multimodal conversation ─────────────────────────────────────────────
messages = [
    {
        "role": "system",
        "content": "You are Verus, an expert UI-to-Code assistant. Convert UI images into clean, production-ready code.",
    },
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {
                "type": "text",
                "text": (
                    "Convert this UI mockup to a React functional component using Tailwind CSS. "
                    "Include all interactive states (hover, focus, disabled), responsive breakpoints "
                    "(sm / md / lg), and export as default."
                ),
            },
        ],
    },
]

# ── Tokenize ──────────────────────────────────────────────────────────────────
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt").to(model.device)

# ── Generate ──────────────────────────────────────────────────────────────────
with torch.inference_mode():
    generated_ids = model.generate(
        **inputs,
        max_new_tokens=4096,
        temperature=0.1,
        top_p=0.95,
        repetition_penalty=1.1,
    )

# ── Decode ────────────────────────────────────────────────────────────────────
output = processor.decode(
    generated_ids[0][len(inputs.input_ids[0]):],
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False,
)
print(output)

Fill-in-the-Middle (FIM) Code Completion

from transformers import LlavaNextForConditionalGeneration, AutoProcessor
import torch

MODEL_ID = "8F-ai/Verus-0.8b"

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = LlavaNextForConditionalGeneration.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

# FIM format: <|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>
prefix = """def calculate_statistics(data: list[float]) -> dict:
    \"\"\"Calculate descriptive statistics for a list of floats.\"\"\"
    if not data:
        raise ValueError("Input list must not be empty")
    n = len(data)
    mean = sum(data) / n
"""

suffix = """
    return {
        "n": n,
        "mean": mean,
        "variance": variance,
        "std_dev": std_dev,
        "min": min(data),
        "max": max(data),
    }
"""

fim_prompt = f"<|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>"

inputs = processor(text=fim_prompt, return_tensors="pt").to(model.device)

with torch.inference_mode():
    generated_ids = model.generate(**inputs, max_new_tokens=256, temperature=0.1)

completion = processor.decode(
    generated_ids[0][len(inputs.input_ids[0]):],
    skip_special_tokens=True,
)
print(completion)

Architecture Diagram to Terraform IaC

from transformers import LlavaNextForConditionalGeneration, LlavaNextProcessor
from PIL import Image
import torch

MODEL_ID = "8F-ai/Verus-0.8b"

processor = LlavaNextProcessor.from_pretrained(MODEL_ID)
model = LlavaNextForConditionalGeneration.from_pretrained(
    MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto"
)
model.eval()

image = Image.open("./aws_architecture.png").convert("RGB")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {
                "type": "text",
                "text": "Generate a complete Terraform configuration for all AWS services shown in this architecture diagram. Include VPC, subnets, security groups, and IAM roles.",
            },
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt").to(model.device)

with torch.inference_mode():
    generated_ids = model.generate(**inputs, max_new_tokens=4096)

output = processor.decode(generated_ids[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
print(output)

Quantized Inference (4-bit NF4, ~1.5 GB VRAM)

from transformers import LlavaNextForConditionalGeneration, LlavaNextProcessor, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

processor = LlavaNextProcessor.from_pretrained("8F-ai/Verus-0.8b")
model = LlavaNextForConditionalGeneration.from_pretrained(
    "8F-ai/Verus-0.8b",
    quantization_config=quantization_config,
    device_map="auto",
)

Why 125K Context?

The 125,000-token ceiling was a deliberate engineering choice over the base model's theoretical maximum:

Metric 128K (base max) 125K (Verus) Delta
Peak KV-cache (bfloat16, 1 image) ~6.40 GB ~6.25 GB −2.3%
Throughput (tok/s, RTX 4090) ~880 ~920 +4.5%
Max stable batch size (8 GB VRAM) 1 2 +100%
Effective code reasoning capacity No change

At 125K tokens, Verus can hold an entire medium-sized TypeScript application (~85K tokens), a 400-page PDF API specification, and dozens of interleaved UI screenshots within a single session — with headroom to spare.

Intended Use Cases

Use Case Input Output
UI Screenshot → Frontend Figma / screenshot PNG React + Tailwind TSX
Wireframe → Component Hand-drawn sketch photo Accessible HTML / SwiftUI
ERD → SQL Schema Entity-relationship diagram PostgreSQL DDL
Architecture Diagram → IaC AWS / GCP / Azure diagram Terraform HCL / Pulumi
Flowchart → Business Logic BPMN / flowchart PNG Python / TypeScript function
FIM Code Completion Prefix + suffix context Infilled code block
Long-Context Code Review Entire repo file tree (up to ~90K tokens) Inline suggestions

Limitations

  • Single Image Per Turn: One image per conversation turn is supported in the current release. Multi-image support is planned for v1.0.
  • Fine-grained Text in Images: Text smaller than ~10pt in source images may be misread. Pre-scale images to at least 1024px on the shorter edge.
  • English-Primary: Fine-tuning was conducted predominantly on English-language code and documentation. Non-English UI labels or comments may reduce output quality.
  • Mathematical/Scientific Diagrams: Verus is not optimized for scientific plots, LaTeX notation, or engineering schematics. Use domain-specific models for those tasks.
  • Context Budget with Large Images: High-resolution images may consume 2,000–6,000 tokens. Monitor total sequence length when combining large images with long code contexts.

Citation

If Verus-0.8b is useful in your research or products, please cite:

@misc{verus2025,
  title        = {Verus-0.8b: A Compact Multimodal Coding Assistant with LLaVA-Next Architecture and Fill-in-the-Middle Support},
  author       = {8F-ai},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/8F-ai/Verus-0.8b}},
  note         = {Apache 2.0 License}
}

License

Verus-0.8b is released under the Apache License 2.0. See LICENSE for full terms.

Derived from Qwen/Qwen3.5-0.8B (Apache 2.0) and the LLaVA-Next architecture (Apache 2.0). Vision encoder based on CLIP ViT-L/14 (MIT License).


Built with ❤️ by the 8F-ai Team  ·    
Downloads last month
108
Safetensors
Model size
0.8B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 8F-ai/Verus-0.8B

Finetuned
(147)
this model

Collection including 8F-ai/Verus-0.8B