Keural VLM β€” Vision-Language Model (PoC Β· V0.1)

##Keural VLM is a proof-of-concept vision-language model developed to explore lightweight multimodal learning using a custom vision encoder. Unlike many existing VLMs, the model does not rely on a pretrained CLIP backbone. Instead, the visual encoder is trained from scratch and connected to Mistral-7B-Instruct through a lightweight projection module. .

MKD Status

Params LLM PyTorch Transformers PEFT Hardware


Keural VLM architecture animation

Image β†’ CNN Stem β†’ Adaptive Token Budget β†’ Spatial Transformer β†’ LevelAware Projector β†’ Mistral-7B β†’ Answer


Highlights

From scratch 24.7M vision encoder β€” 12.4Γ— smaller than LLaVA's CLIP encoder (307M)
Adaptive Token Budget (ATB) Token count is a runtime knob β€” dense regions get more tokens, blank regions fewer
Hierarchical Concept Tokenization (HCT) Every token carries a semantic level: global / region / detail
3-phase pipeline Vision pretraining β†’ projector alignment β†’ SFT + DPO
DPO improved every benchmark e.g. VQAv2 +30.7pp, ScienceQA +14.0pp

What This Is

A complete Vision-Language Model (VLM) proof-of-concept built entirely from scratch β€” no CLIP, no pretrained vision backbone.

Component Details
Vision Encoder 24.7M params, trained from scratch on CC3M + CC12M (~15M pairs)
Projector LevelAwareProjector (384 β†’ 2048 β†’ 4096)
LLM Mistral-7B-Instruct-v0.3 (4-bit NF4 QLoRA)
SFT LLaVA-Instruct-150K, 30,000 steps
DPO RLHF-V dataset, 5,733 pairs, 3,000 steps

Architecture

Image (256Γ—256)
    ↓
CNN Stem  β†’  ATB Tokenizer  β†’  Spatial Transformer (12 layers, embed_dim=384)
    ↓
KeuralEncoderOutput  {tokens, level_ids, spatial_metadata, saliency_scores, pooled}
    ↓
LevelAwareProjector  (384 β†’ 2048 β†’ 4096)
    ↓
Visual Tokens  (N_vis Γ— 4096)
    ↓
Mistral-7B-Instruct-v0.3  +  SFT LoRA  +  DPO LoRA
    ↓
Text Response
Keural VLM detailed architecture

Key Innovations

Adaptive Token Budget (ATB) Tokenization β€” token count is a runtime parameter; dense regions get more tokens, blank regions fewer.

out = encoder(image, token_budget=64)    # fast / cheap
out = encoder(image, token_budget=256)   # default
out = encoder(image, token_budget=1024)  # full fidelity

Hierarchical Concept Tokenization (HCT) β€” every token carries a semantic level tag.

out = encoder(image)
print(out.level_ids)  # {0=global, 1=region, 2=detail}

Training Pipeline

PhaseWhat trainsDataStepsResult
1 Β· Vision Encoder Encoder from scratchCC3M + CC12M (~15.3M)~75,00024.7M params Β· 1Γ— RTX 5090
2A Β· Projector Align Projector + LLM LoRA (r=64), encoder frozenLLaVA-Instruct-150K10,000β€”
2B Β· SFT LLM LoRA (r=64, Ξ±=128)LLaVA-Instruct-150K30,000final loss 1.022
2B Β· DPO DPO LoRA (r=16, Ξ±=32)RLHF-V (5,733 pairs)3,000loss 0.235 Β· reward acc 95% Β· margin 2.11

Benchmark Results

Evaluated on 1,000 samples each (where applicable). The vision encoder is 12.4Γ— smaller than LLaVA's CLIP encoder (307M).

Benchmark Keural SFT-30K Keural SFT+DPO LLaVA 1.5 (307M enc) LLaVA 1.6 (307M enc)
VQAv2 Accuracy 12.9% 43.6% 78.5% 81.8%
POPE F1 66.9% 67.0% 85.9% 86.5%
MME Total Score 704.3 838.8 1510.7 1519.3
TextVQA Accuracy 0.8% 6.6% 58.2% 64.9%
ScienceQA Accuracy 39.7% 53.7% 66.8% 70.6%

POPE F1 (67.0%) is the standout β€” within 19pp of LLaVA 1.6 using a 12Γ— smaller encoder. DPO improved every benchmark, most dramatically VQAv2 (+30.7pp) and ScienceQA (+14.0pp). TextVQA is low by design β€” no OCR training. EasyOCR integration in the GUI bridges this gap.

All benchmarks

Qualitative & Saliency

The ATB tokenizer concentrates tokens on salient regions. Left β†’ right: original Β· saliency heatmap Β· token placement.

Saliency β€” cat Qualitative examples

DPO Training Curves

Metric Step 0 Step 3000
Loss 0.694 0.235
Reward Accuracy ~50% 95%
Reward Margin 0.0 2.11
DPO training curves

Repository Structure

keural-vlm-poc/
β”œβ”€β”€ vision_encoder/     # Keural encoder weights (config + safetensors)
β”œβ”€β”€ sft_adapter/        # SFT LoRA (30K steps)
β”œβ”€β”€ dpo_adapter/        # DPO LoRA (3K steps, RLHF-V) β€” stacks on SFT
β”œβ”€β”€ assets/             # figures, diagrams, animation
└── tokenizer.json …    # Mistral tokenizer + chat template

For inference, load: Vision Encoder β†’ Projector β†’ Mistral-7B + SFT LoRA + DPO LoRA.


Usage

import torch
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
from alignment.projectors import LevelAwareProjector

device = "cuda"

# 1. Vision encoder (frozen)
encoder = AutoModel.from_pretrained(
    "mkd-hika/keural-vision-encoder-poc",
    trust_remote_code=True, torch_dtype=torch.bfloat16,
).to(device).eval()

# 2. Projector
projector = LevelAwareProjector(encoder_dim=384, hidden_dim=2048, llm_dim=4096)
projector.load_state_dict(torch.load("projector.pt", map_location=device))
projector = projector.to(device, dtype=torch.bfloat16).eval()

# 3. LLM + SFT LoRA + DPO LoRA
bnb_cfg = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16,
                             bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4")
base_llm = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    quantization_config=bnb_cfg, device_map="auto", torch_dtype=torch.bfloat16,
)
llm = PeftModel.from_pretrained(base_llm, "sft_adapter")        # SFT LoRA
llm = PeftModel.from_pretrained(llm, "dpo_adapter")             # DPO LoRA
llm.eval()

Roadmap

Phase Encoder Params Hardware Status
PoC (this model) 24.7M 1Γ— RTX 5090 Complete (SFT + DPO)
Mid-level ~183.5 8Γ— H100 80 GB Planned
Commercial ~1.1B 64Γ— H100 80 GB Future

Citation

@misc{keural_vlm_2026,
  title  = {Keural VLM: Vision-Language Model with Content-Adaptive Encoding via Saliency-Guided Token Budgets},
  author = {Hika Barki and MKD Co., Ltd.},
  year   = {2026},
  },
}

πŸ“„ License

See LICENSE for details.

Training data: CC3M, CC12M, LLaVA-Instruct-150K, RLHF-V β€” respective data licenses apply.

MKD Co., Ltd. β€” 2026

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support