Instructions to use mkd-ai/keural-VLM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use mkd-ai/keural-VLM with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="mkd-ai/keural-VLM")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("mkd-ai/keural-VLM", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use mkd-ai/keural-VLM with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "mkd-ai/keural-VLM"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mkd-ai/keural-VLM",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/mkd-ai/keural-VLM

SGLang

How to use mkd-ai/keural-VLM with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "mkd-ai/keural-VLM" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mkd-ai/keural-VLM",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "mkd-ai/keural-VLM" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mkd-ai/keural-VLM",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use mkd-ai/keural-VLM with Docker Model Runner:
```
docker model run hf.co/mkd-ai/keural-VLM
```

Keural VLM — Vision-Language Model _{(PoC · V0.1)}

##Keural VLM is a proof-of-concept vision-language model developed to explore lightweight multimodal learning using a custom vision encoder. Unlike many existing VLMs, the model does not rely on a pretrained CLIP backbone. Instead, the visual encoder is trained from scratch and connected to Mistral-7B-Instruct through a lightweight projection module. .

Image → CNN Stem → Adaptive Token Budget → Spatial Transformer → LevelAware Projector → Mistral-7B → Answer

Highlights


From scratch	24.7M vision encoder — 12.4× smaller than LLaVA's CLIP encoder (307M)
Adaptive Token Budget (ATB)	Token count is a runtime knob — dense regions get more tokens, blank regions fewer
Hierarchical Concept Tokenization (HCT)	Every token carries a semantic level: global / region / detail
3-phase pipeline	Vision pretraining → projector alignment → SFT + DPO
DPO improved every benchmark	e.g. VQAv2 +30.7pp, ScienceQA +14.0pp

What This Is

A complete Vision-Language Model (VLM) proof-of-concept built entirely from scratch — no CLIP, no pretrained vision backbone.

Component	Details
Vision Encoder	24.7M params, trained from scratch on CC3M + CC12M (~15M pairs)
Projector	`LevelAwareProjector` (384 → 2048 → 4096)
LLM	`Mistral-7B-Instruct-v0.3` (4-bit NF4 QLoRA)
SFT	LLaVA-Instruct-150K, 30,000 steps
DPO	RLHF-V dataset, 5,733 pairs, 3,000 steps

Architecture

Image (256×256)
    ↓
CNN Stem  →  ATB Tokenizer  →  Spatial Transformer (12 layers, embed_dim=384)
    ↓
KeuralEncoderOutput  {tokens, level_ids, spatial_metadata, saliency_scores, pooled}
    ↓
LevelAwareProjector  (384 → 2048 → 4096)
    ↓
Visual Tokens  (N_vis × 4096)
    ↓
Mistral-7B-Instruct-v0.3  +  SFT LoRA  +  DPO LoRA
    ↓
Text Response

Key Innovations

Adaptive Token Budget (ATB) Tokenization — token count is a runtime parameter; dense regions get more tokens, blank regions fewer.

out = encoder(image, token_budget=64)    # fast / cheap
out = encoder(image, token_budget=256)   # default
out = encoder(image, token_budget=1024)  # full fidelity

Hierarchical Concept Tokenization (HCT) — every token carries a semantic level tag.

out = encoder(image)
print(out.level_ids)  # {0=global, 1=region, 2=detail}

Training Pipeline

Phase	What trains	Data	Steps	Result
1 · Vision Encoder	Encoder from scratch	CC3M + CC12M (~15.3M)	~75,000	24.7M params · 1× RTX 5090
2A · Projector Align	Projector + LLM LoRA (r=64), encoder frozen	LLaVA-Instruct-150K	10,000	—
2B · SFT	LLM LoRA (r=64, α=128)	LLaVA-Instruct-150K	30,000	final loss 1.022
2B · DPO	DPO LoRA (r=16, α=32)	RLHF-V (5,733 pairs)	3,000	loss 0.235 · reward acc 95% · margin 2.11

Benchmark Results

Evaluated on 1,000 samples each (where applicable). The vision encoder is 12.4× smaller than LLaVA's CLIP encoder (307M).

Benchmark	Keural SFT-30K	Keural SFT+DPO	LLaVA 1.5 _{(307M enc)}	LLaVA 1.6 _{(307M enc)}
VQAv2 Accuracy	12.9%	43.6%	78.5%	81.8%
POPE F1	66.9%	67.0%	85.9%	86.5%
MME Total Score	704.3	838.8	1510.7	1519.3
TextVQA Accuracy	0.8%	6.6%	58.2%	64.9%
ScienceQA Accuracy	39.7%	53.7%	66.8%	70.6%

POPE F1 (67.0%) is the standout — within 19pp of LLaVA 1.6 using a 12× smaller encoder. DPO improved every benchmark, most dramatically VQAv2 (+30.7pp) and ScienceQA (+14.0pp). TextVQA is low by design — no OCR training. EasyOCR integration in the GUI bridges this gap.

Qualitative & Saliency

The ATB tokenizer concentrates tokens on salient regions. Left → right: original · saliency heatmap · token placement.

DPO Training Curves

Metric	Step 0	Step 3000
Loss	0.694	0.235
Reward Accuracy	~50%	95%
Reward Margin	0.0	2.11

Repository Structure

keural-vlm-poc/
├── vision_encoder/     # Keural encoder weights (config + safetensors)
├── sft_adapter/        # SFT LoRA (30K steps)
├── dpo_adapter/        # DPO LoRA (3K steps, RLHF-V) — stacks on SFT
├── assets/             # figures, diagrams, animation
└── tokenizer.json …    # Mistral tokenizer + chat template

For inference, load: Vision Encoder → Projector → Mistral-7B + SFT LoRA + DPO LoRA.

Usage

import torch
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
from alignment.projectors import LevelAwareProjector

device = "cuda"

# 1. Vision encoder (frozen)
encoder = AutoModel.from_pretrained(
    "mkd-hika/keural-vision-encoder-poc",
    trust_remote_code=True, torch_dtype=torch.bfloat16,
).to(device).eval()

# 2. Projector
projector = LevelAwareProjector(encoder_dim=384, hidden_dim=2048, llm_dim=4096)
projector.load_state_dict(torch.load("projector.pt", map_location=device))
projector = projector.to(device, dtype=torch.bfloat16).eval()

# 3. LLM + SFT LoRA + DPO LoRA
bnb_cfg = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16,
                             bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4")
base_llm = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    quantization_config=bnb_cfg, device_map="auto", torch_dtype=torch.bfloat16,
)
llm = PeftModel.from_pretrained(base_llm, "sft_adapter")        # SFT LoRA
llm = PeftModel.from_pretrained(llm, "dpo_adapter")             # DPO LoRA
llm.eval()

Roadmap

Phase	Encoder Params	Hardware	Status
PoC (this model)	24.7M	1× RTX 5090	Complete (SFT + DPO)
Mid-level	~183.5	8× H100 80 GB	Planned
Commercial	~1.1B	64× H100 80 GB	Future

Citation

@misc{keural_vlm_2026,
  title  = {Keural VLM: Vision-Language Model with Content-Adaptive Encoding via Saliency-Guided Token Budgets},
  author = {Hika Barki and MKD Co., Ltd.},
  year   = {2026},
  },
}

📄 License

See LICENSE for details.

Training data: CC3M, CC12M, LLaVA-Instruct-150K, RLHF-V — respective data licenses apply.

MKD Co., Ltd. — 2026

Downloads last month: -; Downloads are not tracked for this model. How to track

Keural VLM — Vision-Language Model (PoC · V0.1)