Instructions to use AD-Styles/mini-llava-v3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use AD-Styles/mini-llava-v3 with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
- Mini-LLaVA v3 β Korean Multilingual + OOD Detection + Slim Deploy
Mini-LLaVA v3 β Korean Multilingual + OOD Detection + Slim Deploy
v2 μμ νμ§ λͺ»νλ νκ΅μ΄ μλ΅ / νκ° / λ°°ν¬ λ¬΄κ² μΈ κ°μ§λ₯Ό v3 μμ λͺ¨λ ν΄κ²°. νκ΅μ΄λ mix λ°μ΄ν° μ¬νμ΅, νκ°μ μΆλ‘ wrapper + OOD layer μΆκ°, λ°°ν¬ λ¬΄κ²λ Slim adapter (1045 MB β 8.28 MB) β νμ΅ / λΆμ / μΆλ‘ μ λ¬Έμ λ³λ‘ ꡬλΆν μ κ·Ό. CLIP-ViT-B/32 + MLP Projector + Qwen2.5-0.5B + LoRA(r=16) λ₯Ό μ§μ ꡬνν Vision-Language Model μ νμ΅ κ°μ€μΉ.
β οΈ ν¬κΈ° β μ±λ₯ λͺ μ: Slim adapter (8.28 MB) λ κ°μ λͺ¨λΈ, κ°μ μΆλ ₯ (greedy 7/7 λΉνΈ μΌμΉ). λͺ¨λΈμ΄ λ λλν΄μ§ κ²μ΄ μλλΌ ν¨ν€μ§λ§ ν¨μ¨ν. μ§μ§ capability κ°μ μ KoreanΒ·OOD λ κ°μ§ (μμΈν trade-off λ νκ³ ν μ°Έμ‘°).
π¦ μ΄ λ ν¬μ κ΅¬μ± (~14 MB total)
projector.pt 5.7 MB β MultiModalProjector (CLIPβLLM λ§€ν)
lora_adapter_slim/
ββ adapter_config.json 1.1 KB β PEFT config (modules_to_save=None)
ββ adapter_model.safetensors 8.27 MB β LoRA weights (q/k/v/o, r=16)
ββ image_token_row.safetensors 7.17 KB β <image> ν ν° 1 row λ§ (slim ν΅μ¬)
ββ README.md (PEFT auto-generated)
v2 λλΉ β99.21% (1045 MB β 8.28 MB) β slim ν μ리λ GitHub README Β§Slim Adapter μ°Έμ‘°.
π Quick Start
import torch
from PIL import Image
from huggingface_hub import snapshot_download
# 1) v3 src μ½λ κ°μ Έμ€κΈ° (GitHub)
# git clone https://github.com/AD-Styles/vlm-from-scratch-v3
# cd vlm-from-scratch-v3
from src.model import MiniLLaVA
from src.dataset import encode_for_inference
from src.ood_detection import OODDetector
# 2) κ°μ€μΉ λ€μ΄λ‘λ
local_dir = snapshot_download("AD-Styles/mini-llava-v3", local_dir="checkpoints/v3_step1_korean")
# 3) λͺ¨λΈ λ‘λ (slim adapter μλ μΈμ)
model = MiniLLaVA(freeze_vision=True, freeze_llm=True, torch_dtype=torch.float32)
model.load_projector(f"{local_dir}/projector.pt", map_location="cpu")
model.load_lora_adapter(f"{local_dir}/lora_adapter_slim")
model.to("cpu").eval()
# 4) μΆλ‘
image = Image.open("path/to/image.jpg").convert("RGB")
input_ids, attn = encode_for_inference(model.tokenizer, "μ΄ μ΄λ―Έμ§μ 무μμ΄ λ³΄μ΄λμ?")
pixel_values = model.image_processor(image, return_tensors="pt")["pixel_values"]
with torch.no_grad():
out = model.generate(
input_ids=input_ids.unsqueeze(0),
attention_mask=attn.unsqueeze(0),
pixel_values=pixel_values,
max_new_tokens=128,
)
print(model.tokenizer.decode(out[0], skip_special_tokens=True))
# 5) (μ ν) OOD κ²μΆ
detector = OODDetector(threshold=0.5, device="cpu")
# generate ν λ output_scores=True λ‘ first_logits λ°μμ detector.score(image, first_logits) νΈμΆ
β¨ v2 β v3 λ³ν (capability vs deployment λΆλ¦¬)
π’ capability μΆκ° (λͺ¨λΈμ΄ μλ‘ ν μ μκ² λ κ² β μ§μ§ μ±λ₯ κ°μ )
| νλͺ© | v2 | v3 (μ΄ λ ν¬) |
|---|---|---|
| λ€κ΅μ΄ μλ΅ | β μλ¬Έ only (catastrophic forgetting) | β μλ¬Έ + νκ΅μ΄ |
| OOD μ νΈ | β 무쑰건 λ΅λ³ (hallucination) | β "μ λͺ¨λ₯΄κ² μ" layer μΆκ° (CLIP+entropy, κ²μ¦ N=2 β 본격 ROC λΆμμ v4) |
π΅ deployment μ΅μ ν (μ±λ₯ λ³ν 0, λ°°ν¬ ν¨μ¨λ§)
| νλͺ© | v2 | v3 |
|---|---|---|
| LoRA adapter | 1045 MB | 8.28 MB (β99.21%) |
| λͺ¨λΈ μμ° μ΄ν© | ~1051 MB | ~14 MB |
| λͺ¨λΈ μΆλ ₯ | (baseline) | bit-identical to FULL (greedy 7/7 κ²μ¦) |
π‘ λ³νμ§ μμ κ²
- μ΄λ―Έμ§ μ΄ν΄ μ νλ β 0.5B LLM νκ³λ‘ v2/v3 λμΌ μμ€ (v4 LLM size up μΌλ‘ ν΄κ²° μμ )
- μλ¬Έ VQA β v3 baseline 36.67% vs v2 34.67% (+2.00%p, VQAv2 50 samples greedy decoding). μΆλ‘ wrapper μΆκ°λ μμ μμ ν μ§λ¬Έ μ μμλ μν₯ μμ β wrapper μ μλ―Έ μλ κ°μ μ POPE νκ° μ°¨λ¨ μͺ½ (+3 ~ +20%p, μμΈν λ΄μ©μ GitHub README)
π§ νμ΅ λ°μ΄ν° (Step 1, 175λΆ)
| Source | Sample μ | μΈμ΄ |
|---|---|---|
| VQAv2 | 3K | μλ¬Έ |
| LocalizedNarratives | 3K | μλ¬Έ |
| A-OKVQA | 3K | μλ¬Έ |
| KoLLaVA (LLaVA-Instruct DeepL νμ) | 4K | νκ΅μ΄ |
| ν©κ³ | 13K | Korean ratio 30.8% |
π‘οΈ OOD Detector (μ ν)
ood_score = 0.6 Γ clip_signal + 0.4 Γ entropy_signal
is_ood = ood_score > 0.5 (default)
clip_signal: 1 - max(CLIP-ViT-B/32 similarity to 57 in-dist categories)
entropy_signal: H(LLM first-token logits) / 8.0 nats
κ²μ¦ κ²°κ³Ό (scripts/test_ood_integration.py): In-Dist (μ€μ κ°) 0.365 (β
) Β· OOD (Pikachu μΉ΄ν°) 0.505 (β οΈ)
πͺΆ Slim Adapter β 99% μ κ° (1045 MB β 8.28 MB)
PEFT νμ€μ modules_to_save (embed_tokens + lm_head) μ ν΅μ§Έλ‘ μ μ₯ β 1 GB.
νμ§λ§ μ¬μ λΆμμΌλ‘ λ°κ²¬:
saved embed_tokens vs base Qwen2.5:
첫 151,665 ν: max diff = 0.000000e+00 (μ νν μΌμΉ)
λ§μ§λ§ 1 ν (<image> ν ν°): νμ΅λ representation
β image_token_row.safetensors (7 KB) λ§ λ³λ μ μ₯νκ³ , μΆλ‘ μ base Qwen2.5 μ λ§μ§λ§ row λ§ patch.
β greedy decoding 7/7 μλ΅ λΉνΈ λ¨μ μΌμΉ (scripts/verify_slim_adapter.py).
β οΈ νκ³
- 0.5B LLM β μ΄λ―Έμ§ λ΄μ© μ νλλ μ¬μ ν νκ³ (κ°λ₯Ό μλ‘ μ€μΈ λ±)
- CLIP-ViT-B/32 β 49 patches, ViT-L/14 ablation μ§ννμΌλ ν¨κ³Ό νκ³ β λ―Έμ±ν
- 57 OOD μΉ΄ν κ³ λ¦¬ β COCO + μΌμ κ°μ²΄ μμ£Ό, λλ©μΈ νμ₯ μ μΉ΄ν κ³ λ¦¬ λ³΄κ° κΆμ₯
π λ§ν¬
- π Code: github.com/AD-Styles/vlm-from-scratch-v3
- π Live Demo: HF Spaces β mini-llava-v3-demo
- π v2 baseline: github.com/AD-Styles/vlm-from-scratch
- π€ v2 weights: AD-Styles/mini-llava-stage2
- π’ Triton/vLLM deploy: github.com/AD-Styles/nlp-triton-deployment
π License
MIT β Β© 2026 κΉλμ€ (AD-Styles)
π Citation
@misc{kim2026minillavav3,
title = {Mini-LLaVA v3: Korean Multilingual + Slim LoRA Adapter + OOD Detection},
author = {Kim, Doyun},
year = {2026},
url = {https://github.com/AD-Styles/vlm-from-scratch-v3}
}
- Downloads last month
- -