Mini-LLaVA v3 β€” Korean Multilingual + OOD Detection + Slim Deploy

v2 μ—μ„œ ν’€μ§€ λͺ»ν–ˆλ˜ ν•œκ΅­μ–΄ 응닡 / ν™˜κ° / 배포 무게 μ„Έ κ°€μ§€λ₯Ό v3 μ—μ„œ λͺ¨λ‘ ν•΄κ²°. ν•œκ΅­μ–΄λŠ” mix 데이터 μž¬ν•™μŠ΅, ν™˜κ°μ€ μΆ”λ‘  wrapper + OOD layer μΆ”κ°€, 배포 λ¬΄κ²ŒλŠ” Slim adapter (1045 MB β†’ 8.28 MB) β€” ν•™μŠ΅ / 뢄석 / 좔둠을 λ¬Έμ œλ³„λ‘œ κ΅¬λΆ„ν•œ μ ‘κ·Ό. CLIP-ViT-B/32 + MLP Projector + Qwen2.5-0.5B + LoRA(r=16) λ₯Ό 직접 κ΅¬ν˜„ν•œ Vision-Language Model 의 ν•™μŠ΅ κ°€μ€‘μΉ˜.

⚠️ 크기 β‰  μ„±λŠ₯ λͺ…μ‹œ: Slim adapter (8.28 MB) λŠ” 같은 λͺ¨λΈ, 같은 좜λ ₯ (greedy 7/7 λΉ„νŠΈ 일치). λͺ¨λΈμ΄ 더 λ˜‘λ˜‘ν•΄μ§„ 것이 μ•„λ‹ˆλΌ νŒ¨ν‚€μ§•λ§Œ νš¨μœ¨ν™”. μ§„μ§œ capability κ°œμ„ μ€ KoreanΒ·OOD 두 κ°€μ§€ (μžμ„Έν•œ trade-off λŠ” ν•œκ³„ ν‘œ μ°Έμ‘°).

πŸ“¦ 이 레포의 ꡬ성 (~14 MB total)

projector.pt                       5.7 MB   ← MultiModalProjector (CLIPβ†’LLM λ§€ν•‘)
lora_adapter_slim/
β”œβ”€ adapter_config.json             1.1 KB   ← PEFT config (modules_to_save=None)
β”œβ”€ adapter_model.safetensors       8.27 MB  ← LoRA weights (q/k/v/o, r=16)
β”œβ”€ image_token_row.safetensors     7.17 KB  ← <image> 토큰 1 row 만 (slim 핡심)
└─ README.md (PEFT auto-generated)

v2 λŒ€λΉ„ βˆ’99.21% (1045 MB β†’ 8.28 MB) β€” slim ν™” μ›λ¦¬λŠ” GitHub README Β§Slim Adapter μ°Έμ‘°.

πŸš€ Quick Start

import torch
from PIL import Image
from huggingface_hub import snapshot_download

# 1) v3 src μ½”λ“œ κ°€μ Έμ˜€κΈ° (GitHub)
#    git clone https://github.com/AD-Styles/vlm-from-scratch-v3
#    cd vlm-from-scratch-v3
from src.model import MiniLLaVA
from src.dataset import encode_for_inference
from src.ood_detection import OODDetector

# 2) κ°€μ€‘μΉ˜ λ‹€μš΄λ‘œλ“œ
local_dir = snapshot_download("AD-Styles/mini-llava-v3", local_dir="checkpoints/v3_step1_korean")

# 3) λͺ¨λΈ λ‘œλ“œ (slim adapter μžλ™ 인식)
model = MiniLLaVA(freeze_vision=True, freeze_llm=True, torch_dtype=torch.float32)
model.load_projector(f"{local_dir}/projector.pt", map_location="cpu")
model.load_lora_adapter(f"{local_dir}/lora_adapter_slim")
model.to("cpu").eval()

# 4) μΆ”λ‘ 
image = Image.open("path/to/image.jpg").convert("RGB")
input_ids, attn = encode_for_inference(model.tokenizer, "이 이미지에 무엇이 λ³΄μ΄λ‚˜μš”?")
pixel_values = model.image_processor(image, return_tensors="pt")["pixel_values"]
with torch.no_grad():
    out = model.generate(
        input_ids=input_ids.unsqueeze(0),
        attention_mask=attn.unsqueeze(0),
        pixel_values=pixel_values,
        max_new_tokens=128,
    )
print(model.tokenizer.decode(out[0], skip_special_tokens=True))

# 5) (선택) OOD κ²€μΆœ
detector = OODDetector(threshold=0.5, device="cpu")
# generate ν•  λ•Œ output_scores=True 둜 first_logits λ°›μ•„μ„œ detector.score(image, first_logits) 호좜

✨ v2 β†’ v3 λ³€ν™” (capability vs deployment 뢄리)

🟒 capability μΆ”κ°€ (λͺ¨λΈμ΄ μƒˆλ‘œ ν•  수 있게 된 것 β€” μ§„μ§œ μ„±λŠ₯ κ°œμ„ )

ν•­λͺ© v2 v3 (이 레포)
λ‹€κ΅­μ–΄ 응닡 ❌ 영문 only (catastrophic forgetting) βœ… 영문 + ν•œκ΅­μ–΄
OOD μ‹ ν˜Έ ❌ 무쑰건 λ‹΅λ³€ (hallucination) βœ… "잘 λͺ¨λ₯΄κ² μŒ" layer μΆ”κ°€ (CLIP+entropy, 검증 N=2 β€” 본격 ROC 뢄석은 v4)

πŸ”΅ deployment μ΅œμ ν™” (μ„±λŠ₯ λ³€ν™” 0, 배포 효율만)

ν•­λͺ© v2 v3
LoRA adapter 1045 MB 8.28 MB (βˆ’99.21%)
λͺ¨λΈ μžμ‚° 총합 ~1051 MB ~14 MB
λͺ¨λΈ 좜λ ₯ (baseline) bit-identical to FULL (greedy 7/7 검증)

🟑 λ³€ν•˜μ§€ μ•Šμ€ 것

  • 이미지 이해 정확도 β€” 0.5B LLM ν•œκ³„λ‘œ v2/v3 동일 μˆ˜μ€€ (v4 LLM size up 으둜 ν•΄κ²° μ˜ˆμ •)
  • 영문 VQA β€” v3 baseline 36.67% vs v2 34.67% (+2.00%p, VQAv2 50 samples greedy decoding). μΆ”λ‘  wrapper 좔가도 자유 μ„œμˆ ν˜• 질문 μ μˆ˜μ—λŠ” 영ν–₯ μ—†μŒ β€” wrapper 의 의미 μžˆλŠ” κ°œμ„ μ€ POPE ν™˜κ° 차단 μͺ½ (+3 ~ +20%p, μžμ„Έν•œ λ‚΄μš©μ€ GitHub README)

🧠 ν•™μŠ΅ 데이터 (Step 1, 175λΆ„)

Source Sample 수 μ–Έμ–΄
VQAv2 3K 영문
LocalizedNarratives 3K 영문
A-OKVQA 3K 영문
KoLLaVA (LLaVA-Instruct DeepL ν•œμ—­) 4K ν•œκ΅­μ–΄
합계 13K Korean ratio 30.8%

πŸ›‘οΈ OOD Detector (선택)

ood_score = 0.6 Γ— clip_signal + 0.4 Γ— entropy_signal
is_ood    = ood_score > 0.5  (default)

clip_signal:    1 - max(CLIP-ViT-B/32 similarity to 57 in-dist categories)
entropy_signal: H(LLM first-token logits) / 8.0 nats

검증 κ²°κ³Ό (scripts/test_ood_integration.py): In-Dist (μ‹€μ œ 개) 0.365 (βœ…) Β· OOD (Pikachu 카툰) 0.505 (⚠️)

πŸͺΆ Slim Adapter β€” 99% 절감 (1045 MB β†’ 8.28 MB)

PEFT ν‘œμ€€μ€ modules_to_save (embed_tokens + lm_head) 을 ν†΅μ§Έλ‘œ μ €μž₯ β†’ 1 GB. ν•˜μ§€λ§Œ 사전 λΆ„μ„μœΌλ‘œ 발견:

saved embed_tokens vs base Qwen2.5:
  첫 151,665 ν–‰: max diff = 0.000000e+00  (μ •ν™•νžˆ 일치)
  λ§ˆμ§€λ§‰ 1 ν–‰ (<image> 토큰): ν•™μŠ΅λœ representation

β†’ image_token_row.safetensors (7 KB) 만 별도 μ €μž₯ν•˜κ³ , μΆ”λ‘  μ‹œ base Qwen2.5 의 λ§ˆμ§€λ§‰ row 만 patch. β†’ greedy decoding 7/7 응닡 λΉ„νŠΈ λ‹¨μœ„ 일치 (scripts/verify_slim_adapter.py).

⚠️ ν•œκ³„

  • 0.5B LLM β€” 이미지 λ‚΄μš© μ •ν™•λ„λŠ” μ—¬μ „νžˆ ν•œκ³„ (개λ₯Ό μ†Œλ‘œ 였인 λ“±)
  • CLIP-ViT-B/32 β€” 49 patches, ViT-L/14 ablation μ§„ν–‰ν–ˆμœΌλ‚˜ 효과 ν•œκ³„ β†’ 미채택
  • 57 OOD μΉ΄ν…Œκ³ λ¦¬ β€” COCO + 일상 객체 μœ„μ£Ό, 도메인 ν™•μž₯ μ‹œ μΉ΄ν…Œκ³ λ¦¬ 보강 ꢌμž₯

πŸ”— 링크

πŸ“œ License

MIT β€” Β© 2026 κΉ€λ„μœ€ (AD-Styles)

πŸ“š Citation

@misc{kim2026minillavav3,
  title  = {Mini-LLaVA v3: Korean Multilingual + Slim LoRA Adapter + OOD Detection},
  author = {Kim, Doyun},
  year   = {2026},
  url    = {https://github.com/AD-Styles/vlm-from-scratch-v3}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for AD-Styles/mini-llava-v3

Adapter
(596)
this model

Space using AD-Styles/mini-llava-v3 1