mini-llava-v3 / README.md
AD-Styles's picture
docs: drop "(μ •μ§ν•œ λͺ…μ‹œ)" subtitle from Β§λ³€ν•˜μ§€ μ•Šμ€ 것
e18f68f verified
---
license: mit
language:
- en
- ko
library_name: peft
base_model: Qwen/Qwen2.5-0.5B-Instruct
pipeline_tag: image-text-to-text
tags:
- vision-language
- multimodal
- clip
- qwen2.5
- lora
- peft
- llava
- korean
- ood-detection
- mini-llava
---
# Mini-LLaVA v3 β€” Korean Multilingual + OOD Detection + Slim Deploy
> v2 μ—μ„œ ν’€μ§€ λͺ»ν–ˆλ˜ ν•œκ΅­μ–΄ 응닡 / ν™˜κ° / 배포 무게 μ„Έ κ°€μ§€λ₯Ό v3 μ—μ„œ λͺ¨λ‘ ν•΄κ²°. **ν•œκ΅­μ–΄λŠ” mix 데이터 μž¬ν•™μŠ΅, ν™˜κ°μ€ μΆ”λ‘  wrapper + OOD layer μΆ”κ°€, 배포 λ¬΄κ²ŒλŠ” Slim adapter (1045 MB β†’ 8.28 MB)** β€” ν•™μŠ΅ / 뢄석 / 좔둠을 λ¬Έμ œλ³„λ‘œ κ΅¬λΆ„ν•œ μ ‘κ·Ό.
> CLIP-ViT-B/32 + MLP Projector + Qwen2.5-0.5B + LoRA(r=16) λ₯Ό 직접 κ΅¬ν˜„ν•œ Vision-Language Model 의 ν•™μŠ΅ κ°€μ€‘μΉ˜.
>
> ⚠️ **크기 β‰  μ„±λŠ₯ λͺ…μ‹œ**: Slim adapter (8.28 MB) λŠ” **같은 λͺ¨λΈ, 같은 좜λ ₯** (greedy 7/7 λΉ„νŠΈ 일치). λͺ¨λΈμ΄ 더 λ˜‘λ˜‘ν•΄μ§„ 것이 μ•„λ‹ˆλΌ νŒ¨ν‚€μ§•λ§Œ νš¨μœ¨ν™”. μ§„μ§œ capability κ°œμ„ μ€ KoreanΒ·OOD 두 κ°€μ§€ (μžμ„Έν•œ trade-off λŠ” ν•œκ³„ ν‘œ μ°Έμ‘°).
## πŸ“¦ 이 레포의 ꡬ성 (~14 MB total)
```
projector.pt 5.7 MB ← MultiModalProjector (CLIPβ†’LLM λ§€ν•‘)
lora_adapter_slim/
β”œβ”€ adapter_config.json 1.1 KB ← PEFT config (modules_to_save=None)
β”œβ”€ adapter_model.safetensors 8.27 MB ← LoRA weights (q/k/v/o, r=16)
β”œβ”€ image_token_row.safetensors 7.17 KB ← <image> 토큰 1 row 만 (slim 핡심)
└─ README.md (PEFT auto-generated)
```
**v2 λŒ€λΉ„ βˆ’99.21%** (1045 MB β†’ 8.28 MB) β€” slim ν™” μ›λ¦¬λŠ” [GitHub README Β§Slim Adapter](https://github.com/AD-Styles/vlm-from-scratch-v3#step-4--slim-adapter-1045-mb--828-mb-좜λ ₯-λ³€ν™”-μ—†μŒ) μ°Έμ‘°.
## πŸš€ Quick Start
```python
import torch
from PIL import Image
from huggingface_hub import snapshot_download
# 1) v3 src μ½”λ“œ κ°€μ Έμ˜€κΈ° (GitHub)
# git clone https://github.com/AD-Styles/vlm-from-scratch-v3
# cd vlm-from-scratch-v3
from src.model import MiniLLaVA
from src.dataset import encode_for_inference
from src.ood_detection import OODDetector
# 2) κ°€μ€‘μΉ˜ λ‹€μš΄λ‘œλ“œ
local_dir = snapshot_download("AD-Styles/mini-llava-v3", local_dir="checkpoints/v3_step1_korean")
# 3) λͺ¨λΈ λ‘œλ“œ (slim adapter μžλ™ 인식)
model = MiniLLaVA(freeze_vision=True, freeze_llm=True, torch_dtype=torch.float32)
model.load_projector(f"{local_dir}/projector.pt", map_location="cpu")
model.load_lora_adapter(f"{local_dir}/lora_adapter_slim")
model.to("cpu").eval()
# 4) μΆ”λ‘ 
image = Image.open("path/to/image.jpg").convert("RGB")
input_ids, attn = encode_for_inference(model.tokenizer, "이 이미지에 무엇이 λ³΄μ΄λ‚˜μš”?")
pixel_values = model.image_processor(image, return_tensors="pt")["pixel_values"]
with torch.no_grad():
out = model.generate(
input_ids=input_ids.unsqueeze(0),
attention_mask=attn.unsqueeze(0),
pixel_values=pixel_values,
max_new_tokens=128,
)
print(model.tokenizer.decode(out[0], skip_special_tokens=True))
# 5) (선택) OOD κ²€μΆœ
detector = OODDetector(threshold=0.5, device="cpu")
# generate ν•  λ•Œ output_scores=True 둜 first_logits λ°›μ•„μ„œ detector.score(image, first_logits) 호좜
```
## ✨ v2 β†’ v3 λ³€ν™” (capability vs deployment 뢄리)
### 🟒 capability μΆ”κ°€ (λͺ¨λΈμ΄ μƒˆλ‘œ ν•  수 있게 된 것 β€” μ§„μ§œ μ„±λŠ₯ κ°œμ„ )
| ν•­λͺ© | v2 | **v3 (이 레포)** |
|---|---|---|
| λ‹€κ΅­μ–΄ 응닡 | ❌ 영문 only (catastrophic forgetting) | βœ… **영문 + ν•œκ΅­μ–΄** |
| OOD μ‹ ν˜Έ | ❌ 무쑰건 λ‹΅λ³€ (hallucination) | βœ… **"잘 λͺ¨λ₯΄κ² μŒ" layer μΆ”κ°€** (CLIP+entropy, 검증 N=2 β€” 본격 ROC 뢄석은 v4) |
### πŸ”΅ deployment μ΅œμ ν™” (μ„±λŠ₯ λ³€ν™” 0, 배포 효율만)
| ν•­λͺ© | v2 | v3 |
|---|---|---|
| LoRA adapter | 1045 MB | 8.28 MB (βˆ’99.21%) |
| λͺ¨λΈ μžμ‚° 총합 | ~1051 MB | ~14 MB |
| λͺ¨λΈ 좜λ ₯ | (baseline) | **bit-identical** to FULL (greedy 7/7 검증) |
### 🟑 λ³€ν•˜μ§€ μ•Šμ€ 것
- 이미지 이해 정확도 β€” 0.5B LLM ν•œκ³„λ‘œ v2/v3 동일 μˆ˜μ€€ (v4 LLM size up 으둜 ν•΄κ²° μ˜ˆμ •)
- 영문 VQA β€” v3 baseline 36.67% vs v2 34.67% (+2.00%p, VQAv2 50 samples greedy decoding). μΆ”λ‘  wrapper 좔가도 자유 μ„œμˆ ν˜• 질문 μ μˆ˜μ—λŠ” 영ν–₯ μ—†μŒ β€” wrapper 의 의미 μžˆλŠ” κ°œμ„ μ€ POPE ν™˜κ° 차단 μͺ½ (+3 ~ +20%p, μžμ„Έν•œ λ‚΄μš©μ€ GitHub README)
## 🧠 ν•™μŠ΅ 데이터 (Step 1, 175λΆ„)
| Source | Sample 수 | μ–Έμ–΄ |
|---|---|---|
| VQAv2 | 3K | 영문 |
| LocalizedNarratives | 3K | 영문 |
| A-OKVQA | 3K | 영문 |
| **KoLLaVA** (LLaVA-Instruct DeepL ν•œμ—­) | **4K** | **ν•œκ΅­μ–΄** |
| **합계** | **13K** | **Korean ratio 30.8%** |
## πŸ›‘οΈ OOD Detector (선택)
```
ood_score = 0.6 Γ— clip_signal + 0.4 Γ— entropy_signal
is_ood = ood_score > 0.5 (default)
clip_signal: 1 - max(CLIP-ViT-B/32 similarity to 57 in-dist categories)
entropy_signal: H(LLM first-token logits) / 8.0 nats
```
검증 κ²°κ³Ό (`scripts/test_ood_integration.py`): In-Dist (μ‹€μ œ 개) 0.365 (βœ…) Β· OOD (Pikachu 카툰) 0.505 (⚠️)
## πŸͺΆ Slim Adapter β€” 99% 절감 (1045 MB β†’ 8.28 MB)
PEFT ν‘œμ€€μ€ `modules_to_save` (embed_tokens + lm_head) 을 **ν†΅μ§Έλ‘œ** μ €μž₯ β†’ 1 GB.
ν•˜μ§€λ§Œ 사전 λΆ„μ„μœΌλ‘œ 발견:
```
saved embed_tokens vs base Qwen2.5:
첫 151,665 ν–‰: max diff = 0.000000e+00 (μ •ν™•νžˆ 일치)
λ§ˆμ§€λ§‰ 1 ν–‰ (<image> 토큰): ν•™μŠ΅λœ representation
```
β†’ `image_token_row.safetensors` (7 KB) 만 별도 μ €μž₯ν•˜κ³ , μΆ”λ‘  μ‹œ base Qwen2.5 의 λ§ˆμ§€λ§‰ row 만 patch.
β†’ **greedy decoding 7/7 응닡 λΉ„νŠΈ λ‹¨μœ„ 일치** (`scripts/verify_slim_adapter.py`).
## ⚠️ ν•œκ³„
- **0.5B LLM** β€” 이미지 λ‚΄μš© μ •ν™•λ„λŠ” μ—¬μ „νžˆ ν•œκ³„ (개λ₯Ό μ†Œλ‘œ 였인 λ“±)
- **CLIP-ViT-B/32** β€” 49 patches, ViT-L/14 ablation μ§„ν–‰ν–ˆμœΌλ‚˜ 효과 ν•œκ³„ β†’ 미채택
- **57 OOD μΉ΄ν…Œκ³ λ¦¬** β€” COCO + 일상 객체 μœ„μ£Ό, 도메인 ν™•μž₯ μ‹œ μΉ΄ν…Œκ³ λ¦¬ 보강 ꢌμž₯
## πŸ”— 링크
- πŸ“‚ **Code**: [github.com/AD-Styles/vlm-from-scratch-v3](https://github.com/AD-Styles/vlm-from-scratch-v3)
- πŸš€ **Live Demo**: [HF Spaces β€” mini-llava-v3-demo](https://huggingface.co/spaces/AD-Styles/mini-llava-v3-demo)
- πŸ” **v2 baseline**: [github.com/AD-Styles/vlm-from-scratch](https://github.com/AD-Styles/vlm-from-scratch)
- πŸ€— **v2 weights**: [AD-Styles/mini-llava-stage2](https://huggingface.co/AD-Styles/mini-llava-stage2)
- 🚒 **Triton/vLLM deploy**: [github.com/AD-Styles/nlp-triton-deployment](https://github.com/AD-Styles/nlp-triton-deployment)
## πŸ“œ License
MIT β€” Β© 2026 κΉ€λ„μœ€ (AD-Styles)
## πŸ“š Citation
```bibtex
@misc{kim2026minillavav3,
title = {Mini-LLaVA v3: Korean Multilingual + Slim LoRA Adapter + OOD Detection},
author = {Kim, Doyun},
year = {2026},
url = {https://github.com/AD-Styles/vlm-from-scratch-v3}
}
```