File size: 6,853 Bytes
eae11b6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62f15c2
eae11b6
fe6c580
eae11b6
62f15c2
abac4e0
eae11b6
 
 
 
 
 
 
 
 
 
 
 
85399b0
eae11b6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62f15c2
 
 
eae11b6
 
 
 
ee5e61c
62f15c2
 
 
 
 
 
 
 
 
e18f68f
62f15c2
 
ee5e61c
eae11b6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
abac4e0
eae11b6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
---
license: mit
language:
  - en
  - ko
library_name: peft
base_model: Qwen/Qwen2.5-0.5B-Instruct
pipeline_tag: image-text-to-text
tags:
  - vision-language
  - multimodal
  - clip
  - qwen2.5
  - lora
  - peft
  - llava
  - korean
  - ood-detection
  - mini-llava
---

# Mini-LLaVA v3 β€” Korean Multilingual + OOD Detection + Slim Deploy

> v2 μ—μ„œ ν’€μ§€ λͺ»ν–ˆλ˜ ν•œκ΅­μ–΄ 응닡 / ν™˜κ° / 배포 무게 μ„Έ κ°€μ§€λ₯Ό v3 μ—μ„œ λͺ¨λ‘ ν•΄κ²°. **ν•œκ΅­μ–΄λŠ” mix 데이터 μž¬ν•™μŠ΅, ν™˜κ°μ€ μΆ”λ‘  wrapper + OOD layer μΆ”κ°€, 배포 λ¬΄κ²ŒλŠ” Slim adapter (1045 MB β†’ 8.28 MB)** β€” ν•™μŠ΅ / 뢄석 / 좔둠을 λ¬Έμ œλ³„λ‘œ κ΅¬λΆ„ν•œ μ ‘κ·Ό.
> CLIP-ViT-B/32 + MLP Projector + Qwen2.5-0.5B + LoRA(r=16) λ₯Ό 직접 κ΅¬ν˜„ν•œ Vision-Language Model 의 ν•™μŠ΅ κ°€μ€‘μΉ˜.
>
> ⚠️ **크기 β‰  μ„±λŠ₯ λͺ…μ‹œ**: Slim adapter (8.28 MB) λŠ” **같은 λͺ¨λΈ, 같은 좜λ ₯** (greedy 7/7 λΉ„νŠΈ 일치). λͺ¨λΈμ΄ 더 λ˜‘λ˜‘ν•΄μ§„ 것이 μ•„λ‹ˆλΌ νŒ¨ν‚€μ§•λ§Œ νš¨μœ¨ν™”. μ§„μ§œ capability κ°œμ„ μ€ KoreanΒ·OOD 두 κ°€μ§€ (μžμ„Έν•œ trade-off λŠ” ν•œκ³„ ν‘œ μ°Έμ‘°).

## πŸ“¦ 이 레포의 ꡬ성 (~14 MB total)

```
projector.pt                       5.7 MB   ← MultiModalProjector (CLIPβ†’LLM λ§€ν•‘)
lora_adapter_slim/
β”œβ”€ adapter_config.json             1.1 KB   ← PEFT config (modules_to_save=None)
β”œβ”€ adapter_model.safetensors       8.27 MB  ← LoRA weights (q/k/v/o, r=16)
β”œβ”€ image_token_row.safetensors     7.17 KB  ← <image> 토큰 1 row 만 (slim 핡심)
└─ README.md (PEFT auto-generated)
```

**v2 λŒ€λΉ„ βˆ’99.21%** (1045 MB β†’ 8.28 MB) β€” slim ν™” μ›λ¦¬λŠ” [GitHub README Β§Slim Adapter](https://github.com/AD-Styles/vlm-from-scratch-v3#step-4--slim-adapter-1045-mb--828-mb-좜λ ₯-λ³€ν™”-μ—†μŒ) μ°Έμ‘°.

## πŸš€ Quick Start

```python
import torch
from PIL import Image
from huggingface_hub import snapshot_download

# 1) v3 src μ½”λ“œ κ°€μ Έμ˜€κΈ° (GitHub)
#    git clone https://github.com/AD-Styles/vlm-from-scratch-v3
#    cd vlm-from-scratch-v3
from src.model import MiniLLaVA
from src.dataset import encode_for_inference
from src.ood_detection import OODDetector

# 2) κ°€μ€‘μΉ˜ λ‹€μš΄λ‘œλ“œ
local_dir = snapshot_download("AD-Styles/mini-llava-v3", local_dir="checkpoints/v3_step1_korean")

# 3) λͺ¨λΈ λ‘œλ“œ (slim adapter μžλ™ 인식)
model = MiniLLaVA(freeze_vision=True, freeze_llm=True, torch_dtype=torch.float32)
model.load_projector(f"{local_dir}/projector.pt", map_location="cpu")
model.load_lora_adapter(f"{local_dir}/lora_adapter_slim")
model.to("cpu").eval()

# 4) μΆ”λ‘ 
image = Image.open("path/to/image.jpg").convert("RGB")
input_ids, attn = encode_for_inference(model.tokenizer, "이 이미지에 무엇이 λ³΄μ΄λ‚˜μš”?")
pixel_values = model.image_processor(image, return_tensors="pt")["pixel_values"]
with torch.no_grad():
    out = model.generate(
        input_ids=input_ids.unsqueeze(0),
        attention_mask=attn.unsqueeze(0),
        pixel_values=pixel_values,
        max_new_tokens=128,
    )
print(model.tokenizer.decode(out[0], skip_special_tokens=True))

# 5) (선택) OOD κ²€μΆœ
detector = OODDetector(threshold=0.5, device="cpu")
# generate ν•  λ•Œ output_scores=True 둜 first_logits λ°›μ•„μ„œ detector.score(image, first_logits) 호좜
```

## ✨ v2 β†’ v3 λ³€ν™” (capability vs deployment 뢄리)

### 🟒 capability μΆ”κ°€ (λͺ¨λΈμ΄ μƒˆλ‘œ ν•  수 있게 된 것 β€” μ§„μ§œ μ„±λŠ₯ κ°œμ„ )

| ν•­λͺ© | v2 | **v3 (이 레포)** |
|---|---|---|
| λ‹€κ΅­μ–΄ 응닡 | ❌ 영문 only (catastrophic forgetting) | βœ… **영문 + ν•œκ΅­μ–΄** |
| OOD μ‹ ν˜Έ | ❌ 무쑰건 λ‹΅λ³€ (hallucination) | βœ… **"잘 λͺ¨λ₯΄κ² μŒ" layer μΆ”κ°€** (CLIP+entropy, 검증 N=2 β€” 본격 ROC 뢄석은 v4) |

### πŸ”΅ deployment μ΅œμ ν™” (μ„±λŠ₯ λ³€ν™” 0, 배포 효율만)

| ν•­λͺ© | v2 | v3 |
|---|---|---|
| LoRA adapter | 1045 MB | 8.28 MB (βˆ’99.21%) |
| λͺ¨λΈ μžμ‚° 총합 | ~1051 MB | ~14 MB |
| λͺ¨λΈ 좜λ ₯ | (baseline) | **bit-identical** to FULL (greedy 7/7 검증) |

### 🟑 λ³€ν•˜μ§€ μ•Šμ€ 것

- 이미지 이해 정확도 β€” 0.5B LLM ν•œκ³„λ‘œ v2/v3 동일 μˆ˜μ€€ (v4 LLM size up 으둜 ν•΄κ²° μ˜ˆμ •)
- 영문 VQA β€” v3 baseline 36.67% vs v2 34.67% (+2.00%p, VQAv2 50 samples greedy decoding). μΆ”λ‘  wrapper 좔가도 자유 μ„œμˆ ν˜• 질문 μ μˆ˜μ—λŠ” 영ν–₯ μ—†μŒ β€” wrapper 의 의미 μžˆλŠ” κ°œμ„ μ€ POPE ν™˜κ° 차단 μͺ½ (+3 ~ +20%p, μžμ„Έν•œ λ‚΄μš©μ€ GitHub README)

## 🧠 ν•™μŠ΅ 데이터 (Step 1, 175λΆ„)

| Source | Sample 수 | μ–Έμ–΄ |
|---|---|---|
| VQAv2 | 3K | 영문 |
| LocalizedNarratives | 3K | 영문 |
| A-OKVQA | 3K | 영문 |
| **KoLLaVA** (LLaVA-Instruct DeepL ν•œμ—­) | **4K** | **ν•œκ΅­μ–΄** |
| **합계** | **13K** | **Korean ratio 30.8%** |

## πŸ›‘οΈ OOD Detector (선택)

```
ood_score = 0.6 Γ— clip_signal + 0.4 Γ— entropy_signal
is_ood    = ood_score > 0.5  (default)

clip_signal:    1 - max(CLIP-ViT-B/32 similarity to 57 in-dist categories)
entropy_signal: H(LLM first-token logits) / 8.0 nats
```

검증 κ²°κ³Ό (`scripts/test_ood_integration.py`): In-Dist (μ‹€μ œ 개) 0.365 (βœ…) Β· OOD (Pikachu 카툰) 0.505 (⚠️)

## πŸͺΆ Slim Adapter β€” 99% 절감 (1045 MB β†’ 8.28 MB)

PEFT ν‘œμ€€μ€ `modules_to_save` (embed_tokens + lm_head) 을 **ν†΅μ§Έλ‘œ** μ €μž₯ β†’ 1 GB.
ν•˜μ§€λ§Œ 사전 λΆ„μ„μœΌλ‘œ 발견:

```
saved embed_tokens vs base Qwen2.5:
  첫 151,665 ν–‰: max diff = 0.000000e+00  (μ •ν™•νžˆ 일치)
  λ§ˆμ§€λ§‰ 1 ν–‰ (<image> 토큰): ν•™μŠ΅λœ representation
```

β†’ `image_token_row.safetensors` (7 KB) 만 별도 μ €μž₯ν•˜κ³ , μΆ”λ‘  μ‹œ base Qwen2.5 의 λ§ˆμ§€λ§‰ row 만 patch.
β†’ **greedy decoding 7/7 응닡 λΉ„νŠΈ λ‹¨μœ„ 일치** (`scripts/verify_slim_adapter.py`).

## ⚠️ ν•œκ³„

- **0.5B LLM** β€” 이미지 λ‚΄μš© μ •ν™•λ„λŠ” μ—¬μ „νžˆ ν•œκ³„ (개λ₯Ό μ†Œλ‘œ 였인 λ“±)
- **CLIP-ViT-B/32** β€” 49 patches, ViT-L/14 ablation μ§„ν–‰ν–ˆμœΌλ‚˜ 효과 ν•œκ³„ β†’ 미채택
- **57 OOD μΉ΄ν…Œκ³ λ¦¬** β€” COCO + 일상 객체 μœ„μ£Ό, 도메인 ν™•μž₯ μ‹œ μΉ΄ν…Œκ³ λ¦¬ 보강 ꢌμž₯

## πŸ”— 링크

- πŸ“‚ **Code**: [github.com/AD-Styles/vlm-from-scratch-v3](https://github.com/AD-Styles/vlm-from-scratch-v3)
- πŸš€ **Live Demo**: [HF Spaces β€” mini-llava-v3-demo](https://huggingface.co/spaces/AD-Styles/mini-llava-v3-demo)
- πŸ” **v2 baseline**: [github.com/AD-Styles/vlm-from-scratch](https://github.com/AD-Styles/vlm-from-scratch)
- πŸ€— **v2 weights**: [AD-Styles/mini-llava-stage2](https://huggingface.co/AD-Styles/mini-llava-stage2)
- 🚒 **Triton/vLLM deploy**: [github.com/AD-Styles/nlp-triton-deployment](https://github.com/AD-Styles/nlp-triton-deployment)

## πŸ“œ License

MIT β€” Β© 2026 κΉ€λ„μœ€ (AD-Styles)

## πŸ“š Citation

```bibtex
@misc{kim2026minillavav3,
  title  = {Mini-LLaVA v3: Korean Multilingual + Slim LoRA Adapter + OOD Detection},
  author = {Kim, Doyun},
  year   = {2026},
  url    = {https://github.com/AD-Styles/vlm-from-scratch-v3}
}
```