AD-Styles commited on
Commit
eae11b6
ยท
verified ยท
1 Parent(s): 4fbe8cd

Add v3 model card (Korean + Slim + OOD)

Browse files
Files changed (1) hide show
  1. README.md +153 -0
README.md ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ - ko
6
+ library_name: peft
7
+ base_model: Qwen/Qwen2.5-0.5B-Instruct
8
+ pipeline_tag: image-text-to-text
9
+ tags:
10
+ - vision-language
11
+ - multimodal
12
+ - clip
13
+ - qwen2.5
14
+ - lora
15
+ - peft
16
+ - llava
17
+ - korean
18
+ - ood-detection
19
+ - mini-llava
20
+ ---
21
+
22
+ # Mini-LLaVA v3 โ€” Korean Multilingual + Slim LoRA + OOD Detection
23
+
24
+ > v2 ์˜ ๋ฏธํ•ด๊ฒฐ ๊ณผ์ œ 3๊ฐ€์ง€ (ํ•œ๊ตญ์–ด forgetting, 1 GB adapter, OOD hallucination) ๋ฅผ ์ •์กฐ์ค€ํ•œ ์ง„ํ™” ๋ฒ„์ „.
25
+ > CLIP-ViT-B/32 + MLP Projector + Qwen2.5-0.5B + LoRA(r=16) ๋ฅผ ์ง์ ‘ ๊ตฌํ˜„ํ•œ Vision-Language Model ์˜ ํ•™์Šต ๊ฐ€์ค‘์น˜.
26
+
27
+ ## ๐Ÿ“ฆ ์ด ๋ ˆํฌ์˜ ๊ตฌ์„ฑ (~14 MB total)
28
+
29
+ ```
30
+ projector.pt 5.7 MB โ† MultiModalProjector (CLIPโ†’LLM ๋งคํ•‘)
31
+ lora_adapter_slim/
32
+ โ”œโ”€ adapter_config.json 1.1 KB โ† PEFT config (modules_to_save=None)
33
+ โ”œโ”€ adapter_model.safetensors 8.27 MB โ† LoRA weights (q/k/v/o, r=16)
34
+ โ”œโ”€ image_token_row.safetensors 7.17 KB โ† <image> ํ† ํฐ 1 row ๋งŒ (slim ํ•ต์‹ฌ)
35
+ โ””โ”€ README.md (PEFT auto-generated)
36
+ ```
37
+
38
+ **v2 ๋Œ€๋น„ โˆ’99.21%** (1045 MB โ†’ 8.28 MB) โ€” slim ํ™” ์›๋ฆฌ๋Š” [GitHub README ยงSlim Adapter](https://github.com/AD-Styles/vlm-from-scratch-v3#2%EF%B8%8F%E2%83%A3-slim-adapter--1045-mb--828-mb-%EC%9E%AC%ED%95%99%EC%8A%B5-0) ์ฐธ์กฐ.
39
+
40
+ ## ๐Ÿš€ Quick Start
41
+
42
+ ```python
43
+ import torch
44
+ from PIL import Image
45
+ from huggingface_hub import snapshot_download
46
+
47
+ # 1) v3 src ์ฝ”๋“œ ๊ฐ€์ ธ์˜ค๊ธฐ (GitHub)
48
+ # git clone https://github.com/AD-Styles/vlm-from-scratch-v3
49
+ # cd vlm-from-scratch-v3
50
+ from src.model import MiniLLaVA
51
+ from src.dataset import encode_for_inference
52
+ from src.ood_detection import OODDetector
53
+
54
+ # 2) ๊ฐ€์ค‘์น˜ ๋‹ค์šด๋กœ๋“œ
55
+ local_dir = snapshot_download("AD-Styles/mini-llava-v3", local_dir="checkpoints/v3_step1_korean")
56
+
57
+ # 3) ๋ชจ๋ธ ๋กœ๋“œ (slim adapter ์ž๋™ ์ธ์‹)
58
+ model = MiniLLaVA(freeze_vision=True, freeze_llm=True, torch_dtype=torch.float32)
59
+ model.load_projector(f"{local_dir}/projector.pt", map_location="cpu")
60
+ model.load_lora_adapter(f"{local_dir}/lora_adapter_slim")
61
+ model.to("cpu").eval()
62
+
63
+ # 4) ์ถ”๋ก 
64
+ image = Image.open("path/to/image.jpg").convert("RGB")
65
+ input_ids, attn = encode_for_inference(model.tokenizer, "์ด ์ด๋ฏธ์ง€์— ๋ฌด์—‡์ด ๋ณด์ด๋‚˜์š”?")
66
+ pixel_values = model.image_processor(image, return_tensors="pt")["pixel_values"]
67
+ with torch.no_grad():
68
+ out = model.generate(
69
+ input_ids=input_ids.unsqueeze(0),
70
+ attention_mask=attn.unsqueeze(0),
71
+ pixel_values=pixel_values,
72
+ max_new_tokens=128,
73
+ )
74
+ print(model.tokenizer.decode(out[0], skip_special_tokens=True))
75
+
76
+ # 5) (์„ ํƒ) OOD ๊ฒ€์ถœ
77
+ detector = OODDetector(threshold=0.5, device="cpu")
78
+ # generate ํ•  ๋•Œ output_scores=True ๋กœ first_logits ๋ฐ›์•„์„œ detector.score(image, first_logits) ํ˜ธ์ถœ
79
+ ```
80
+
81
+ ## โœจ v2 โ†’ v3 ํ•ต์‹ฌ ๊ฐœ์„ 
82
+
83
+ | ํ•ญ๋ชฉ | v2 | **v3 (์ด ๋ ˆํฌ)** |
84
+ |---|---|---|
85
+ | ๋‹ค๊ตญ์–ด ์‘๋‹ต | โŒ ์˜๋ฌธ only (catastrophic forgetting) | โœ… **์˜๋ฌธ + ํ•œ๊ตญ์–ด** |
86
+ | LoRA adapter | 1045 MB | **8.28 MB (โˆ’99.21%)** |
87
+ | OOD ์ฒ˜๋ฆฌ | ๋ฌด์กฐ๊ฑด ๋‹ต๋ณ€ (hallucination) | **"์ž˜ ๋ชจ๋ฅด๊ฒ ์Œ" ๊ฐ€๋Šฅ** (CLIP+entropy) |
88
+ | ๋‹ค์šด๋กœ๋“œ ์ž์‚ฐ ์ดํ•ฉ | ~1051 MB | **~14 MB** |
89
+
90
+ ## ๐Ÿง  ํ•™์Šต ๋ฐ์ดํ„ฐ (Step 1, 175๋ถ„)
91
+
92
+ | Source | Sample ์ˆ˜ | ์–ธ์–ด |
93
+ |---|---|---|
94
+ | VQAv2 | 3K | ์˜๋ฌธ |
95
+ | LocalizedNarratives | 3K | ์˜๋ฌธ |
96
+ | A-OKVQA | 3K | ์˜๋ฌธ |
97
+ | **KoLLaVA** (LLaVA-Instruct DeepL ํ•œ์—ญ) | **4K** | **ํ•œ๊ตญ์–ด** |
98
+ | **ํ•ฉ๊ณ„** | **13K** | **Korean ratio 30.8%** |
99
+
100
+ ## ๐Ÿ›ก๏ธ OOD Detector (์„ ํƒ)
101
+
102
+ ```
103
+ ood_score = 0.6 ร— clip_signal + 0.4 ร— entropy_signal
104
+ is_ood = ood_score > 0.5 (default)
105
+
106
+ clip_signal: 1 - max(CLIP-ViT-B/32 similarity to 57 in-dist categories)
107
+ entropy_signal: H(LLM first-token logits) / 8.0 nats
108
+ ```
109
+
110
+ ๊ฒ€์ฆ ๊ฒฐ๊ณผ (`scripts/test_ood_integration.py`): In-Dist (์‹ค์ œ ๊ฐœ) 0.365 (โœ…) ยท OOD (Pikachu ์นดํˆฐ) 0.505 (โš ๏ธ)
111
+
112
+ ## ๐Ÿชถ Slim Adapter โ€” ํ•ต์‹ฌ ๊ธฐ์ˆ 
113
+
114
+ PEFT ํ‘œ์ค€์€ `modules_to_save` (embed_tokens + lm_head) ์„ **ํ†ต์งธ๋กœ** ์ €์žฅ โ†’ 1 GB.
115
+ ํ•˜์ง€๋งŒ ์‚ฌ์ „ ๋ถ„์„์œผ๋กœ ๋ฐœ๊ฒฌ:
116
+
117
+ ```
118
+ saved embed_tokens vs base Qwen2.5:
119
+ ์ฒซ 151,665 ํ–‰: max diff = 0.000000e+00 (์ •ํ™•ํžˆ ์ผ์น˜)
120
+ ๋งˆ์ง€๋ง‰ 1 ํ–‰ (<image> ํ† ํฐ): ํ•™์Šต๋œ representation
121
+ ```
122
+
123
+ โ†’ `image_token_row.safetensors` (7 KB) ๋งŒ ๋ณ„๋„ ์ €์žฅํ•˜๊ณ , ์ถ”๋ก  ์‹œ base Qwen2.5 ์˜ ๋งˆ์ง€๋ง‰ row ๋งŒ patch.
124
+ โ†’ **greedy decoding 7/7 ์‘๋‹ต ๋น„ํŠธ ๋‹จ์œ„ ์ผ์น˜** (`scripts/verify_slim_adapter.py`).
125
+
126
+ ## โš ๏ธ ํ•œ๊ณ„
127
+
128
+ - **0.5B LLM** โ€” ์ด๋ฏธ์ง€ ๋‚ด์šฉ ์ •ํ™•๋„๋Š” ์—ฌ์ „ํžˆ ํ•œ๊ณ„ (๊ฐœ๋ฅผ ์†Œ๋กœ ์˜ค์ธ ๋“ฑ)
129
+ - **CLIP-ViT-B/32** โ€” 49 patches, ViT-L/14 ablation ์ง„ํ–‰ํ–ˆ์œผ๋‚˜ ํšจ๊ณผ ํ•œ๊ณ„ โ†’ ๋ฏธ์ฑ„ํƒ
130
+ - **57 OOD ์นดํ…Œ๊ณ ๋ฆฌ** โ€” COCO + ์ผ์ƒ ๊ฐ์ฒด ์œ„์ฃผ, ๋„๋ฉ”์ธ ํ™•์žฅ ์‹œ ์นดํ…Œ๊ณ ๋ฆฌ ๋ณด๊ฐ• ๊ถŒ์žฅ
131
+
132
+ ## ๐Ÿ”— ๋งํฌ
133
+
134
+ - ๐Ÿ“‚ **Code**: [github.com/AD-Styles/vlm-from-scratch-v3](https://github.com/AD-Styles/vlm-from-scratch-v3)
135
+ - ๐Ÿš€ **Live Demo**: [HF Spaces โ€” mini-llava-v3-demo](https://huggingface.co/spaces/AD-Styles/mini-llava-v3-demo)
136
+ - ๐Ÿ” **v2 baseline**: [github.com/AD-Styles/vlm-from-scratch](https://github.com/AD-Styles/vlm-from-scratch)
137
+ - ๐Ÿค— **v2 weights**: [AD-Styles/mini-llava-stage2](https://huggingface.co/AD-Styles/mini-llava-stage2)
138
+ - ๐Ÿšข **Triton/vLLM deploy**: [github.com/AD-Styles/nlp-triton-deployment](https://github.com/AD-Styles/nlp-triton-deployment)
139
+
140
+ ## ๐Ÿ“œ License
141
+
142
+ MIT โ€” ยฉ 2026 ๊น€๋„์œค (AD-Styles)
143
+
144
+ ## ๐Ÿ“š Citation
145
+
146
+ ```bibtex
147
+ @misc{kim2026minillavav3,
148
+ title = {Mini-LLaVA v3: Korean Multilingual + Slim LoRA Adapter + OOD Detection},
149
+ author = {Kim, Doyun},
150
+ year = {2026},
151
+ url = {https://github.com/AD-Styles/vlm-from-scratch-v3}
152
+ }
153
+ ```