Image-Text-to-Text
PEFT
Safetensors
English
Korean
vision-language
multimodal
clip
qwen2.5
lora
llava
korean
ood-detection
mini-llava
Instructions to use AD-Styles/mini-llava-v3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use AD-Styles/mini-llava-v3 with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
Add v3 model card (Korean + Slim + OOD)
Browse files
README.md
ADDED
|
@@ -0,0 +1,153 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
- ko
|
| 6 |
+
library_name: peft
|
| 7 |
+
base_model: Qwen/Qwen2.5-0.5B-Instruct
|
| 8 |
+
pipeline_tag: image-text-to-text
|
| 9 |
+
tags:
|
| 10 |
+
- vision-language
|
| 11 |
+
- multimodal
|
| 12 |
+
- clip
|
| 13 |
+
- qwen2.5
|
| 14 |
+
- lora
|
| 15 |
+
- peft
|
| 16 |
+
- llava
|
| 17 |
+
- korean
|
| 18 |
+
- ood-detection
|
| 19 |
+
- mini-llava
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
# Mini-LLaVA v3 โ Korean Multilingual + Slim LoRA + OOD Detection
|
| 23 |
+
|
| 24 |
+
> v2 ์ ๋ฏธํด๊ฒฐ ๊ณผ์ 3๊ฐ์ง (ํ๊ตญ์ด forgetting, 1 GB adapter, OOD hallucination) ๋ฅผ ์ ์กฐ์คํ ์งํ ๋ฒ์ .
|
| 25 |
+
> CLIP-ViT-B/32 + MLP Projector + Qwen2.5-0.5B + LoRA(r=16) ๋ฅผ ์ง์ ๊ตฌํํ Vision-Language Model ์ ํ์ต ๊ฐ์ค์น.
|
| 26 |
+
|
| 27 |
+
## ๐ฆ ์ด ๋ ํฌ์ ๊ตฌ์ฑ (~14 MB total)
|
| 28 |
+
|
| 29 |
+
```
|
| 30 |
+
projector.pt 5.7 MB โ MultiModalProjector (CLIPโLLM ๋งคํ)
|
| 31 |
+
lora_adapter_slim/
|
| 32 |
+
โโ adapter_config.json 1.1 KB โ PEFT config (modules_to_save=None)
|
| 33 |
+
โโ adapter_model.safetensors 8.27 MB โ LoRA weights (q/k/v/o, r=16)
|
| 34 |
+
โโ image_token_row.safetensors 7.17 KB โ <image> ํ ํฐ 1 row ๋ง (slim ํต์ฌ)
|
| 35 |
+
โโ README.md (PEFT auto-generated)
|
| 36 |
+
```
|
| 37 |
+
|
| 38 |
+
**v2 ๋๋น โ99.21%** (1045 MB โ 8.28 MB) โ slim ํ ์๋ฆฌ๋ [GitHub README ยงSlim Adapter](https://github.com/AD-Styles/vlm-from-scratch-v3#2%EF%B8%8F%E2%83%A3-slim-adapter--1045-mb--828-mb-%EC%9E%AC%ED%95%99%EC%8A%B5-0) ์ฐธ์กฐ.
|
| 39 |
+
|
| 40 |
+
## ๐ Quick Start
|
| 41 |
+
|
| 42 |
+
```python
|
| 43 |
+
import torch
|
| 44 |
+
from PIL import Image
|
| 45 |
+
from huggingface_hub import snapshot_download
|
| 46 |
+
|
| 47 |
+
# 1) v3 src ์ฝ๋ ๊ฐ์ ธ์ค๊ธฐ (GitHub)
|
| 48 |
+
# git clone https://github.com/AD-Styles/vlm-from-scratch-v3
|
| 49 |
+
# cd vlm-from-scratch-v3
|
| 50 |
+
from src.model import MiniLLaVA
|
| 51 |
+
from src.dataset import encode_for_inference
|
| 52 |
+
from src.ood_detection import OODDetector
|
| 53 |
+
|
| 54 |
+
# 2) ๊ฐ์ค์น ๋ค์ด๋ก๋
|
| 55 |
+
local_dir = snapshot_download("AD-Styles/mini-llava-v3", local_dir="checkpoints/v3_step1_korean")
|
| 56 |
+
|
| 57 |
+
# 3) ๋ชจ๋ธ ๋ก๋ (slim adapter ์๋ ์ธ์)
|
| 58 |
+
model = MiniLLaVA(freeze_vision=True, freeze_llm=True, torch_dtype=torch.float32)
|
| 59 |
+
model.load_projector(f"{local_dir}/projector.pt", map_location="cpu")
|
| 60 |
+
model.load_lora_adapter(f"{local_dir}/lora_adapter_slim")
|
| 61 |
+
model.to("cpu").eval()
|
| 62 |
+
|
| 63 |
+
# 4) ์ถ๋ก
|
| 64 |
+
image = Image.open("path/to/image.jpg").convert("RGB")
|
| 65 |
+
input_ids, attn = encode_for_inference(model.tokenizer, "์ด ์ด๋ฏธ์ง์ ๋ฌด์์ด ๋ณด์ด๋์?")
|
| 66 |
+
pixel_values = model.image_processor(image, return_tensors="pt")["pixel_values"]
|
| 67 |
+
with torch.no_grad():
|
| 68 |
+
out = model.generate(
|
| 69 |
+
input_ids=input_ids.unsqueeze(0),
|
| 70 |
+
attention_mask=attn.unsqueeze(0),
|
| 71 |
+
pixel_values=pixel_values,
|
| 72 |
+
max_new_tokens=128,
|
| 73 |
+
)
|
| 74 |
+
print(model.tokenizer.decode(out[0], skip_special_tokens=True))
|
| 75 |
+
|
| 76 |
+
# 5) (์ ํ) OOD ๊ฒ์ถ
|
| 77 |
+
detector = OODDetector(threshold=0.5, device="cpu")
|
| 78 |
+
# generate ํ ๋ output_scores=True ๋ก first_logits ๋ฐ์์ detector.score(image, first_logits) ํธ์ถ
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
## โจ v2 โ v3 ํต์ฌ ๊ฐ์
|
| 82 |
+
|
| 83 |
+
| ํญ๋ชฉ | v2 | **v3 (์ด ๋ ํฌ)** |
|
| 84 |
+
|---|---|---|
|
| 85 |
+
| ๋ค๊ตญ์ด ์๋ต | โ ์๋ฌธ only (catastrophic forgetting) | โ
**์๋ฌธ + ํ๊ตญ์ด** |
|
| 86 |
+
| LoRA adapter | 1045 MB | **8.28 MB (โ99.21%)** |
|
| 87 |
+
| OOD ์ฒ๋ฆฌ | ๋ฌด์กฐ๊ฑด ๋ต๋ณ (hallucination) | **"์ ๋ชจ๋ฅด๊ฒ ์" ๊ฐ๋ฅ** (CLIP+entropy) |
|
| 88 |
+
| ๋ค์ด๋ก๋ ์์ฐ ์ดํฉ | ~1051 MB | **~14 MB** |
|
| 89 |
+
|
| 90 |
+
## ๐ง ํ์ต ๋ฐ์ดํฐ (Step 1, 175๋ถ)
|
| 91 |
+
|
| 92 |
+
| Source | Sample ์ | ์ธ์ด |
|
| 93 |
+
|---|---|---|
|
| 94 |
+
| VQAv2 | 3K | ์๋ฌธ |
|
| 95 |
+
| LocalizedNarratives | 3K | ์๋ฌธ |
|
| 96 |
+
| A-OKVQA | 3K | ์๋ฌธ |
|
| 97 |
+
| **KoLLaVA** (LLaVA-Instruct DeepL ํ์ญ) | **4K** | **ํ๊ตญ์ด** |
|
| 98 |
+
| **ํฉ๊ณ** | **13K** | **Korean ratio 30.8%** |
|
| 99 |
+
|
| 100 |
+
## ๐ก๏ธ OOD Detector (์ ํ)
|
| 101 |
+
|
| 102 |
+
```
|
| 103 |
+
ood_score = 0.6 ร clip_signal + 0.4 ร entropy_signal
|
| 104 |
+
is_ood = ood_score > 0.5 (default)
|
| 105 |
+
|
| 106 |
+
clip_signal: 1 - max(CLIP-ViT-B/32 similarity to 57 in-dist categories)
|
| 107 |
+
entropy_signal: H(LLM first-token logits) / 8.0 nats
|
| 108 |
+
```
|
| 109 |
+
|
| 110 |
+
๊ฒ์ฆ ๊ฒฐ๊ณผ (`scripts/test_ood_integration.py`): In-Dist (์ค์ ๊ฐ) 0.365 (โ
) ยท OOD (Pikachu ์นดํฐ) 0.505 (โ ๏ธ)
|
| 111 |
+
|
| 112 |
+
## ๐ชถ Slim Adapter โ ํต์ฌ ๊ธฐ์
|
| 113 |
+
|
| 114 |
+
PEFT ํ์ค์ `modules_to_save` (embed_tokens + lm_head) ์ **ํต์งธ๋ก** ์ ์ฅ โ 1 GB.
|
| 115 |
+
ํ์ง๋ง ์ฌ์ ๋ถ์์ผ๋ก ๋ฐ๊ฒฌ:
|
| 116 |
+
|
| 117 |
+
```
|
| 118 |
+
saved embed_tokens vs base Qwen2.5:
|
| 119 |
+
์ฒซ 151,665 ํ: max diff = 0.000000e+00 (์ ํํ ์ผ์น)
|
| 120 |
+
๋ง์ง๋ง 1 ํ (<image> ํ ํฐ): ํ์ต๋ representation
|
| 121 |
+
```
|
| 122 |
+
|
| 123 |
+
โ `image_token_row.safetensors` (7 KB) ๋ง ๋ณ๋ ์ ์ฅํ๊ณ , ์ถ๋ก ์ base Qwen2.5 ์ ๋ง์ง๋ง row ๋ง patch.
|
| 124 |
+
โ **greedy decoding 7/7 ์๋ต ๋นํธ ๋จ์ ์ผ์น** (`scripts/verify_slim_adapter.py`).
|
| 125 |
+
|
| 126 |
+
## โ ๏ธ ํ๊ณ
|
| 127 |
+
|
| 128 |
+
- **0.5B LLM** โ ์ด๋ฏธ์ง ๋ด์ฉ ์ ํ๋๋ ์ฌ์ ํ ํ๊ณ (๊ฐ๋ฅผ ์๋ก ์ค์ธ ๋ฑ)
|
| 129 |
+
- **CLIP-ViT-B/32** โ 49 patches, ViT-L/14 ablation ์งํํ์ผ๋ ํจ๊ณผ ํ๊ณ โ ๋ฏธ์ฑํ
|
| 130 |
+
- **57 OOD ์นดํ
๊ณ ๋ฆฌ** โ COCO + ์ผ์ ๊ฐ์ฒด ์์ฃผ, ๋๋ฉ์ธ ํ์ฅ ์ ์นดํ
๊ณ ๋ฆฌ ๋ณด๊ฐ ๊ถ์ฅ
|
| 131 |
+
|
| 132 |
+
## ๐ ๋งํฌ
|
| 133 |
+
|
| 134 |
+
- ๐ **Code**: [github.com/AD-Styles/vlm-from-scratch-v3](https://github.com/AD-Styles/vlm-from-scratch-v3)
|
| 135 |
+
- ๐ **Live Demo**: [HF Spaces โ mini-llava-v3-demo](https://huggingface.co/spaces/AD-Styles/mini-llava-v3-demo)
|
| 136 |
+
- ๐ **v2 baseline**: [github.com/AD-Styles/vlm-from-scratch](https://github.com/AD-Styles/vlm-from-scratch)
|
| 137 |
+
- ๐ค **v2 weights**: [AD-Styles/mini-llava-stage2](https://huggingface.co/AD-Styles/mini-llava-stage2)
|
| 138 |
+
- ๐ข **Triton/vLLM deploy**: [github.com/AD-Styles/nlp-triton-deployment](https://github.com/AD-Styles/nlp-triton-deployment)
|
| 139 |
+
|
| 140 |
+
## ๐ License
|
| 141 |
+
|
| 142 |
+
MIT โ ยฉ 2026 ๊น๋์ค (AD-Styles)
|
| 143 |
+
|
| 144 |
+
## ๐ Citation
|
| 145 |
+
|
| 146 |
+
```bibtex
|
| 147 |
+
@misc{kim2026minillavav3,
|
| 148 |
+
title = {Mini-LLaVA v3: Korean Multilingual + Slim LoRA Adapter + OOD Detection},
|
| 149 |
+
author = {Kim, Doyun},
|
| 150 |
+
year = {2026},
|
| 151 |
+
url = {https://github.com/AD-Styles/vlm-from-scratch-v3}
|
| 152 |
+
}
|
| 153 |
+
```
|