Image-Text-to-Text
PEFT
Safetensors
English
Korean
vision-language
multimodal
clip
qwen2.5
lora
llava
korean
ood-detection
mini-llava
Instructions to use AD-Styles/mini-llava-v3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use AD-Styles/mini-llava-v3 with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
File size: 6,853 Bytes
eae11b6 62f15c2 eae11b6 fe6c580 eae11b6 62f15c2 abac4e0 eae11b6 85399b0 eae11b6 62f15c2 eae11b6 ee5e61c 62f15c2 e18f68f 62f15c2 ee5e61c eae11b6 abac4e0 eae11b6 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 | ---
license: mit
language:
- en
- ko
library_name: peft
base_model: Qwen/Qwen2.5-0.5B-Instruct
pipeline_tag: image-text-to-text
tags:
- vision-language
- multimodal
- clip
- qwen2.5
- lora
- peft
- llava
- korean
- ood-detection
- mini-llava
---
# Mini-LLaVA v3 β Korean Multilingual + OOD Detection + Slim Deploy
> v2 μμ νμ§ λͺ»νλ νκ΅μ΄ μλ΅ / νκ° / λ°°ν¬ λ¬΄κ² μΈ κ°μ§λ₯Ό v3 μμ λͺ¨λ ν΄κ²°. **νκ΅μ΄λ mix λ°μ΄ν° μ¬νμ΅, νκ°μ μΆλ‘ wrapper + OOD layer μΆκ°, λ°°ν¬ λ¬΄κ²λ Slim adapter (1045 MB β 8.28 MB)** β νμ΅ / λΆμ / μΆλ‘ μ λ¬Έμ λ³λ‘ ꡬλΆν μ κ·Ό.
> CLIP-ViT-B/32 + MLP Projector + Qwen2.5-0.5B + LoRA(r=16) λ₯Ό μ§μ ꡬνν Vision-Language Model μ νμ΅ κ°μ€μΉ.
>
> β οΈ **ν¬κΈ° β μ±λ₯ λͺ
μ**: Slim adapter (8.28 MB) λ **κ°μ λͺ¨λΈ, κ°μ μΆλ ₯** (greedy 7/7 λΉνΈ μΌμΉ). λͺ¨λΈμ΄ λ λλν΄μ§ κ²μ΄ μλλΌ ν¨ν€μ§λ§ ν¨μ¨ν. μ§μ§ capability κ°μ μ KoreanΒ·OOD λ κ°μ§ (μμΈν trade-off λ νκ³ ν μ°Έμ‘°).
## π¦ μ΄ λ ν¬μ κ΅¬μ± (~14 MB total)
```
projector.pt 5.7 MB β MultiModalProjector (CLIPβLLM λ§€ν)
lora_adapter_slim/
ββ adapter_config.json 1.1 KB β PEFT config (modules_to_save=None)
ββ adapter_model.safetensors 8.27 MB β LoRA weights (q/k/v/o, r=16)
ββ image_token_row.safetensors 7.17 KB β <image> ν ν° 1 row λ§ (slim ν΅μ¬)
ββ README.md (PEFT auto-generated)
```
**v2 λλΉ β99.21%** (1045 MB β 8.28 MB) β slim ν μ리λ [GitHub README Β§Slim Adapter](https://github.com/AD-Styles/vlm-from-scratch-v3#step-4--slim-adapter-1045-mb--828-mb-μΆλ ₯-λ³ν-μμ) μ°Έμ‘°.
## π Quick Start
```python
import torch
from PIL import Image
from huggingface_hub import snapshot_download
# 1) v3 src μ½λ κ°μ Έμ€κΈ° (GitHub)
# git clone https://github.com/AD-Styles/vlm-from-scratch-v3
# cd vlm-from-scratch-v3
from src.model import MiniLLaVA
from src.dataset import encode_for_inference
from src.ood_detection import OODDetector
# 2) κ°μ€μΉ λ€μ΄λ‘λ
local_dir = snapshot_download("AD-Styles/mini-llava-v3", local_dir="checkpoints/v3_step1_korean")
# 3) λͺ¨λΈ λ‘λ (slim adapter μλ μΈμ)
model = MiniLLaVA(freeze_vision=True, freeze_llm=True, torch_dtype=torch.float32)
model.load_projector(f"{local_dir}/projector.pt", map_location="cpu")
model.load_lora_adapter(f"{local_dir}/lora_adapter_slim")
model.to("cpu").eval()
# 4) μΆλ‘
image = Image.open("path/to/image.jpg").convert("RGB")
input_ids, attn = encode_for_inference(model.tokenizer, "μ΄ μ΄λ―Έμ§μ 무μμ΄ λ³΄μ΄λμ?")
pixel_values = model.image_processor(image, return_tensors="pt")["pixel_values"]
with torch.no_grad():
out = model.generate(
input_ids=input_ids.unsqueeze(0),
attention_mask=attn.unsqueeze(0),
pixel_values=pixel_values,
max_new_tokens=128,
)
print(model.tokenizer.decode(out[0], skip_special_tokens=True))
# 5) (μ ν) OOD κ²μΆ
detector = OODDetector(threshold=0.5, device="cpu")
# generate ν λ output_scores=True λ‘ first_logits λ°μμ detector.score(image, first_logits) νΈμΆ
```
## β¨ v2 β v3 λ³ν (capability vs deployment λΆλ¦¬)
### π’ capability μΆκ° (λͺ¨λΈμ΄ μλ‘ ν μ μκ² λ κ² β μ§μ§ μ±λ₯ κ°μ )
| νλͺ© | v2 | **v3 (μ΄ λ ν¬)** |
|---|---|---|
| λ€κ΅μ΄ μλ΅ | β μλ¬Έ only (catastrophic forgetting) | β
**μλ¬Έ + νκ΅μ΄** |
| OOD μ νΈ | β 무쑰건 λ΅λ³ (hallucination) | β
**"μ λͺ¨λ₯΄κ² μ" layer μΆκ°** (CLIP+entropy, κ²μ¦ N=2 β 본격 ROC λΆμμ v4) |
### π΅ deployment μ΅μ ν (μ±λ₯ λ³ν 0, λ°°ν¬ ν¨μ¨λ§)
| νλͺ© | v2 | v3 |
|---|---|---|
| LoRA adapter | 1045 MB | 8.28 MB (β99.21%) |
| λͺ¨λΈ μμ° μ΄ν© | ~1051 MB | ~14 MB |
| λͺ¨λΈ μΆλ ₯ | (baseline) | **bit-identical** to FULL (greedy 7/7 κ²μ¦) |
### π‘ λ³νμ§ μμ κ²
- μ΄λ―Έμ§ μ΄ν΄ μ νλ β 0.5B LLM νκ³λ‘ v2/v3 λμΌ μμ€ (v4 LLM size up μΌλ‘ ν΄κ²° μμ )
- μλ¬Έ VQA β v3 baseline 36.67% vs v2 34.67% (+2.00%p, VQAv2 50 samples greedy decoding). μΆλ‘ wrapper μΆκ°λ μμ μμ ν μ§λ¬Έ μ μμλ μν₯ μμ β wrapper μ μλ―Έ μλ κ°μ μ POPE νκ° μ°¨λ¨ μͺ½ (+3 ~ +20%p, μμΈν λ΄μ©μ GitHub README)
## π§ νμ΅ λ°μ΄ν° (Step 1, 175λΆ)
| Source | Sample μ | μΈμ΄ |
|---|---|---|
| VQAv2 | 3K | μλ¬Έ |
| LocalizedNarratives | 3K | μλ¬Έ |
| A-OKVQA | 3K | μλ¬Έ |
| **KoLLaVA** (LLaVA-Instruct DeepL νμ) | **4K** | **νκ΅μ΄** |
| **ν©κ³** | **13K** | **Korean ratio 30.8%** |
## π‘οΈ OOD Detector (μ ν)
```
ood_score = 0.6 Γ clip_signal + 0.4 Γ entropy_signal
is_ood = ood_score > 0.5 (default)
clip_signal: 1 - max(CLIP-ViT-B/32 similarity to 57 in-dist categories)
entropy_signal: H(LLM first-token logits) / 8.0 nats
```
κ²μ¦ κ²°κ³Ό (`scripts/test_ood_integration.py`): In-Dist (μ€μ κ°) 0.365 (β
) Β· OOD (Pikachu μΉ΄ν°) 0.505 (β οΈ)
## πͺΆ Slim Adapter β 99% μ κ° (1045 MB β 8.28 MB)
PEFT νμ€μ `modules_to_save` (embed_tokens + lm_head) μ **ν΅μ§Έλ‘** μ μ₯ β 1 GB.
νμ§λ§ μ¬μ λΆμμΌλ‘ λ°κ²¬:
```
saved embed_tokens vs base Qwen2.5:
첫 151,665 ν: max diff = 0.000000e+00 (μ νν μΌμΉ)
λ§μ§λ§ 1 ν (<image> ν ν°): νμ΅λ representation
```
β `image_token_row.safetensors` (7 KB) λ§ λ³λ μ μ₯νκ³ , μΆλ‘ μ base Qwen2.5 μ λ§μ§λ§ row λ§ patch.
β **greedy decoding 7/7 μλ΅ λΉνΈ λ¨μ μΌμΉ** (`scripts/verify_slim_adapter.py`).
## β οΈ νκ³
- **0.5B LLM** β μ΄λ―Έμ§ λ΄μ© μ νλλ μ¬μ ν νκ³ (κ°λ₯Ό μλ‘ μ€μΈ λ±)
- **CLIP-ViT-B/32** β 49 patches, ViT-L/14 ablation μ§ννμΌλ ν¨κ³Ό νκ³ β λ―Έμ±ν
- **57 OOD μΉ΄ν
κ³ λ¦¬** β COCO + μΌμ κ°μ²΄ μμ£Ό, λλ©μΈ νμ₯ μ μΉ΄ν
κ³ λ¦¬ λ³΄κ° κΆμ₯
## π λ§ν¬
- π **Code**: [github.com/AD-Styles/vlm-from-scratch-v3](https://github.com/AD-Styles/vlm-from-scratch-v3)
- π **Live Demo**: [HF Spaces β mini-llava-v3-demo](https://huggingface.co/spaces/AD-Styles/mini-llava-v3-demo)
- π **v2 baseline**: [github.com/AD-Styles/vlm-from-scratch](https://github.com/AD-Styles/vlm-from-scratch)
- π€ **v2 weights**: [AD-Styles/mini-llava-stage2](https://huggingface.co/AD-Styles/mini-llava-stage2)
- π’ **Triton/vLLM deploy**: [github.com/AD-Styles/nlp-triton-deployment](https://github.com/AD-Styles/nlp-triton-deployment)
## π License
MIT β Β© 2026 κΉλμ€ (AD-Styles)
## π Citation
```bibtex
@misc{kim2026minillavav3,
title = {Mini-LLaVA v3: Korean Multilingual + Slim LoRA Adapter + OOD Detection},
author = {Kim, Doyun},
year = {2026},
url = {https://github.com/AD-Styles/vlm-from-scratch-v3}
}
```
|