Image-Text-to-Text
PEFT
Safetensors
English
Korean
vision-language
multimodal
clip
qwen2.5
lora
llava
korean
ood-detection
mini-llava
Instructions to use AD-Styles/mini-llava-v3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use AD-Styles/mini-llava-v3 with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
Honest messaging: separate capability gains (Korean, OOD) from deployment optimization (Slim)
Browse files
README.md
CHANGED
|
@@ -19,10 +19,12 @@ tags:
|
|
| 19 |
- mini-llava
|
| 20 |
---
|
| 21 |
|
| 22 |
-
# Mini-LLaVA v3 โ Korean Multilingual +
|
| 23 |
|
| 24 |
-
> v2
|
| 25 |
> CLIP-ViT-B/32 + MLP Projector + Qwen2.5-0.5B + LoRA(r=16) ๋ฅผ ์ง์ ๊ตฌํํ Vision-Language Model ์ ํ์ต ๊ฐ์ค์น.
|
|
|
|
|
|
|
| 26 |
|
| 27 |
## ๐ฆ ์ด ๋ ํฌ์ ๊ตฌ์ฑ (~14 MB total)
|
| 28 |
|
|
@@ -78,14 +80,27 @@ detector = OODDetector(threshold=0.5, device="cpu")
|
|
| 78 |
# generate ํ ๋ output_scores=True ๋ก first_logits ๋ฐ์์ detector.score(image, first_logits) ํธ์ถ
|
| 79 |
```
|
| 80 |
|
| 81 |
-
## โจ v2 โ v3
|
|
|
|
|
|
|
| 82 |
|
| 83 |
| ํญ๋ชฉ | v2 | **v3 (์ด ๋ ํฌ)** |
|
| 84 |
|---|---|---|
|
| 85 |
| ๋ค๊ตญ์ด ์๋ต | โ ์๋ฌธ only (catastrophic forgetting) | โ
**์๋ฌธ + ํ๊ตญ์ด** |
|
| 86 |
-
|
|
| 87 |
-
|
| 88 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 89 |
|
| 90 |
## ๐ง ํ์ต ๋ฐ์ดํฐ (Step 1, 175๋ถ)
|
| 91 |
|
|
|
|
| 19 |
- mini-llava
|
| 20 |
---
|
| 21 |
|
| 22 |
+
# Mini-LLaVA v3 โ Korean Multilingual + OOD Detection + Slim Deploy
|
| 23 |
|
| 24 |
+
> v2 baseline ์์ **capability 2๊ฐ (KoreanยทOOD) ์ถ๊ฐ + deployment 1๊ฐ (Slim packaging) ์ต์ ํ**.
|
| 25 |
> CLIP-ViT-B/32 + MLP Projector + Qwen2.5-0.5B + LoRA(r=16) ๋ฅผ ์ง์ ๊ตฌํํ Vision-Language Model ์ ํ์ต ๊ฐ์ค์น.
|
| 26 |
+
>
|
| 27 |
+
> โ ๏ธ **ํฌ๊ธฐ โ ์ฑ๋ฅ ๋ช
์**: Slim adapter (8.28 MB) ๋ **๊ฐ์ ๋ชจ๋ธ, ๊ฐ์ ์ถ๋ ฅ** (greedy 7/7 ๋นํธ ์ผ์น). ๋ชจ๋ธ์ด ๋ ๋๋ํด์ง ๊ฒ์ด ์๋๋ผ ํจํค์ง๋ง ํจ์จํ. ์ง์ง capability ๊ฐ์ ์ Korean / OOD ๋ ๊ฐ์ง.
|
| 28 |
|
| 29 |
## ๐ฆ ์ด ๋ ํฌ์ ๊ตฌ์ฑ (~14 MB total)
|
| 30 |
|
|
|
|
| 80 |
# generate ํ ๋ output_scores=True ๋ก first_logits ๋ฐ์์ detector.score(image, first_logits) ํธ์ถ
|
| 81 |
```
|
| 82 |
|
| 83 |
+
## โจ v2 โ v3 ๋ณํ (capability vs deployment ๋ถ๋ฆฌ)
|
| 84 |
+
|
| 85 |
+
### ๐ข capability ์ถ๊ฐ (๋ชจ๋ธ์ด ์๋ก ํ ์ ์๊ฒ ๋ ๊ฒ โ ์ง์ง ์ฑ๋ฅ ๊ฐ์ )
|
| 86 |
|
| 87 |
| ํญ๋ชฉ | v2 | **v3 (์ด ๋ ํฌ)** |
|
| 88 |
|---|---|---|
|
| 89 |
| ๋ค๊ตญ์ด ์๋ต | โ ์๋ฌธ only (catastrophic forgetting) | โ
**์๋ฌธ + ํ๊ตญ์ด** |
|
| 90 |
+
| OOD ์ ํธ | โ ๋ฌด์กฐ๊ฑด ๋ต๋ณ (hallucination) | โ
**"์ ๋ชจ๋ฅด๊ฒ ์" ๊ฐ๋ฅ** (CLIP+entropy) |
|
| 91 |
+
|
| 92 |
+
### ๐ต deployment ์ต์ ํ (์ฑ๋ฅ ๋ณํ 0, ๋ฐฐํฌ ํจ์จ๋ง)
|
| 93 |
+
|
| 94 |
+
| ํญ๋ชฉ | v2 | v3 |
|
| 95 |
+
|---|---|---|
|
| 96 |
+
| LoRA adapter | 1045 MB | 8.28 MB (โ99.21%) |
|
| 97 |
+
| ๋ชจ๋ธ ์์ฐ ์ดํฉ | ~1051 MB | ~14 MB |
|
| 98 |
+
| ๋ชจ๋ธ ์ถ๋ ฅ | (baseline) | **bit-identical** to FULL (greedy 7/7 ๊ฒ์ฆ) |
|
| 99 |
+
|
| 100 |
+
### ๐ก ๋ณํ์ง ์์ ๊ฒ (์ ์งํ ๋ช
์)
|
| 101 |
+
|
| 102 |
+
- ์ด๋ฏธ์ง ์ดํด ์ ํ๋ โ 0.5B LLM ํ๊ณ๋ก v2/v3 ๋์ผ ์์ค (v4 LLM size up ์ผ๋ก ํด๊ฒฐ ์์ )
|
| 103 |
+
- ์๋ฌธ VQA head-to-head โ v2 vs v3 ๋น๊ต๋ ๋ฏธ์ธก์
|
| 104 |
|
| 105 |
## ๐ง ํ์ต ๋ฐ์ดํฐ (Step 1, 175๋ถ)
|
| 106 |
|