Image-Text-to-Text
PEFT
Safetensors
English
Korean
vision-language
multimodal
clip
qwen2.5
lora
llava
korean
ood-detection
mini-llava
Instructions to use AD-Styles/mini-llava-v3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use AD-Styles/mini-llava-v3 with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
| license: mit | |
| language: | |
| - en | |
| - ko | |
| library_name: peft | |
| base_model: Qwen/Qwen2.5-0.5B-Instruct | |
| pipeline_tag: image-text-to-text | |
| tags: | |
| - vision-language | |
| - multimodal | |
| - clip | |
| - qwen2.5 | |
| - lora | |
| - peft | |
| - llava | |
| - korean | |
| - ood-detection | |
| - mini-llava | |
| # Mini-LLaVA v3 β Korean Multilingual + OOD Detection + Slim Deploy | |
| > v2 μμ νμ§ λͺ»νλ νκ΅μ΄ μλ΅ / νκ° / λ°°ν¬ λ¬΄κ² μΈ κ°μ§λ₯Ό v3 μμ λͺ¨λ ν΄κ²°. **νκ΅μ΄λ mix λ°μ΄ν° μ¬νμ΅, νκ°μ μΆλ‘ wrapper + OOD layer μΆκ°, λ°°ν¬ λ¬΄κ²λ Slim adapter (1045 MB β 8.28 MB)** β νμ΅ / λΆμ / μΆλ‘ μ λ¬Έμ λ³λ‘ ꡬλΆν μ κ·Ό. | |
| > CLIP-ViT-B/32 + MLP Projector + Qwen2.5-0.5B + LoRA(r=16) λ₯Ό μ§μ ꡬνν Vision-Language Model μ νμ΅ κ°μ€μΉ. | |
| > | |
| > β οΈ **ν¬κΈ° β μ±λ₯ λͺ μ**: Slim adapter (8.28 MB) λ **κ°μ λͺ¨λΈ, κ°μ μΆλ ₯** (greedy 7/7 λΉνΈ μΌμΉ). λͺ¨λΈμ΄ λ λλν΄μ§ κ²μ΄ μλλΌ ν¨ν€μ§λ§ ν¨μ¨ν. μ§μ§ capability κ°μ μ KoreanΒ·OOD λ κ°μ§ (μμΈν trade-off λ νκ³ ν μ°Έμ‘°). | |
| ## π¦ μ΄ λ ν¬μ κ΅¬μ± (~14 MB total) | |
| ``` | |
| projector.pt 5.7 MB β MultiModalProjector (CLIPβLLM λ§€ν) | |
| lora_adapter_slim/ | |
| ββ adapter_config.json 1.1 KB β PEFT config (modules_to_save=None) | |
| ββ adapter_model.safetensors 8.27 MB β LoRA weights (q/k/v/o, r=16) | |
| ββ image_token_row.safetensors 7.17 KB β <image> ν ν° 1 row λ§ (slim ν΅μ¬) | |
| ββ README.md (PEFT auto-generated) | |
| ``` | |
| **v2 λλΉ β99.21%** (1045 MB β 8.28 MB) β slim ν μ리λ [GitHub README Β§Slim Adapter](https://github.com/AD-Styles/vlm-from-scratch-v3#step-4--slim-adapter-1045-mb--828-mb-μΆλ ₯-λ³ν-μμ) μ°Έμ‘°. | |
| ## π Quick Start | |
| ```python | |
| import torch | |
| from PIL import Image | |
| from huggingface_hub import snapshot_download | |
| # 1) v3 src μ½λ κ°μ Έμ€κΈ° (GitHub) | |
| # git clone https://github.com/AD-Styles/vlm-from-scratch-v3 | |
| # cd vlm-from-scratch-v3 | |
| from src.model import MiniLLaVA | |
| from src.dataset import encode_for_inference | |
| from src.ood_detection import OODDetector | |
| # 2) κ°μ€μΉ λ€μ΄λ‘λ | |
| local_dir = snapshot_download("AD-Styles/mini-llava-v3", local_dir="checkpoints/v3_step1_korean") | |
| # 3) λͺ¨λΈ λ‘λ (slim adapter μλ μΈμ) | |
| model = MiniLLaVA(freeze_vision=True, freeze_llm=True, torch_dtype=torch.float32) | |
| model.load_projector(f"{local_dir}/projector.pt", map_location="cpu") | |
| model.load_lora_adapter(f"{local_dir}/lora_adapter_slim") | |
| model.to("cpu").eval() | |
| # 4) μΆλ‘ | |
| image = Image.open("path/to/image.jpg").convert("RGB") | |
| input_ids, attn = encode_for_inference(model.tokenizer, "μ΄ μ΄λ―Έμ§μ 무μμ΄ λ³΄μ΄λμ?") | |
| pixel_values = model.image_processor(image, return_tensors="pt")["pixel_values"] | |
| with torch.no_grad(): | |
| out = model.generate( | |
| input_ids=input_ids.unsqueeze(0), | |
| attention_mask=attn.unsqueeze(0), | |
| pixel_values=pixel_values, | |
| max_new_tokens=128, | |
| ) | |
| print(model.tokenizer.decode(out[0], skip_special_tokens=True)) | |
| # 5) (μ ν) OOD κ²μΆ | |
| detector = OODDetector(threshold=0.5, device="cpu") | |
| # generate ν λ output_scores=True λ‘ first_logits λ°μμ detector.score(image, first_logits) νΈμΆ | |
| ``` | |
| ## β¨ v2 β v3 λ³ν (capability vs deployment λΆλ¦¬) | |
| ### π’ capability μΆκ° (λͺ¨λΈμ΄ μλ‘ ν μ μκ² λ κ² β μ§μ§ μ±λ₯ κ°μ ) | |
| | νλͺ© | v2 | **v3 (μ΄ λ ν¬)** | | |
| |---|---|---| | |
| | λ€κ΅μ΄ μλ΅ | β μλ¬Έ only (catastrophic forgetting) | β **μλ¬Έ + νκ΅μ΄** | | |
| | OOD μ νΈ | β 무쑰건 λ΅λ³ (hallucination) | β **"μ λͺ¨λ₯΄κ² μ" layer μΆκ°** (CLIP+entropy, κ²μ¦ N=2 β 본격 ROC λΆμμ v4) | | |
| ### π΅ deployment μ΅μ ν (μ±λ₯ λ³ν 0, λ°°ν¬ ν¨μ¨λ§) | |
| | νλͺ© | v2 | v3 | | |
| |---|---|---| | |
| | LoRA adapter | 1045 MB | 8.28 MB (β99.21%) | | |
| | λͺ¨λΈ μμ° μ΄ν© | ~1051 MB | ~14 MB | | |
| | λͺ¨λΈ μΆλ ₯ | (baseline) | **bit-identical** to FULL (greedy 7/7 κ²μ¦) | | |
| ### π‘ λ³νμ§ μμ κ² | |
| - μ΄λ―Έμ§ μ΄ν΄ μ νλ β 0.5B LLM νκ³λ‘ v2/v3 λμΌ μμ€ (v4 LLM size up μΌλ‘ ν΄κ²° μμ ) | |
| - μλ¬Έ VQA β v3 baseline 36.67% vs v2 34.67% (+2.00%p, VQAv2 50 samples greedy decoding). μΆλ‘ wrapper μΆκ°λ μμ μμ ν μ§λ¬Έ μ μμλ μν₯ μμ β wrapper μ μλ―Έ μλ κ°μ μ POPE νκ° μ°¨λ¨ μͺ½ (+3 ~ +20%p, μμΈν λ΄μ©μ GitHub README) | |
| ## π§ νμ΅ λ°μ΄ν° (Step 1, 175λΆ) | |
| | Source | Sample μ | μΈμ΄ | | |
| |---|---|---| | |
| | VQAv2 | 3K | μλ¬Έ | | |
| | LocalizedNarratives | 3K | μλ¬Έ | | |
| | A-OKVQA | 3K | μλ¬Έ | | |
| | **KoLLaVA** (LLaVA-Instruct DeepL νμ) | **4K** | **νκ΅μ΄** | | |
| | **ν©κ³** | **13K** | **Korean ratio 30.8%** | | |
| ## π‘οΈ OOD Detector (μ ν) | |
| ``` | |
| ood_score = 0.6 Γ clip_signal + 0.4 Γ entropy_signal | |
| is_ood = ood_score > 0.5 (default) | |
| clip_signal: 1 - max(CLIP-ViT-B/32 similarity to 57 in-dist categories) | |
| entropy_signal: H(LLM first-token logits) / 8.0 nats | |
| ``` | |
| κ²μ¦ κ²°κ³Ό (`scripts/test_ood_integration.py`): In-Dist (μ€μ κ°) 0.365 (β ) Β· OOD (Pikachu μΉ΄ν°) 0.505 (β οΈ) | |
| ## πͺΆ Slim Adapter β 99% μ κ° (1045 MB β 8.28 MB) | |
| PEFT νμ€μ `modules_to_save` (embed_tokens + lm_head) μ **ν΅μ§Έλ‘** μ μ₯ β 1 GB. | |
| νμ§λ§ μ¬μ λΆμμΌλ‘ λ°κ²¬: | |
| ``` | |
| saved embed_tokens vs base Qwen2.5: | |
| 첫 151,665 ν: max diff = 0.000000e+00 (μ νν μΌμΉ) | |
| λ§μ§λ§ 1 ν (<image> ν ν°): νμ΅λ representation | |
| ``` | |
| β `image_token_row.safetensors` (7 KB) λ§ λ³λ μ μ₯νκ³ , μΆλ‘ μ base Qwen2.5 μ λ§μ§λ§ row λ§ patch. | |
| β **greedy decoding 7/7 μλ΅ λΉνΈ λ¨μ μΌμΉ** (`scripts/verify_slim_adapter.py`). | |
| ## β οΈ νκ³ | |
| - **0.5B LLM** β μ΄λ―Έμ§ λ΄μ© μ νλλ μ¬μ ν νκ³ (κ°λ₯Ό μλ‘ μ€μΈ λ±) | |
| - **CLIP-ViT-B/32** β 49 patches, ViT-L/14 ablation μ§ννμΌλ ν¨κ³Ό νκ³ β λ―Έμ±ν | |
| - **57 OOD μΉ΄ν κ³ λ¦¬** β COCO + μΌμ κ°μ²΄ μμ£Ό, λλ©μΈ νμ₯ μ μΉ΄ν κ³ λ¦¬ λ³΄κ° κΆμ₯ | |
| ## π λ§ν¬ | |
| - π **Code**: [github.com/AD-Styles/vlm-from-scratch-v3](https://github.com/AD-Styles/vlm-from-scratch-v3) | |
| - π **Live Demo**: [HF Spaces β mini-llava-v3-demo](https://huggingface.co/spaces/AD-Styles/mini-llava-v3-demo) | |
| - π **v2 baseline**: [github.com/AD-Styles/vlm-from-scratch](https://github.com/AD-Styles/vlm-from-scratch) | |
| - π€ **v2 weights**: [AD-Styles/mini-llava-stage2](https://huggingface.co/AD-Styles/mini-llava-stage2) | |
| - π’ **Triton/vLLM deploy**: [github.com/AD-Styles/nlp-triton-deployment](https://github.com/AD-Styles/nlp-triton-deployment) | |
| ## π License | |
| MIT β Β© 2026 κΉλμ€ (AD-Styles) | |
| ## π Citation | |
| ```bibtex | |
| @misc{kim2026minillavav3, | |
| title = {Mini-LLaVA v3: Korean Multilingual + Slim LoRA Adapter + OOD Detection}, | |
| author = {Kim, Doyun}, | |
| year = {2026}, | |
| url = {https://github.com/AD-Styles/vlm-from-scratch-v3} | |
| } | |
| ``` | |