Image-to-Text
PEFT
Safetensors
English
Korean
vision-language-model
multimodal
lora
llava
mini-llava
vlm
Instructions to use AD-Styles/mini-llava-stage2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use AD-Styles/mini-llava-stage2 with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
File size: 4,934 Bytes
9ad8b05 a869dac 9ad8b05 a869dac 9ad8b05 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 | ---
license: mit
language:
- en
- ko
library_name: peft
tags:
- vision-language-model
- multimodal
- lora
- llava
- mini-llava
- vlm
base_model:
- Qwen/Qwen2.5-0.5B-Instruct
- openai/clip-vit-base-patch32
pipeline_tag: image-to-text
---
# Mini-LLaVA โ Stage 2 LoRA Weights
LLaVA-1.5 ์ ํต์ฌ ์ํคํ
์ฒ๋ฅผ ์ฒ์๋ถํฐ ์ง์ ๊ตฌํํ ๋ฉํฐ๋ชจ๋ฌ LLM ์ ํ์ต๋ ๊ฐ์ค์น.
HuggingFace ์ `LlavaForConditionalGeneration` ๊ฐ์ ๊ณ ์์ค ์ถ์ํ ๋ฏธ์ฌ์ฉ, ์ตํฉ ๋ก์ง ์ง์ ๊ตฌํ.
๐ **Live Demo:** https://huggingface.co/spaces/AD-Styles/mini-llava-demo (์ค์น ์์ด ์ฆ์ ์ฒดํ)
๐ **์ฝ๋ ๋ ํฌ:** https://github.com/AD-Styles/vlm-from-scratch
๐ **์์ธ ๋ถ์ (v1โv2 ํ๊ณ ๋ก ํฌํจ):** [GitHub README](https://github.com/AD-Styles/vlm-from-scratch#readme)
๐ **Test A/B/C ๊ฒฐ๊ณผ ํ:** [GitHub README #-results](https://github.com/AD-Styles/vlm-from-scratch#-results)
## ๐งฉ ๊ตฌ์ฑ
| ํ์ผ | ํฌ๊ธฐ | ์ญํ |
|------|------|------|
| `projector.pt` | 5.7 MB | CLIP-ViT-B/32 โ Qwen2.5 ์๋ฒ ๋ฉ ๊ณต๊ฐ ๋งคํ (2-layer MLP) |
| `lora_adapter/adapter_config.json` | 1 KB | PEFT LoRA ์ค์ |
| `lora_adapter/adapter_model.safetensors` | ~1 GB | LoRA r=16 ๊ฐ์ค์น + embed_tokens / lm_head (ํ์ต ์ ๋ณ๊ฒฝ๋ถ ํฌํจ) |
> **์ 1GB?** PEFT ๊ฐ `<image>` ํน์ ํ ํฐ ์ถ๊ฐ๋ก ์ธํ embedding resize ๋ฅผ ๊ฐ์งํด `embed_tokens` / `lm_head` ๋ฅผ ํจ๊ป ์ ์ฅ. ์ด๋ฅผ ๋จ์ ๋ถ๋ฆฌ ์๋ํ์ผ๋ (์ฝ๋ ๋ ํฌ [scripts/extract_lora.py](https://github.com/AD-Styles/vlm-from-scratch/blob/main/scripts/extract_lora.py)), 5๋ฌธํญ ์ค 3๋ฌธํญ์ด ์์ ํ ๋ค๋ฅธ ์๋ต์ผ๋ก ๋ณํ๋ฉด์ **embed_tokens ๊ฐ ํ์ต๋ ์ํ์ ์ผ๋ถ์์ ์คํ์ผ๋ก ํ์ธ**. ๋ฐ๋ผ์ ์๋ณธ 1GB ๊ทธ๋๋ก ๋ฐฐํฌ.
## ๐ ์ฌ์ฉ๋ฒ
```bash
# 1. ์ฝ๋ + ํ๊ฒฝ ์ค๋น
git clone https://github.com/AD-Styles/vlm-from-scratch
cd vlm-from-scratch
pip install --upgrade "torch>=2.6.0" "torchvision>=0.21.0" --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
# 2. ์ด ๋ ํฌ์์ ๊ฐ์ค์น ๋ค์ด๋ก๋ (~1 GB)
huggingface-cli download AD-Styles/mini-llava-stage2 --local-dir checkpoints/v2_stage2_lora
# 3. Gradio ๋ฐ๋ชจ ์คํ
python app.py \
--checkpoint checkpoints/v2_stage2_lora/projector.pt \
--lora-adapter checkpoints/v2_stage2_lora/lora_adapter
```
๋ธ๋ผ์ฐ์ ์์ `http://localhost:7860` ์ ์ โ ์ด๋ฏธ์ง ์
๋ก๋ โ ์์ฐ์ด ์ง๋ฌธ.
## ๐ ํ์ต ์ค์
| ํญ๋ชฉ | ๊ฐ |
|------|-----|
| Base ๋ชจ๋ธ | `openai/clip-vit-base-patch32` (frozen) + `Qwen/Qwen2.5-0.5B-Instruct` (frozen + LoRA) |
| ๋ฐ์ดํฐ | `HuggingFaceM4/the_cauldron` 3-config ๋ฏน์ค (localized_narratives + aokvqa + vqav2), 9,000 ์ํ |
| LoRA | rank=16, alpha=32, target=q/k/v/o projection |
| Epochs | 2 |
| Batch size | 2 (grad_accum=4 โ effective 8) |
| Learning rate | 2e-4 (cosine) |
| GPU | NVIDIA RTX 4060 Laptop (8GB VRAM) |
| ํ์ต ์๊ฐ | 47 ๋ถ |
| ํ์ต ๊ฐ๋ฅ ํ๋ผ๋ฏธํฐ | 3,655,424 (์ ์ฒด์ 0.66%) |
| Init | v1 baseline projector ์ด์ด ํ์ต (Stage 1 โ Stage 2) |
## โ
๊ฒ์ฆ ๊ฒฐ๊ณผ (์๋ฌธ VQA, ๊ฐ์์ง ์ฌ์ง)
| ์ง๋ฌธ | ์๋ต | ์ ํ๋ |
|------|------|--------|
| What is in this image? | "Dog." | โ
|
| What color is the dog? | "White." | โ
|
| Is the dog wearing anything on its head? | "Yes." | โ
|
| What is on the dog's head? | "Hat." | โ
|
| Describe this image in one sentence. | "...cat on the floor." | โ ๏ธ (ํฌ๋กํคํฐ ๋ชจ์ ์ํฅ) |
**Test A: 4/5 ์ ํ**, instruction format ์๋ ๋งค์นญ ์ฑ๊ณต.
## โ ๏ธ ํ๊ณ (์ ์งํ ๋ช
์)
- **ํ๊ตญ์ด:** LoRA์ catastrophic forgetting โ ํ์ต ๋ฐ์ดํฐ 100% ์์ด. "์ด ๊ฐ์์ง๋ ๋จธ๋ฆฌ์ ๋ฌด์์ ์ฐ๊ณ ์๋์?" โ "๊ฐ." (์์ด ๋จ๋ต ํธํฅ์ด ํ๊ตญ์ด ํํ์ผ๋ก ์๋ชป ๋ณํ)
- **OOD (๋งํ/์ ๋๋ฉ์ด์
):** CLIP-ViT-B/32 ํํ ํ๊ณ. ํผ์นด์ธ โ "Giraffe" (ํ์ต ๋ถํฌ ๋ด ๊ฐ์ฅ ๊ฐ๊น์ด ๋๋ฌผ๋ก ๋งคํ)
- **Hallucination:** "๋ชจ๋ฅธ๋ค" ๋ต๋ณ ๋ชป ํจ (VLM ๊ณตํต ๋ฌธ์ )
์์ธํ ๋ถ์์ [GitHub README์ Test B/C ์น์
](https://github.com/AD-Styles/vlm-from-scratch#-results) ์ฐธ์กฐ.
## ๐ฎ ํฅํ ๊ฐ์ (v3 ๋ก๋๋งต)
1. ํ๊ตญ์ด instruction ๋ฐ์ดํฐ 30%+ ์ถ๊ฐ โ catastrophic forgetting ํด์
2. CLIP-ViT-L/14 (576 patches) ์
๊ทธ๋ ์ด๋ โ OOD ๊ฒฌ๊ณ ์ฑ โ
3. OOD detection module โ "๋ชจ๋ฅธ๋ค" ๋ต๋ณ ํ์ต
4. vLLM / Triton Inference Server ํตํฉ โ ํ๋ก๋์
์๋น
## ๐ References
- LLaVA-1.5 (Liu et al., 2023) โ [arxiv:2310.03744](https://arxiv.org/abs/2310.03744)
- CLIP (Radford et al., 2021) โ [arxiv:2103.00020](https://arxiv.org/abs/2103.00020)
- LoRA (Hu et al., 2022) โ [arxiv:2106.09685](https://arxiv.org/abs/2106.09685)
- the_cauldron (Laurenรงon et al., 2024) โ [arxiv:2405.02246](https://arxiv.org/abs/2405.02246)
## License
MIT โ ์์ ๋กญ๊ฒ ์ฌ์ฉ / ์์ / ๋ฐฐํฌ ๊ฐ๋ฅ.
---
๐ค ๊น๋์ค (AD-Styles) ยท 2026
|