mini-llava-stage2 / README.md
AD-Styles's picture
Upload README.md with huggingface_hub
a869dac verified
---
license: mit
language:
- en
- ko
library_name: peft
tags:
- vision-language-model
- multimodal
- lora
- llava
- mini-llava
- vlm
base_model:
- Qwen/Qwen2.5-0.5B-Instruct
- openai/clip-vit-base-patch32
pipeline_tag: image-to-text
---
# Mini-LLaVA โ€” Stage 2 LoRA Weights
LLaVA-1.5 ์˜ ํ•ต์‹ฌ ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ฒ˜์Œ๋ถ€ํ„ฐ ์ง์ ‘ ๊ตฌํ˜„ํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ LLM ์˜ ํ•™์Šต๋œ ๊ฐ€์ค‘์น˜.
HuggingFace ์˜ `LlavaForConditionalGeneration` ๊ฐ™์€ ๊ณ ์ˆ˜์ค€ ์ถ”์ƒํ™” ๋ฏธ์‚ฌ์šฉ, ์œตํ•ฉ ๋กœ์ง ์ง์ ‘ ๊ตฌํ˜„.
๐Ÿš€ **Live Demo:** https://huggingface.co/spaces/AD-Styles/mini-llava-demo (์„ค์น˜ ์—†์ด ์ฆ‰์‹œ ์ฒดํ—˜)
๐Ÿ“‚ **์ฝ”๋“œ ๋ ˆํฌ:** https://github.com/AD-Styles/vlm-from-scratch
๐Ÿ“ **์ƒ์„ธ ๋ถ„์„ (v1โ†’v2 ํšŒ๊ณ ๋ก ํฌํ•จ):** [GitHub README](https://github.com/AD-Styles/vlm-from-scratch#readme)
๐Ÿ“Š **Test A/B/C ๊ฒฐ๊ณผ ํ‘œ:** [GitHub README #-results](https://github.com/AD-Styles/vlm-from-scratch#-results)
## ๐Ÿงฉ ๊ตฌ์„ฑ
| ํŒŒ์ผ | ํฌ๊ธฐ | ์—ญํ•  |
|------|------|------|
| `projector.pt` | 5.7 MB | CLIP-ViT-B/32 โ†’ Qwen2.5 ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„ ๋งคํ•‘ (2-layer MLP) |
| `lora_adapter/adapter_config.json` | 1 KB | PEFT LoRA ์„ค์ • |
| `lora_adapter/adapter_model.safetensors` | ~1 GB | LoRA r=16 ๊ฐ€์ค‘์น˜ + embed_tokens / lm_head (ํ•™์Šต ์‹œ ๋ณ€๊ฒฝ๋ถ„ ํฌํ•จ) |
> **์™œ 1GB?** PEFT ๊ฐ€ `<image>` ํŠน์ˆ˜ ํ† ํฐ ์ถ”๊ฐ€๋กœ ์ธํ•œ embedding resize ๋ฅผ ๊ฐ์ง€ํ•ด `embed_tokens` / `lm_head` ๋ฅผ ํ•จ๊ป˜ ์ €์žฅ. ์ด๋ฅผ ๋‹จ์ˆœ ๋ถ„๋ฆฌ ์‹œ๋„ํ–ˆ์œผ๋‚˜ (์ฝ”๋“œ ๋ ˆํฌ [scripts/extract_lora.py](https://github.com/AD-Styles/vlm-from-scratch/blob/main/scripts/extract_lora.py)), 5๋ฌธํ•ญ ์ค‘ 3๋ฌธํ•ญ์ด ์™„์ „ํžˆ ๋‹ค๋ฅธ ์‘๋‹ต์œผ๋กœ ๋ณ€ํ•˜๋ฉด์„œ **embed_tokens ๊ฐ€ ํ•™์Šต๋œ ์ƒํƒœ์˜ ์ผ๋ถ€์ž„์„ ์‹คํ—˜์œผ๋กœ ํ™•์ธ**. ๋”ฐ๋ผ์„œ ์›๋ณธ 1GB ๊ทธ๋Œ€๋กœ ๋ฐฐํฌ.
## ๐Ÿš€ ์‚ฌ์šฉ๋ฒ•
```bash
# 1. ์ฝ”๋“œ + ํ™˜๊ฒฝ ์ค€๋น„
git clone https://github.com/AD-Styles/vlm-from-scratch
cd vlm-from-scratch
pip install --upgrade "torch>=2.6.0" "torchvision>=0.21.0" --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
# 2. ์ด ๋ ˆํฌ์—์„œ ๊ฐ€์ค‘์น˜ ๋‹ค์šด๋กœ๋“œ (~1 GB)
huggingface-cli download AD-Styles/mini-llava-stage2 --local-dir checkpoints/v2_stage2_lora
# 3. Gradio ๋ฐ๋ชจ ์‹คํ–‰
python app.py \
--checkpoint checkpoints/v2_stage2_lora/projector.pt \
--lora-adapter checkpoints/v2_stage2_lora/lora_adapter
```
๋ธŒ๋ผ์šฐ์ €์—์„œ `http://localhost:7860` ์ ‘์† โ†’ ์ด๋ฏธ์ง€ ์—…๋กœ๋“œ โ†’ ์ž์—ฐ์–ด ์งˆ๋ฌธ.
## ๐Ÿ“Š ํ•™์Šต ์„ค์ •
| ํ•ญ๋ชฉ | ๊ฐ’ |
|------|-----|
| Base ๋ชจ๋ธ | `openai/clip-vit-base-patch32` (frozen) + `Qwen/Qwen2.5-0.5B-Instruct` (frozen + LoRA) |
| ๋ฐ์ดํ„ฐ | `HuggingFaceM4/the_cauldron` 3-config ๋ฏน์Šค (localized_narratives + aokvqa + vqav2), 9,000 ์ƒ˜ํ”Œ |
| LoRA | rank=16, alpha=32, target=q/k/v/o projection |
| Epochs | 2 |
| Batch size | 2 (grad_accum=4 โ†’ effective 8) |
| Learning rate | 2e-4 (cosine) |
| GPU | NVIDIA RTX 4060 Laptop (8GB VRAM) |
| ํ•™์Šต ์‹œ๊ฐ„ | 47 ๋ถ„ |
| ํ•™์Šต ๊ฐ€๋Šฅ ํŒŒ๋ผ๋ฏธํ„ฐ | 3,655,424 (์ „์ฒด์˜ 0.66%) |
| Init | v1 baseline projector ์ด์–ด ํ•™์Šต (Stage 1 โ†’ Stage 2) |
## โœ… ๊ฒ€์ฆ ๊ฒฐ๊ณผ (์˜๋ฌธ VQA, ๊ฐ•์•„์ง€ ์‚ฌ์ง„)
| ์งˆ๋ฌธ | ์‘๋‹ต | ์ •ํ™•๋„ |
|------|------|--------|
| What is in this image? | "Dog." | โœ… |
| What color is the dog? | "White." | โœ… |
| Is the dog wearing anything on its head? | "Yes." | โœ… |
| What is on the dog's head? | "Hat." | โœ… |
| Describe this image in one sentence. | "...cat on the floor." | โš ๏ธ (ํ—ฌ๋กœํ‚คํ‹ฐ ๋ชจ์ž ์˜ํ–ฅ) |
**Test A: 4/5 ์ •ํ™•**, instruction format ์ž๋™ ๋งค์นญ ์„ฑ๊ณต.
## โš ๏ธ ํ•œ๊ณ„ (์ •์งํ•œ ๋ช…์‹œ)
- **ํ•œ๊ตญ์–ด:** LoRA์˜ catastrophic forgetting โ€” ํ•™์Šต ๋ฐ์ดํ„ฐ 100% ์˜์–ด. "์ด ๊ฐ•์•„์ง€๋Š” ๋จธ๋ฆฌ์— ๋ฌด์—‡์„ ์“ฐ๊ณ  ์žˆ๋‚˜์š”?" โ†’ "๊ฐœ." (์˜์–ด ๋‹จ๋‹ต ํŽธํ–ฅ์ด ํ•œ๊ตญ์–ด ํ‘œํ˜„์œผ๋กœ ์ž˜๋ชป ๋ณ€ํ™˜)
- **OOD (๋งŒํ™”/์• ๋‹ˆ๋ฉ”์ด์…˜):** CLIP-ViT-B/32 ํ‘œํ˜„ ํ•œ๊ณ„. ํ”ผ์นด์ธ„ โ†’ "Giraffe" (ํ•™์Šต ๋ถ„ํฌ ๋‚ด ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๋™๋ฌผ๋กœ ๋งคํ•‘)
- **Hallucination:** "๋ชจ๋ฅธ๋‹ค" ๋‹ต๋ณ€ ๋ชป ํ•จ (VLM ๊ณตํ†ต ๋ฌธ์ œ)
์ž์„ธํ•œ ๋ถ„์„์€ [GitHub README์˜ Test B/C ์„น์…˜](https://github.com/AD-Styles/vlm-from-scratch#-results) ์ฐธ์กฐ.
## ๐Ÿ”ฎ ํ–ฅํ›„ ๊ฐœ์„  (v3 ๋กœ๋“œ๋งต)
1. ํ•œ๊ตญ์–ด instruction ๋ฐ์ดํ„ฐ 30%+ ์ถ”๊ฐ€ โ†’ catastrophic forgetting ํ•ด์†Œ
2. CLIP-ViT-L/14 (576 patches) ์—…๊ทธ๋ ˆ์ด๋“œ โ†’ OOD ๊ฒฌ๊ณ ์„ฑ โ†‘
3. OOD detection module โ†’ "๋ชจ๋ฅธ๋‹ค" ๋‹ต๋ณ€ ํ•™์Šต
4. vLLM / Triton Inference Server ํ†ตํ•ฉ โ†’ ํ”„๋กœ๋•์…˜ ์„œ๋น™
## ๐Ÿ“š References
- LLaVA-1.5 (Liu et al., 2023) โ€” [arxiv:2310.03744](https://arxiv.org/abs/2310.03744)
- CLIP (Radford et al., 2021) โ€” [arxiv:2103.00020](https://arxiv.org/abs/2103.00020)
- LoRA (Hu et al., 2022) โ€” [arxiv:2106.09685](https://arxiv.org/abs/2106.09685)
- the_cauldron (Laurenรงon et al., 2024) โ€” [arxiv:2405.02246](https://arxiv.org/abs/2405.02246)
## License
MIT โ€” ์ž์œ ๋กญ๊ฒŒ ์‚ฌ์šฉ / ์ˆ˜์ • / ๋ฐฐํฌ ๊ฐ€๋Šฅ.
---
๐Ÿค– ๊น€๋„์œค (AD-Styles) ยท 2026