Image-to-Text
PEFT
Safetensors
English
Korean
vision-language-model
multimodal
lora
llava
mini-llava
vlm
Instructions to use AD-Styles/mini-llava-stage2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use AD-Styles/mini-llava-stage2 with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
| license: mit | |
| language: | |
| - en | |
| - ko | |
| library_name: peft | |
| tags: | |
| - vision-language-model | |
| - multimodal | |
| - lora | |
| - llava | |
| - mini-llava | |
| - vlm | |
| base_model: | |
| - Qwen/Qwen2.5-0.5B-Instruct | |
| - openai/clip-vit-base-patch32 | |
| pipeline_tag: image-to-text | |
| # Mini-LLaVA โ Stage 2 LoRA Weights | |
| LLaVA-1.5 ์ ํต์ฌ ์ํคํ ์ฒ๋ฅผ ์ฒ์๋ถํฐ ์ง์ ๊ตฌํํ ๋ฉํฐ๋ชจ๋ฌ LLM ์ ํ์ต๋ ๊ฐ์ค์น. | |
| HuggingFace ์ `LlavaForConditionalGeneration` ๊ฐ์ ๊ณ ์์ค ์ถ์ํ ๋ฏธ์ฌ์ฉ, ์ตํฉ ๋ก์ง ์ง์ ๊ตฌํ. | |
| ๐ **Live Demo:** https://huggingface.co/spaces/AD-Styles/mini-llava-demo (์ค์น ์์ด ์ฆ์ ์ฒดํ) | |
| ๐ **์ฝ๋ ๋ ํฌ:** https://github.com/AD-Styles/vlm-from-scratch | |
| ๐ **์์ธ ๋ถ์ (v1โv2 ํ๊ณ ๋ก ํฌํจ):** [GitHub README](https://github.com/AD-Styles/vlm-from-scratch#readme) | |
| ๐ **Test A/B/C ๊ฒฐ๊ณผ ํ:** [GitHub README #-results](https://github.com/AD-Styles/vlm-from-scratch#-results) | |
| ## ๐งฉ ๊ตฌ์ฑ | |
| | ํ์ผ | ํฌ๊ธฐ | ์ญํ | | |
| |------|------|------| | |
| | `projector.pt` | 5.7 MB | CLIP-ViT-B/32 โ Qwen2.5 ์๋ฒ ๋ฉ ๊ณต๊ฐ ๋งคํ (2-layer MLP) | | |
| | `lora_adapter/adapter_config.json` | 1 KB | PEFT LoRA ์ค์ | | |
| | `lora_adapter/adapter_model.safetensors` | ~1 GB | LoRA r=16 ๊ฐ์ค์น + embed_tokens / lm_head (ํ์ต ์ ๋ณ๊ฒฝ๋ถ ํฌํจ) | | |
| > **์ 1GB?** PEFT ๊ฐ `<image>` ํน์ ํ ํฐ ์ถ๊ฐ๋ก ์ธํ embedding resize ๋ฅผ ๊ฐ์งํด `embed_tokens` / `lm_head` ๋ฅผ ํจ๊ป ์ ์ฅ. ์ด๋ฅผ ๋จ์ ๋ถ๋ฆฌ ์๋ํ์ผ๋ (์ฝ๋ ๋ ํฌ [scripts/extract_lora.py](https://github.com/AD-Styles/vlm-from-scratch/blob/main/scripts/extract_lora.py)), 5๋ฌธํญ ์ค 3๋ฌธํญ์ด ์์ ํ ๋ค๋ฅธ ์๋ต์ผ๋ก ๋ณํ๋ฉด์ **embed_tokens ๊ฐ ํ์ต๋ ์ํ์ ์ผ๋ถ์์ ์คํ์ผ๋ก ํ์ธ**. ๋ฐ๋ผ์ ์๋ณธ 1GB ๊ทธ๋๋ก ๋ฐฐํฌ. | |
| ## ๐ ์ฌ์ฉ๋ฒ | |
| ```bash | |
| # 1. ์ฝ๋ + ํ๊ฒฝ ์ค๋น | |
| git clone https://github.com/AD-Styles/vlm-from-scratch | |
| cd vlm-from-scratch | |
| pip install --upgrade "torch>=2.6.0" "torchvision>=0.21.0" --index-url https://download.pytorch.org/whl/cu124 | |
| pip install -r requirements.txt | |
| # 2. ์ด ๋ ํฌ์์ ๊ฐ์ค์น ๋ค์ด๋ก๋ (~1 GB) | |
| huggingface-cli download AD-Styles/mini-llava-stage2 --local-dir checkpoints/v2_stage2_lora | |
| # 3. Gradio ๋ฐ๋ชจ ์คํ | |
| python app.py \ | |
| --checkpoint checkpoints/v2_stage2_lora/projector.pt \ | |
| --lora-adapter checkpoints/v2_stage2_lora/lora_adapter | |
| ``` | |
| ๋ธ๋ผ์ฐ์ ์์ `http://localhost:7860` ์ ์ โ ์ด๋ฏธ์ง ์ ๋ก๋ โ ์์ฐ์ด ์ง๋ฌธ. | |
| ## ๐ ํ์ต ์ค์ | |
| | ํญ๋ชฉ | ๊ฐ | | |
| |------|-----| | |
| | Base ๋ชจ๋ธ | `openai/clip-vit-base-patch32` (frozen) + `Qwen/Qwen2.5-0.5B-Instruct` (frozen + LoRA) | | |
| | ๋ฐ์ดํฐ | `HuggingFaceM4/the_cauldron` 3-config ๋ฏน์ค (localized_narratives + aokvqa + vqav2), 9,000 ์ํ | | |
| | LoRA | rank=16, alpha=32, target=q/k/v/o projection | | |
| | Epochs | 2 | | |
| | Batch size | 2 (grad_accum=4 โ effective 8) | | |
| | Learning rate | 2e-4 (cosine) | | |
| | GPU | NVIDIA RTX 4060 Laptop (8GB VRAM) | | |
| | ํ์ต ์๊ฐ | 47 ๋ถ | | |
| | ํ์ต ๊ฐ๋ฅ ํ๋ผ๋ฏธํฐ | 3,655,424 (์ ์ฒด์ 0.66%) | | |
| | Init | v1 baseline projector ์ด์ด ํ์ต (Stage 1 โ Stage 2) | | |
| ## โ ๊ฒ์ฆ ๊ฒฐ๊ณผ (์๋ฌธ VQA, ๊ฐ์์ง ์ฌ์ง) | |
| | ์ง๋ฌธ | ์๋ต | ์ ํ๋ | | |
| |------|------|--------| | |
| | What is in this image? | "Dog." | โ | | |
| | What color is the dog? | "White." | โ | | |
| | Is the dog wearing anything on its head? | "Yes." | โ | | |
| | What is on the dog's head? | "Hat." | โ | | |
| | Describe this image in one sentence. | "...cat on the floor." | โ ๏ธ (ํฌ๋กํคํฐ ๋ชจ์ ์ํฅ) | | |
| **Test A: 4/5 ์ ํ**, instruction format ์๋ ๋งค์นญ ์ฑ๊ณต. | |
| ## โ ๏ธ ํ๊ณ (์ ์งํ ๋ช ์) | |
| - **ํ๊ตญ์ด:** LoRA์ catastrophic forgetting โ ํ์ต ๋ฐ์ดํฐ 100% ์์ด. "์ด ๊ฐ์์ง๋ ๋จธ๋ฆฌ์ ๋ฌด์์ ์ฐ๊ณ ์๋์?" โ "๊ฐ." (์์ด ๋จ๋ต ํธํฅ์ด ํ๊ตญ์ด ํํ์ผ๋ก ์๋ชป ๋ณํ) | |
| - **OOD (๋งํ/์ ๋๋ฉ์ด์ ):** CLIP-ViT-B/32 ํํ ํ๊ณ. ํผ์นด์ธ โ "Giraffe" (ํ์ต ๋ถํฌ ๋ด ๊ฐ์ฅ ๊ฐ๊น์ด ๋๋ฌผ๋ก ๋งคํ) | |
| - **Hallucination:** "๋ชจ๋ฅธ๋ค" ๋ต๋ณ ๋ชป ํจ (VLM ๊ณตํต ๋ฌธ์ ) | |
| ์์ธํ ๋ถ์์ [GitHub README์ Test B/C ์น์ ](https://github.com/AD-Styles/vlm-from-scratch#-results) ์ฐธ์กฐ. | |
| ## ๐ฎ ํฅํ ๊ฐ์ (v3 ๋ก๋๋งต) | |
| 1. ํ๊ตญ์ด instruction ๋ฐ์ดํฐ 30%+ ์ถ๊ฐ โ catastrophic forgetting ํด์ | |
| 2. CLIP-ViT-L/14 (576 patches) ์ ๊ทธ๋ ์ด๋ โ OOD ๊ฒฌ๊ณ ์ฑ โ | |
| 3. OOD detection module โ "๋ชจ๋ฅธ๋ค" ๋ต๋ณ ํ์ต | |
| 4. vLLM / Triton Inference Server ํตํฉ โ ํ๋ก๋์ ์๋น | |
| ## ๐ References | |
| - LLaVA-1.5 (Liu et al., 2023) โ [arxiv:2310.03744](https://arxiv.org/abs/2310.03744) | |
| - CLIP (Radford et al., 2021) โ [arxiv:2103.00020](https://arxiv.org/abs/2103.00020) | |
| - LoRA (Hu et al., 2022) โ [arxiv:2106.09685](https://arxiv.org/abs/2106.09685) | |
| - the_cauldron (Laurenรงon et al., 2024) โ [arxiv:2405.02246](https://arxiv.org/abs/2405.02246) | |
| ## License | |
| MIT โ ์์ ๋กญ๊ฒ ์ฌ์ฉ / ์์ / ๋ฐฐํฌ ๊ฐ๋ฅ. | |
| --- | |
| ๐ค ๊น๋์ค (AD-Styles) ยท 2026 | |