Instructions to use AD-Styles/mini-llava-v4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use AD-Styles/mini-llava-v4 with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
File size: 1,240 Bytes
f0023ba 197182c e68a22a f0023ba | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | ---
license: apache-2.0
library_name: peft
base_model: Qwen/Qwen2.5-1.5B-Instruct
tags:
- vision-language
- multimodal
- llava
- qlora
---
# Mini-LLaVA v4 โ weights
์ฒ์๋ถํฐ ์กฐ๋ฆฝํ ๋ฉํฐ๋ชจ๋ฌ LLM (`vlm-from-scratch-v4`) ์ ํ์ต๋ ๊ฐ์ค์น.
- **๊ตฌ์กฐ**: CLIP-ViT-B/32 (frozen) + 2-layer MLP Projector + Qwen2.5-1.5B-Instruct + LoRA
- **ํ์ต**: QLoRA 4-bit NF4 ยท Stage 1 ์ ๋ ฌ โ Stage 2 instruction 46K (์๋ฌธ + ํ๊ตญ์ด ๊ท ํ ๋ฏน์ค) ยท RTX 4060 8GB
- **ํ๊ฐ**: raw ๋ชจ๋ธ ๊ธฐ์ค VQAv2 56.8% / POPE 71.8% (n=400, wrapper ์์). 8GB GPUยท์ฝ 9๋ง ์ํ๋ก ํ์ตํ ์ํ ๋ชจ๋ธ์ด๋ผ ์ ๋ ์ฑ๋ฅ์ ๊ณต๊ฐ VLM ์ ๋ชป ๋ฏธ์นฉ๋๋ค โ ์์ธํ ๋ด์ฉ์ GitHub README.
## ํ์ผ
| ํ์ผ | ์ค๋ช
|
|---|---|
| `projector.pt` | MultiModalProjector (CLIP 768 โ LLM 1536) state_dict |
| `lora_adapter/` | Qwen2.5-1.5B ์ linear layer LoRA ์ด๋ํฐ (r=16) |
`<image>` ํ ํฐ์ผ๋ก Qwen2.5 ๋ด์ฅ `<|image_pad|>` ๋ฅผ ์ฌ์ฌ์ฉํ๋ฏ๋ก adapter ์
embedding ๊ตฐ๋๋๊ธฐ๊ฐ ์๋ค (70 MB ์ ๋ถ LoRA).
## ์ฌ์ฉ
์ถ๋ก ์ฝ๋๋ [github.com/AD-Styles/vlm-from-scratch-v4](https://github.com/AD-Styles/vlm-from-scratch-v4)
์ `src/` ์ฐธ๊ณ . ๋ฐ๋ชจ: HF Space `AD-Styles/mini-llava-v4-demo`.
|