File size: 1,240 Bytes
f0023ba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
197182c
e68a22a
f0023ba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
---
license: apache-2.0
library_name: peft
base_model: Qwen/Qwen2.5-1.5B-Instruct
tags:
  - vision-language
  - multimodal
  - llava
  - qlora
---

# Mini-LLaVA v4 โ€” weights

์ฒ˜์Œ๋ถ€ํ„ฐ ์กฐ๋ฆฝํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ LLM (`vlm-from-scratch-v4`) ์˜ ํ•™์Šต๋œ ๊ฐ€์ค‘์น˜.

- **๊ตฌ์กฐ**: CLIP-ViT-B/32 (frozen) + 2-layer MLP Projector + Qwen2.5-1.5B-Instruct + LoRA
- **ํ•™์Šต**: QLoRA 4-bit NF4 ยท Stage 1 ์ •๋ ฌ โ†’ Stage 2 instruction 46K (์˜๋ฌธ + ํ•œ๊ตญ์–ด ๊ท ํ˜• ๋ฏน์Šค) ยท RTX 4060 8GB
- **ํ‰๊ฐ€**: raw ๋ชจ๋ธ ๊ธฐ์ค€ VQAv2 56.8% / POPE 71.8% (n=400, wrapper ์—†์Œ). 8GB GPUยท์•ฝ 9๋งŒ ์ƒ˜ํ”Œ๋กœ ํ•™์Šตํ•œ ์†Œํ˜• ๋ชจ๋ธ์ด๋ผ ์ ˆ๋Œ€ ์„ฑ๋Šฅ์€ ๊ณต๊ฐœ VLM ์— ๋ชป ๋ฏธ์นฉ๋‹ˆ๋‹ค โ€” ์ž์„ธํ•œ ๋‚ด์šฉ์€ GitHub README.

## ํŒŒ์ผ

| ํŒŒ์ผ | ์„ค๋ช… |
|---|---|
| `projector.pt` | MultiModalProjector (CLIP 768 โ†’ LLM 1536) state_dict |
| `lora_adapter/` | Qwen2.5-1.5B ์ „ linear layer LoRA ์–ด๋Œ‘ํ„ฐ (r=16) |

`<image>` ํ† ํฐ์œผ๋กœ Qwen2.5 ๋‚ด์žฅ `<|image_pad|>` ๋ฅผ ์žฌ์‚ฌ์šฉํ•˜๋ฏ€๋กœ adapter ์—
embedding ๊ตฐ๋”๋”๊ธฐ๊ฐ€ ์—†๋‹ค (70 MB ์ „๋ถ€ LoRA).

## ์‚ฌ์šฉ

์ถ”๋ก  ์ฝ”๋“œ๋Š” [github.com/AD-Styles/vlm-from-scratch-v4](https://github.com/AD-Styles/vlm-from-scratch-v4)
์˜ `src/` ์ฐธ๊ณ . ๋ฐ๋ชจ: HF Space `AD-Styles/mini-llava-v4-demo`.