Mini-LLaVA โ€” Stage 2 LoRA Weights

LLaVA-1.5 ์˜ ํ•ต์‹ฌ ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ฒ˜์Œ๋ถ€ํ„ฐ ์ง์ ‘ ๊ตฌํ˜„ํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ LLM ์˜ ํ•™์Šต๋œ ๊ฐ€์ค‘์น˜. HuggingFace ์˜ LlavaForConditionalGeneration ๊ฐ™์€ ๊ณ ์ˆ˜์ค€ ์ถ”์ƒํ™” ๋ฏธ์‚ฌ์šฉ, ์œตํ•ฉ ๋กœ์ง ์ง์ ‘ ๊ตฌํ˜„.

๐Ÿš€ Live Demo: https://huggingface.co/spaces/AD-Styles/mini-llava-demo (์„ค์น˜ ์—†์ด ์ฆ‰์‹œ ์ฒดํ—˜) ๐Ÿ“‚ ์ฝ”๋“œ ๋ ˆํฌ: https://github.com/AD-Styles/vlm-from-scratch ๐Ÿ“ ์ƒ์„ธ ๋ถ„์„ (v1โ†’v2 ํšŒ๊ณ ๋ก ํฌํ•จ): GitHub README ๐Ÿ“Š Test A/B/C ๊ฒฐ๊ณผ ํ‘œ: GitHub README #-results

๐Ÿงฉ ๊ตฌ์„ฑ

ํŒŒ์ผ ํฌ๊ธฐ ์—ญํ• 
projector.pt 5.7 MB CLIP-ViT-B/32 โ†’ Qwen2.5 ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„ ๋งคํ•‘ (2-layer MLP)
lora_adapter/adapter_config.json 1 KB PEFT LoRA ์„ค์ •
lora_adapter/adapter_model.safetensors ~1 GB LoRA r=16 ๊ฐ€์ค‘์น˜ + embed_tokens / lm_head (ํ•™์Šต ์‹œ ๋ณ€๊ฒฝ๋ถ„ ํฌํ•จ)

์™œ 1GB? PEFT ๊ฐ€ <image> ํŠน์ˆ˜ ํ† ํฐ ์ถ”๊ฐ€๋กœ ์ธํ•œ embedding resize ๋ฅผ ๊ฐ์ง€ํ•ด embed_tokens / lm_head ๋ฅผ ํ•จ๊ป˜ ์ €์žฅ. ์ด๋ฅผ ๋‹จ์ˆœ ๋ถ„๋ฆฌ ์‹œ๋„ํ–ˆ์œผ๋‚˜ (์ฝ”๋“œ ๋ ˆํฌ scripts/extract_lora.py), 5๋ฌธํ•ญ ์ค‘ 3๋ฌธํ•ญ์ด ์™„์ „ํžˆ ๋‹ค๋ฅธ ์‘๋‹ต์œผ๋กœ ๋ณ€ํ•˜๋ฉด์„œ embed_tokens ๊ฐ€ ํ•™์Šต๋œ ์ƒํƒœ์˜ ์ผ๋ถ€์ž„์„ ์‹คํ—˜์œผ๋กœ ํ™•์ธ. ๋”ฐ๋ผ์„œ ์›๋ณธ 1GB ๊ทธ๋Œ€๋กœ ๋ฐฐํฌ.

๐Ÿš€ ์‚ฌ์šฉ๋ฒ•

# 1. ์ฝ”๋“œ + ํ™˜๊ฒฝ ์ค€๋น„
git clone https://github.com/AD-Styles/vlm-from-scratch
cd vlm-from-scratch
pip install --upgrade "torch>=2.6.0" "torchvision>=0.21.0" --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt

# 2. ์ด ๋ ˆํฌ์—์„œ ๊ฐ€์ค‘์น˜ ๋‹ค์šด๋กœ๋“œ (~1 GB)
huggingface-cli download AD-Styles/mini-llava-stage2 --local-dir checkpoints/v2_stage2_lora

# 3. Gradio ๋ฐ๋ชจ ์‹คํ–‰
python app.py \
  --checkpoint checkpoints/v2_stage2_lora/projector.pt \
  --lora-adapter checkpoints/v2_stage2_lora/lora_adapter

๋ธŒ๋ผ์šฐ์ €์—์„œ http://localhost:7860 ์ ‘์† โ†’ ์ด๋ฏธ์ง€ ์—…๋กœ๋“œ โ†’ ์ž์—ฐ์–ด ์งˆ๋ฌธ.

๐Ÿ“Š ํ•™์Šต ์„ค์ •

ํ•ญ๋ชฉ ๊ฐ’
Base ๋ชจ๋ธ openai/clip-vit-base-patch32 (frozen) + Qwen/Qwen2.5-0.5B-Instruct (frozen + LoRA)
๋ฐ์ดํ„ฐ HuggingFaceM4/the_cauldron 3-config ๋ฏน์Šค (localized_narratives + aokvqa + vqav2), 9,000 ์ƒ˜ํ”Œ
LoRA rank=16, alpha=32, target=q/k/v/o projection
Epochs 2
Batch size 2 (grad_accum=4 โ†’ effective 8)
Learning rate 2e-4 (cosine)
GPU NVIDIA RTX 4060 Laptop (8GB VRAM)
ํ•™์Šต ์‹œ๊ฐ„ 47 ๋ถ„
ํ•™์Šต ๊ฐ€๋Šฅ ํŒŒ๋ผ๋ฏธํ„ฐ 3,655,424 (์ „์ฒด์˜ 0.66%)
Init v1 baseline projector ์ด์–ด ํ•™์Šต (Stage 1 โ†’ Stage 2)

โœ… ๊ฒ€์ฆ ๊ฒฐ๊ณผ (์˜๋ฌธ VQA, ๊ฐ•์•„์ง€ ์‚ฌ์ง„)

์งˆ๋ฌธ ์‘๋‹ต ์ •ํ™•๋„
What is in this image? "Dog." โœ…
What color is the dog? "White." โœ…
Is the dog wearing anything on its head? "Yes." โœ…
What is on the dog's head? "Hat." โœ…
Describe this image in one sentence. "...cat on the floor." โš ๏ธ (ํ—ฌ๋กœํ‚คํ‹ฐ ๋ชจ์ž ์˜ํ–ฅ)

Test A: 4/5 ์ •ํ™•, instruction format ์ž๋™ ๋งค์นญ ์„ฑ๊ณต.

โš ๏ธ ํ•œ๊ณ„ (์ •์งํ•œ ๋ช…์‹œ)

  • ํ•œ๊ตญ์–ด: LoRA์˜ catastrophic forgetting โ€” ํ•™์Šต ๋ฐ์ดํ„ฐ 100% ์˜์–ด. "์ด ๊ฐ•์•„์ง€๋Š” ๋จธ๋ฆฌ์— ๋ฌด์—‡์„ ์“ฐ๊ณ  ์žˆ๋‚˜์š”?" โ†’ "๊ฐœ." (์˜์–ด ๋‹จ๋‹ต ํŽธํ–ฅ์ด ํ•œ๊ตญ์–ด ํ‘œํ˜„์œผ๋กœ ์ž˜๋ชป ๋ณ€ํ™˜)
  • OOD (๋งŒํ™”/์• ๋‹ˆ๋ฉ”์ด์…˜): CLIP-ViT-B/32 ํ‘œํ˜„ ํ•œ๊ณ„. ํ”ผ์นด์ธ„ โ†’ "Giraffe" (ํ•™์Šต ๋ถ„ํฌ ๋‚ด ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๋™๋ฌผ๋กœ ๋งคํ•‘)
  • Hallucination: "๋ชจ๋ฅธ๋‹ค" ๋‹ต๋ณ€ ๋ชป ํ•จ (VLM ๊ณตํ†ต ๋ฌธ์ œ)

์ž์„ธํ•œ ๋ถ„์„์€ GitHub README์˜ Test B/C ์„น์…˜ ์ฐธ์กฐ.

๐Ÿ”ฎ ํ–ฅํ›„ ๊ฐœ์„  (v3 ๋กœ๋“œ๋งต)

  1. ํ•œ๊ตญ์–ด instruction ๋ฐ์ดํ„ฐ 30%+ ์ถ”๊ฐ€ โ†’ catastrophic forgetting ํ•ด์†Œ
  2. CLIP-ViT-L/14 (576 patches) ์—…๊ทธ๋ ˆ์ด๋“œ โ†’ OOD ๊ฒฌ๊ณ ์„ฑ โ†‘
  3. OOD detection module โ†’ "๋ชจ๋ฅธ๋‹ค" ๋‹ต๋ณ€ ํ•™์Šต
  4. vLLM / Triton Inference Server ํ†ตํ•ฉ โ†’ ํ”„๋กœ๋•์…˜ ์„œ๋น™

๐Ÿ“š References

License

MIT โ€” ์ž์œ ๋กญ๊ฒŒ ์‚ฌ์šฉ / ์ˆ˜์ • / ๋ฐฐํฌ ๊ฐ€๋Šฅ.


๐Ÿค– ๊น€๋„์œค (AD-Styles) ยท 2026

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for AD-Styles/mini-llava-stage2

Adapter
(598)
this model

Space using AD-Styles/mini-llava-stage2 1

Papers for AD-Styles/mini-llava-stage2