File size: 7,693 Bytes

df4a083
 
 
 
 
 
 
 
 
 
 
 
4c0a2b2
df4a083
8781fa4
4c0a2b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
df4a083
 
74b0906
df4a083
006ada2
df4a083
f4e9991
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74b0906
 
 
 
 
 
 
df70f3b
df4a083
006ada2
 
 
 
 
74b0906
006ada2
 
 
df4a083
 
 
74b0906
df4a083
006ada2
 
df4a083
 
74b0906
df4a083
 
006ada2
df4a083
006ada2
dbe1446
006ada2
df4a083
006ada2
df4a083
df70f3b
b2e8b6a
006ada2
 
 
b2e8b6a
74b0906
df4a083
006ada2
df4a083
006ada2
 
 
 
 
b2e8b6a
006ada2
 
 
 
 
 
b2e8b6a
006ada2
 
 
 
b2e8b6a
df70f3b
b2e8b6a
74b0906
 
 
 
 
 
 
 
 
 
 
 
b2e8b6a
74b0906
 
 
 
 
 
 
 
 
 
df4a083
006ada2
df4a083
 
 
df70f3b
df4a083
006ada2
 
 
 
df4a083
006ada2
 
 
 
df4a083
006ada2
df4a083
 
 
df70f3b
df4a083
 
 
 
 
 
74b0906
df4a083
 
 
df70f3b
df4a083
006ada2
df4a083
006ada2
 
df4a083
006ada2
df4a083
 
 
df70f3b
df4a083
 
 
 
006ada2
74b0906
df4a083
4c0a2b2

---
language:
- en
library_name: transformers
tags:
- pytorch
- safetensors
- vision-language
- visual-question-answering
pipeline_tag: visual-question-answering
license: apache-2.0
base_model:
- google/siglip2-so400m-patch14-384
- keeeeenw/MicroLlama
model-index:
- name: MicroLLaVA (MicroLLaMA 300M + SigLIP2-so400m-patch4-384)
  results:
  - task:
      type: visual-question-answering
      name: VQAv2
    dataset:
      name: VQAv2
      type: vqav2
    metrics:
    - name: Overall Accuracy
      type: accuracy
      value: 56.91
    - name: Yes/No Accuracy
      type: accuracy
      value: 72.32
    - name: Number Accuracy
      type: accuracy
      value: 43.89
    - name: Other Accuracy
      type: accuracy
      value: 46.65
    source:
      name: Internal Evaluation on VQAv2 test-dev
      url: https://visualqa.org/download.html
---

# MicroLLaVA

A compact vision language model that you can pretrain and finetune on a single consumer GPU.

## 🔍 Performance & Training Highlights

- 📊 **VQAv2 Accuracy**:  
  Achieves **56.91%** on VQAv2 dev/test — making MicroLLaVA one of the best-performing open-source language models with vision capabilities under **700M parameters**.

- 🧠 **Parameter Budget**:
  - 🗣️ Language Model: **MicroLLaMA (300M)**
  - 👁️ Vision Encoder: **SigLIP2 (400M)**  
  → **~700M total parameters**

- 🏆 **Best in Class**:  
  According to ChatGPT’s Deep Research Agent (Aug 2025):  
  > *“No known open model below ~700M currently surpasses MicroLLaVA’s VQAv2 accuracy. Models that do perform better tend to have larger language components.”*

- 🧪 **Ongoing Experiments**:
  - 🔧 **Qwen3-0.6B + SigLIP2**  
    → Training is **converging**, showing promising loss curves. (Qwen3-0.6B is significantly larger than MicroLLaMA.)
  - ❌ **Gemma-3B-270M-IT + SigLIP2**  
    → Training **did not converge**, likely due to instability, bugs, or poor alignment under current hyperparameters.

## 📰 News and Updates

* 08/17/2025: this hugging face repo is renamed to https://huggingface.co/keeeeenw/MicroLlava.
* 08/17/2025: improved **VQAv2** average dev-test score from **44.01%** to **56.91%** by upgrading the vision tower from SigLip to SigLip2.
* 08/09/2025: initial version of MicroLlava released

## 🎯 TLDR

| Item            | Detail |
|-----------------|--------|
| Framework       | Transformers + PyTorch |
| Checkpoint type | `safetensors` |
| LLM             | [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) (about 300M parameters) |
| Vision tower    | [`siglip-so400m-patch14-384`](https://huggingface.co/google/siglip2-so400m-patch14-384) |
| Hardware used   | Single NVIDIA RTX 4090 |
| Training stack  | No DeepSpeed required |
| Intended tasks  | Visual Question Answering, caption-style prompts |

---

## 📋 Introduction

MicroLLaVA is a [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) based model that pairs a very small language model [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) with an efficient SigLIP vision encoder.  
The goal is to create a vision language model that almost anyone can train and iterate on with one consumer GPU.

- **Language model**: [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) with ~300M parameters  
- **Vision encoder**: [`siglip2-so400m-patch14-384`](https://huggingface.co/google/siglip2-so400m-patch14-384)
- **Training codebase**: [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) with additional changes in my fork: [Custom fork with training tweaks](https://github.com/keeeeenw/TinyLLaVA_Factory)

Because of its compact size, this model can be trained entirely on a single NVIDIA RTX 4090 without DeepSpeed.  

Pretraining on **LAION-CC-SBU-558K** took about **5 hours** on a single NVIDIA RTX 4090 without DeepSpeed.

Supervised finetuning on all datasets from the TinyLLaVA Factory guide (except `ocr_vqa`) took about **12 hours** on the same GPU.

---

## 🚀 Quick start

```python
from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM
import torch

repo_id = "keeeeenw/MicroLlava"

tokenizer = AutoTokenizer.from_pretrained(repo_id)

# If processor config is available
try:
    processor = AutoProcessor.from_pretrained(repo_id)
except Exception:
    processor = None  # Optional if images are preprocessed manually

model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True  # Set to True if repo includes custom code
)

inputs = tokenizer("Describe the image in one sentence.", return_tensors="pt").to(model.device)
output_ids = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
```

## 🏆 Evaluation

### VQAv2 Evaluation Results (MicroLlama 300M + Siglip2-so400m-patch4-384)

| Question Type | Accuracy |
|---------------|----------|
| Yes/No | 72.32% |
| Number | 43.89% |
| Other | 46.65% |
| **Overall** | **56.91%** |

*Evaluated on VQAv2 test-dev split*

### (Previous version) VQAv2 Evaluation Results (MicroLlama 300M + Siglip-so400m-patch4-384)

| Question Type | Accuracy |
|---------------|----------|
| Yes/No | 65.08% |
| Number | 28.97% |
| Other | 29.32% |
| **Overall** | **44.01%** |

*Evaluated on VQAv2 test-dev split*

More evaluation results will be added in the coming days.

Community contributions with benchmark results are welcome and encouraged.

---

## ✅ Intended uses and limitations

**Intended uses**
- Rapid experimentation for vision-language research on limited hardware  
- Educational demonstrations for students and hobbyists  
- Starting point for domain-specific finetuning  

**Limitations**
- The small LLM size and compact vision encoder may limit reasoning depth and OCR performance  
- Performance can vary significantly depending on the image domain and quality  
- The model includes minimal safety filtering and refusal behavior — downstream applications should implement their own safeguards  

> ⚠️ This model should not be used for applications that may cause harm or have significant safety, financial, legal, or medical implications without thorough human review.

---

## 📝 Citation

```bibtex
@misc{wang2024microllama,
  title        = {MicroLLaVA: a TinyLLaVA based VLM with MicroLlama 300M for single GPU training},
  author       = {Zixiao Ken Wang},
  year         = {2025},
  url          = {https://huggingface.co/keeeeenw/MicroLlava}
}
```

## 📄 License

This model is released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).  

You are free to use, modify, and distribute this model and its derivatives, provided that you comply with the terms of the license.  
If you use this model in your research or applications, please credit the original authors and clearly indicate any modifications you have made.  

> **Note**: Ensure that the datasets used for pretraining or finetuning also allow redistribution of derived model weights.

---

## 🙏 Acknowledgements

This work builds upon the efforts of many in the open-source AI community:

- **[TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory)** maintainers and contributors for creating the training framework  
- **[`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama)** I am also the creator of MicroLlama. Please help support my work!
- **SigLIP2** authors for the efficient vision encoder architecture  
- Contributors to **LAION-CC-SBU-558K** and other datasets used in pretraining and finetuning  
- The Hugging Face ecosystem for hosting, tools, and community support