keeeeenw
/

MicroLlava

+---
+language:
+- en
+library_name: transformers
+tags:
+- pytorch
+- safetensors
+- vision-language
+- visual-question-answering
+pipeline_tag: visual-question-answering
+license: apache-2.0
+base_model:
+- keeeeenw/MicroLlama
+- google/siglip-so400m-patch14-384
+---
+# MicroLLaVA (TinyLLaVA Factory based)
+A compact vision language model that you can pretrain and finetune on a single consumer GPU.
+## TLDR
+| Item            | Detail |
+|-----------------|--------|
+| Framework       | Transformers + PyTorch |
+| Checkpoint type | `safetensors` |
+| LLM             | [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) (about 300M parameters) |
+| Vision tower    | [`siglip-so400m-patch14-384`](https://huggingface.co/google/siglip-so400m-patch14-384) |
+| Hardware used   | Single NVIDIA RTX 4090 |
+| Training stack  | No DeepSpeed required |
+| Intended tasks  | Visual Question Answering, caption-style prompts |
+---
+## Introduction
+MicroLLaVA is a [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) based model that pairs a very small language model [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) with an efficient SigLIP vision encoder.
+The goal is to create a vision language model that almost anyone can train and iterate on with one consumer GPU.
+- **Language model**: [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) with ~300M parameters
+- **Vision encoder**: [`siglip-so400m-patch14-384`](https://huggingface.co/google/siglip-so400m-patch14-384)
+- **Training codebase**: [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) with additional changes in my fork: [Custom fork with training tweaks](https://github.com/keeeeenw/TinyLLaVA_Factory)
+---
+## Files included
+| File                       | Purpose |
+|----------------------------|---------|
+| `config.json`              | Model configuration for Transformers |
+| `generation_config.json`   | Generation defaults |
+| `model.safetensors`        | Weights |
+| `tokenizer.model`          | SentencePiece model |
+| `tokenizer_config.json`    | Tokenizer configuration |
+| `special_tokens_map.json`  | Special token mapping |
+| `trainer_state.json`       | Trainer state |
+| `training_args.bin`        | Training arguments |
+| `log.txt`                  | Training log |
+If your workflow uses a custom processor, also include `preprocessor_config.json` or `processor_config.json` so `AutoProcessor.from_pretrained` works.
+Because of its compact size, this model can be trained entirely on a single NVIDIA RTX 4090 without DeepSpeed.
+Pretraining on **LAION-CC-SBU-558K** took about **5 hours** on a single NVIDIA RTX 4090 without DeepSpeed.
+Supervised finetuning on all datasets from the TinyLLaVA Factory guide (except `ocr_vqa`) took about **12 hours** on the same GPU.
+---
+## Quick start
+```python
+from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM
+import torch
+repo_id = "keeeeenw/MicroLlava-siglip-so400m-patch14-384-base-finetune"
+tokenizer = AutoTokenizer.from_pretrained(repo_id)
+# If processor config is available
+try:
+    processor = AutoProcessor.from_pretrained(repo_id)
+except Exception:
+    processor = None  # Optional if images are preprocessed manually
+model = AutoModelForCausalLM.from_pretrained(
+    repo_id,
+    torch_dtype=torch.float16,
+    device_map="auto",
+    trust_remote_code=True  # Set to True if repo includes custom code
+)
+inputs = tokenizer("Describe the image in one sentence.", return_tensors="pt").to(model.device)
+output_ids = model.generate(**inputs, max_new_tokens=64)
+print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
+```
+## Evaluation
+Evaluation results will be added in the coming days. Planned tests include:
+- VQAv2-style prompts for question answering
+- and more
+Community contributions with benchmark results are welcome and encouraged.
+---
+## Intended uses and limitations
+**Intended uses**
+- Rapid experimentation for vision-language research on limited hardware
+- Educational demonstrations for students and hobbyists
+- Starting point for domain-specific finetuning
+**Limitations**
+- The small LLM size and compact vision encoder may limit reasoning depth and OCR performance
+- Performance can vary significantly depending on the image domain and quality
+- The model includes minimal safety filtering and refusal behavior — downstream applications should implement their own safeguards
+> ⚠️ This model should not be used for applications that may cause harm or have significant safety, financial, legal, or medical implications without thorough human review.
+---
+## Reproducibility checklist
+To reproduce results and training runs:
+1. Fix all random seeds in training scripts
+2. Record exact dataset versions and any filtering applied
+3. Log optimizer type, learning rate schedule, precision settings, and gradient accumulation steps
+4. Save the exact TinyLLaVA Factory commit or fork commit used for both pretraining and finetuning
+5. Document hardware and software versions (CUDA, PyTorch, etc.)
+---
+## Citation
+```bibtex
+@misc{wang2024microllama,
+  title        = {MicroLLaVA: a TinyLLaVA based VLM with MicroLlama 300M for single GPU training},
+  author       = {Zixiao Ken Wang},
+  year         = {2025},
+  url          = {https://huggingface.co/keeeeenw/MicroLlava-siglip-so400m-patch14-384-base-finetune}
+}
+```
+## License
+This model is released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).
+You are free to use, modify, and distribute this model and its derivatives, provided that you comply with the terms of the license.
+If you use this model in your research or applications, please credit the original authors and clearly indicate any modifications you have made.
+> **Note**: Ensure that the datasets used for pretraining or finetuning also allow redistribution of derived model weights.
+---
+## Acknowledgements
+This work builds upon the efforts of many in the open-source AI community:
+- **[TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory)** maintainers and contributors for creating the training framework
+- **[`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama)** I am also the creator of MicroLlama. Please help support my work!
+- **SigLIP** authors for the efficient vision encoder architecture
+- Contributors to **LAION-CC-SBU-558K** and other datasets used in pretraining and finetuning
+- The Hugging Face ecosystem for hosting, tools, and community support