keeeeenw
/

MicroLlava

@@ -11,54 +11,42 @@ pipeline_tag: visual-question-answering
 license: apache-2.0
 base_model:
 - keeeeenw/MicroLlama
-- google/siglip-so400m-patch14-384
 ---
-# MicroLLaVA (TinyLLaVA Factory based)
 A compact vision language model that you can pretrain and finetune on a single consumer GPU.
-## TLDR
 | Item            | Detail |
 |-----------------|--------|
 | Framework       | Transformers + PyTorch |
 | Checkpoint type | `safetensors` |
 | LLM             | [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) (about 300M parameters) |
-| Vision tower    | [`siglip-so400m-patch14-384`](https://huggingface.co/google/siglip-so400m-patch14-384) |
 | Hardware used   | Single NVIDIA RTX 4090 |
 | Training stack  | No DeepSpeed required |
 | Intended tasks  | Visual Question Answering, caption-style prompts |
 ---
-## Introduction
 MicroLLaVA is a [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) based model that pairs a very small language model [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) with an efficient SigLIP vision encoder.
 The goal is to create a vision language model that almost anyone can train and iterate on with one consumer GPU.
 - **Language model**: [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) with ~300M parameters
-- **Vision encoder**: [`siglip-so400m-patch14-384`](https://huggingface.co/google/siglip-so400m-patch14-384)
 - **Training codebase**: [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) with additional changes in my fork: [Custom fork with training tweaks](https://github.com/keeeeenw/TinyLLaVA_Factory)
----
-## Files included
-| File                       | Purpose |
-|----------------------------|---------|
-| `config.json`              | Model configuration for Transformers |
-| `generation_config.json`   | Generation defaults |
-| `model.safetensors`        | Weights |
-| `tokenizer.model`          | SentencePiece model |
-| `tokenizer_config.json`    | Tokenizer configuration |
-| `special_tokens_map.json`  | Special token mapping |
-| `trainer_state.json`       | Trainer state |
-| `training_args.bin`        | Training arguments |
-| `log.txt`                  | Training log |
-If your workflow uses a custom processor, also include `preprocessor_config.json` or `processor_config.json` so `AutoProcessor.from_pretrained` works.
 Because of its compact size, this model can be trained entirely on a single NVIDIA RTX 4090 without DeepSpeed.
 Pretraining on **LAION-CC-SBU-558K** took about **5 hours** on a single NVIDIA RTX 4090 without DeepSpeed.
@@ -73,7 +61,7 @@ Supervised finetuning on all datasets from the TinyLLaVA Factory guide (except `
 from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM
 import torch
-repo_id = "keeeeenw/MicroLlava-siglip-so400m-patch14-384-base-finetune"
 tokenizer = AutoTokenizer.from_pretrained(repo_id)
@@ -97,10 +85,29 @@ print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
 ## Evaluation
-Evaluation results will be added in the coming days. Planned tests include:
-- VQAv2-style prompts for question answering
-- and more
 Community contributions with benchmark results are welcome and encouraged.
@@ -141,7 +148,7 @@ To reproduce results and training runs:
   title        = {MicroLLaVA: a TinyLLaVA based VLM with MicroLlama 300M for single GPU training},
   author       = {Zixiao Ken Wang},
   year         = {2025},
-  url          = {https://huggingface.co/keeeeenw/MicroLlava-siglip-so400m-patch14-384-base-finetune}
 }
 ```
@@ -162,7 +169,6 @@ This work builds upon the efforts of many in the open-source AI community:
 - **[TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory)** maintainers and contributors for creating the training framework
 - **[`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama)** I am also the creator of MicroLlama. Please help support my work!
-- **SigLIP** authors for the efficient vision encoder architecture
 - Contributors to **LAION-CC-SBU-558K** and other datasets used in pretraining and finetuning
-- The Hugging Face ecosystem for hosting, tools, and community support

 license: apache-2.0
 base_model:
 - keeeeenw/MicroLlama
 ---
+# MicroLLaVA
 A compact vision language model that you can pretrain and finetune on a single consumer GPU.
+## 📰 News and Updates
+* 08/17/2025: this hugging face repo is renamed to https://huggingface.co/keeeeenw/MicroLlava.
+* 08/17/2025: improved **VQAv2** average dev-test score from **44.01%** to **56.91%** by upgrading the vision tower from SigLip to SigLip2.
+* 08/09/2025: initial version of MicroLlava released
+## 🚀 TLDR
 | Item            | Detail |
 |-----------------|--------|
 | Framework       | Transformers + PyTorch |
 | Checkpoint type | `safetensors` |
 | LLM             | [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) (about 300M parameters) |
+| Vision tower    | [`siglip-so400m-patch14-384`](https://huggingface.co/google/siglip2-so400m-patch14-384) |
 | Hardware used   | Single NVIDIA RTX 4090 |
 | Training stack  | No DeepSpeed required |
 | Intended tasks  | Visual Question Answering, caption-style prompts |
 ---
+## 📋 Introduction
 MicroLLaVA is a [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) based model that pairs a very small language model [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) with an efficient SigLIP vision encoder.
 The goal is to create a vision language model that almost anyone can train and iterate on with one consumer GPU.
 - **Language model**: [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) with ~300M parameters
+- **Vision encoder**: [`siglip2-so400m-patch14-384`](https://huggingface.co/google/siglip2-so400m-patch14-384)
 - **Training codebase**: [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) with additional changes in my fork: [Custom fork with training tweaks](https://github.com/keeeeenw/TinyLLaVA_Factory)
 Because of its compact size, this model can be trained entirely on a single NVIDIA RTX 4090 without DeepSpeed.
 Pretraining on **LAION-CC-SBU-558K** took about **5 hours** on a single NVIDIA RTX 4090 without DeepSpeed.
 from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM
 import torch
+repo_id = "keeeeenw/MicroLlava"
 tokenizer = AutoTokenizer.from_pretrained(repo_id)
 ## Evaluation
+### VQAv2 Evaluation Results (MicroLlama 300M + Siglip2-so400m-patch4-384)
+| Question Type | Accuracy |
+|---------------|----------|
+| Yes/No | 72.32% |
+| Number | 43.89% |
+| Other | 46.65% |
+| **Overall** | **56.91%** |
+*Evaluated on VQAv2 test-dev split*
+### (Previous version) VQAv2 Evaluation Results (MicroLlama 300M + Siglip-so400m-patch4-384)
+| Question Type | Accuracy |
+|---------------|----------|
+| Yes/No | 65.08% |
+| Number | 28.97% |
+| Other | 29.32% |
+| **Overall** | **44.01%** |
+*Evaluated on VQAv2 test-dev split*
+More evaluation results will be added in the coming days.
 Community contributions with benchmark results are welcome and encouraged.
   title        = {MicroLLaVA: a TinyLLaVA based VLM with MicroLlama 300M for single GPU training},
   author       = {Zixiao Ken Wang},
   year         = {2025},
+  url          = {https://huggingface.co/keeeeenw/MicroLlava}
 }
 ```
 - **[TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory)** maintainers and contributors for creating the training framework
 - **[`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama)** I am also the creator of MicroLlama. Please help support my work!
+- **SigLIP2** authors for the efficient vision encoder architecture
 - Contributors to **LAION-CC-SBU-558K** and other datasets used in pretraining and finetuning
+- The Hugging Face ecosystem for hosting, tools, and community support