🍎 MacroVLM Base - Qwen3-VL-4B Q4_K_M

Vision-Language Model for Food Nutrition Estimation

This is the base model for the MacroVLM project - a lightweight VLM designed to estimate calories, protein, carbs, and fat from food images.

📊 Model Details

Property	Value
Base Model	Qwen/Qwen3-VL-4B-Instruct
Quantization	Q4_K_M (4-bit, k-quant mixed)
Model Size	2.33 GB
Vision Encoder	mmproj-F16.gguf (0.78 GB)
Total Size	~3.1 GB
Format	GGUF (llama.cpp compatible)
License	Apache 2.0

🎯 Intended Use

This model is designed for:

Food image analysis - Identify foods in images
Nutrition estimation - Estimate calories and macros
On-device deployment - Runs on mobile/edge devices
Research - Baseline for nutrition VLM fine-tuning

Output Format

{
  "calories": 350,
  "protein_g": 25,
  "carbs_g": 30,
  "fat_g": 15
}

🚀 Quick Start

With llama.cpp

# Run inference
./llama-llava-cli \
  -m MacroVLM-Base-Q4_K_M.gguf \
  --mmproj mmproj-F16.gguf \
  --image food.jpg \
  -p "Analyze this food. Estimate nutrition as JSON: {calories, protein_g, carbs_g, fat_g}"

With Ollama

# Create Modelfile
cat << 'EOF' > Modelfile
FROM ./MacroVLM-Base-Q4_K_M.gguf
PARAMETER temperature 0.1
SYSTEM "You are a nutrition estimation assistant. Analyze food images and provide calorie and macro estimates in JSON format."
EOF

ollama create macrovlm -f Modelfile
ollama run macrovlm "Analyze this meal"

📚 What I Learned Building This

Journey from Moondream2 to Qwen3-VL

This project started with Moondream2 (1.9B params) but we pivoted to Qwen3-VL for several reasons:

❌ Moondream2 Challenges

Custom architecture - No standard training interface
LoRA complications - Could apply LoRA weights but couldn't run gradient training
Limited fine-tuning docs - Model designed for inference, not adaptation
Baseline performance - MAE of 115 calories on Nutrition5k (not great)

✅ Why Qwen3-VL

Modern architecture - Proper HuggingFace transformers support
Multiple sizes - 2B, 4B, 8B options for different use cases
GGUF available - Ready for on-device deployment
Active development - Well-documented fine-tuning pipelines

Dataset Insights

Nutrition5k Lite (our training data):

2,732 real food images from Google cafeterias
Lab-measured nutrition (not estimates!)
Calorie range: 50-1,324 cal (mean: 298)
Much better than Food-101's category averages

Key Learnings

VLMs guess round numbers - Base models predict 150, 250, 500 cal without fine-tuning
Real labels matter - Category-average nutrition data doesn't cut it
Size vs accuracy tradeoff - 4B is the sweet spot for on-device + quality
Quantization works - Q4_K_M loses minimal accuracy vs FP16

📈 Baseline Performance

Evaluated on 100 Nutrition5k test samples (before fine-tuning):

Metric	Value
Valid JSON Response	~95%
MAE Calories	~150-200 cal
Within ±20%	~20-25%

Note: This is the base model. Fine-tuned version coming soon!

🔧 Technical Specifications

Quantization Details

Method: Q4_K_M (k-quants, mixed precision)
Benefits: 4-bit weights, 6-bit critical layers
Quality: ~0.5% perplexity increase vs FP16
Speed: ~2x faster than FP16 on CPU

Hardware Requirements

Device	VRAM/RAM	Speed
Mac M1/M2	4GB+	~5 tok/s
Mac M3/M4	4GB+	~8 tok/s
CPU (AVX2)	8GB+	~2 tok/s
iPhone 15 Pro	8GB	TBD

Files Included

├── MacroVLM-Base-Q4_K_M.gguf    # Language model (2.33 GB)
├── mmproj-F16.gguf              # Vision encoder (0.78 GB)
└── README.md                     # This file

🗺️ Roadmap

Base model selection (Qwen3-VL-4B)
Quantization to Q4_K_M
Upload to HuggingFace
Fine-tune on Nutrition5k
Benchmark fine-tuned model
iOS/Android deployment guide
API endpoint example

📄 Citation

If you use this model, please cite:

@misc{macrovlm2026,
  title={MacroVLM: Lightweight Vision-Language Model for Food Nutrition Estimation},
  author={Haplo LLC},
  year={2026},
  url={https://huggingface.co/HaploLLC/MacroVLM-Base-Q4_K_M}
}

Acknowledgments

Qwen Team - For the excellent Qwen3-VL model
Google Research - For the Nutrition5k dataset
llama.cpp - For GGUF format and quantization tools

⚠️ Limitations

Base model (not fine-tuned for nutrition yet)
Primarily trained on Western cafeteria food
Single-dish estimation (no multi-item breakdown)
Portion size affects accuracy significantly

📬 Contact

GitHub: HaploLLC
Twitter: @HaploApps
Website: haplo.app

Built with 🔥 for the on-device AI revolution

Downloads last month: 12

GGUF

Model size

4B params

Architecture

qwen3vl

Hardware compatibility

4-bit

Model tree for jc-builds/MacroVLM-Base-Q4_K_M

Base model

Qwen/Qwen3-VL-4B-Instruct

Quantized

(59)

this model