π MacroVLM Base - Qwen3-VL-4B Q4_K_M
Vision-Language Model for Food Nutrition Estimation
This is the base model for the MacroVLM project - a lightweight VLM designed to estimate calories, protein, carbs, and fat from food images.
π Model Details
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3-VL-4B-Instruct |
| Quantization | Q4_K_M (4-bit, k-quant mixed) |
| Model Size | 2.33 GB |
| Vision Encoder | mmproj-F16.gguf (0.78 GB) |
| Total Size | ~3.1 GB |
| Format | GGUF (llama.cpp compatible) |
| License | Apache 2.0 |
π― Intended Use
This model is designed for:
- Food image analysis - Identify foods in images
- Nutrition estimation - Estimate calories and macros
- On-device deployment - Runs on mobile/edge devices
- Research - Baseline for nutrition VLM fine-tuning
Output Format
{
"calories": 350,
"protein_g": 25,
"carbs_g": 30,
"fat_g": 15
}
π Quick Start
With llama.cpp
# Run inference
./llama-llava-cli \
-m MacroVLM-Base-Q4_K_M.gguf \
--mmproj mmproj-F16.gguf \
--image food.jpg \
-p "Analyze this food. Estimate nutrition as JSON: {calories, protein_g, carbs_g, fat_g}"
With Ollama
# Create Modelfile
cat << 'EOF' > Modelfile
FROM ./MacroVLM-Base-Q4_K_M.gguf
PARAMETER temperature 0.1
SYSTEM "You are a nutrition estimation assistant. Analyze food images and provide calorie and macro estimates in JSON format."
EOF
ollama create macrovlm -f Modelfile
ollama run macrovlm "Analyze this meal"
π What I Learned Building This
Journey from Moondream2 to Qwen3-VL
This project started with Moondream2 (1.9B params) but we pivoted to Qwen3-VL for several reasons:
β Moondream2 Challenges
- Custom architecture - No standard training interface
- LoRA complications - Could apply LoRA weights but couldn't run gradient training
- Limited fine-tuning docs - Model designed for inference, not adaptation
- Baseline performance - MAE of 115 calories on Nutrition5k (not great)
β Why Qwen3-VL
- Modern architecture - Proper HuggingFace transformers support
- Multiple sizes - 2B, 4B, 8B options for different use cases
- GGUF available - Ready for on-device deployment
- Active development - Well-documented fine-tuning pipelines
Dataset Insights
Nutrition5k Lite (our training data):
- 2,732 real food images from Google cafeterias
- Lab-measured nutrition (not estimates!)
- Calorie range: 50-1,324 cal (mean: 298)
- Much better than Food-101's category averages
Key Learnings
- VLMs guess round numbers - Base models predict 150, 250, 500 cal without fine-tuning
- Real labels matter - Category-average nutrition data doesn't cut it
- Size vs accuracy tradeoff - 4B is the sweet spot for on-device + quality
- Quantization works - Q4_K_M loses minimal accuracy vs FP16
π Baseline Performance
Evaluated on 100 Nutrition5k test samples (before fine-tuning):
| Metric | Value |
|---|---|
| Valid JSON Response | ~95% |
| MAE Calories | ~150-200 cal |
| Within Β±20% | ~20-25% |
Note: This is the base model. Fine-tuned version coming soon!
π§ Technical Specifications
Quantization Details
- Method: Q4_K_M (k-quants, mixed precision)
- Benefits: 4-bit weights, 6-bit critical layers
- Quality: ~0.5% perplexity increase vs FP16
- Speed: ~2x faster than FP16 on CPU
Hardware Requirements
| Device | VRAM/RAM | Speed |
|---|---|---|
| Mac M1/M2 | 4GB+ | ~5 tok/s |
| Mac M3/M4 | 4GB+ | ~8 tok/s |
| CPU (AVX2) | 8GB+ | ~2 tok/s |
| iPhone 15 Pro | 8GB | TBD |
Files Included
βββ MacroVLM-Base-Q4_K_M.gguf # Language model (2.33 GB)
βββ mmproj-F16.gguf # Vision encoder (0.78 GB)
βββ README.md # This file
πΊοΈ Roadmap
- Base model selection (Qwen3-VL-4B)
- Quantization to Q4_K_M
- Upload to HuggingFace
- Fine-tune on Nutrition5k
- Benchmark fine-tuned model
- iOS/Android deployment guide
- API endpoint example
π Citation
If you use this model, please cite:
@misc{macrovlm2026,
title={MacroVLM: Lightweight Vision-Language Model for Food Nutrition Estimation},
author={Haplo LLC},
year={2026},
url={https://huggingface.co/HaploLLC/MacroVLM-Base-Q4_K_M}
}
Acknowledgments
- Qwen Team - For the excellent Qwen3-VL model
- Google Research - For the Nutrition5k dataset
- llama.cpp - For GGUF format and quantization tools
β οΈ Limitations
- Base model (not fine-tuned for nutrition yet)
- Primarily trained on Western cafeteria food
- Single-dish estimation (no multi-item breakdown)
- Portion size affects accuracy significantly
π¬ Contact
- GitHub: HaploLLC
- Twitter: @HaploApps
- Website: haplo.app
Built with π₯ for the on-device AI revolution
- Downloads last month
- 12
Hardware compatibility
Log In to add your hardware
4-bit
Model tree for jc-builds/MacroVLM-Base-Q4_K_M
Base model
Qwen/Qwen3-VL-4B-Instruct