🍎 MacroVLM Base - Qwen3-VL-4B Q4_K_M

Quantization Size Base Model Task

Vision-Language Model for Food Nutrition Estimation

This is the base model for the MacroVLM project - a lightweight VLM designed to estimate calories, protein, carbs, and fat from food images.


πŸ“Š Model Details

Property Value
Base Model Qwen/Qwen3-VL-4B-Instruct
Quantization Q4_K_M (4-bit, k-quant mixed)
Model Size 2.33 GB
Vision Encoder mmproj-F16.gguf (0.78 GB)
Total Size ~3.1 GB
Format GGUF (llama.cpp compatible)
License Apache 2.0

🎯 Intended Use

This model is designed for:

  • Food image analysis - Identify foods in images
  • Nutrition estimation - Estimate calories and macros
  • On-device deployment - Runs on mobile/edge devices
  • Research - Baseline for nutrition VLM fine-tuning

Output Format

{
  "calories": 350,
  "protein_g": 25,
  "carbs_g": 30,
  "fat_g": 15
}

πŸš€ Quick Start

With llama.cpp

# Run inference
./llama-llava-cli \
  -m MacroVLM-Base-Q4_K_M.gguf \
  --mmproj mmproj-F16.gguf \
  --image food.jpg \
  -p "Analyze this food. Estimate nutrition as JSON: {calories, protein_g, carbs_g, fat_g}"

With Ollama

# Create Modelfile
cat << 'EOF' > Modelfile
FROM ./MacroVLM-Base-Q4_K_M.gguf
PARAMETER temperature 0.1
SYSTEM "You are a nutrition estimation assistant. Analyze food images and provide calorie and macro estimates in JSON format."
EOF

ollama create macrovlm -f Modelfile
ollama run macrovlm "Analyze this meal"

πŸ“š What I Learned Building This

Journey from Moondream2 to Qwen3-VL

This project started with Moondream2 (1.9B params) but we pivoted to Qwen3-VL for several reasons:

❌ Moondream2 Challenges

  1. Custom architecture - No standard training interface
  2. LoRA complications - Could apply LoRA weights but couldn't run gradient training
  3. Limited fine-tuning docs - Model designed for inference, not adaptation
  4. Baseline performance - MAE of 115 calories on Nutrition5k (not great)

βœ… Why Qwen3-VL

  1. Modern architecture - Proper HuggingFace transformers support
  2. Multiple sizes - 2B, 4B, 8B options for different use cases
  3. GGUF available - Ready for on-device deployment
  4. Active development - Well-documented fine-tuning pipelines

Dataset Insights

Nutrition5k Lite (our training data):

  • 2,732 real food images from Google cafeterias
  • Lab-measured nutrition (not estimates!)
  • Calorie range: 50-1,324 cal (mean: 298)
  • Much better than Food-101's category averages

Key Learnings

  1. VLMs guess round numbers - Base models predict 150, 250, 500 cal without fine-tuning
  2. Real labels matter - Category-average nutrition data doesn't cut it
  3. Size vs accuracy tradeoff - 4B is the sweet spot for on-device + quality
  4. Quantization works - Q4_K_M loses minimal accuracy vs FP16

πŸ“ˆ Baseline Performance

Evaluated on 100 Nutrition5k test samples (before fine-tuning):

Metric Value
Valid JSON Response ~95%
MAE Calories ~150-200 cal
Within Β±20% ~20-25%

Note: This is the base model. Fine-tuned version coming soon!


πŸ”§ Technical Specifications

Quantization Details

  • Method: Q4_K_M (k-quants, mixed precision)
  • Benefits: 4-bit weights, 6-bit critical layers
  • Quality: ~0.5% perplexity increase vs FP16
  • Speed: ~2x faster than FP16 on CPU

Hardware Requirements

Device VRAM/RAM Speed
Mac M1/M2 4GB+ ~5 tok/s
Mac M3/M4 4GB+ ~8 tok/s
CPU (AVX2) 8GB+ ~2 tok/s
iPhone 15 Pro 8GB TBD

Files Included

β”œβ”€β”€ MacroVLM-Base-Q4_K_M.gguf    # Language model (2.33 GB)
β”œβ”€β”€ mmproj-F16.gguf              # Vision encoder (0.78 GB)
└── README.md                     # This file

πŸ—ΊοΈ Roadmap

  • Base model selection (Qwen3-VL-4B)
  • Quantization to Q4_K_M
  • Upload to HuggingFace
  • Fine-tune on Nutrition5k
  • Benchmark fine-tuned model
  • iOS/Android deployment guide
  • API endpoint example

πŸ“„ Citation

If you use this model, please cite:

@misc{macrovlm2026,
  title={MacroVLM: Lightweight Vision-Language Model for Food Nutrition Estimation},
  author={Haplo LLC},
  year={2026},
  url={https://huggingface.co/HaploLLC/MacroVLM-Base-Q4_K_M}
}

Acknowledgments

  • Qwen Team - For the excellent Qwen3-VL model
  • Google Research - For the Nutrition5k dataset
  • llama.cpp - For GGUF format and quantization tools

⚠️ Limitations

  • Base model (not fine-tuned for nutrition yet)
  • Primarily trained on Western cafeteria food
  • Single-dish estimation (no multi-item breakdown)
  • Portion size affects accuracy significantly

πŸ“¬ Contact


Built with πŸ”₯ for the on-device AI revolution

Downloads last month
12
GGUF
Model size
4B params
Architecture
qwen3vl
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jc-builds/MacroVLM-Base-Q4_K_M

Quantized
(59)
this model