Model Card for smolvlm2-256m-FoodExtract-Vision-v1

This model is a fine-tuned version of HuggingFaceTB/SmolVLM2-256M-Video-Instruct specialized in structured food extraction. It analyzes images to determine if they contain food, generates a short title, and extracts lists of visible food and drink items in a specific JSON format.

Training Strategy: This model was trained using PEFT (LoRA) with a Frozen Vision Encoder. Only the language model adapters were updated to align the output with the required JSON structure, while preserving the pre-trained visual features.

Quick start

This model relies on a specific system prompt and user prompt structure to output the correct JSON format.

import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
from peft import PeftModel
from PIL import Image
import requests

# 1) Load base model + processor, then apply the LoRA adapter
base_model_id = "HuggingFaceTB/SmolVLM2-256M-Video-Instruct"
adapter_model_id = "berkeruveyik/smolvlm2-256m-FoodExtract-Vision-v1"  # Replace with your model ID

print("Loading processor and base model...")
processor = AutoProcessor.from_pretrained(base_model_id)
model = AutoModelForImageTextToText.from_pretrained(
    base_model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

print("Loading LoRA adapter...")
model = PeftModel.from_pretrained(model, adapter_model_id)
model.eval()
print("Model ready!")

# 2) Prompts
SYSTEM_MESSAGE = """You are an expert food and drink image extractor.
You provide structured data to visual inputs classifying them as edible food/drink or not.
as well as titling the image with a simple simple food/drink related caption.
Finally you extract any and all visible food/drink items to lists."""

USER_PROMPT = """Classify the given input image into food or not, and if edible food or drink items are present, extract them into lists. If no food/drink items are visible, return an empty list.

Only return valid JSON in the following form:

```json
{
  "is_food": 0,
  "image_title": "",
  "food_items": [],
  "drink_items": []
}
```"""

# 3) Load image
image_url = "https://img.freepik.com/free-psd/roasted-chicken-dinner-platter-delicious-feast_632498-25445.jpg"

print(f"\nLoading image from: {image_url}")
resp = requests.get(image_url, stream=True, headers={"User-Agent": "Mozilla/5.0"})
resp.raise_for_status()
image = Image.open(resp.raw).convert("RGB")

# 4) Prepare inputs
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": SYSTEM_MESSAGE + "\n\n" + USER_PROMPT},
        ],
    }
]

text = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=False,
)

inputs = processor(
    text=text,
    images=image,
    return_tensors="pt",
)

# Move tensors to model device and dtype
inputs = {k: v.to(model.device) for k, v in inputs.items()}
inputs = {
    k: (v.to(dtype=model.dtype) if torch.is_floating_point(v) and v.dtype == torch.float32 else v)
    for k, v in inputs.items()
}

# 5) Generate
print("\nGenerating output...")
with torch.no_grad():
    generated_ids = model.generate(**inputs, max_new_tokens=256, do_sample=False)

# 6) Decode only the newly generated tokens
prompt_len = inputs["input_ids"].shape[1]
output_text = processor.batch_decode(
    generated_ids[:, prompt_len:],
    skip_special_tokens=True
)[0]

print("\n" + "="*60)
print("OUTPUT:")
print("="*60)
print(output_text)
print("="*60)

Training procedure

This model was trained using Low-Rank Adaptation (LoRA). The Vision Encoder was frozen during training, meaning only the LLM parameters (via adapters) were updated. This approach allows the model to learn the new JSON output format efficiently without forgetting its general visual understanding.

Training Hyperparameters

The following hyperparameters were used during training:

Training Regime: PEFT (LoRA) + Frozen Vision Encoder
Num train epochs: 3
Learning rate: 2e-4
Batch size per device: 8
Gradient accumulation steps: 4
Optimizer: adamw_torch
LoRA Rank (r): 64
LoRA Alpha: 128
LoRA Dropout: 0.05
Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
BF16: True

Framework versions

TRL: 0.27.2
Transformers: 5.0.0
Pytorch: 2.9.0+cu126
Datasets: 4.0.0
Tokenizers: 0.22.2
PEFT: 0.10.0

Citations

Cite TRL and PEFT as:

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{[https://github.com/huggingface/trl](https://github.com/huggingface/trl)}}
}

@misc{mangrulkar2022peft,
    title        = {PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods},
    author       = {Sourab Mangrulkar and Sylvain Gugger and Lysandre Debut and Younes Belkada and Sayak Paul and Benjamin Bossan},
    year         = {2022},
    publisher    = {Hugging Face},
    howpublished = {\url{[https://github.com/huggingface/peft](https://github.com/huggingface/peft)}},
}