Model Card for Fine-Tuned CLIP Food Retrieval Model

Model Details

Model Description

This model is a fine-tuned version of OpenAI CLIP trained for food image-text retrieval tasks. The model learns semantic alignment between food images and natural language captions, enabling multimodal retrieval and similarity matching.

The model was fine-tuned using a custom food dataset containing paired food images and captions. The primary objective of this project is to improve food image understanding and text-image embedding alignment for retrieval-based applications.

This project focuses on multimodal representation learning using contrastive learning with CLIP architecture.

Developed by: Atharva Gaykar
Funded by [optional]: Self-funded academic/research project
Shared by [optional]: Atharva Gaykar
Model type: Vision-Language Model (CLIP Fine-Tuned Retrieval Model)
Language(s) (NLP): English
License: CC-BY-2.0
Finetuned from model [optional]: openai/clip-vit-base-patch32

Model Sources [optional]

Repository: [Add Your GitHub Repository Link]
Paper [optional]: https://arxiv.org/abs/2103.00020
Demo [optional]: [Add Demo Link if Available]

Uses

Direct Use

This model can be directly used for:

Food image-to-text retrieval
Text-to-food image retrieval
Semantic food search
Food recommendation systems
Image similarity matching
Multimodal embedding generation

Example applications:

AI-powered food search engines
Smart restaurant menu systems
Food recommendation assistants
Vision-language research projects

Downstream Use [optional]

The model can be integrated into larger systems such as:

FastAPI-based retrieval APIs
FAISS vector search systems
Food recognition applications
Mobile food search apps
Multimodal recommendation engines

The embeddings generated by the model can also be used for clustering, similarity search, and recommendation pipelines.

Out-of-Scope Use

This model is not intended for:

Medical diagnosis
Nutritional estimation
Sensitive biometric analysis
Surveillance applications
Safety-critical decision systems

The model may also perform poorly on:

Non-food images
Extremely low-quality images
Highly ambiguous or abstract captions
Food categories absent from the training dataset

Bias, Risks, and Limitations

The model inherits limitations from the base CLIP model and the training dataset.

Potential limitations include:

Bias toward frequently occurring food categories
Reduced performance on underrepresented cuisines
Sensitivity to caption quality
Limited generalization to unseen food domains
Dependence on image quality and lighting conditions

The retrieval quality is strongly influenced by the diversity and quality of the caption dataset.

Recommendations

Users should:

Validate retrieval results before production use
Avoid using the model for safety-critical systems
Use balanced datasets during further fine-tuning
Apply retrieval thresholding for better reliability
Consider adding multilingual captions for broader use

How to Get Started with the Model

from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch

model = CLIPModel.from_pretrained("Gaykar/clip-food-model")
processor = CLIPProcessor.from_pretrained("Gaykar/clip-food-model")

image = Image.open("food.jpg")

inputs = processor(
    text=["a delicious pizza"],
    images=image,
    return_tensors="pt",
    padding=True
)

outputs = model(**inputs)

logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)

print(probs)

Training Details

Training Data

The model was trained on a custom food image-caption dataset containing food images paired with textual descriptions.

Dataset preparation included:

Caption generation
Dataset balancing
CSV-based annotation management
Data preprocessing and cleaning
Image-caption pairing

The dataset contains multiple food categories with semantic caption supervision.

Training Procedure

The model was fine-tuned using the Hugging Face Transformers library with PyTorch backend.

Preprocessing [optional]

Preprocessing steps included:

Image resizing
CLIP image normalization
Caption tokenization
Batch collation using CLIPProcessor
Dataset balancing and filtering

Training Hyperparameters

Base Model: openai/clip-vit-base-patch32
Batch Size: [Add Batch Size]
Learning Rate: [Add Learning Rate]
Epochs: [Add Epoch Count]
Optimizer: AdamW
Training regime: fp32 / mixed precision GPU training
Framework: PyTorch + Hugging Face Transformers

Speeds, Sizes, Times [optional]

Training Device: GPU
Checkpoint Format: Hugging Face Transformers
Model Size: Based on CLIP ViT-B/32 architecture

Evaluation

Testing Data, Factors & Metrics

Testing Data

Evaluation was performed on a held-out test split from the food image-caption dataset.

The test data consisted of unseen food image-caption pairs.

Factors

Evaluation considered:

Image-text semantic similarity
Retrieval alignment quality
Caption relevance
Cross-modal embedding consistency

Metrics

The model was evaluated using:

Retrieval similarity scores
Top-k retrieval matching
Embedding similarity analysis
Accuracy-based retrieval evaluation

Future improvements may include:

Recall@K
Precision@K
Mean Reciprocal Rank (MRR)

Results

The fine-tuned model demonstrated improved semantic alignment between food images and captions compared to baseline zero-shot CLIP retrieval on the custom dataset.

The model successfully learned domain-specific food representations and improved retrieval consistency.

Summary

The project demonstrates effective fine-tuning of CLIP for multimodal food retrieval tasks using custom image-caption supervision.

The model performs well for semantic food search and retrieval-oriented applications.

Model Examination [optional]

The model generates joint image-text embeddings using contrastive learning principles.

Embedding similarity analysis shows that semantically related food images and captions cluster closer in vector space after fine-tuning.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator.

Hardware Type: NVIDIA T4 GPU
Cloud Provider: Google Colab / Local GPU
Compute Region: India
Carbon Emitted: Unknown

Technical Specifications [optional]

Model Architecture and Objective

This model is based on the CLIP (Contrastive Language-Image Pretraining) architecture.

Architecture components include:

Vision Transformer (ViT)
Transformer-based text encoder
Contrastive multimodal embedding learning

Training objective:

Align image embeddings with matching text embeddings
Maximize similarity for correct image-text pairs
Minimize similarity for incorrect pairs

Compute Infrastructure

Hardware

NVIDIA GPU
Google Colab / Local training environment

Software

Python
PyTorch
Hugging Face Transformers
Pandas
NumPy
PIL
tqdm

Citation [optional]

BibTeX

@article{radford2021learning,
  title={Learning Transferable Visual Models From Natural Language Supervision},
  author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and others},
  journal={arXiv preprint arXiv:2103.00020},
  year={2021}
}

APA

Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. arXiv preprint arXiv:2103.00020.

Glossary [optional]

CLIP: Contrastive Language-Image Pretraining
Embedding: Dense vector representation of data
Multimodal Learning: Learning across multiple data modalities such as images and text
Contrastive Learning: Learning representations by comparing positive and negative pairs

More Information [optional]

This project was developed as a multimodal AI research and engineering project focused on semantic food retrieval systems using CLIP fine-tuning techniques.

Future improvements may include:

FAISS-based retrieval
LoRA fine-tuning
Hard negative mining
Better retrieval metrics
Web deployment

Model Card Authors [optional]

Atharva Gaykar

Model Card Contact

Atharva Gaykar gaykaratharva7@gmail.com

Downloads last month: 10

Safetensors

Model size

0.2B params

Tensor type

F32

Paper for Gaykar/clip-food-model

Learning Transferable Visual Models From Natural Language Supervision

Paper • 2103.00020 • Published Feb 26, 2021 • 22