Model Card for Fine-Tuned CLIP Food Retrieval Model

Model Details

Model Description

This model is a fine-tuned version of OpenAI CLIP trained for food image-text retrieval tasks. The model learns semantic alignment between food images and natural language captions, enabling multimodal retrieval and similarity matching.

The model was fine-tuned using a custom food dataset containing paired food images and captions. The primary objective of this project is to improve food image understanding and text-image embedding alignment for retrieval-based applications.

This project focuses on multimodal representation learning using contrastive learning with CLIP architecture.

  • Developed by: Atharva Gaykar
  • Funded by [optional]: Self-funded academic/research project
  • Shared by [optional]: Atharva Gaykar
  • Model type: Vision-Language Model (CLIP Fine-Tuned Retrieval Model)
  • Language(s) (NLP): English
  • License: CC-BY-2.0
  • Finetuned from model [optional]: openai/clip-vit-base-patch32

Model Sources [optional]


Uses

Direct Use

This model can be directly used for:

  • Food image-to-text retrieval
  • Text-to-food image retrieval
  • Semantic food search
  • Food recommendation systems
  • Image similarity matching
  • Multimodal embedding generation

Example applications:

  • AI-powered food search engines
  • Smart restaurant menu systems
  • Food recommendation assistants
  • Vision-language research projects

Downstream Use [optional]

The model can be integrated into larger systems such as:

  • FastAPI-based retrieval APIs
  • FAISS vector search systems
  • Food recognition applications
  • Mobile food search apps
  • Multimodal recommendation engines

The embeddings generated by the model can also be used for clustering, similarity search, and recommendation pipelines.


Out-of-Scope Use

This model is not intended for:

  • Medical diagnosis
  • Nutritional estimation
  • Sensitive biometric analysis
  • Surveillance applications
  • Safety-critical decision systems

The model may also perform poorly on:

  • Non-food images
  • Extremely low-quality images
  • Highly ambiguous or abstract captions
  • Food categories absent from the training dataset

Bias, Risks, and Limitations

The model inherits limitations from the base CLIP model and the training dataset.

Potential limitations include:

  • Bias toward frequently occurring food categories
  • Reduced performance on underrepresented cuisines
  • Sensitivity to caption quality
  • Limited generalization to unseen food domains
  • Dependence on image quality and lighting conditions

The retrieval quality is strongly influenced by the diversity and quality of the caption dataset.


Recommendations

Users should:

  • Validate retrieval results before production use
  • Avoid using the model for safety-critical systems
  • Use balanced datasets during further fine-tuning
  • Apply retrieval thresholding for better reliability
  • Consider adding multilingual captions for broader use

How to Get Started with the Model

from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch

model = CLIPModel.from_pretrained("Gaykar/clip-food-model")
processor = CLIPProcessor.from_pretrained("Gaykar/clip-food-model")

image = Image.open("food.jpg")

inputs = processor(
    text=["a delicious pizza"],
    images=image,
    return_tensors="pt",
    padding=True
)

outputs = model(**inputs)

logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)

print(probs)

Training Details

Training Data

The model was trained on a custom food image-caption dataset containing food images paired with textual descriptions.

Dataset preparation included:

  • Caption generation
  • Dataset balancing
  • CSV-based annotation management
  • Data preprocessing and cleaning
  • Image-caption pairing

The dataset contains multiple food categories with semantic caption supervision.


Training Procedure

The model was fine-tuned using the Hugging Face Transformers library with PyTorch backend.

Preprocessing [optional]

Preprocessing steps included:

  • Image resizing
  • CLIP image normalization
  • Caption tokenization
  • Batch collation using CLIPProcessor
  • Dataset balancing and filtering

Training Hyperparameters

  • Base Model: openai/clip-vit-base-patch32
  • Batch Size: [Add Batch Size]
  • Learning Rate: [Add Learning Rate]
  • Epochs: [Add Epoch Count]
  • Optimizer: AdamW
  • Training regime: fp32 / mixed precision GPU training
  • Framework: PyTorch + Hugging Face Transformers

Speeds, Sizes, Times [optional]

  • Training Device: GPU
  • Checkpoint Format: Hugging Face Transformers
  • Model Size: Based on CLIP ViT-B/32 architecture

Evaluation

Testing Data, Factors & Metrics

Testing Data

Evaluation was performed on a held-out test split from the food image-caption dataset.

The test data consisted of unseen food image-caption pairs.


Factors

Evaluation considered:

  • Image-text semantic similarity
  • Retrieval alignment quality
  • Caption relevance
  • Cross-modal embedding consistency

Metrics

The model was evaluated using:

  • Retrieval similarity scores
  • Top-k retrieval matching
  • Embedding similarity analysis
  • Accuracy-based retrieval evaluation

Future improvements may include:

  • Recall@K
  • Precision@K
  • Mean Reciprocal Rank (MRR)

Results

The fine-tuned model demonstrated improved semantic alignment between food images and captions compared to baseline zero-shot CLIP retrieval on the custom dataset.

The model successfully learned domain-specific food representations and improved retrieval consistency.


Summary

The project demonstrates effective fine-tuning of CLIP for multimodal food retrieval tasks using custom image-caption supervision.

The model performs well for semantic food search and retrieval-oriented applications.


Model Examination [optional]

The model generates joint image-text embeddings using contrastive learning principles.

Embedding similarity analysis shows that semantically related food images and captions cluster closer in vector space after fine-tuning.


Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator.

  • Hardware Type: NVIDIA T4 GPU

  • Cloud Provider: Google Colab / Local GPU

  • Compute Region: India

  • Carbon Emitted: Unknown


Technical Specifications [optional]

Model Architecture and Objective

This model is based on the CLIP (Contrastive Language-Image Pretraining) architecture.

Architecture components include:

  • Vision Transformer (ViT)
  • Transformer-based text encoder
  • Contrastive multimodal embedding learning

Training objective:

  • Align image embeddings with matching text embeddings
  • Maximize similarity for correct image-text pairs
  • Minimize similarity for incorrect pairs

Compute Infrastructure

Hardware

  • NVIDIA GPU
  • Google Colab / Local training environment

Software

  • Python
  • PyTorch
  • Hugging Face Transformers
  • Pandas
  • NumPy
  • PIL
  • tqdm

Citation [optional]

BibTeX

@article{radford2021learning,
  title={Learning Transferable Visual Models From Natural Language Supervision},
  author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and others},
  journal={arXiv preprint arXiv:2103.00020},
  year={2021}
}

APA

Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. arXiv preprint arXiv:2103.00020.


Glossary [optional]

  • CLIP: Contrastive Language-Image Pretraining
  • Embedding: Dense vector representation of data
  • Multimodal Learning: Learning across multiple data modalities such as images and text
  • Contrastive Learning: Learning representations by comparing positive and negative pairs

More Information [optional]

This project was developed as a multimodal AI research and engineering project focused on semantic food retrieval systems using CLIP fine-tuning techniques.

Future improvements may include:

  • FAISS-based retrieval
  • LoRA fine-tuning
  • Hard negative mining
  • Better retrieval metrics
  • Web deployment

Model Card Authors [optional]

Atharva Gaykar


Model Card Contact

Atharva Gaykar gaykaratharva7@gmail.com

Downloads last month
10
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Gaykar/clip-food-model