Instructions to use Gaykar/clip-food-model with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Gaykar/clip-food-model with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("zero-shot-image-classification", model="Gaykar/clip-food-model") pipe( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/parrots.png", candidate_labels=["animals", "humans", "landscape"], )# Load model directly from transformers import AutoProcessor, AutoModelForZeroShotImageClassification processor = AutoProcessor.from_pretrained("Gaykar/clip-food-model") model = AutoModelForZeroShotImageClassification.from_pretrained("Gaykar/clip-food-model") - Notebooks
- Google Colab
- Kaggle
- Model Card for Fine-Tuned CLIP Food Retrieval Model
- Uses
- Bias, Risks, and Limitations
- How to Get Started with the Model
- Training Details
- Evaluation
- Model Examination [optional]
- Environmental Impact
- Technical Specifications [optional]
- Citation [optional]
- Glossary [optional]
- More Information [optional]
- Model Card Authors [optional]
- Model Card Contact
Model Card for Fine-Tuned CLIP Food Retrieval Model
Model Details
Model Description
This model is a fine-tuned version of OpenAI CLIP trained for food image-text retrieval tasks. The model learns semantic alignment between food images and natural language captions, enabling multimodal retrieval and similarity matching.
The model was fine-tuned using a custom food dataset containing paired food images and captions. The primary objective of this project is to improve food image understanding and text-image embedding alignment for retrieval-based applications.
This project focuses on multimodal representation learning using contrastive learning with CLIP architecture.
- Developed by: Atharva Gaykar
- Funded by [optional]: Self-funded academic/research project
- Shared by [optional]: Atharva Gaykar
- Model type: Vision-Language Model (CLIP Fine-Tuned Retrieval Model)
- Language(s) (NLP): English
- License: CC-BY-2.0
- Finetuned from model [optional]: openai/clip-vit-base-patch32
Model Sources [optional]
- Repository: [Add Your GitHub Repository Link]
- Paper [optional]: https://arxiv.org/abs/2103.00020
- Demo [optional]: [Add Demo Link if Available]
Uses
Direct Use
This model can be directly used for:
- Food image-to-text retrieval
- Text-to-food image retrieval
- Semantic food search
- Food recommendation systems
- Image similarity matching
- Multimodal embedding generation
Example applications:
- AI-powered food search engines
- Smart restaurant menu systems
- Food recommendation assistants
- Vision-language research projects
Downstream Use [optional]
The model can be integrated into larger systems such as:
- FastAPI-based retrieval APIs
- FAISS vector search systems
- Food recognition applications
- Mobile food search apps
- Multimodal recommendation engines
The embeddings generated by the model can also be used for clustering, similarity search, and recommendation pipelines.
Out-of-Scope Use
This model is not intended for:
- Medical diagnosis
- Nutritional estimation
- Sensitive biometric analysis
- Surveillance applications
- Safety-critical decision systems
The model may also perform poorly on:
- Non-food images
- Extremely low-quality images
- Highly ambiguous or abstract captions
- Food categories absent from the training dataset
Bias, Risks, and Limitations
The model inherits limitations from the base CLIP model and the training dataset.
Potential limitations include:
- Bias toward frequently occurring food categories
- Reduced performance on underrepresented cuisines
- Sensitivity to caption quality
- Limited generalization to unseen food domains
- Dependence on image quality and lighting conditions
The retrieval quality is strongly influenced by the diversity and quality of the caption dataset.
Recommendations
Users should:
- Validate retrieval results before production use
- Avoid using the model for safety-critical systems
- Use balanced datasets during further fine-tuning
- Apply retrieval thresholding for better reliability
- Consider adding multilingual captions for broader use
How to Get Started with the Model
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch
model = CLIPModel.from_pretrained("Gaykar/clip-food-model")
processor = CLIPProcessor.from_pretrained("Gaykar/clip-food-model")
image = Image.open("food.jpg")
inputs = processor(
text=["a delicious pizza"],
images=image,
return_tensors="pt",
padding=True
)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
print(probs)
Training Details
Training Data
The model was trained on a custom food image-caption dataset containing food images paired with textual descriptions.
Dataset preparation included:
- Caption generation
- Dataset balancing
- CSV-based annotation management
- Data preprocessing and cleaning
- Image-caption pairing
The dataset contains multiple food categories with semantic caption supervision.
Training Procedure
The model was fine-tuned using the Hugging Face Transformers library with PyTorch backend.
Preprocessing [optional]
Preprocessing steps included:
- Image resizing
- CLIP image normalization
- Caption tokenization
- Batch collation using CLIPProcessor
- Dataset balancing and filtering
Training Hyperparameters
- Base Model: openai/clip-vit-base-patch32
- Batch Size: [Add Batch Size]
- Learning Rate: [Add Learning Rate]
- Epochs: [Add Epoch Count]
- Optimizer: AdamW
- Training regime: fp32 / mixed precision GPU training
- Framework: PyTorch + Hugging Face Transformers
Speeds, Sizes, Times [optional]
- Training Device: GPU
- Checkpoint Format: Hugging Face Transformers
- Model Size: Based on CLIP ViT-B/32 architecture
Evaluation
Testing Data, Factors & Metrics
Testing Data
Evaluation was performed on a held-out test split from the food image-caption dataset.
The test data consisted of unseen food image-caption pairs.
Factors
Evaluation considered:
- Image-text semantic similarity
- Retrieval alignment quality
- Caption relevance
- Cross-modal embedding consistency
Metrics
The model was evaluated using:
- Retrieval similarity scores
- Top-k retrieval matching
- Embedding similarity analysis
- Accuracy-based retrieval evaluation
Future improvements may include:
- Recall@K
- Precision@K
- Mean Reciprocal Rank (MRR)
Results
The fine-tuned model demonstrated improved semantic alignment between food images and captions compared to baseline zero-shot CLIP retrieval on the custom dataset.
The model successfully learned domain-specific food representations and improved retrieval consistency.
Summary
The project demonstrates effective fine-tuning of CLIP for multimodal food retrieval tasks using custom image-caption supervision.
The model performs well for semantic food search and retrieval-oriented applications.
Model Examination [optional]
The model generates joint image-text embeddings using contrastive learning principles.
Embedding similarity analysis shows that semantically related food images and captions cluster closer in vector space after fine-tuning.
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator.
Hardware Type: NVIDIA T4 GPU
Cloud Provider: Google Colab / Local GPU
Compute Region: India
Carbon Emitted: Unknown
Technical Specifications [optional]
Model Architecture and Objective
This model is based on the CLIP (Contrastive Language-Image Pretraining) architecture.
Architecture components include:
- Vision Transformer (ViT)
- Transformer-based text encoder
- Contrastive multimodal embedding learning
Training objective:
- Align image embeddings with matching text embeddings
- Maximize similarity for correct image-text pairs
- Minimize similarity for incorrect pairs
Compute Infrastructure
Hardware
- NVIDIA GPU
- Google Colab / Local training environment
Software
- Python
- PyTorch
- Hugging Face Transformers
- Pandas
- NumPy
- PIL
- tqdm
Citation [optional]
BibTeX
@article{radford2021learning,
title={Learning Transferable Visual Models From Natural Language Supervision},
author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and others},
journal={arXiv preprint arXiv:2103.00020},
year={2021}
}
APA
Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. arXiv preprint arXiv:2103.00020.
Glossary [optional]
- CLIP: Contrastive Language-Image Pretraining
- Embedding: Dense vector representation of data
- Multimodal Learning: Learning across multiple data modalities such as images and text
- Contrastive Learning: Learning representations by comparing positive and negative pairs
More Information [optional]
This project was developed as a multimodal AI research and engineering project focused on semantic food retrieval systems using CLIP fine-tuning techniques.
Future improvements may include:
- FAISS-based retrieval
- LoRA fine-tuning
- Hard negative mining
- Better retrieval metrics
- Web deployment
Model Card Authors [optional]
Atharva Gaykar
Model Card Contact
Atharva Gaykar gaykaratharva7@gmail.com
- Downloads last month
- 10