PictSure: Few-Shot Image Classification with In-Context Learning

Model Description

PictSure (link to project website) is a novel few-shot learning model for image classification that leverages context images to make predictions on new, unseen images. The model combines pre-trained image encoders with transformer architecture to enable effective few-shot learning with minimal examples. More details can be found on our paper page.

Key Features

Few-shot Learning: Classify images with only a few examples per class
Context-Aware: Uses context images and labels to inform predictions
Flexible Architecture: Supports both ResNet and Vision Transformer (ViT) backbones as well as other custom backbones
Transformer-Based: Employs transformer encoders for sequence processing
Easy Integration: Simple API for setting context and making predictions

How to Use

Installation

pip install torch torchvision
pip install huggingface_hub
pip install PictSure

Basic Usage

This example can also be found (including the mentioned images) in our GitHub Repository.

from PictSure import PictSure
from PIL import Image

# Load pre-trained model
model = PictSure.from_pretrained("pictsure/pictsure-vit")

# Prepare context images and labels
context_images = [
    Image.open("cat1.jpg"),
    Image.open("cat2.jpg"),
    Image.open("dog1.jpg"),
    Image.open("dog2.jpg")
]
context_labels = [0, 0, 1, 1]  # 0 for cat, 1 for dog

# Set context
model.set_context_images(context_images, context_labels)

# Make prediction on new image
test_image = Image.open("unknown_animal.jpg")
prediction = model.predict(test_image)
print(f"Predicted class: {prediction}")

Model Architecture

PictSure consists of several key components:

Embedding Network: Pre-trained ResNet18 or custom Vision Transformer for feature extraction
Projection Layers: Linear projections for image features and label embeddings
Transformer Encoder: Multi-head attention mechanism with configurable heads and layers
Classification Head: Final linear layer for class prediction

Architecture Details

Image Resolution: 224 × 224 pixels
Feature Dimension: 1024D (concatenated image + label projections)
Default Configuration:
- ResNet model: 8 attention heads, 4 transformer layers, ~53M parameters
- ViT model: 8 attention heads, 4 transformer layers, ~128M parameters

Intended Use

Primary Use Cases

Few-shot image classification in scenarios with limited labeled data
Meta-learning applications where rapid adaptation to new classes is required
Educational and research purposes in computer vision and machine learning
Prototyping classification systems with minimal training data

Limitations

Requires context images to be set before making predictions
Performance depends on the quality and representativeness of context examples
Limited to classification tasks (not suitable for detection or segmentation)
Input images must be resized to 224×224 pixels

Training Data

The pre-trained models were trained on curated datasets for few-shot learning evaluation:

Encoder models: ImageNet pre-trained features (ResNet18/ViT)
Meta-Training: ImageNet-21k
Validation: Standard few-shot learning benchmarks, e.g. mini-ImageNet, PlantDoc and BoneBreak

Data Preprocessing

Images resized to 224×224 pixels
Normalized with ImageNet statistics:
- Mean: [0.4914, 0.4822, 0.4465]
- Std: [0.2023, 0.1994, 0.2010]

Evaluation

Performance Metrics

The model is evaluated using standard few-shot learning metrics:

Accuracy: Overall classification accuracy
Few-shot Accuracy: Performance with 1, 5 shot scenarios

Model Variants

Model	Backbone	Parameters	Model Size	Performance
ResPreAll	ResNet18	53M	~200MB	Balanced speed/accuracy
ViTPreAll	ViT-Base	128M	~500MB	Higher accuracy

Ethical Considerations

Potential Biases

The model inherits biases from ImageNet and Imagenet-21k pre-training
Performance may vary across different demographic groups or geographic regions
Context examples significantly influence predictions and may introduce bias

Responsible Use

Validate performance on your specific use case and demographic groups
Be aware of potential biases in context image selection
Consider fairness implications when deploying in production systems
Ensure diverse and representative context examples

Limitations and Risks

Technical Limitations

Context Dependency: Requires good context examples for optimal performance
Computational Requirements: Transformer architecture requires significant memory
Fixed Architecture: Pre-trained models have fixed class numbers and architecture
Image Size: Limited to 224×224 input resolution

Potential Risks

Misclassification: Incorrect predictions in critical applications
Bias Amplification: May amplify biases present in context images
Overfitting to Context: May not generalize beyond provided examples

Citation

@misc{schiesser2025pictsure,
      title={PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers}, 
      author={Lukas Schiesser and Cornelius Wolff and Sophie Haas and Simon Pukrop},
      year={2025},
      eprint={2506.14842},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.14842}, 
}

Model Card Contact

For questions about this model card or the PictSure model, open an issue in the GitHub repository.

Downloads last month: 21

Collection including pictsure/pictsure-vit

PictSure 1.0

Collection

5 items • Updated Apr 13

Paper for pictsure/pictsure-vit

PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers

Paper • 2506.14842 • Published Jun 16, 2025 • 7