PictSure: Few-Shot Image Classification with In-Context Learning

Model Description

PictSure (link to project website) is a novel few-shot learning model for image classification that leverages context images to make predictions on new, unseen images. The model combines pre-trained image encoders with transformer architecture to enable effective few-shot learning with minimal examples. More details can be found on our paper page.

Key Features

  • Few-shot Learning: Classify images with only a few examples per class
  • Context-Aware: Uses context images and labels to inform predictions
  • Flexible Architecture: Supports both ResNet and Vision Transformer (ViT) backbones as well as other custom backbones
  • Transformer-Based: Employs transformer encoders for sequence processing
  • Easy Integration: Simple API for setting context and making predictions

How to Use

Installation

pip install torch torchvision
pip install huggingface_hub
pip install PictSure

Basic Usage

This example can also be found (including the mentioned images) in our GitHub Repository.

from PictSure import PictSure
from PIL import Image

# Load pre-trained model
model = PictSure.from_pretrained("pictsure/pictsure-vit")

# Prepare context images and labels
context_images = [
    Image.open("cat1.jpg"),
    Image.open("cat2.jpg"),
    Image.open("dog1.jpg"),
    Image.open("dog2.jpg")
]
context_labels = [0, 0, 1, 1]  # 0 for cat, 1 for dog

# Set context
model.set_context_images(context_images, context_labels)

# Make prediction on new image
test_image = Image.open("unknown_animal.jpg")
prediction = model.predict(test_image)
print(f"Predicted class: {prediction}")

Model Architecture

PictSure consists of several key components:

  1. Embedding Network: Pre-trained ResNet18 or custom Vision Transformer for feature extraction
  2. Projection Layers: Linear projections for image features and label embeddings
  3. Transformer Encoder: Multi-head attention mechanism with configurable heads and layers
  4. Classification Head: Final linear layer for class prediction

Architecture Details

  • Image Resolution: 224 × 224 pixels
  • Feature Dimension: 1024D (concatenated image + label projections)
  • Default Configuration:
    • ResNet model: 8 attention heads, 4 transformer layers, ~53M parameters
    • ViT model: 8 attention heads, 4 transformer layers, ~128M parameters

Intended Use

Primary Use Cases

  • Few-shot image classification in scenarios with limited labeled data
  • Meta-learning applications where rapid adaptation to new classes is required
  • Educational and research purposes in computer vision and machine learning
  • Prototyping classification systems with minimal training data

Limitations

  • Requires context images to be set before making predictions
  • Performance depends on the quality and representativeness of context examples
  • Limited to classification tasks (not suitable for detection or segmentation)
  • Input images must be resized to 224×224 pixels

Training Data

The pre-trained models were trained on curated datasets for few-shot learning evaluation:

  • Encoder models: ImageNet pre-trained features (ResNet18/ViT)
  • Meta-Training: ImageNet-21k
  • Validation: Standard few-shot learning benchmarks, e.g. mini-ImageNet, PlantDoc and BoneBreak

Data Preprocessing

  • Images resized to 224×224 pixels
  • Normalized with ImageNet statistics:
    • Mean: [0.4914, 0.4822, 0.4465]
    • Std: [0.2023, 0.1994, 0.2010]

Evaluation

Performance Metrics

The model is evaluated using standard few-shot learning metrics:

  • Accuracy: Overall classification accuracy
  • Few-shot Accuracy: Performance with 1, 5 shot scenarios

Model Variants

Model Backbone Parameters Model Size Performance
ResPreAll ResNet18 53M ~200MB Balanced speed/accuracy
ViTPreAll ViT-Base 128M ~500MB Higher accuracy

Ethical Considerations

Potential Biases

  • The model inherits biases from ImageNet and Imagenet-21k pre-training
  • Performance may vary across different demographic groups or geographic regions
  • Context examples significantly influence predictions and may introduce bias

Responsible Use

  • Validate performance on your specific use case and demographic groups
  • Be aware of potential biases in context image selection
  • Consider fairness implications when deploying in production systems
  • Ensure diverse and representative context examples

Limitations and Risks

Technical Limitations

  • Context Dependency: Requires good context examples for optimal performance
  • Computational Requirements: Transformer architecture requires significant memory
  • Fixed Architecture: Pre-trained models have fixed class numbers and architecture
  • Image Size: Limited to 224×224 input resolution

Potential Risks

  • Misclassification: Incorrect predictions in critical applications
  • Bias Amplification: May amplify biases present in context images
  • Overfitting to Context: May not generalize beyond provided examples

Citation

@misc{schiesser2025pictsure,
      title={PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers}, 
      author={Lukas Schiesser and Cornelius Wolff and Sophie Haas and Simon Pukrop},
      year={2025},
      eprint={2506.14842},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.14842}, 
}

Model Card Contact

For questions about this model card or the PictSure model, open an issue in the GitHub repository.

Downloads last month
141
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for pictsure/pictsure-vit