photo-identifier-v3

A single fine-tuned ConvNeXt-Small model that identifies a wide range of subjects in everyday photographs — objects, animals, vehicles, food, landmarks, scenes, textures, people, aerial views, and more.

~1,900 classes across 30+ domains. One model, any photo.

⚠️ Active training — the checkpoint published here covers 1,216 classes at epoch 20. A full 300-epoch run targeting ~1,900 classes is in progress; a new version will be pushed when complete.


Model Details

Property Value
Backbone ConvNeXt-Small (pretrained ImageNet-1k, torchvision)
Parameters ~49.8 M
Input size 224 × 224 RGB
Classes ~1,900 (see full list in config.json)
Loss Focal Loss γ=2 + Class-Balanced α (CVPR 2019)
Inference weights EMA shadow model (decay 0.999)
Best val accuracy 66.25% (1,216 classes, epoch 20 — full run in progress)

What It Recognises

Domain Examples
Food 101 foods (sushi, pizza, steak, ramen, …)
Animals — wildlife 57 species: snow leopard, orca, gorilla, bison, platypus, …
Animals — birds 40 species: bald eagle, painted bunting, snowy owl, roadrunner, …
Animals — marine whale, dolphin, sea turtle, great white shark, octopus, coral reef
Animals — reptile king cobra, komodo dragon, chameleon, saltwater crocodile, …
Animals — insects monarch butterfly, blue morpho, luna moth, honeybee, dragonfly
Animals — exotic birds flamingo, toucan, penguin, peacock, parrot, albatross
Animals — trees 45 species: coast redwood, joshua tree, weeping willow, ginkgo, …
Vehicles 196 car models (Stanford Cars), 100 aircraft types (FGVC-Aircraft), buses, trains, motorcycles, helicopters
Landmarks Eiffel Tower, Colosseum, Taj Mahal, Machu Picchu, Burj Khalifa, …
Named skyscrapers Empire State, Chrysler, One WTC, Petronas Towers, Taipei 101, …
Architecture styles Victorian, Art Deco, Gothic, Modernist, log cabin, castle, mosque, …
Home styles Farmhouse, craftsman, bungalow, A-frame, adobe, Tudor revival, …
Scenes — outdoor 397 SUN397 scenes + Intel scenes (forest, glacier, mountain, sea, …)
Scenes — aerial 45 RESISC45 overhead/satellite classes (bridge, stadium, airport, …)
American Southwest 26 locations: Arches, Zion, Antelope Canyon, Horseshoe Bend, Wave, …
Sky & weather Sunset, aurora, fog, blizzard + tornado, hurricane, lightning, flood
Clouds Cumulus, cirrus, cumulonimbus, lenticular, mammatus
Mountains Everest, K2, Matterhorn, Denali, Kilimanjaro, Mont Blanc, Rainier
Night scenes City at night, neon signs, fireworks, bioluminescence
Space Rocket launch, astronaut, Earth from space, moon, Milky Way
Sports 17 action sports: basketball, surfing, skiing, cycling, gymnastics, …
Musical instruments Piano, violin, guitar, drums, saxophone, cello, harp, banjo, …
Flowers 102 Oxford flowers + 12 wildflower species
Rocks & minerals Granite, obsidian, quartz, amethyst, geode, malachite
Mushrooms Chanterelle, morel, oyster, fly agaric, lion's mane
Textures 47 DTD texture classes (rippled, braided, knitted, cracked, …)
Traffic signs 43 GTSRB German traffic sign types
People FairFace 18 age × gender classes + full-body people/crowd/wedding
Medical 4 Alzheimer MRI stages, 7 skin lesion types
Documents 6 document type classes

Quick Start

Google Colab / fresh environment: run !pip install -q transformers torchvision safetensors Pillow huggingface_hub first.

from transformers import AutoModelForImageClassification
from PIL import Image

# Load from HuggingFace Hub (trust_remote_code required for custom backbone)
model = AutoModelForImageClassification.from_pretrained(
    "BlakePeavy/photo-identifier-v3",
    trust_remote_code=True,
)
model.eval()

# Run inference
img = Image.open("my_photo.jpg").convert("RGB")
results = model.predict(img, top_k=5)

for label, score in results:
    print(f"{score:.1%}  {label}")

Example output:

82.4%  animals--snow_leopard
 9.1%  animals--cheetah
 4.2%  animals--jaguar

Using transformers pipeline

The image processor must be loaded explicitly because this model uses a custom model_type not registered in the default transformers auto-registry.

from transformers import pipeline, AutoImageProcessor

# Load the image processor from the repo's preprocessor_config.json
processor = AutoImageProcessor.from_pretrained(
    "BlakePeavy/photo-identifier-v3",
    use_fast=False,
)
pipe = pipeline(
    "image-classification",
    model="BlakePeavy/photo-identifier-v3",
    image_processor=processor,
    trust_remote_code=True,
)
results = pipe("my_photo.jpg", top_k=5)
for r in results:
    print(f"{r['score']:.1%}  {r['label']}")

Loading the Model Manually

Useful when you want plain PyTorch with no transformers dependency. The weights are stored as model.safetensors (not pytorch_model.bin). Keys have a convnext. prefix that must be stripped before loading into a bare torchvision.models.convnext_small.

# !pip install -q torch torchvision safetensors Pillow huggingface_hub
import torch
import json
from torchvision import models, transforms
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download
from PIL import Image

REPO = "BlakePeavy/photo-identifier-v3"

# Download model files
config_path  = hf_hub_download(REPO, "config.json")
weights_path = hf_hub_download(REPO, "model.safetensors")

# Load label map
with open(config_path) as f:
    cfg = json.load(f)
classes = cfg["id2label"]  # {"0": "class_name", ...}

# Rebuild the backbone (weights=None — we load from safetensors below)
model = models.convnext_small(weights=None)
in_f = model.classifier[-1].in_features
model.classifier[-1] = torch.nn.Linear(in_f, len(classes))

# Load from safetensors — strip the "convnext." wrapper prefix
sd = load_file(weights_path)
sd = {k.replace("convnext.", "", 1): v for k, v in sd.items()
      if k.startswith("convnext.")}
model.load_state_dict(sd)
model.eval()

# Preprocess
tf = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])

img = Image.open("my_photo.jpg").convert("RGB")
with torch.no_grad():
    logits = model(tf(img).unsqueeze(0))
    probs  = logits.softmax(-1)[0]
    top5   = probs.topk(5)

for score, idx in zip(top5.values, top5.indices):
    print(f"{score:.1%}  {classes[str(idx.item())]}")

Training Recipe

14 techniques from published papers applied together:

Technique Paper
ConvNeXt-Small backbone Liu et al., CVPR 2022
Differential LR (1e-5 backbone / 5e-4 head) Standard transfer learning practice
Warmup + Cosine annealing Loshchilov & Hutter, ICLR 2017
Label smoothing ε=0.1 Müller, Kornblith & Hinton, NeurIPS 2019
Mixup α=0.2 Zhang et al., ICLR 2018
CutMix α=1.0 Yun et al., ICCV 2019
RandAugment N=2 M=9 Cubuk et al., NeurIPS 2020
Random Erasing p=0.25 (post-normalize) Zhong et al., AAAI 2020
EMA decay=0.999 Tarvainen & Valpola, NeurIPS 2017
Gradient clipping max_norm=1.0
Progressive resize 160→224 px Tan & Le, ICML 2021
Focal Loss γ=2 Lin et al., ICCV 2017
Class-Balanced α weights β=0.9999 Cui et al., CVPR 2019
AdamW weight_decay=0.05 Loshchilov & Hutter, ICLR 2019
Automatic Mixed Precision (AMP) Micikevicius et al., ICLR 2018

Data Sources

Trained on a mix of public research datasets and openly-licensed photos. Two sources carry licence terms worth noting:

  • iNaturalist — species observation photos. Individual observations are licenced by their contributors; a subset are CC BY-NC. This model is released for non-commercial use accordingly.
  • Wikimedia Commons — CC-licensed landscape and subject photography. Some images are CC BY-SA (share-alike).

Limitations

  • Resolution: Trained at 224×224. Very small subjects in high-resolution photos may not be detected.
  • Rare classes: Classes with fewer than 100 training images (some Wikimedia groups) have higher error rates. Focal Loss mitigates but does not eliminate this.
  • Medical classes (Alzheimer MRI, skin lesions) are for demonstration only — not for clinical use.
  • Class overlap: Some visually similar classes (leopard / cheetah / jaguar, Victorian / Gothic architecture) may be confused near their decision boundaries.

Citation

If you use this model in your work:

@misc{photo-identifier-v3-2026,
  title  = {photo-identifier-v3: A Single Model for ~1900-class Open-World Image Classification},
  year   = {2026},
  url    = {https://huggingface.co/BlakePeavy/photo-identifier-v3},
  note   = {ConvNeXt-Small fine-tuned on 20+ datasets with Focal Loss and Class-Balanced weighting}
}

Licence

Code: MIT
Model weights: Non-Commercial — a subset of iNaturalist training data is CC BY-NC.
See iNaturalist licensing and Wikimedia Commons reuse terms for details.

Downloads last month
116
Safetensors
Model size
50.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train BlakePeavy/photo-identifier-v3

Evaluation results