photo-identifier-v3

A single fine-tuned ConvNeXt-Small model that identifies a wide range of subjects in everyday photographs — objects, animals, vehicles, food, landmarks, scenes, textures, people, aerial views, and more.

~1,900 classes across 30+ domains. One model, any photo.

⚠️ Active training — the checkpoint published here covers 1,216 classes at epoch 20. A full 300-epoch run targeting ~1,900 classes is in progress; a new version will be pushed when complete.

Model Details

Property	Value
Backbone	ConvNeXt-Small (pretrained ImageNet-1k, torchvision)
Parameters	~49.8 M
Input size	224 × 224 RGB
Classes	~1,900 (see full list in `config.json`)
Loss	Focal Loss γ=2 + Class-Balanced α (CVPR 2019)
Inference weights	EMA shadow model (decay 0.999)
Best val accuracy	66.25% (1,216 classes, epoch 20 — full run in progress)

What It Recognises

Domain	Examples
Food	101 foods (sushi, pizza, steak, ramen, …)
Animals — wildlife	57 species: snow leopard, orca, gorilla, bison, platypus, …
Animals — birds	40 species: bald eagle, painted bunting, snowy owl, roadrunner, …
Animals — marine	whale, dolphin, sea turtle, great white shark, octopus, coral reef
Animals — reptile	king cobra, komodo dragon, chameleon, saltwater crocodile, …
Animals — insects	monarch butterfly, blue morpho, luna moth, honeybee, dragonfly
Animals — exotic birds	flamingo, toucan, penguin, peacock, parrot, albatross
Animals — trees	45 species: coast redwood, joshua tree, weeping willow, ginkgo, …
Vehicles	196 car models (Stanford Cars), 100 aircraft types (FGVC-Aircraft), buses, trains, motorcycles, helicopters
Landmarks	Eiffel Tower, Colosseum, Taj Mahal, Machu Picchu, Burj Khalifa, …
Named skyscrapers	Empire State, Chrysler, One WTC, Petronas Towers, Taipei 101, …
Architecture styles	Victorian, Art Deco, Gothic, Modernist, log cabin, castle, mosque, …
Home styles	Farmhouse, craftsman, bungalow, A-frame, adobe, Tudor revival, …
Scenes — outdoor	397 SUN397 scenes + Intel scenes (forest, glacier, mountain, sea, …)
Scenes — aerial	45 RESISC45 overhead/satellite classes (bridge, stadium, airport, …)
American Southwest	26 locations: Arches, Zion, Antelope Canyon, Horseshoe Bend, Wave, …
Sky & weather	Sunset, aurora, fog, blizzard + tornado, hurricane, lightning, flood
Clouds	Cumulus, cirrus, cumulonimbus, lenticular, mammatus
Mountains	Everest, K2, Matterhorn, Denali, Kilimanjaro, Mont Blanc, Rainier
Night scenes	City at night, neon signs, fireworks, bioluminescence
Space	Rocket launch, astronaut, Earth from space, moon, Milky Way
Sports	17 action sports: basketball, surfing, skiing, cycling, gymnastics, …
Musical instruments	Piano, violin, guitar, drums, saxophone, cello, harp, banjo, …
Flowers	102 Oxford flowers + 12 wildflower species
Rocks & minerals	Granite, obsidian, quartz, amethyst, geode, malachite
Mushrooms	Chanterelle, morel, oyster, fly agaric, lion's mane
Textures	47 DTD texture classes (rippled, braided, knitted, cracked, …)
Traffic signs	43 GTSRB German traffic sign types
People	FairFace 18 age × gender classes + full-body people/crowd/wedding
Medical	4 Alzheimer MRI stages, 7 skin lesion types
Documents	6 document type classes

Quick Start

Google Colab / fresh environment: run !pip install -q transformers torchvision safetensors Pillow huggingface_hub first.

from transformers import AutoModelForImageClassification
from PIL import Image

# Load from HuggingFace Hub (trust_remote_code required for custom backbone)
model = AutoModelForImageClassification.from_pretrained(
    "BlakePeavy/photo-identifier-v3",
    trust_remote_code=True,
)
model.eval()

# Run inference
img = Image.open("my_photo.jpg").convert("RGB")
results = model.predict(img, top_k=5)

for label, score in results:
    print(f"{score:.1%}  {label}")

Example output:

82.4%  animals--snow_leopard
 9.1%  animals--cheetah
 4.2%  animals--jaguar

Using `transformers` pipeline

The image processor must be loaded explicitly because this model uses a custom model_type not registered in the default transformers auto-registry.

from transformers import pipeline, AutoImageProcessor

# Load the image processor from the repo's preprocessor_config.json
processor = AutoImageProcessor.from_pretrained(
    "BlakePeavy/photo-identifier-v3",
    use_fast=False,
)
pipe = pipeline(
    "image-classification",
    model="BlakePeavy/photo-identifier-v3",
    image_processor=processor,
    trust_remote_code=True,
)
results = pipe("my_photo.jpg", top_k=5)
for r in results:
    print(f"{r['score']:.1%}  {r['label']}")

Loading the Model Manually

Useful when you want plain PyTorch with no transformers dependency. The weights are stored as model.safetensors (not pytorch_model.bin). Keys have a convnext. prefix that must be stripped before loading into a bare torchvision.models.convnext_small.

# !pip install -q torch torchvision safetensors Pillow huggingface_hub
import torch
import json
from torchvision import models, transforms
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download
from PIL import Image

REPO = "BlakePeavy/photo-identifier-v3"

# Download model files
config_path  = hf_hub_download(REPO, "config.json")
weights_path = hf_hub_download(REPO, "model.safetensors")

# Load label map
with open(config_path) as f:
    cfg = json.load(f)
classes = cfg["id2label"]  # {"0": "class_name", ...}

# Rebuild the backbone (weights=None — we load from safetensors below)
model = models.convnext_small(weights=None)
in_f = model.classifier[-1].in_features
model.classifier[-1] = torch.nn.Linear(in_f, len(classes))

# Load from safetensors — strip the "convnext." wrapper prefix
sd = load_file(weights_path)
sd = {k.replace("convnext.", "", 1): v for k, v in sd.items()
      if k.startswith("convnext.")}
model.load_state_dict(sd)
model.eval()

# Preprocess
tf = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])

img = Image.open("my_photo.jpg").convert("RGB")
with torch.no_grad():
    logits = model(tf(img).unsqueeze(0))
    probs  = logits.softmax(-1)[0]
    top5   = probs.topk(5)

for score, idx in zip(top5.values, top5.indices):
    print(f"{score:.1%}  {classes[str(idx.item())]}")

Training Recipe

14 techniques from published papers applied together:

Technique	Paper
ConvNeXt-Small backbone	Liu et al., CVPR 2022
Differential LR (1e-5 backbone / 5e-4 head)	Standard transfer learning practice
Warmup + Cosine annealing	Loshchilov & Hutter, ICLR 2017
Label smoothing ε=0.1	Müller, Kornblith & Hinton, NeurIPS 2019
Mixup α=0.2	Zhang et al., ICLR 2018
CutMix α=1.0	Yun et al., ICCV 2019
RandAugment N=2 M=9	Cubuk et al., NeurIPS 2020
Random Erasing p=0.25 (post-normalize)	Zhong et al., AAAI 2020
EMA decay=0.999	Tarvainen & Valpola, NeurIPS 2017
Gradient clipping max_norm=1.0	—
Progressive resize 160→224 px	Tan & Le, ICML 2021
Focal Loss γ=2	Lin et al., ICCV 2017
Class-Balanced α weights β=0.9999	Cui et al., CVPR 2019
AdamW weight_decay=0.05	Loshchilov & Hutter, ICLR 2019
Automatic Mixed Precision (AMP)	Micikevicius et al., ICLR 2018

Data Sources

Trained on a mix of public research datasets and openly-licensed photos. Two sources carry licence terms worth noting:

iNaturalist — species observation photos. Individual observations are licenced by their contributors; a subset are CC BY-NC. This model is released for non-commercial use accordingly.
Wikimedia Commons — CC-licensed landscape and subject photography. Some images are CC BY-SA (share-alike).

Limitations

Resolution: Trained at 224×224. Very small subjects in high-resolution photos may not be detected.
Rare classes: Classes with fewer than 100 training images (some Wikimedia groups) have higher error rates. Focal Loss mitigates but does not eliminate this.
Medical classes (Alzheimer MRI, skin lesions) are for demonstration only — not for clinical use.
Class overlap: Some visually similar classes (leopard / cheetah / jaguar, Victorian / Gothic architecture) may be confused near their decision boundaries.

Citation

If you use this model in your work:

@misc{photo-identifier-v3-2026,
  title  = {photo-identifier-v3: A Single Model for ~1900-class Open-World Image Classification},
  year   = {2026},
  url    = {https://huggingface.co/BlakePeavy/photo-identifier-v3},
  note   = {ConvNeXt-Small fine-tuned on 20+ datasets with Focal Loss and Class-Balanced weighting}
}

Licence

Code: MIT
Model weights: Non-Commercial — a subset of iNaturalist training data is CC BY-NC.
See iNaturalist licensing and Wikimedia Commons reuse terms for details.

Downloads last month: 35

Safetensors

Model size

50.4M params

Tensor type

F32

Datasets used to train BlakePeavy/photo-identifier-v3

Evaluation results

Accuracy on PhotoID mixed dataset
self-reported

0.662