TwinCar β€” ConvNeXt-Tiny (Stanford Cars)

A fine-grained vehicle make/model classifier built on ConvNeXt-Tiny, fully fine-tuned on the Stanford Cars dataset (195 classes). This is the best-performing model from the TwinCar project, which compares several CNN and Vision Transformer architectures for automotive classification.

Given an image of a car, the model predicts the vehicle's make, model, and year as a single fine-grained class (e.g. BMW_X5_SUV_2007).

Model details

Architecture ConvNeXt-Tiny (torchvision.models.convnext_tiny)
Initialization ImageNet-1k pretrained weights (ConvNeXt_Tiny_Weights.IMAGENET1K_V1)
Fine-tuning Full network (single-phase, no frozen layers)
Classes 195 fine-grained make/model/year categories
Input size 224 Γ— 224 RGB
Framework PyTorch
Weights file best_model.pt (raw state_dict, ~112 MB)

The classifier head is the standard ConvNeXt-Tiny head with the final Linear layer replaced by a Linear(in_features, 195). Only the final layer is reshaped; LayerNorm and Flatten are kept as in the original head.

Files in this repository

File Description
best_model.pt Model weights, saved as a plain state_dict
idx_to_class.json Maps class index β†’ label string (e.g. 42 β†’ "BMW_X5_SUV_2007")
train_config.json Training hyperparameters and preprocessing constants
README.md This model card

Note on the weights format. best_model.pt is a raw state_dict saved with torch.save(model.state_dict(), ...), not a wrapped checkpoint. Load it directly with model.load_state_dict(...) as shown below.

How to use

import json
import torch
import torch.nn as nn
from PIL import Image
from torchvision import models, transforms
from huggingface_hub import hf_hub_download

REPO_ID = "cherky15/twincars_convnext"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 1. Download files from the Hub
weights_path = hf_hub_download(REPO_ID, "best_model.pt")
labels_path  = hf_hub_download(REPO_ID, "idx_to_class.json")
config_path  = hf_hub_download(REPO_ID, "train_config.json")

with open(labels_path) as f:
    idx_to_class = {int(k): v for k, v in json.load(f).items()}
with open(config_path) as f:
    config = json.load(f)

num_classes = len(idx_to_class)  # 195

# 2. Rebuild the architecture and load the weights
model = models.convnext_tiny(weights=None)
in_features = model.classifier[2].in_features
model.classifier[2] = nn.Linear(in_features, num_classes)

state_dict = torch.load(weights_path, map_location=device)
model.load_state_dict(state_dict)
model.to(device).eval()

# 3. Preprocess (must match validation/test transforms used in training)
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(config["img_size"]),          # 224
    transforms.ToTensor(),
    transforms.Normalize(mean=config["imagenet_mean"],  # ImageNet stats
                         std=config["imagenet_std"]),
])

# 4. Predict
image = Image.open("car.jpg").convert("RGB")
x = transform(image).unsqueeze(0).to(device)

with torch.no_grad():
    probs = torch.softmax(model(x), dim=1)
    top5_prob, top5_idx = probs.topk(5, dim=1)

for prob, idx in zip(top5_prob[0], top5_idx[0]):
    label = idx_to_class[int(idx)].replace("_", " ")
    print(f"{label:40s} {prob.item():.3f}")

For batch inference over a folder of images (with make/model/year parsing and CSV export), see batch_predict.py in the TwinCar repository.

Intended use

Intended: fine-grained recognition of car make/model/year on clean, well-framed images similar in distribution to Stanford Cars (single centered vehicle, mostly clear backgrounds), and as a research/educational baseline for comparing architectures on fine-grained classification.

Out of scope: the model only knows the 195 Stanford Cars categories. It cannot recognize makes/models outside this set, and will still return a confident class for non-car images or unseen vehicles. It is not validated for safety-critical or surveillance use.

Training data

  • Dataset: Stanford Cars (naufalso/stanford_cars)
  • Classes: 195 fine-grained make/model/year categories
  • Test split: 8,000 images
  • Class imbalance was handled during training with a WeightedRandomSampler.

Training procedure

Fine-tuned end-to-end from ImageNet-pretrained weights with mixed-precision (AMP) on a CUDA GPU.

Hyperparameters

Hyperparameter Value
Optimizer AdamW
Learning rate 3e-4
Weight decay 1e-4
Loss CrossEntropy with label smoothing = 0.1
LR scheduler ReduceLROnPlateau (mode=min, factor=0.5, patience=2)
Batch size 32
Max epochs 30
Early stopping patience = 5 (on validation loss)
Image size 224 Γ— 224
Seed 42

Data augmentation (training)

RandomResizedCrop(224, scale=(0.8, 1.0)), RandomHorizontalFlip(p=0.5), RandomRotation(10), ColorJitter(brightness=0.2, contrast=0.2, saturation=0.1, hue=0.02), then ToTensor and ImageNet normalization. Validation/test images use Resize(256) β†’ CenterCrop(224) β†’ normalize.

The checkpoint corresponds to the epoch with the lowest validation loss.

Evaluation results

Evaluated on the Stanford Cars test split (8,000 images, 195 classes).

Metric Value
Test loss 0.6942
Top-1 accuracy 87.08%
Top-5 accuracy 96.81%
Macro Precision 0.8765
Macro Recall 0.8696
Macro F1 0.8697
Weighted F1 0.8705
Make accuracy 93.03%
Make + Model accuracy 87.58%

Comparison with other TwinCar models

Model Top-1 Top-5 Weighted F1 Make + Model Acc Size
EfficientNet-B0 v1 83.08% 95.20% 0.8308 83.45% 17.33 MB
EfficientNet-B0 v2 81.64% 94.50% 0.8184 82.19% 17.33 MB
ConvNeXt-Tiny (this model) 87.08% 96.81% 0.8705 87.58% 111.95 MB

ConvNeXt-Tiny is the strongest model for the core TwinCar goal β€” distinguishing visually similar makes/models (e.g. Audi S4 vs S5, BMW 3 Series vs M3) β€” thanks to a stronger backbone, at the cost of a larger model size. EfficientNet-B0 v1 remains a useful lightweight baseline (~6Γ— smaller) when deployment size matters.

Limitations and bias

Results are measured on the Stanford Cars test split. Real-world images β€” e.g. from drones or ground robots, with varied angles, shadows, reflections, occlusion, or cluttered parking-lot backgrounds β€” will likely degrade accuracy. The reported numbers are benchmark performance on a curated dataset, not guaranteed production performance. The label space and any demographic/geographic biases are inherited from Stanford Cars (predominantly US-market vehicles up to ~2012).

References

  • ConvNeXt β€” A ConvNet for the 2020s (Liu et al., 2022)
  • Stanford Cars β€” 3D Object Representations for Fine-Grained Categorization (Krause et al., 2013)
  • Project repository β€” github.com/dragicakostoska/TwinCar

License

Released under CC BY-NC 4.0 for educational and research purposes. Note that use is also subject to the terms of the Stanford Cars dataset and the ImageNet-pretrained base weights.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train cherky15/twincars_convnext

Evaluation results