TwinCar — ConvNeXt-Tiny (Stanford Cars)

A fine-grained vehicle make/model classifier built on ConvNeXt-Tiny, fully fine-tuned on the Stanford Cars dataset (195 classes). This is the best-performing model from the TwinCar project, which compares several CNN and Vision Transformer architectures for automotive classification.

Given an image of a car, the model predicts the vehicle's make, model, and year as a single fine-grained class (e.g. BMW_X5_SUV_2007).

Model details


Architecture	ConvNeXt-Tiny (`torchvision.models.convnext_tiny`)
Initialization	ImageNet-1k pretrained weights (`ConvNeXt_Tiny_Weights.IMAGENET1K_V1`)
Fine-tuning	Full network (single-phase, no frozen layers)
Classes	195 fine-grained make/model/year categories
Input size	224 × 224 RGB
Framework	PyTorch
Weights file	`best_model.pt` (raw `state_dict`, ~112 MB)

The classifier head is the standard ConvNeXt-Tiny head with the final Linear layer replaced by a Linear(in_features, 195). Only the final layer is reshaped; LayerNorm and Flatten are kept as in the original head.

Files in this repository

File	Description
`best_model.pt`	Model weights, saved as a plain `state_dict`
`idx_to_class.json`	Maps class index → label string (e.g. `42 → "BMW_X5_SUV_2007"`)
`train_config.json`	Training hyperparameters and preprocessing constants
`README.md`	This model card

Note on the weights format. best_model.pt is a raw state_dict saved with torch.save(model.state_dict(), ...), not a wrapped checkpoint. Load it directly with model.load_state_dict(...) as shown below.

How to use

import json
import torch
import torch.nn as nn
from PIL import Image
from torchvision import models, transforms
from huggingface_hub import hf_hub_download

REPO_ID = "cherky15/twincars_convnext"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 1. Download files from the Hub
weights_path = hf_hub_download(REPO_ID, "best_model.pt")
labels_path  = hf_hub_download(REPO_ID, "idx_to_class.json")
config_path  = hf_hub_download(REPO_ID, "train_config.json")

with open(labels_path) as f:
    idx_to_class = {int(k): v for k, v in json.load(f).items()}
with open(config_path) as f:
    config = json.load(f)

num_classes = len(idx_to_class)  # 195

# 2. Rebuild the architecture and load the weights
model = models.convnext_tiny(weights=None)
in_features = model.classifier[2].in_features
model.classifier[2] = nn.Linear(in_features, num_classes)

state_dict = torch.load(weights_path, map_location=device)
model.load_state_dict(state_dict)
model.to(device).eval()

# 3. Preprocess (must match validation/test transforms used in training)
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(config["img_size"]),          # 224
    transforms.ToTensor(),
    transforms.Normalize(mean=config["imagenet_mean"],  # ImageNet stats
                         std=config["imagenet_std"]),
])

# 4. Predict
image = Image.open("car.jpg").convert("RGB")
x = transform(image).unsqueeze(0).to(device)

with torch.no_grad():
    probs = torch.softmax(model(x), dim=1)
    top5_prob, top5_idx = probs.topk(5, dim=1)

for prob, idx in zip(top5_prob[0], top5_idx[0]):
    label = idx_to_class[int(idx)].replace("_", " ")
    print(f"{label:40s} {prob.item():.3f}")

For batch inference over a folder of images (with make/model/year parsing and CSV export), see batch_predict.py in the TwinCar repository.

Intended use

Intended: fine-grained recognition of car make/model/year on clean, well-framed images similar in distribution to Stanford Cars (single centered vehicle, mostly clear backgrounds), and as a research/educational baseline for comparing architectures on fine-grained classification.

Out of scope: the model only knows the 195 Stanford Cars categories. It cannot recognize makes/models outside this set, and will still return a confident class for non-car images or unseen vehicles. It is not validated for safety-critical or surveillance use.

Training data

Dataset: Stanford Cars (naufalso/stanford_cars)
Classes: 195 fine-grained make/model/year categories
Test split: 8,000 images
Class imbalance was handled during training with a WeightedRandomSampler.

Training procedure

Fine-tuned end-to-end from ImageNet-pretrained weights with mixed-precision (AMP) on a CUDA GPU.

Hyperparameters

Hyperparameter	Value
Optimizer	AdamW
Learning rate	3e-4
Weight decay	1e-4
Loss	CrossEntropy with label smoothing = 0.1
LR scheduler	ReduceLROnPlateau (mode=min, factor=0.5, patience=2)
Batch size	32
Max epochs	30
Early stopping	patience = 5 (on validation loss)
Image size	224 × 224
Seed	42

Data augmentation (training)

RandomResizedCrop(224, scale=(0.8, 1.0)), RandomHorizontalFlip(p=0.5), RandomRotation(10), ColorJitter(brightness=0.2, contrast=0.2, saturation=0.1, hue=0.02), then ToTensor and ImageNet normalization. Validation/test images use Resize(256) → CenterCrop(224) → normalize.

The checkpoint corresponds to the epoch with the lowest validation loss.

Evaluation results

Evaluated on the Stanford Cars test split (8,000 images, 195 classes).

Metric	Value
Test loss	0.6942
Top-1 accuracy	87.08%
Top-5 accuracy	96.81%
Macro Precision	0.8765
Macro Recall	0.8696
Macro F1	0.8697
Weighted F1	0.8705
Make accuracy	93.03%
Make + Model accuracy	87.58%

Comparison with other TwinCar models

Model	Top-1	Top-5	Weighted F1	Make + Model Acc	Size
EfficientNet-B0 v1	83.08%	95.20%	0.8308	83.45%	17.33 MB
EfficientNet-B0 v2	81.64%	94.50%	0.8184	82.19%	17.33 MB
ConvNeXt-Tiny (this model)	87.08%	96.81%	0.8705	87.58%	111.95 MB

ConvNeXt-Tiny is the strongest model for the core TwinCar goal — distinguishing visually similar makes/models (e.g. Audi S4 vs S5, BMW 3 Series vs M3) — thanks to a stronger backbone, at the cost of a larger model size. EfficientNet-B0 v1 remains a useful lightweight baseline (~6× smaller) when deployment size matters.

Limitations and bias

Results are measured on the Stanford Cars test split. Real-world images — e.g. from drones or ground robots, with varied angles, shadows, reflections, occlusion, or cluttered parking-lot backgrounds — will likely degrade accuracy. The reported numbers are benchmark performance on a curated dataset, not guaranteed production performance. The label space and any demographic/geographic biases are inherited from Stanford Cars (predominantly US-market vehicles up to ~2012).

References

ConvNeXt — A ConvNet for the 2020s (Liu et al., 2022)
Stanford Cars — 3D Object Representations for Fine-Grained Categorization (Krause et al., 2013)
Project repository — github.com/dragicakostoska/TwinCar

License

Released under CC BY-NC 4.0 for educational and research purposes. Note that use is also subject to the terms of the Stanford Cars dataset and the ImageNet-pretrained base weights.

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train cherky15/twincars_convnext

Evaluation results

Top-1 Accuracy on Stanford Cars
self-reported

0.871
Top-5 Accuracy on Stanford Cars
self-reported

0.968
Weighted F1 on Stanford Cars
self-reported

0.871
Macro Precision on Stanford Cars
self-reported

0.876
Macro Recall on Stanford Cars
self-reported

0.870