TwinCar β ConvNeXt-Tiny (Stanford Cars)
A fine-grained vehicle make/model classifier built on ConvNeXt-Tiny, fully fine-tuned on the Stanford Cars dataset (195 classes). This is the best-performing model from the TwinCar project, which compares several CNN and Vision Transformer architectures for automotive classification.
Given an image of a car, the model predicts the vehicle's make, model, and year as a single fine-grained class (e.g. BMW_X5_SUV_2007).
Model details
| Architecture | ConvNeXt-Tiny (torchvision.models.convnext_tiny) |
| Initialization | ImageNet-1k pretrained weights (ConvNeXt_Tiny_Weights.IMAGENET1K_V1) |
| Fine-tuning | Full network (single-phase, no frozen layers) |
| Classes | 195 fine-grained make/model/year categories |
| Input size | 224 Γ 224 RGB |
| Framework | PyTorch |
| Weights file | best_model.pt (raw state_dict, ~112 MB) |
The classifier head is the standard ConvNeXt-Tiny head with the final Linear layer replaced by a Linear(in_features, 195). Only the final layer is reshaped; LayerNorm and Flatten are kept as in the original head.
Files in this repository
| File | Description |
|---|---|
best_model.pt |
Model weights, saved as a plain state_dict |
idx_to_class.json |
Maps class index β label string (e.g. 42 β "BMW_X5_SUV_2007") |
train_config.json |
Training hyperparameters and preprocessing constants |
README.md |
This model card |
Note on the weights format.
best_model.ptis a rawstate_dictsaved withtorch.save(model.state_dict(), ...), not a wrapped checkpoint. Load it directly withmodel.load_state_dict(...)as shown below.
How to use
import json
import torch
import torch.nn as nn
from PIL import Image
from torchvision import models, transforms
from huggingface_hub import hf_hub_download
REPO_ID = "cherky15/twincars_convnext"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# 1. Download files from the Hub
weights_path = hf_hub_download(REPO_ID, "best_model.pt")
labels_path = hf_hub_download(REPO_ID, "idx_to_class.json")
config_path = hf_hub_download(REPO_ID, "train_config.json")
with open(labels_path) as f:
idx_to_class = {int(k): v for k, v in json.load(f).items()}
with open(config_path) as f:
config = json.load(f)
num_classes = len(idx_to_class) # 195
# 2. Rebuild the architecture and load the weights
model = models.convnext_tiny(weights=None)
in_features = model.classifier[2].in_features
model.classifier[2] = nn.Linear(in_features, num_classes)
state_dict = torch.load(weights_path, map_location=device)
model.load_state_dict(state_dict)
model.to(device).eval()
# 3. Preprocess (must match validation/test transforms used in training)
transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(config["img_size"]), # 224
transforms.ToTensor(),
transforms.Normalize(mean=config["imagenet_mean"], # ImageNet stats
std=config["imagenet_std"]),
])
# 4. Predict
image = Image.open("car.jpg").convert("RGB")
x = transform(image).unsqueeze(0).to(device)
with torch.no_grad():
probs = torch.softmax(model(x), dim=1)
top5_prob, top5_idx = probs.topk(5, dim=1)
for prob, idx in zip(top5_prob[0], top5_idx[0]):
label = idx_to_class[int(idx)].replace("_", " ")
print(f"{label:40s} {prob.item():.3f}")
For batch inference over a folder of images (with make/model/year parsing and CSV export), see batch_predict.py in the TwinCar repository.
Intended use
Intended: fine-grained recognition of car make/model/year on clean, well-framed images similar in distribution to Stanford Cars (single centered vehicle, mostly clear backgrounds), and as a research/educational baseline for comparing architectures on fine-grained classification.
Out of scope: the model only knows the 195 Stanford Cars categories. It cannot recognize makes/models outside this set, and will still return a confident class for non-car images or unseen vehicles. It is not validated for safety-critical or surveillance use.
Training data
- Dataset: Stanford Cars (
naufalso/stanford_cars) - Classes: 195 fine-grained make/model/year categories
- Test split: 8,000 images
- Class imbalance was handled during training with a
WeightedRandomSampler.
Training procedure
Fine-tuned end-to-end from ImageNet-pretrained weights with mixed-precision (AMP) on a CUDA GPU.
Hyperparameters
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | 3e-4 |
| Weight decay | 1e-4 |
| Loss | CrossEntropy with label smoothing = 0.1 |
| LR scheduler | ReduceLROnPlateau (mode=min, factor=0.5, patience=2) |
| Batch size | 32 |
| Max epochs | 30 |
| Early stopping | patience = 5 (on validation loss) |
| Image size | 224 Γ 224 |
| Seed | 42 |
Data augmentation (training)
RandomResizedCrop(224, scale=(0.8, 1.0)), RandomHorizontalFlip(p=0.5), RandomRotation(10), ColorJitter(brightness=0.2, contrast=0.2, saturation=0.1, hue=0.02), then ToTensor and ImageNet normalization. Validation/test images use Resize(256) β CenterCrop(224) β normalize.
The checkpoint corresponds to the epoch with the lowest validation loss.
Evaluation results
Evaluated on the Stanford Cars test split (8,000 images, 195 classes).
| Metric | Value |
|---|---|
| Test loss | 0.6942 |
| Top-1 accuracy | 87.08% |
| Top-5 accuracy | 96.81% |
| Macro Precision | 0.8765 |
| Macro Recall | 0.8696 |
| Macro F1 | 0.8697 |
| Weighted F1 | 0.8705 |
| Make accuracy | 93.03% |
| Make + Model accuracy | 87.58% |
Comparison with other TwinCar models
| Model | Top-1 | Top-5 | Weighted F1 | Make + Model Acc | Size |
|---|---|---|---|---|---|
| EfficientNet-B0 v1 | 83.08% | 95.20% | 0.8308 | 83.45% | 17.33 MB |
| EfficientNet-B0 v2 | 81.64% | 94.50% | 0.8184 | 82.19% | 17.33 MB |
| ConvNeXt-Tiny (this model) | 87.08% | 96.81% | 0.8705 | 87.58% | 111.95 MB |
ConvNeXt-Tiny is the strongest model for the core TwinCar goal β distinguishing visually similar makes/models (e.g. Audi S4 vs S5, BMW 3 Series vs M3) β thanks to a stronger backbone, at the cost of a larger model size. EfficientNet-B0 v1 remains a useful lightweight baseline (~6Γ smaller) when deployment size matters.
Limitations and bias
Results are measured on the Stanford Cars test split. Real-world images β e.g. from drones or ground robots, with varied angles, shadows, reflections, occlusion, or cluttered parking-lot backgrounds β will likely degrade accuracy. The reported numbers are benchmark performance on a curated dataset, not guaranteed production performance. The label space and any demographic/geographic biases are inherited from Stanford Cars (predominantly US-market vehicles up to ~2012).
References
- ConvNeXt β A ConvNet for the 2020s (Liu et al., 2022)
- Stanford Cars β 3D Object Representations for Fine-Grained Categorization (Krause et al., 2013)
- Project repository β github.com/dragicakostoska/TwinCar
License
Released under CC BY-NC 4.0 for educational and research purposes. Note that use is also subject to the terms of the Stanford Cars dataset and the ImageNet-pretrained base weights.
Dataset used to train cherky15/twincars_convnext
Evaluation results
- Top-1 Accuracy on Stanford Carsself-reported0.871
- Top-5 Accuracy on Stanford Carsself-reported0.968
- Weighted F1 on Stanford Carsself-reported0.871
- Macro Precision on Stanford Carsself-reported0.876
- Macro Recall on Stanford Carsself-reported0.870