MCTrack Re-ID Model
Person re-identification model trained on Market-1501 as part of the MSML 640 (Computer Vision) Final Project at the University of Maryland. This model is the appearance backbone used in our multi-camera object tracking system.
Model Variants
This repo contains two checkpoints:
- best_60ep.pth (primary) - trained for 60 epochs. Used as the deployed Re-ID model in our final cross-camera demos due to qualitatively cleaner cluster outputs.
- best_120ep.pth (ablation) - same recipe, trained for 120 epochs. Slightly higher Re-ID accuracy but only marginal improvement on downstream tasks (see "Performance" below).
Both checkpoints contain only the model state_dict. No optimizer or scheduler state.
Architecture
- Backbone: ResNet-50 (ImageNet-initialized; final stride changed from 2 to 1 to retain higher-resolution features)
- Pooling: Global average pooling
- Neck: BNNeck (Batch Normalization neck) - separates triplet-loss features from classification features
- Embedding dimension: 256
- Total parameters: ~25M
Training Recipe
| Setting | Value |
|---|---|
| Dataset | Market-1501 (12,936 train images, 751 train IDs) |
| Identity sampler | P=16 IDs x K=4 instances per batch |
| Batch size | 64 |
| Optimizer | Adam, lr=3.5e-4, weight_decay=5e-4 |
| LR schedule | Linear warmup (10 epochs), step decay (x0.1 at epochs 40 and 70) |
| Loss | Combined CE (label smoothing 0.1) + triplet (soft margin) |
| Image size | 256x128 |
| Augmentation | Random horizontal flip, random erasing |
Performance
Standalone Re-ID (Market-1501)
| Variant | mAP | Rank-1 | Rank-5 | Rank-10 |
|---|---|---|---|---|
| 60-epoch | 73.73 | 89.32 | 96.04 | 97.74 |
| 120-epoch | 75.15 | 90.66 | 96.79 | 98.04 |
Downstream - Single-camera tracking (MOT17, with DeepSORT)
| Variant | HOTA | MOTA | IDF1 | IDSW |
|---|---|---|---|---|
| 60-epoch | 41.1 | 36.5 | 48.6 | 260 |
| 120-epoch | 41.0 | 36.6 | 48.8 | 246 |
Downstream - Cross-camera tracking (Wildtrack, with ground-plane filter)
| Variant | IDF1 | IDP | IDR |
|---|---|---|---|
| 60-epoch | 16.99 | 23.80 | 13.21 |
| 120-epoch | 18.72 | 26.41 | 14.50 |
Usage
Download and load with huggingface_hub:
from huggingface_hub import hf_hub_download
import torch
ckpt_path = hf_hub_download(
repo_id="blank4hd/mctrack-reid",
filename="best_60ep.pth",
)
state = torch.load(ckpt_path, map_location="cpu", weights_only=False)
model_state = state["state_dict"]
To use this with the model architecture, you need the ReIDModel class from
the project repository. A minimal standalone loader (load_reid.py) is
provided alongside this model card with the architecture definition inlined,
so the model can be used without cloning the full project.
Intended Use
This model produces 256-dimensional appearance embeddings for cropped person images. Two crops of the same person are expected to produce embeddings with high cosine similarity; crops of different people produce embeddings with low similarity.
Suitable for:
- Person re-identification in research / academic settings
- Appearance feature extraction in tracking pipelines (e.g., DeepSORT)
- Educational demonstration of metric learning
Not suitable for:
- Surveillance applications without explicit consent
- Identification of individuals across populations (high false-positive rate in cross-domain settings)
- Any use where reliability is safety-critical
Limitations
- Domain gap. Trained on Market-1501 (Tsinghua University campus, ~5 surveillance cameras). Performance degrades on outdoor pedestrian-square scenes (e.g., Wildtrack), where IDF1 drops to ~17-19% in cross-camera matching.
- Person crops only. Expects the input to be a tightly-cropped person image. Whole scenes or non-person inputs produce meaningless embeddings.
- Resolution sensitive. Trained at 256x128. Significantly different input resolutions will degrade quality.
- No fairness audit. Not evaluated for performance disparities across demographic groups.
Training Details (compute and time)
- Hardware: Apple M4 Pro (MPS backend)
- Per-epoch time: ~46 seconds
- Total training time: 60-epoch ~46 min; 120-epoch ~92 min
- Memory usage: ~3 GB unified memory at batch size 64
Citation
This work was completed for the MSML 640 final project, Spring 2026.
Group 9 - MSML 640 Final Project
Multi-Camera Object Tracking with Person Re-Identification
Acknowledgments
- Architecture inspired by Luo et al. ("Bag of Tricks and a Strong Baseline for Deep Person Re-Identification", CVPR Workshop 2019)
- BNNeck design from the same paper
- Triplet loss formulation from Hermans et al. ("In Defense of the Triplet Loss for Person Re-Identification", arXiv 2017)
- Market-1501 dataset from Zheng et al. ("Scalable Person Re-Identification: A Benchmark", ICCV 2015)