--- license: mit library_name: pytorch tags: - person-reidentification - person-reid - computer-vision - pytorch - resnet - bnneck - triplet-loss - multi-camera-tracking datasets: - Market-1501 metrics: - mAP - Rank-1 --- # MCTrack Re-ID Model Person re-identification model trained on Market-1501 as part of the **MSML 640 (Computer Vision) Final Project** at the University of Maryland. This model is the appearance backbone used in our multi-camera object tracking system. ## Model Variants This repo contains two checkpoints: - **best_60ep.pth** (primary) - trained for 60 epochs. Used as the deployed Re-ID model in our final cross-camera demos due to qualitatively cleaner cluster outputs. - **best_120ep.pth** (ablation) - same recipe, trained for 120 epochs. Slightly higher Re-ID accuracy but only marginal improvement on downstream tasks (see "Performance" below). Both checkpoints contain only the model state_dict. No optimizer or scheduler state. ## Architecture - **Backbone:** ResNet-50 (ImageNet-initialized; final stride changed from 2 to 1 to retain higher-resolution features) - **Pooling:** Global average pooling - **Neck:** BNNeck (Batch Normalization neck) - separates triplet-loss features from classification features - **Embedding dimension:** 256 - **Total parameters:** ~25M ## Training Recipe | Setting | Value | | ---------------- | ---------------------------------------------------------------- | | Dataset | Market-1501 (12,936 train images, 751 train IDs) | | Identity sampler | P=16 IDs x K=4 instances per batch | | Batch size | 64 | | Optimizer | Adam, lr=3.5e-4, weight_decay=5e-4 | | LR schedule | Linear warmup (10 epochs), step decay (x0.1 at epochs 40 and 70) | | Loss | Combined CE (label smoothing 0.1) + triplet (soft margin) | | Image size | 256x128 | | Augmentation | Random horizontal flip, random erasing | ## Performance ### Standalone Re-ID (Market-1501) | Variant | mAP | Rank-1 | Rank-5 | Rank-10 | | --------- | ----- | ------ | ------ | ------- | | 60-epoch | 73.73 | 89.32 | 96.04 | 97.74 | | 120-epoch | 75.15 | 90.66 | 96.79 | 98.04 | ### Downstream - Single-camera tracking (MOT17, with DeepSORT) | Variant | HOTA | MOTA | IDF1 | IDSW | | --------- | ---- | ---- | ---- | ---- | | 60-epoch | 41.1 | 36.5 | 48.6 | 260 | | 120-epoch | 41.0 | 36.6 | 48.8 | 246 | ### Downstream - Cross-camera tracking (Wildtrack, with ground-plane filter) | Variant | IDF1 | IDP | IDR | | --------- | ----- | ----- | ----- | | 60-epoch | 16.99 | 23.80 | 13.21 | | 120-epoch | 18.72 | 26.41 | 14.50 | ## Usage Download and load with `huggingface_hub`: ```python from huggingface_hub import hf_hub_download import torch ckpt_path = hf_hub_download( repo_id="blank4hd/mctrack-reid", filename="best_60ep.pth", ) state = torch.load(ckpt_path, map_location="cpu", weights_only=False) model_state = state["state_dict"] ``` To use this with the model architecture, you need the `ReIDModel` class from the project repository. A minimal standalone loader (`load_reid.py`) is provided alongside this model card with the architecture definition inlined, so the model can be used without cloning the full project. ## Intended Use This model produces 256-dimensional appearance embeddings for cropped person images. Two crops of the same person are expected to produce embeddings with high cosine similarity; crops of different people produce embeddings with low similarity. **Suitable for:** - Person re-identification in research / academic settings - Appearance feature extraction in tracking pipelines (e.g., DeepSORT) - Educational demonstration of metric learning **Not suitable for:** - Surveillance applications without explicit consent - Identification of individuals across populations (high false-positive rate in cross-domain settings) - Any use where reliability is safety-critical ## Limitations - **Domain gap.** Trained on Market-1501 (Tsinghua University campus, ~5 surveillance cameras). Performance degrades on outdoor pedestrian-square scenes (e.g., Wildtrack), where IDF1 drops to ~17-19% in cross-camera matching. - **Person crops only.** Expects the input to be a tightly-cropped person image. Whole scenes or non-person inputs produce meaningless embeddings. - **Resolution sensitive.** Trained at 256x128. Significantly different input resolutions will degrade quality. - **No fairness audit.** Not evaluated for performance disparities across demographic groups. ## Training Details (compute and time) - **Hardware:** Apple M4 Pro (MPS backend) - **Per-epoch time:** ~46 seconds - **Total training time:** 60-epoch ~46 min; 120-epoch ~92 min - **Memory usage:** ~3 GB unified memory at batch size 64 ## Citation This work was completed for the MSML 640 final project, Spring 2026. ``` Group 9 - MSML 640 Final Project Multi-Camera Object Tracking with Person Re-Identification ``` ## Acknowledgments - Architecture inspired by Luo et al. ("Bag of Tricks and a Strong Baseline for Deep Person Re-Identification", CVPR Workshop 2019) - BNNeck design from the same paper - Triplet loss formulation from Hermans et al. ("In Defense of the Triplet Loss for Person Re-Identification", arXiv 2017) - Market-1501 dataset from Zheng et al. ("Scalable Person Re-Identification: A Benchmark", ICCV 2015)