File size: 5,717 Bytes

1d7cec6

---
license: mit
library_name: pytorch
tags:
  - person-reidentification
  - person-reid
  - computer-vision
  - pytorch
  - resnet
  - bnneck
  - triplet-loss
  - multi-camera-tracking
datasets:
  - Market-1501
metrics:
  - mAP
  - Rank-1
---

# MCTrack Re-ID Model

Person re-identification model trained on Market-1501 as part of the
**MSML 640 (Computer Vision) Final Project** at the University of Maryland.
This model is the appearance backbone used in our multi-camera object tracking
system.

## Model Variants

This repo contains two checkpoints:

- **best_60ep.pth** (primary) - trained for 60 epochs.
  Used as the deployed Re-ID model in our final cross-camera demos due to
  qualitatively cleaner cluster outputs.
- **best_120ep.pth** (ablation) - same recipe, trained for 120 epochs.
  Slightly higher Re-ID accuracy but only marginal improvement on downstream
  tasks (see "Performance" below).

Both checkpoints contain only the model state_dict. No optimizer or scheduler
state.

## Architecture

- **Backbone:** ResNet-50 (ImageNet-initialized; final stride changed from
  2 to 1 to retain higher-resolution features)
- **Pooling:** Global average pooling
- **Neck:** BNNeck (Batch Normalization neck) - separates triplet-loss
  features from classification features
- **Embedding dimension:** 256
- **Total parameters:** ~25M

## Training Recipe

| Setting          | Value                                                            |
| ---------------- | ---------------------------------------------------------------- |
| Dataset          | Market-1501 (12,936 train images, 751 train IDs)                 |
| Identity sampler | P=16 IDs x K=4 instances per batch                               |
| Batch size       | 64                                                               |
| Optimizer        | Adam, lr=3.5e-4, weight_decay=5e-4                               |
| LR schedule      | Linear warmup (10 epochs), step decay (x0.1 at epochs 40 and 70) |
| Loss             | Combined CE (label smoothing 0.1) + triplet (soft margin)        |
| Image size       | 256x128                                                          |
| Augmentation     | Random horizontal flip, random erasing                           |

## Performance

### Standalone Re-ID (Market-1501)

| Variant   | mAP   | Rank-1 | Rank-5 | Rank-10 |
| --------- | ----- | ------ | ------ | ------- |
| 60-epoch  | 73.73 | 89.32  | 96.04  | 97.74   |
| 120-epoch | 75.15 | 90.66  | 96.79  | 98.04   |

### Downstream - Single-camera tracking (MOT17, with DeepSORT)

| Variant   | HOTA | MOTA | IDF1 | IDSW |
| --------- | ---- | ---- | ---- | ---- |
| 60-epoch  | 41.1 | 36.5 | 48.6 | 260  |
| 120-epoch | 41.0 | 36.6 | 48.8 | 246  |

### Downstream - Cross-camera tracking (Wildtrack, with ground-plane filter)

| Variant   | IDF1  | IDP   | IDR   |
| --------- | ----- | ----- | ----- |
| 60-epoch  | 16.99 | 23.80 | 13.21 |
| 120-epoch | 18.72 | 26.41 | 14.50 |

## Usage

Download and load with `huggingface_hub`:

```python
from huggingface_hub import hf_hub_download
import torch

ckpt_path = hf_hub_download(
    repo_id="blank4hd/mctrack-reid",
    filename="best_60ep.pth",
)

state = torch.load(ckpt_path, map_location="cpu", weights_only=False)
model_state = state["state_dict"]
```

To use this with the model architecture, you need the `ReIDModel` class from
the project repository. A minimal standalone loader (`load_reid.py`) is
provided alongside this model card with the architecture definition inlined,
so the model can be used without cloning the full project.

## Intended Use

This model produces 256-dimensional appearance embeddings for cropped person
images. Two crops of the same person are expected to produce embeddings with
high cosine similarity; crops of different people produce embeddings with low
similarity.

**Suitable for:**

- Person re-identification in research / academic settings
- Appearance feature extraction in tracking pipelines (e.g., DeepSORT)
- Educational demonstration of metric learning

**Not suitable for:**

- Surveillance applications without explicit consent
- Identification of individuals across populations (high false-positive rate
  in cross-domain settings)
- Any use where reliability is safety-critical

## Limitations

- **Domain gap.** Trained on Market-1501 (Tsinghua University campus, ~5
  surveillance cameras). Performance degrades on outdoor pedestrian-square
  scenes (e.g., Wildtrack), where IDF1 drops to ~17-19% in cross-camera
  matching.
- **Person crops only.** Expects the input to be a tightly-cropped person
  image. Whole scenes or non-person inputs produce meaningless embeddings.
- **Resolution sensitive.** Trained at 256x128. Significantly different input
  resolutions will degrade quality.
- **No fairness audit.** Not evaluated for performance disparities across
  demographic groups.

## Training Details (compute and time)

- **Hardware:** Apple M4 Pro (MPS backend)
- **Per-epoch time:** ~46 seconds
- **Total training time:** 60-epoch ~46 min; 120-epoch ~92 min
- **Memory usage:** ~3 GB unified memory at batch size 64

## Citation

This work was completed for the MSML 640 final project, Spring 2026.

```
Group 9 - MSML 640 Final Project
Multi-Camera Object Tracking with Person Re-Identification
```

## Acknowledgments

- Architecture inspired by Luo et al. ("Bag of Tricks and a Strong Baseline
  for Deep Person Re-Identification", CVPR Workshop 2019)
- BNNeck design from the same paper
- Triplet loss formulation from Hermans et al. ("In Defense of the Triplet
  Loss for Person Re-Identification", arXiv 2017)
- Market-1501 dataset from Zheng et al. ("Scalable Person Re-Identification:
  A Benchmark", ICCV 2015)