File size: 3,745 Bytes
2a3aa69
 
 
 
 
 
 
 
 
 
 
 
 
 
5c52545
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
license: mit
library_name: pytorch
tags:
- computer-vision
- object-tracking
- rgb-event
- event-camera
- cvpr-2024
- sdstrack
datasets:
- krisspy39/visevent
---

# SDSTrack — RGB-E Checkpoint for VisEvent

This repository contains the **RGB-E (RGB-Event)** checkpoint for [SDSTrack](https://github.com/hoqolo/SDSTrack), a self-distillation symmetric adapter learning tracker for multi-modal single object tracking (CVPR 2024).

## Model Details

| Attribute | Value |
|-----------|-------|
| **Tracker** | SDSTrack (Self-Distillation Symmetric Adapter Learning) |
| **Backbone** | ViT-B (Vision Transformer, base size) with MAE pretraining |
| **Modality** | RGB-E (RGB + Event camera) |
| **Dataset** | VisEvent |
| **Config** | `cvpr2024_rgbe` |
| **Training epochs** | 50 |
| **Batch size** | 16 |
| **Learning rate** | 1e-4 |
| **Expected Success AUC** | ~0.62 |
| **Expected Precision @ 20px** | ~0.74 |

## Architecture

- **Symmetric adapters**: Lightweight parameter-efficient modules for each modality branch
- **Cross-modal fusion**: Symmetric fusion before the transformer head
- **Self-distillation**: Complementary masked patch distillation between RGB and event tokens
- **Foundation model**: OSTrack (One-stream transformer tracker) with ViT-B backbone

## Files

| File | Size | Description |
|------|------|-------------|
| `SDSTrack_cvpr2024_rgbe.pth.tar` | ~490 MB | Trained checkpoint for RGB-E evaluation on VisEvent |

## Usage

### Loading the checkpoint in Python

```python
from huggingface_hub import hf_hub_download
import torch

checkpoint_path = hf_hub_download(
    repo_id="krisspy39/sdstrack-rgbe",
    filename="SDSTrack_cvpr2024_rgbe.pth.tar",
    repo_type="model"
)

checkpoint = torch.load(checkpoint_path, map_location="cpu", weights_only=False)
```

### Evaluation with upstream code

1. Clone [SDSTrack](https://github.com/hoqolo/SDSTrack):
```bash
git clone https://github.com/hoqolo/SDSTrack.git
cd SDSTrack
```

2. Download the pretrained OSTrack foundation model to `./pretrained/vitb_256_mae_ce_32x4_ep300/OSTrack_ep0300.pth.tar`

3. Symlink or copy this checkpoint to `./models/SDSTrack_cvpr2024_rgbe.pth.tar`

4. Run evaluation:
```bash
python ./RGBE_workspace/test_rgbe_mgpus.py \
  --script_name sdstrack \
  --num_gpus 1 \
  --threads 4 \
  --epoch 50 \
  --yaml_name cvpr2024_rgbe
```

## Dataset

This checkpoint is trained and evaluated on **[VisEvent](https://github.com/wangxiao5791509/VisEvent_SOT_Benchmark)**, a large-scale RGB-Event single object tracking benchmark.

- **Train**: 120 sequences
- **Test**: ~302 sequences
- **Data format**: Each sequence contains `vis_imgs/` (RGB frames), `event_imgs/` (event frames), `groundtruth.txt`, and `absent_label.txt`

The VisEvent dataset is also available as a webdataset on Hugging Face: [krisspy39/visevent](https://huggingface.co/datasets/krisspy39/visevent)

## Citation

If you use this model or the SDSTrack tracker, please cite:

```bibtex
@inproceedings{hou2024sdstrack,
  title={Self-Distillation Symmetric Adapter Learning for Multi-Modal Object Tracking},
  author={Hou, Xiaojun and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2024}
}
```

## License

This checkpoint is provided for research purposes. Please refer to the original [SDSTrack repository](https://github.com/hoqolo/SDSTrack) for licensing terms.

## Acknowledgments

- Original SDSTrack implementation by [hoqolo](https://github.com/hoqolo)
- VisEvent dataset by [wangxiao5791509](https://github.com/wangxiao5791509/VisEvent_SOT_Benchmark)
- This checkpoint was reproduced as part of a university Pattern Recognition course project (Topic #65: Event-camera-based object tracking)