Update README.md
Browse files
README.md
CHANGED
|
@@ -1,4 +1,58 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
---
|
| 4 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
---
|
| 4 |
+
# 🌊 CascadeFormer: Two-stage Cascading Transformer for Human Action Recognition
|
| 5 |
+
|
| 6 |
+
## News
|
| 7 |
+
|
| 8 |
+
- [August 31, 2025] paper available on [arXiv](https://arxiv.org/abs/2509.00692)!
|
| 9 |
+
- [July 19, 2025] model checkpoints are publicly available on [HuggingFace](https://huggingface.co/YusenPeng/CascadeFormerCheckpoints) for further analysis/application!
|
| 10 |
+
|
| 11 |
+
## CascadeFormer
|
| 12 |
+
|
| 13 |
+

|
| 14 |
+
|
| 15 |
+
Overview of the masked pretraining component in CascadeFormer. A fixed percentage of joints are randomly masked across all frames in each video. The partially masked skeleton sequence is passed through a feature extraction module to produce frame-level embeddings, which are then input into a temporal transformer (T1). A lightweight linear decoder is applied to reconstruct the masked joints, and the model is optimized using mean squared error over the masked positions. This stage
|
| 16 |
+
enables the model to learn generalizable spatiotemporal representations prior to supervised finetuning.
|
| 17 |
+
|
| 18 |
+

|
| 19 |
+
|
| 20 |
+
Overview of the cascading finetuning component in CascadeFormer. The frame embeddings produced by the pre-
|
| 21 |
+
trained temporal transformer backbone (T1) are passed into a task-specific transformer (T2) for hierarchical refinement. The
|
| 22 |
+
output of T2 is fused with the original embeddings via a cross-attention module. The resulting fused representations are ag-
|
| 23 |
+
gregated through frame-level average pooling and passed to a lightweight classification head. The entire model—including T1,
|
| 24 |
+
T2, and the classification head—is optimized using cross-entropy loss on action labels during finetuning
|
| 25 |
+
|
| 26 |
+
## Evaluation
|
| 27 |
+
|
| 28 |
+

|
| 29 |
+
|
| 30 |
+
Overall accuracy evaluation results of CascadeFormer variants on three datasets. CascadeFormer 1.0 consistently
|
| 31 |
+
achieves the highest accuracy on Penn Action and both NTU60 splits, while 1.1 excels on N-UCLA. All checkpoints are open-
|
| 32 |
+
sourced for reproducibility.
|
| 33 |
+
|
| 34 |
+
## Citation
|
| 35 |
+
|
| 36 |
+
Please cite our work if you find it useful/helpful:
|
| 37 |
+
|
| 38 |
+
```bibtex
|
| 39 |
+
@misc{peng2025cascadeformerfamilytwostagecascading,
|
| 40 |
+
title={CascadeFormer: A Family of Two-stage Cascading Transformers for Skeleton-based Human Action Recognition},
|
| 41 |
+
author={Yusen Peng and Alper Yilmaz},
|
| 42 |
+
year={2025},
|
| 43 |
+
eprint={2509.00692},
|
| 44 |
+
archivePrefix={arXiv},
|
| 45 |
+
primaryClass={cs.CV},
|
| 46 |
+
url={https://arxiv.org/abs/2509.00692},
|
| 47 |
+
}
|
| 48 |
+
```
|
| 49 |
+
|
| 50 |
+
## Contacts
|
| 51 |
+
|
| 52 |
+
If you have any questions or suggestions, feel free to contact:
|
| 53 |
+
|
| 54 |
+
- Yusen Peng (peng.1007@osu.edu)
|
| 55 |
+
- Alper Yilmaz (yilmaz.15@osu.edu)
|
| 56 |
+
|
| 57 |
+
Or describe it in Issues.
|
| 58 |
+
|