YusenPeng
/

CascadeFormerCheckpoints

Model card Files Files and versions

YusenPeng commited on Sep 16, 2025

Commit

3f20e88

·

verified ·

1 Parent(s): 87e1b73

Update README.md

Files changed (1) hide show

README.md +55 -1

README.md CHANGED Viewed

@@ -1,4 +1,58 @@
 ---
 license: mit
 ---
-This repository stores model checkpoints for CascadeFormer.

 ---
 license: mit
 ---
+# 🌊 CascadeFormer: Two-stage Cascading Transformer for Human Action Recognition
+## News
+- [August 31, 2025] paper available on [arXiv](https://arxiv.org/abs/2509.00692)!
+- [July 19, 2025] model checkpoints are publicly available on [HuggingFace](https://huggingface.co/YusenPeng/CascadeFormerCheckpoints) for further analysis/application!
+## CascadeFormer
+![alt text](docs/CascadeFormer_pretrain.png)
+Overview of the masked pretraining component in CascadeFormer. A fixed percentage of joints are randomly masked across all frames in each video. The partially masked skeleton sequence is passed through a feature extraction module to produce frame-level embeddings, which are then input into a temporal transformer (T1). A lightweight linear decoder is applied to reconstruct the masked joints, and the model is optimized using mean squared error over the masked positions. This stage
+enables the model to learn generalizable spatiotemporal representations prior to supervised finetuning.
+![alt text](docs/CascadeFormer_finetune.png)
+Overview of the cascading finetuning component in CascadeFormer. The frame embeddings produced by the pre-
+trained temporal transformer backbone (T1) are passed into a task-specific transformer (T2) for hierarchical refinement. The
+output of T2 is fused with the original embeddings via a cross-attention module. The resulting fused representations are ag-
+gregated through frame-level average pooling and passed to a lightweight classification head. The entire model—including T1,
+T2, and the classification head—is optimized using cross-entropy loss on action labels during finetuning
+## Evaluation
+![alt text](docs/eval_results.png)
+Overall accuracy evaluation results of CascadeFormer variants on three datasets. CascadeFormer 1.0 consistently
+achieves the highest accuracy on Penn Action and both NTU60 splits, while 1.1 excels on N-UCLA. All checkpoints are open-
+sourced for reproducibility.
+## Citation
+Please cite our work if you find it useful/helpful:
+```bibtex
+@misc{peng2025cascadeformerfamilytwostagecascading,
+      title={CascadeFormer: A Family of Two-stage Cascading Transformers for Skeleton-based Human Action Recognition},
+      author={Yusen Peng and Alper Yilmaz},
+      year={2025},
+      eprint={2509.00692},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2509.00692},
+}
+```
+## Contacts
+If you have any questions or suggestions, feel free to contact:
+- Yusen Peng (peng.1007@osu.edu)
+- Alper Yilmaz (yilmaz.15@osu.edu)
+Or describe it in Issues.