YusenPeng commited on
Commit
3f20e88
·
verified ·
1 Parent(s): 87e1b73

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +55 -1
README.md CHANGED
@@ -1,4 +1,58 @@
1
  ---
2
  license: mit
3
  ---
4
- This repository stores model checkpoints for CascadeFormer.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  ---
4
+ # 🌊 CascadeFormer: Two-stage Cascading Transformer for Human Action Recognition
5
+
6
+ ## News
7
+
8
+ - [August 31, 2025] paper available on [arXiv](https://arxiv.org/abs/2509.00692)!
9
+ - [July 19, 2025] model checkpoints are publicly available on [HuggingFace](https://huggingface.co/YusenPeng/CascadeFormerCheckpoints) for further analysis/application!
10
+
11
+ ## CascadeFormer
12
+
13
+ ![alt text](docs/CascadeFormer_pretrain.png)
14
+
15
+ Overview of the masked pretraining component in CascadeFormer. A fixed percentage of joints are randomly masked across all frames in each video. The partially masked skeleton sequence is passed through a feature extraction module to produce frame-level embeddings, which are then input into a temporal transformer (T1). A lightweight linear decoder is applied to reconstruct the masked joints, and the model is optimized using mean squared error over the masked positions. This stage
16
+ enables the model to learn generalizable spatiotemporal representations prior to supervised finetuning.
17
+
18
+ ![alt text](docs/CascadeFormer_finetune.png)
19
+
20
+ Overview of the cascading finetuning component in CascadeFormer. The frame embeddings produced by the pre-
21
+ trained temporal transformer backbone (T1) are passed into a task-specific transformer (T2) for hierarchical refinement. The
22
+ output of T2 is fused with the original embeddings via a cross-attention module. The resulting fused representations are ag-
23
+ gregated through frame-level average pooling and passed to a lightweight classification head. The entire model—including T1,
24
+ T2, and the classification head—is optimized using cross-entropy loss on action labels during finetuning
25
+
26
+ ## Evaluation
27
+
28
+ ![alt text](docs/eval_results.png)
29
+
30
+ Overall accuracy evaluation results of CascadeFormer variants on three datasets. CascadeFormer 1.0 consistently
31
+ achieves the highest accuracy on Penn Action and both NTU60 splits, while 1.1 excels on N-UCLA. All checkpoints are open-
32
+ sourced for reproducibility.
33
+
34
+ ## Citation
35
+
36
+ Please cite our work if you find it useful/helpful:
37
+
38
+ ```bibtex
39
+ @misc{peng2025cascadeformerfamilytwostagecascading,
40
+ title={CascadeFormer: A Family of Two-stage Cascading Transformers for Skeleton-based Human Action Recognition},
41
+ author={Yusen Peng and Alper Yilmaz},
42
+ year={2025},
43
+ eprint={2509.00692},
44
+ archivePrefix={arXiv},
45
+ primaryClass={cs.CV},
46
+ url={https://arxiv.org/abs/2509.00692},
47
+ }
48
+ ```
49
+
50
+ ## Contacts
51
+
52
+ If you have any questions or suggestions, feel free to contact:
53
+
54
+ - Yusen Peng (peng.1007@osu.edu)
55
+ - Alper Yilmaz (yilmaz.15@osu.edu)
56
+
57
+ Or describe it in Issues.
58
+