SAM-MM-HAR
SAM-MM-HAR is a compact multimodal model for Human Activity Recognition (HAR) on privacy-preserving, non-RGB sensors. It fuses depth, skeleton, inertial (IMU), thermal, infrared and mmWave-radar signals to classify 40 everyday activities, and is designed to run on-device within a strict edge budget.
Built by AMEFORGE Lab (Amega Mike) on a proprietary sparse Transformer architecture, and developed for the CUHK-X Multimodal Human Activity Challenge (Small Model Track, co-located with UbiComp 2026).
⚠️ The architecture internals are proprietary and intentionally not disclosed. This card describes the model's behaviour, interface and results — not its design.
Highlights
| Property | Value |
|---|---|
| Task | 40-class HAR, cross-subject |
| Parameters | ~18M (18,043,939) |
| Size on disk | 72.2 MB (FP32) |
| Constraint | ≤ 100 MB, no pretrained backbone |
| Modalities | Depth · Skeleton · IMU · Thermal · IR · Radar |
| Deployment | CPU / edge, offline (no cloud, no API) |
| CPU latency | |
| Input resolution | 64×64 frames, 8 sampled per clip |
The model handles missing modalities gracefully — any available subset of sensors works at inference; absent modalities are simply skipped.
Why non-RGB / privacy-preserving?
The model never sees a colour camera image. It perceives a scene only through physical measurements — depth maps, skeleton keypoints, inertial motion, heat and radar reflections. This preserves visual privacy (no identifiable faces or footage) while retaining enough signal to recognise the activity, which is the deployment reality for healthcare, elderly-care and smart-home monitoring.
Modalities & encoders
| Modality | Real format (CUHK-X) | Encoder |
|---|---|---|
| Depth (colorized) | RGB frames 640×480 | patch conv + sparse attention |
| IR | grayscale frames 640×480 | patch conv + sparse attention |
| Thermal | RGB frames 320×240 | patch conv + sparse attention |
| Skeleton | COCO-17 keypoints (x,y,z,score) per frame | joint encoder + sparse attention |
| IMU | 5 body sensors × 9 features (accel/gyro/angle) | temporal Conv1D |
| Radar (mmWave) | sparse point cloud (often empty) | BEV projection + patch conv |
A temporal world-model component models the dynamics of human motion across frames, which is the key contributor to recognising movement-based activities.
The 40 activity classes
Daily-living activities centred on autonomy, self-care and household tasks
(e.g. 0_Wash_face, 1_Brush_teeth, 36_Walk, 37_Take_medicine,
39_Take_body_temperature, plus cooking, cleaning, exercise and desk activities).
See class_mapping.csv from the CUHK-X dataset for the full list.
Results
Cross-subject evaluation (held-out subjects never seen in training).
| Metric | Value |
|---|---|
| Local validation accuracy | 43.4% |
| Kaggle public leaderboard | 0.303 |
| Kaggle rank (at submission) | 14th |
| Parameters | ~18M (18,043,939) |
| Model size | 72.2 MB (FP32) |
| Input | 64×64 frames, 8 per clip |
| Inference latency |
The CPU-only latency is the notable figure here: the model runs in near real-time on a plain processor with no GPU, which is what makes on-device / edge deployment realistic (e.g. a small home box, a single-board computer).
Strengths: large-motion activities (walking, jumping, squats) are recognised with high accuracy thanks to the temporal world-model. Known limitations: fine seated/hand activities (using a phone, taking medicine, typing) are harder to disambiguate, as they differ only in subtle hand-level detail. Radar is frequently empty and contributes little.
Intended use
- Privacy-preserving activity monitoring for elderly care and healthcare (fall/inactivity detection, autonomy tracking) on low-power edge devices.
- Research on multimodal sensor fusion for HAR without RGB.
- A reusable feature/representation backbone for related sensor-fusion tasks (robotics perception, ambient smart-home understanding).
Out of scope / cautions
- Not a medical device; predictions must not be used for clinical decisions.
- Trained on a specific lab sensor configuration; real-world deployment on different sensors requires re-training or fine-tuning on the target rig.
- Cross-subject generalization is inherently limited by the number of training subjects; expect a drop on populations dissimilar to the training set.
Usage
import torch
from huggingface_hub import hf_hub_download
ckpt = hf_hub_download("AMFORGE/sam-mm-har", "best.pt")
# Use inference.py from the repo, which inlines the architecture and
# the exact preprocessing for each modality:
# python inference.py --checkpoint best.pt --clip path/to/clip_folder
# python inference.py --checkpoint best.pt --data <root> --test-csv test.csv --out submission.csv
Each test clip is a folder with per-modality sub-folders
(Depth_Color/, IR/, Thermal/, Skeleton/, IMU/, Radar/).
The model aggregates per-clip and outputs one activity id (0–39).
Training setup
- From scratch (no pretrained weights), single GPU.
- Input: 12 frames/clip at 96×96; skeleton COCO-17; IMU resampled to fixed length.
- Cross-subject split; label smoothing; data augmentation (flip, brightness/contrast, spatial shift, noise).
- Checkpoints pushed to the Hub every 500 steps with auto-resume.
Citation
@misc{sam_mm_har_2026,
title = {SAM-MM-HAR: Compact Multimodal Human Activity Recognition
on Privacy-Preserving Sensors},
author = {Amega Mike},
year = {2026},
note = {AMEFORGE Lab. Built on a proprietary sparse Transformer architecture.
Developed for the CUHK-X Multimodal Human Activity Challenge (UbiComp 2026).}
}
Dataset: CUHK-X Small Model Track — CUHK AIoT Lab.
Architecture internals are proprietary and not disclosed. © AMEFORGE Lab 2026.