SAM-MM-HAR

SAM-MM-HAR is a compact multimodal model for Human Activity Recognition (HAR) on privacy-preserving, non-RGB sensors. It fuses depth, skeleton, inertial (IMU), thermal, infrared and mmWave-radar signals to classify 40 everyday activities, and is designed to run on-device within a strict edge budget.

Built by AMEFORGE Lab (Amega Mike) on a proprietary sparse Transformer architecture, and developed for the CUHK-X Multimodal Human Activity Challenge (Small Model Track, co-located with UbiComp 2026).

⚠️ The architecture internals are proprietary and intentionally not disclosed. This card describes the model's behaviour, interface and results — not its design.

Highlights

Property	Value
Task	40-class HAR, cross-subject
Parameters	~18M (18,043,939)
Size on disk	72.2 MB (FP32)
Constraint	≤ 100 MB, no pretrained backbone
Modalities	Depth · Skeleton · IMU · Thermal · IR · Radar
Deployment	CPU / edge, offline (no cloud, no API)
CPU latency	~~157 ms/clip (~~6.4 clips/s), no GPU
Input resolution	64×64 frames, 8 sampled per clip

The model handles missing modalities gracefully — any available subset of sensors works at inference; absent modalities are simply skipped.

Why non-RGB / privacy-preserving?

The model never sees a colour camera image. It perceives a scene only through physical measurements — depth maps, skeleton keypoints, inertial motion, heat and radar reflections. This preserves visual privacy (no identifiable faces or footage) while retaining enough signal to recognise the activity, which is the deployment reality for healthcare, elderly-care and smart-home monitoring.

Modalities & encoders

Modality	Real format (CUHK-X)	Encoder
Depth (colorized)	RGB frames 640×480	patch conv + sparse attention
IR	grayscale frames 640×480	patch conv + sparse attention
Thermal	RGB frames 320×240	patch conv + sparse attention
Skeleton	COCO-17 keypoints (x,y,z,score) per frame	joint encoder + sparse attention
IMU	5 body sensors × 9 features (accel/gyro/angle)	temporal Conv1D
Radar (mmWave)	sparse point cloud (often empty)	BEV projection + patch conv

A temporal world-model component models the dynamics of human motion across frames, which is the key contributor to recognising movement-based activities.

The 40 activity classes

Daily-living activities centred on autonomy, self-care and household tasks (e.g. 0_Wash_face, 1_Brush_teeth, 36_Walk, 37_Take_medicine, 39_Take_body_temperature, plus cooking, cleaning, exercise and desk activities). See class_mapping.csv from the CUHK-X dataset for the full list.

Results

Cross-subject evaluation (held-out subjects never seen in training).

Metric	Value
Local validation accuracy	43.4%
Kaggle public leaderboard	0.303
Kaggle rank (at submission)	14th
Parameters	~18M (18,043,939)
Model size	72.2 MB (FP32)
Input	64×64 frames, 8 per clip
Inference latency	~~157 ms/clip on CPU (~~6.4 clips/s)

The CPU-only latency is the notable figure here: the model runs in near real-time on a plain processor with no GPU, which is what makes on-device / edge deployment realistic (e.g. a small home box, a single-board computer).

Strengths: large-motion activities (walking, jumping, squats) are recognised with high accuracy thanks to the temporal world-model. Known limitations: fine seated/hand activities (using a phone, taking medicine, typing) are harder to disambiguate, as they differ only in subtle hand-level detail. Radar is frequently empty and contributes little.

Intended use

Privacy-preserving activity monitoring for elderly care and healthcare (fall/inactivity detection, autonomy tracking) on low-power edge devices.
Research on multimodal sensor fusion for HAR without RGB.
A reusable feature/representation backbone for related sensor-fusion tasks (robotics perception, ambient smart-home understanding).

Out of scope / cautions

Not a medical device; predictions must not be used for clinical decisions.
Trained on a specific lab sensor configuration; real-world deployment on different sensors requires re-training or fine-tuning on the target rig.
Cross-subject generalization is inherently limited by the number of training subjects; expect a drop on populations dissimilar to the training set.

Usage

import torch
from huggingface_hub import hf_hub_download

ckpt = hf_hub_download("AMFORGE/sam-mm-har", "best.pt")
# Use inference.py from the repo, which inlines the architecture and
# the exact preprocessing for each modality:
#   python inference.py --checkpoint best.pt --clip path/to/clip_folder
#   python inference.py --checkpoint best.pt --data <root> --test-csv test.csv --out submission.csv

Each test clip is a folder with per-modality sub-folders (Depth_Color/, IR/, Thermal/, Skeleton/, IMU/, Radar/). The model aggregates per-clip and outputs one activity id (0–39).

Training setup

From scratch (no pretrained weights), single GPU.
Input: 12 frames/clip at 96×96; skeleton COCO-17; IMU resampled to fixed length.
Cross-subject split; label smoothing; data augmentation (flip, brightness/contrast, spatial shift, noise).
Checkpoints pushed to the Hub every 500 steps with auto-resume.

Citation

@misc{sam_mm_har_2026,
  title  = {SAM-MM-HAR: Compact Multimodal Human Activity Recognition
            on Privacy-Preserving Sensors},
  author = {Amega Mike},
  year   = {2026},
  note   = {AMEFORGE Lab. Built on a proprietary sparse Transformer architecture.
            Developed for the CUHK-X Multimodal Human Activity Challenge (UbiComp 2026).}
}

Dataset: CUHK-X Small Model Track — CUHK AIoT Lab.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Video Classification

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support