PedSense -- KeypointLSTM Pedestrian Crossing Intent Classifier (JAAD, 50 epochs)

Overview

A lightweight LSTM classifier that predicts pedestrian crossing intent from normalized skeleton sequences. Each input is a sequence of T=5 frames where each frame contains 17 COCO body keypoints (x, y) relative to the pedestrian's bounding box center and height.

Keypoint sequences were extracted from JAAD dashcam frames using a fine-tuned YOLO26m-Pose model. Sequences are anchored to each pedestrian's crossing_point annotation with a 1-second prediction horizon — the model predicts intent at least 1 second before the crossing event occurs.

Model Details

Property	Value
Architecture	KeypointLSTM (Linear projection + 2-layer LSTM + classifier)
Task	Binary classification (`not-crossing` / `crossing`)
Input	`(T=5, 34)` — 5 frames × 17 keypoints × 2 coordinates (flattened)
Dataset	JAAD
Epochs	50
Batch Size	16
Hidden Size	128
LSTM Layers	2
Dropout	0.3
Learning Rate	1e-3 (AdamW + CosineAnnealingLR)
Train/Val Split	80/20 (video-level)
Prediction Horizon	1 second before crossing

Performance (Best checkpoint by validation F1)

Metric	Value
Best Val F1 (crossing)	0.476

Keypoint Normalization

Each joint (kx, ky) is normalized relative to the pedestrian bounding box:

kx_norm = (kx - cx) / h
ky_norm = (ky - cy) / h

where cx, cy is the bounding box center and h is the bounding box height. This makes sequences view- and scale-invariant.

COCO Keypoint Order (17 points)

Index	Keypoint
0	Nose
1	Left eye
2	Right eye
3	Left ear
4	Right ear
5	Left shoulder
6	Right shoulder
7	Left elbow
8	Right elbow
9	Left wrist
10	Right wrist
11	Left hip
12	Right hip
13	Left knee
14	Right knee
15	Left ankle
16	Right ankle

Usage

import torch
import numpy as np
from huggingface_hub import hf_hub_download

# Download weights and config
weights_path = hf_hub_download(repo_id="JcProg/pedsense-keypoint-lstm-jaad-50e", filename="best.pt")

# Load model
from pedsense.train.resnet_lstm import KeypointLSTM
model = KeypointLSTM(input_size=34, hidden_size=128, num_layers=2, dropout=0.3)
model.load_state_dict(torch.load(weights_path, map_location="cpu", weights_only=True))
model.eval()

# Run inference on a (T=5, 34) sequence
seq = torch.from_numpy(keypoints_normalized).float().unsqueeze(0)  # (1, 5, 34)
with torch.no_grad():
    probs = torch.softmax(model(seq), dim=1)[0]
    label = "crossing" if probs[1] > 0.5 else "not-crossing"

Training

Sequences were built with preprocess keypoints, then the model was trained using the PedSense-AI framework:

# Step 1: Extract frames
uv run pedsense preprocess frames

# Step 2: Build keypoint sequences (YOLO-Pose → JAAD alignment → (T,17,2) arrays)
uv run pedsense preprocess keypoints --sequence-length 5 --prediction-horizon 1.0

# Step 3: Train KeypointLSTM
uv run pedsense train -m keypoint-lstm -n my_lstm -e 50 -b 16

License

AGPL-3.0

Related Models

pedsense-yolo26m-pose-jaad-10e -- YOLO26m-Pose keypoint detector used for sequence extraction
pedsense-yolo11m-pose-jaad-10e -- YOLO11m-Pose keypoint model
pedsense-yolo26m-detector-jaad-20e -- 1-class pedestrian detector (20 epochs)
pedsense-yolo26m-detector-jaad-aug-10e -- augmented detector (10 epochs)
pedsense-yolo26-jaad-10e -- 2-class crossing intent classifier

Downloads last month: 49

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support