PedSense -- KeypointLSTM Pedestrian Crossing Intent Classifier (JAAD, 50 epochs)

Overview

A lightweight LSTM classifier that predicts pedestrian crossing intent from normalized skeleton sequences. Each input is a sequence of T=5 frames where each frame contains 17 COCO body keypoints (x, y) relative to the pedestrian's bounding box center and height.

Keypoint sequences were extracted from JAAD dashcam frames using a fine-tuned YOLO26m-Pose model. Sequences are anchored to each pedestrian's crossing_point annotation with a 1-second prediction horizon โ€” the model predicts intent at least 1 second before the crossing event occurs.

Model Details

Property Value
Architecture KeypointLSTM (Linear projection + 2-layer LSTM + classifier)
Task Binary classification (not-crossing / crossing)
Input (T=5, 34) โ€” 5 frames ร— 17 keypoints ร— 2 coordinates (flattened)
Dataset JAAD
Epochs 50
Batch Size 16
Hidden Size 128
LSTM Layers 2
Dropout 0.3
Learning Rate 1e-3 (AdamW + CosineAnnealingLR)
Train/Val Split 80/20 (video-level)
Prediction Horizon 1 second before crossing

Performance (Best checkpoint by validation F1)

Metric Value
Best Val F1 (crossing) 0.476

Keypoint Normalization

Each joint (kx, ky) is normalized relative to the pedestrian bounding box:

kx_norm = (kx - cx) / h
ky_norm = (ky - cy) / h

where cx, cy is the bounding box center and h is the bounding box height. This makes sequences view- and scale-invariant.

COCO Keypoint Order (17 points)

Index Keypoint
0 Nose
1 Left eye
2 Right eye
3 Left ear
4 Right ear
5 Left shoulder
6 Right shoulder
7 Left elbow
8 Right elbow
9 Left wrist
10 Right wrist
11 Left hip
12 Right hip
13 Left knee
14 Right knee
15 Left ankle
16 Right ankle

Usage

import torch
import numpy as np
from huggingface_hub import hf_hub_download

# Download weights and config
weights_path = hf_hub_download(repo_id="JcProg/pedsense-keypoint-lstm-jaad-50e", filename="best.pt")

# Load model
from pedsense.train.resnet_lstm import KeypointLSTM
model = KeypointLSTM(input_size=34, hidden_size=128, num_layers=2, dropout=0.3)
model.load_state_dict(torch.load(weights_path, map_location="cpu", weights_only=True))
model.eval()

# Run inference on a (T=5, 34) sequence
seq = torch.from_numpy(keypoints_normalized).float().unsqueeze(0)  # (1, 5, 34)
with torch.no_grad():
    probs = torch.softmax(model(seq), dim=1)[0]
    label = "crossing" if probs[1] > 0.5 else "not-crossing"

Training

Sequences were built with preprocess keypoints, then the model was trained using the PedSense-AI framework:

# Step 1: Extract frames
uv run pedsense preprocess frames

# Step 2: Build keypoint sequences (YOLO-Pose โ†’ JAAD alignment โ†’ (T,17,2) arrays)
uv run pedsense preprocess keypoints --sequence-length 5 --prediction-horizon 1.0

# Step 3: Train KeypointLSTM
uv run pedsense train -m keypoint-lstm -n my_lstm -e 50 -b 16

License

AGPL-3.0

Related Models

Downloads last month
49
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support