two_stream_attn_v1_finetune_20260510T160508Z

A real-time hand gesture classifier trained on the IPN Hand dataset.

This model is part of the Maestro pipeline that enables touchless control of presentation and meeting software through hand gestures captured from a standard webcam using MediaPipe for landmark extraction.

Model Description

  • Architecture: EnhancedTwoStreamLSTM (BiLSTM h=128Γ—2, MHA 8 heads, proj=96, mean+max pool, MLP gate)
  • Parameters: 2,099,434
  • Input: (batch, 32, 147) β€” 32-frame sliding window at 30 FPS β‰ˆ 1067 ms
  • Output: Softmax logits over 10 gesture classes
  • Inference latency: < 1 ms per call (CPU, single sample)
  • Feature schema: feature-schema-v5

Architecture

EnhancedTwoStreamLSTM splits the 147-dim feature vector into two parallel streams and processes them through a BiLSTM + self-attention + MLP-gate pipeline:

Input (B, T=32, 147)
    β”‚
    β”œβ”€ Stream A β€” Pose/Shape (73 dims)
    β”‚   Linear+LN+GELU β†’ 96
    β”‚   2-layer BiLSTM (h=128) β†’ (B, T, 256)
    β”‚   LayerNorm β†’ Self-MHA (8 heads) + residual + post-LN
    β”‚   mean+max pool β†’ pool_LN β†’ ctx_a (B, 256)
    β”‚
    β”œβ”€ Stream B β€” Motion/Dynamics (74 dims)
    β”‚   (identical structure) β†’ ctx_b (B, 256)
    β”‚
    β”œβ”€ MLP cross-stream gate
    β”‚   gate_a = Sigmoid(
    β”‚     Linear(128β†’256)(
    β”‚       Tanh(Linear(256β†’128)(ctx_b))))
    β”‚   ctx_a  = LN(ctx_a Γ— gate_a + ctx_a)
    β”‚   gate_b = Sigmoid(
    β”‚     Linear(128β†’256)(
    β”‚       Tanh(Linear(256β†’128)(ctx_a))))
    β”‚   ctx_b  = LN(ctx_b Γ— gate_b + ctx_b)
    β”‚
    └─ cat(ctx_a, ctx_b) β†’ (512,)
       LN β†’ Linear(512β†’256) β†’ GELU β†’ Dropout β†’ Linear(256β†’10)

Design rationale:

  • BiLSTMs encode temporal order via their recurrent cell state β€” no positional encoding needed.
  • Mean+Max pooling captures both sustained gesture shape (mean) and transient click events (max).
  • The 2-layer MLP gate provides non-linear cross-modal recalibration at ~37 K params (vs ~263 K for full MHA cross-attention with a degenerate mean-pooled query).

Gesture Classes

Class Description
unknown Background / transition / no gesture
point_one Single-finger pointing gesture (continuous laser-pointer control)
point_two Two-finger pointing gesture (continuous annotation-pen control)
open_palm_hold Static open palm facing camera
swipe_right Horizontal swipe from left to right
swipe_left Horizontal swipe from right to left
swipe_up Vertical swipe upward
swipe_down Vertical swipe downward
zoom_in Pinch-open (spread fingers away from each other)
zoom_out Pinch-close (bring fingers together)

Gesture Usage In Presentation System

Class Mode Command Runtime handling
unknown discrete no_action No-op background class
point_one continuous β€” Continuous tracker: LaserPointerTracker (bypasses discrete dispatcher)
point_two continuous β€” Continuous tracker: AnnotationPenTracker (bypasses discrete dispatcher)
open_palm_hold discrete erase_annotations Discrete command via GestureActivationController β†’ CommandDispatcher
swipe_right discrete next_slide Discrete command via GestureActivationController β†’ CommandDispatcher
swipe_left discrete previous_slide Discrete command via GestureActivationController β†’ CommandDispatcher
swipe_up discrete start_presentation Discrete command via GestureActivationController β†’ CommandDispatcher
swipe_down discrete stop_presentation Discrete command via GestureActivationController β†’ CommandDispatcher
zoom_in discrete zoom_in_view Discrete command via GestureActivationController β†’ CommandDispatcher
zoom_out discrete zoom_out_view Discrete command via GestureActivationController β†’ CommandDispatcher

Feature Schema (feature-schema-v5)

Block Dims Description
position 0–62 21 wrist-relative, scale-normalised landmark positions (x, y, z)
fingertip_spread 63–67 5 inter-fingertip Euclidean distances
wrist_trajectory 68–70 Net wrist displacement from oldest frame in the window
velocity 71–133 21 per-landmark wrist-relative velocity vectors (Ξ”position per unit time)
joint_angles 134–143 10 MCP + PIP joint angles in radians
wrist_vel_raw 144–146 Camera-normalised wrist velocity (x, y, z) β€” key directional signal

How to Use

import torch
from huggingface_hub import hf_hub_download
from maestro.infrastructure.model.checkpoint_loader import load_inference_artifact

# Download the artifact (cached after first call)
local_path = hf_hub_download(
    repo_id="ntsrigaud/maestro-lstm",
    filename="two_stream_attn_v1_finetune_20260510T160508Z_inference.pt",
)

# Load the artifact (includes model, class labels, and feature schema)
artifact = load_inference_artifact(
    artifact_path=local_path,
    device=torch.device("cpu"),
)
artifact.model.eval()

# Build a 147-dim feature vector using LandmarkFeatureTransformer
# and fill a 32-frame SlidingWindowSequenceBuffer, then:
with torch.no_grad():
    # tensor shape: (batch=1, T=32, F=147)
    window_tensor = torch.tensor(window_np, dtype=torch.float32).unsqueeze(0)
    logits = artifact.model(window_tensor)
    pred_class = artifact.class_labels[logits.argmax(dim=1).item()]

Training Dataset

  • Source: IPN Hand database β€” multi-subject, multi-session gesture recordings
  • Used classes: 10 (9 active gestures + unknown background)
  • Dataset split: 70% train / 15% val / 15% test (stratified by class)
  • Augmentation: temporal scale Β±20%, spatial jitter Οƒ=0.005; label-aware horizontal mirror (swipe_left ↔ swipe_right)

Training Strategy

Two-phase transfer learning pipeline:

  • Phase 1 (pretraining): backbone pretrained on external checkpoint two_stream_attn_v1_20260510T155938Z.pt to learn generic gesture dynamics.
  • Phase 2 (fine-tuning): head replaced and model adapted on IPN 10-gesture production subset.
  • Stage A (frozen backbone): 10 epoch(s) head-only warmup.
  • Stage B (full model): up to 75 epoch(s) joint fine-tuning with scheduler/early stopping.

Training Configuration

Parameter Value
Architecture EnhancedTwoStreamLSTM (BiLSTM h=128Γ—2, MHA 8 heads, proj=96, mean+max pool, MLP gate)
Input size 147
Hidden size 128/stream (BiLSTM output: 256)
Projection dim 96
Num layers 4
MHA heads 8 (head dim: 32)
Dropout 0.2
Learning rate 0.0002
Weight decay 0.0005
Batch size 64
Max epochs 100
Early stopping patience 25
Label smoothing 0.05
Class weighting disabled
Max samples per class 2400
LR scheduler ReduceLROnPlateau (factor=0.5, patience=8)

Evaluation Results (Test Set)

Metric Value
Accuracy 86.2%
Macro F1 85.8%

Per-Class Recall

Class Recall
unknown 75.4%
point_one 97.4%
point_two 95.3%
open_palm_hold 92.3%
swipe_right 76.0%
swipe_left 84.3%
swipe_up 84.8%
swipe_down 84.5%
zoom_in 87.1%
zoom_out 80.7%

Comparison with Previous Architecture

Feature TwoStreamGestureLSTM EnhancedTwoStreamLSTM
LSTM direction Unidirectional Bidirectional
Attention Bahdanau (scalar) MHA Q/K/V (8 heads)
Feature projection No Yes (β†’96)
Temporal pooling Mean only Mean + Max
Cross-stream fusion Concat only 2-layer MLP gate
Parameters ~182 K ~2,099,434

Limitations and Risks

  • Trained on IPN Hand subjects only. Performance may degrade with unusual hand sizes, skin tones, or lighting conditions not represented in training data.
  • The unknown class represents background/transition frames. At runtime, predictions are filtered through per-class confidence thresholds defined in production_ipn.yaml.
  • Requires mediapipe>=0.10.14 for landmark extraction at inference time.
  • Not intended for safety-critical or accessibility-critical applications.
  • Performance was measured on a held-out test split from the same dataset; real-world generalisation may differ.

Environmental Impact

Training was performed on CPU/MPS. Estimated training time: ~10 minutes. Estimated COβ‚‚ equivalent: negligible (<0.001 kg COβ‚‚eq).


Generated by the Maestro training pipeline on 2026-05-10.

Downloads last month
774
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Evaluation results