- two_stream_attn_v1_finetune_20260510T160508Z
two_stream_attn_v1_finetune_20260510T160508Z
A real-time hand gesture classifier trained on the IPN Hand dataset.
This model is part of the Maestro pipeline that enables touchless control of presentation and meeting software through hand gestures captured from a standard webcam using MediaPipe for landmark extraction.
Model Description
- Architecture: EnhancedTwoStreamLSTM (BiLSTM h=128Γ2, MHA 8 heads, proj=96, mean+max pool, MLP gate)
- Parameters: 2,099,434
- Input:
(batch, 32, 147)β 32-frame sliding window at 30 FPS β 1067 ms - Output: Softmax logits over 10 gesture classes
- Inference latency: < 1 ms per call (CPU, single sample)
- Feature schema:
feature-schema-v5
Architecture
EnhancedTwoStreamLSTM splits the 147-dim feature vector into two parallel streams and
processes them through a BiLSTM + self-attention + MLP-gate pipeline:
Input (B, T=32, 147)
β
ββ Stream A β Pose/Shape (73 dims)
β Linear+LN+GELU β 96
β 2-layer BiLSTM (h=128) β (B, T, 256)
β LayerNorm β Self-MHA (8 heads) + residual + post-LN
β mean+max pool β pool_LN β ctx_a (B, 256)
β
ββ Stream B β Motion/Dynamics (74 dims)
β (identical structure) β ctx_b (B, 256)
β
ββ MLP cross-stream gate
β gate_a = Sigmoid(
β Linear(128β256)(
β Tanh(Linear(256β128)(ctx_b))))
β ctx_a = LN(ctx_a Γ gate_a + ctx_a)
β gate_b = Sigmoid(
β Linear(128β256)(
β Tanh(Linear(256β128)(ctx_a))))
β ctx_b = LN(ctx_b Γ gate_b + ctx_b)
β
ββ cat(ctx_a, ctx_b) β (512,)
LN β Linear(512β256) β GELU β Dropout β Linear(256β10)
Design rationale:
- BiLSTMs encode temporal order via their recurrent cell state β no positional encoding needed.
- Mean+Max pooling captures both sustained gesture shape (mean) and transient click events (max).
- The 2-layer MLP gate provides non-linear cross-modal recalibration at ~37 K params (vs ~263 K for full MHA cross-attention with a degenerate mean-pooled query).
Gesture Classes
| Class | Description |
|---|---|
unknown |
Background / transition / no gesture |
point_one |
Single-finger pointing gesture (continuous laser-pointer control) |
point_two |
Two-finger pointing gesture (continuous annotation-pen control) |
open_palm_hold |
Static open palm facing camera |
swipe_right |
Horizontal swipe from left to right |
swipe_left |
Horizontal swipe from right to left |
swipe_up |
Vertical swipe upward |
swipe_down |
Vertical swipe downward |
zoom_in |
Pinch-open (spread fingers away from each other) |
zoom_out |
Pinch-close (bring fingers together) |
Gesture Usage In Presentation System
| Class | Mode | Command | Runtime handling |
|---|---|---|---|
unknown |
discrete |
no_action |
No-op background class |
point_one |
continuous |
β |
Continuous tracker: LaserPointerTracker (bypasses discrete dispatcher) |
point_two |
continuous |
β |
Continuous tracker: AnnotationPenTracker (bypasses discrete dispatcher) |
open_palm_hold |
discrete |
erase_annotations |
Discrete command via GestureActivationController β CommandDispatcher |
swipe_right |
discrete |
next_slide |
Discrete command via GestureActivationController β CommandDispatcher |
swipe_left |
discrete |
previous_slide |
Discrete command via GestureActivationController β CommandDispatcher |
swipe_up |
discrete |
start_presentation |
Discrete command via GestureActivationController β CommandDispatcher |
swipe_down |
discrete |
stop_presentation |
Discrete command via GestureActivationController β CommandDispatcher |
zoom_in |
discrete |
zoom_in_view |
Discrete command via GestureActivationController β CommandDispatcher |
zoom_out |
discrete |
zoom_out_view |
Discrete command via GestureActivationController β CommandDispatcher |
Feature Schema (feature-schema-v5)
| Block | Dims | Description |
|---|---|---|
position |
0β62 | 21 wrist-relative, scale-normalised landmark positions (x, y, z) |
fingertip_spread |
63β67 | 5 inter-fingertip Euclidean distances |
wrist_trajectory |
68β70 | Net wrist displacement from oldest frame in the window |
velocity |
71β133 | 21 per-landmark wrist-relative velocity vectors (Ξposition per unit time) |
joint_angles |
134β143 | 10 MCP + PIP joint angles in radians |
wrist_vel_raw |
144β146 | Camera-normalised wrist velocity (x, y, z) β key directional signal |
How to Use
import torch
from huggingface_hub import hf_hub_download
from maestro.infrastructure.model.checkpoint_loader import load_inference_artifact
# Download the artifact (cached after first call)
local_path = hf_hub_download(
repo_id="ntsrigaud/maestro-lstm",
filename="two_stream_attn_v1_finetune_20260510T160508Z_inference.pt",
)
# Load the artifact (includes model, class labels, and feature schema)
artifact = load_inference_artifact(
artifact_path=local_path,
device=torch.device("cpu"),
)
artifact.model.eval()
# Build a 147-dim feature vector using LandmarkFeatureTransformer
# and fill a 32-frame SlidingWindowSequenceBuffer, then:
with torch.no_grad():
# tensor shape: (batch=1, T=32, F=147)
window_tensor = torch.tensor(window_np, dtype=torch.float32).unsqueeze(0)
logits = artifact.model(window_tensor)
pred_class = artifact.class_labels[logits.argmax(dim=1).item()]
Training Dataset
- Source: IPN Hand database β multi-subject, multi-session gesture recordings
- Used classes: 10 (9 active gestures +
unknownbackground) - Dataset split: 70% train / 15% val / 15% test (stratified by class)
- Augmentation: temporal scale Β±20%, spatial jitter Ο=0.005; label-aware horizontal mirror (swipe_left β swipe_right)
Training Strategy
Two-phase transfer learning pipeline:
- Phase 1 (pretraining): backbone pretrained on external checkpoint
two_stream_attn_v1_20260510T155938Z.ptto learn generic gesture dynamics. - Phase 2 (fine-tuning): head replaced and model adapted on IPN 10-gesture production subset.
- Stage A (frozen backbone): 10 epoch(s) head-only warmup.
- Stage B (full model): up to 75 epoch(s) joint fine-tuning with scheduler/early stopping.
Training Configuration
| Parameter | Value |
|---|---|
| Architecture | EnhancedTwoStreamLSTM (BiLSTM h=128Γ2, MHA 8 heads, proj=96, mean+max pool, MLP gate) |
| Input size | 147 |
| Hidden size | 128/stream (BiLSTM output: 256) |
| Projection dim | 96 |
| Num layers | 4 |
| MHA heads | 8 (head dim: 32) |
| Dropout | 0.2 |
| Learning rate | 0.0002 |
| Weight decay | 0.0005 |
| Batch size | 64 |
| Max epochs | 100 |
| Early stopping patience | 25 |
| Label smoothing | 0.05 |
| Class weighting | disabled |
| Max samples per class | 2400 |
| LR scheduler | ReduceLROnPlateau (factor=0.5, patience=8) |
Evaluation Results (Test Set)
| Metric | Value |
|---|---|
| Accuracy | 86.2% |
| Macro F1 | 85.8% |
Per-Class Recall
| Class | Recall |
|---|---|
unknown |
75.4% |
point_one |
97.4% |
point_two |
95.3% |
open_palm_hold |
92.3% |
swipe_right |
76.0% |
swipe_left |
84.3% |
swipe_up |
84.8% |
swipe_down |
84.5% |
zoom_in |
87.1% |
zoom_out |
80.7% |
Comparison with Previous Architecture
| Feature | TwoStreamGestureLSTM | EnhancedTwoStreamLSTM |
|---|---|---|
| LSTM direction | Unidirectional | Bidirectional |
| Attention | Bahdanau (scalar) | MHA Q/K/V (8 heads) |
| Feature projection | No | Yes (β96) |
| Temporal pooling | Mean only | Mean + Max |
| Cross-stream fusion | Concat only | 2-layer MLP gate |
| Parameters | ~182 K | ~2,099,434 |
Limitations and Risks
- Trained on IPN Hand subjects only. Performance may degrade with unusual hand sizes, skin tones, or lighting conditions not represented in training data.
- The
unknownclass represents background/transition frames. At runtime, predictions are filtered through per-class confidence thresholds defined inproduction_ipn.yaml. - Requires mediapipe>=0.10.14 for landmark extraction at inference time.
- Not intended for safety-critical or accessibility-critical applications.
- Performance was measured on a held-out test split from the same dataset; real-world generalisation may differ.
Environmental Impact
Training was performed on CPU/MPS. Estimated training time: ~10 minutes. Estimated COβ equivalent: negligible (<0.001 kg COβeq).
Generated by the Maestro training pipeline on 2026-05-10.
- Downloads last month
- 774
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support
Evaluation results
- accuracy on IPN Handself-reported0.862
- f1 on IPN Handself-reported0.858