How2Sign Pose-Based CSLR

Pose-based Continuous Sign Language Recognition (ASL video -> gloss) trained on the How2Sign dataset with LLM-generated pseudo-glosses.

Architecture

component spec
Input 153-D per frame: 150-D MediaPipe Holistic keypoints + 3-D presence mask
Pose layout 8 pose + 21 left-hand + 21 right-hand landmarks, xyz each
Frame rate 8 fps effective (stride 3 over ~24 fps source video)
Stem Linear(153 -> 256) + LayerNorm + GELU + Dropout
Position Sinusoidal positional encoding
Encoder 4 x TransformerEncoderLayer (d=256, heads=4, FF=1024, pre-norm GELU)
Head Linear(256 -> 2239)
Loss CTC (blank id = 1)
Parameters ~3.8M

Training data

  • Source: How2Sign realigned train split, ~31k sentence clips.
  • Pose extraction: MediaPipe Holistic (8 pose + 42 hand landmarks per frame).
  • Glosses: LLM-generated pseudo-glosses, not human-annotated. Noisier supervision than e.g. PHOENIX14T.
  • Vocabulary: 2239 tokens (full vocab, no <unk> cap).
  • Internal val: ~10% of training videos held out (group-by-video, seed=13). Official How2Sign val/test stay reserved for end-to-end pipeline evaluation since they have no glosses.

Evaluation

metric value
Internal val WER (overall, full vocab) 95.39%
Internal val WER (sentences using only top-500 tokens) 92.70%

The overall WER is high because ~10k of the 18k vocab tokens occur <=3 times in train (long fingerspelled words, rare proper nouns). The head-only WER is the representative number for whether the pipeline learns the common signs.

Usage

import json, numpy as np, torch
from huggingface_hub import hf_hub_download
# Requires the PoseTransformerCSLR class from the training notebook.
from your_pkg import PoseTransformerCSLR

model = PoseTransformerCSLR.from_pretrained("manohonsy/how2sign-pose-cslr").eval()

vocab = json.load(open(hf_hub_download("manohonsy/how2sign-pose-cslr", "vocab.json")))["token_to_id"]
id_to_token = {i: t for t, i in vocab.items()}
blank_id = vocab["<blank>"]

# Run on a .npz produced by how2sign_prep (or your own MediaPipe Holistic pipeline)
data = np.load("your_sample.npz")
features, mask = data["features"].astype("float32"), data["mask"].astype(bool)

# (apply normalize_features + concat mask as in the training notebook)
# x: torch.FloatTensor (1, T, 153)
log_probs, lens = model(x, torch.tensor([x.shape[1]]))

preds = log_probs.argmax(-1)[0].tolist()
collapsed, prev = [], None
for p in preds:
    if p != prev: collapsed.append(p); prev = p
gloss_ids = [p for p in collapsed if p != blank_id]
print(" ".join(id_to_token[i] for i in gloss_ids))

Caveats

  • Pseudo-gloss supervision. Glosses are LLM-generated. Reported WER reflects fit to pseudo-glosses, not to ground-truth ASL gloss.
  • Long-tail vocabulary. ~10k tokens occur <=3 times. Effectively unlearnable from this corpus size. Head-only WER is the meaningful signal.
  • Pose-only input. Misses appearance cues (fine handshape, non-manual markers). Trade-off documented in ADR-003.
  • MediaPipe sensitivity. Landmark quality drops in low light, partial occlusion, or non-frontal camera angles.

License

MIT.

Related

Implements ADR-003: Pose-Based CSLR Front-End of the project. See ADR-001 (gloss-based pipeline) and ADR-002 (mBART-50 translator) for the broader architecture.

Downloads last month
43
Safetensors
Model size
5.05M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support