ksl-sign-keypoint-embedder

Korean Sign Language Keypoint Embedding Model

A lightweight Transformer encoder that maps Korean Sign Language (KSL) keypoint sequences to 128-dimensional L2-normalized embeddings, trained with Triplet Loss and Semi-hard Negative Mining. Designed for CPU-only deployment in closed-network environments (kiosks, hospitals, government offices).


Model Details

Property Value
Architecture Transformer Encoder (CLS token, 2 layers, 4 heads, FF=1024)
Input Keypoint sequence โ€” (T=64, D=140)
Output 128-dim L2-normalized embedding
Training objective Triplet Loss, Semi-hard Negative Mining (A-P-Z-M)
Parameters ~350K
Inference CPU-only (no GPU required)

Input keypoints (140-dim/frame): 42 hand landmarks (left + right, x/y) + 8 upper-body pose landmarks + 19 facial key points โ€” extracted with MediaPipe Holistic, then shoulder-centered and normalized.


Training Data

Trained on the AI Hub Korean Sign Language Video Dataset โ€” a large-scale KSL corpus of 536K+ sign videos with pre-extracted keypoints across directions, transportation, and address domains.

  • Vocabulary: 3,000 Korean sign words
  • Split: by signer (speaker-independent) โ€” no signer appears in both train and test
  • Samples: train 35,978 / val 9,000 / test 9,000

๋ฐ์ดํ„ฐ ์ถœ์ฒ˜: ๊ตญ๋ฆฝ๊ตญ์–ด์›ยท๊ณผํ•™๊ธฐ์ˆ ์ •๋ณดํ†ต์‹ ๋ถ€ AI Hub ํ•œ๊ตญ์ˆ˜์–ด ์˜์ƒ ๋ฐ์ดํ„ฐ์…‹ (aihub.or.kr)


Performance

Evaluated on 9,000 test queries across 3,000 Korean sign words (FAISS flat L2 index, gallery = training set embeddings):

Metric Score
Top-1 Accuracy 86.3%
Top-5 Accuracy 97.3%
Top-10 Accuracy 98.6%

Training: 100 epochs, batch 32, lr=5e-4, triplet margin=0.1, cosine LR schedule with 3-epoch warmup. Best checkpoint at epoch 97.


Usage

1. Load the model

import torch
from model import SignEncoder

encoder = SignEncoder(input_dim=140)  # preset B: pose+hands+face, xy only
state = torch.load("best_model.pt", map_location="cpu")
encoder.load_state_dict(state)
encoder.eval()

2. Embed a keypoint sequence

import numpy as np
from preprocess import load_npz_keypoints

# Load a single sign clip (.npz from AI Hub)
seq = load_npz_keypoints("path/to/sign.npz", preset="B", target_length=64)
# seq: (64, 140) float32

with torch.no_grad():
    x = torch.from_numpy(seq).unsqueeze(0)        # (1, 64, 140)
    emb = encoder(x)                               # (1, 128), L2-normalized
    emb_np = emb.cpu().numpy().astype("float32")  # FAISS requires float32

3. Build a FAISS index (gallery)

from faiss_db import build_index

# dataset: {word: {signer: [np.ndarray(64, 140), ...]}}
index, labels = build_index(encoder, dataset, d=128, save_path="kg/sign_kg.faiss")

4. Top-K retrieval

from faiss_db import load_index, search

index, labels = load_index("kg/sign_kg.faiss")
results = search(index, labels, emb_np, top_k=5)
# [("hospital", 0.021), ("clinic", 0.134), ...]

for word, dist in results:
    print(f"{word}: L2={dist:.4f}")

Intended Use

  • Sign Language Recognition (SLR): encode a segmented sign clip โ†’ FAISS Top-K search โ†’ word candidates
  • RAG-based vocabulary expansion: add new sign words to the index without retraining
  • Downstream pipeline: feeds into an sLLM (EXAONE 3.5 2.4B) for KSL โ†’ Korean grammar conversion

Limitations

  • Vocabulary is limited to the 3,000 words present in the AI Hub Sign Language Video Dataset used during training
  • Left/right mirror augmentation was intentionally excluded โ€” handedness is semantically meaningful in KSL
  • Performance may degrade on signers not represented in the training split (signer diversity is bounded by the dataset)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support