ksl-sign-keypoint-embedder
Korean Sign Language Keypoint Embedding Model
A lightweight Transformer encoder that maps Korean Sign Language (KSL) keypoint sequences to 128-dimensional L2-normalized embeddings, trained with Triplet Loss and Semi-hard Negative Mining. Designed for CPU-only deployment in closed-network environments (kiosks, hospitals, government offices).
Model Details
| Property | Value |
|---|---|
| Architecture | Transformer Encoder (CLS token, 2 layers, 4 heads, FF=1024) |
| Input | Keypoint sequence โ (T=64, D=140) |
| Output | 128-dim L2-normalized embedding |
| Training objective | Triplet Loss, Semi-hard Negative Mining (A-P-Z-M) |
| Parameters | ~350K |
| Inference | CPU-only (no GPU required) |
Input keypoints (140-dim/frame): 42 hand landmarks (left + right, x/y) + 8 upper-body pose landmarks + 19 facial key points โ extracted with MediaPipe Holistic, then shoulder-centered and normalized.
Training Data
Trained on the AI Hub Korean Sign Language Video Dataset โ a large-scale KSL corpus of 536K+ sign videos with pre-extracted keypoints across directions, transportation, and address domains.
- Vocabulary: 3,000 Korean sign words
- Split: by signer (speaker-independent) โ no signer appears in both train and test
- Samples: train 35,978 / val 9,000 / test 9,000
๋ฐ์ดํฐ ์ถ์ฒ: ๊ตญ๋ฆฝ๊ตญ์ด์ยท๊ณผํ๊ธฐ์ ์ ๋ณดํต์ ๋ถ AI Hub ํ๊ตญ์์ด ์์ ๋ฐ์ดํฐ์ (aihub.or.kr)
Performance
Evaluated on 9,000 test queries across 3,000 Korean sign words (FAISS flat L2 index, gallery = training set embeddings):
| Metric | Score |
|---|---|
| Top-1 Accuracy | 86.3% |
| Top-5 Accuracy | 97.3% |
| Top-10 Accuracy | 98.6% |
Training: 100 epochs, batch 32, lr=5e-4, triplet margin=0.1, cosine LR schedule with 3-epoch warmup. Best checkpoint at epoch 97.
Usage
1. Load the model
import torch
from model import SignEncoder
encoder = SignEncoder(input_dim=140) # preset B: pose+hands+face, xy only
state = torch.load("best_model.pt", map_location="cpu")
encoder.load_state_dict(state)
encoder.eval()
2. Embed a keypoint sequence
import numpy as np
from preprocess import load_npz_keypoints
# Load a single sign clip (.npz from AI Hub)
seq = load_npz_keypoints("path/to/sign.npz", preset="B", target_length=64)
# seq: (64, 140) float32
with torch.no_grad():
x = torch.from_numpy(seq).unsqueeze(0) # (1, 64, 140)
emb = encoder(x) # (1, 128), L2-normalized
emb_np = emb.cpu().numpy().astype("float32") # FAISS requires float32
3. Build a FAISS index (gallery)
from faiss_db import build_index
# dataset: {word: {signer: [np.ndarray(64, 140), ...]}}
index, labels = build_index(encoder, dataset, d=128, save_path="kg/sign_kg.faiss")
4. Top-K retrieval
from faiss_db import load_index, search
index, labels = load_index("kg/sign_kg.faiss")
results = search(index, labels, emb_np, top_k=5)
# [("hospital", 0.021), ("clinic", 0.134), ...]
for word, dist in results:
print(f"{word}: L2={dist:.4f}")
Intended Use
- Sign Language Recognition (SLR): encode a segmented sign clip โ FAISS Top-K search โ word candidates
- RAG-based vocabulary expansion: add new sign words to the index without retraining
- Downstream pipeline: feeds into an sLLM (EXAONE 3.5 2.4B) for KSL โ Korean grammar conversion
Limitations
- Vocabulary is limited to the 3,000 words present in the AI Hub Sign Language Video Dataset used during training
- Left/right mirror augmentation was intentionally excluded โ handedness is semantically meaningful in KSL
- Performance may degrade on signers not represented in the training split (signer diversity is bounded by the dataset)