ksl-sign-keypoint-embedder

Korean Sign Language Keypoint Embedding Model

A lightweight Transformer encoder that maps Korean Sign Language (KSL) keypoint sequences to 128-dimensional L2-normalized embeddings, trained with Triplet Loss and Semi-hard Negative Mining. Designed for CPU-only deployment in closed-network environments (kiosks, hospitals, government offices).

Model Details

Property	Value
Architecture	Transformer Encoder (CLS token, 2 layers, 4 heads, FF=1024)
Input	Keypoint sequence — `(T=64, D=140)`
Output	128-dim L2-normalized embedding
Training objective	Triplet Loss, Semi-hard Negative Mining (A-P-Z-M)
Parameters	~350K
Inference	CPU-only (no GPU required)

Input keypoints (140-dim/frame): 42 hand landmarks (left + right, x/y) + 8 upper-body pose landmarks + 19 facial key points — extracted with MediaPipe Holistic, then shoulder-centered and normalized.

Training Data

Trained on the AI Hub Korean Sign Language Video Dataset — a large-scale KSL corpus of 536K+ sign videos with pre-extracted keypoints across directions, transportation, and address domains.

Vocabulary: 3,000 Korean sign words
Split: by signer (speaker-independent) — no signer appears in both train and test
Samples: train 35,978 / val 9,000 / test 9,000

데이터 출처: 국립국어원·과학기술정보통신부 AI Hub 한국수어 영상 데이터셋 (aihub.or.kr)

Performance

Evaluated on 9,000 test queries across 3,000 Korean sign words (FAISS flat L2 index, gallery = training set embeddings):

Metric	Score
Top-1 Accuracy	86.3%
Top-5 Accuracy	97.3%
Top-10 Accuracy	98.6%

Training: 100 epochs, batch 32, lr=5e-4, triplet margin=0.1, cosine LR schedule with 3-epoch warmup. Best checkpoint at epoch 97.

Usage

1. Load the model

import torch
from model import SignEncoder

encoder = SignEncoder(input_dim=140)  # preset B: pose+hands+face, xy only
state = torch.load("best_model.pt", map_location="cpu")
encoder.load_state_dict(state)
encoder.eval()

2. Embed a keypoint sequence

import numpy as np
from preprocess import load_npz_keypoints

# Load a single sign clip (.npz from AI Hub)
seq = load_npz_keypoints("path/to/sign.npz", preset="B", target_length=64)
# seq: (64, 140) float32

with torch.no_grad():
    x = torch.from_numpy(seq).unsqueeze(0)        # (1, 64, 140)
    emb = encoder(x)                               # (1, 128), L2-normalized
    emb_np = emb.cpu().numpy().astype("float32")  # FAISS requires float32

3. Build a FAISS index (gallery)

from faiss_db import build_index

# dataset: {word: {signer: [np.ndarray(64, 140), ...]}}
index, labels = build_index(encoder, dataset, d=128, save_path="kg/sign_kg.faiss")

4. Top-K retrieval

from faiss_db import load_index, search

index, labels = load_index("kg/sign_kg.faiss")
results = search(index, labels, emb_np, top_k=5)
# [("hospital", 0.021), ("clinic", 0.134), ...]

for word, dist in results:
    print(f"{word}: L2={dist:.4f}")

Intended Use

Sign Language Recognition (SLR): encode a segmented sign clip → FAISS Top-K search → word candidates
RAG-based vocabulary expansion: add new sign words to the index without retraining
Downstream pipeline: feeds into an sLLM (EXAONE 3.5 2.4B) for KSL → Korean grammar conversion

Limitations

Vocabulary is limited to the 3,000 words present in the AI Hub Sign Language Video Dataset used during training
Left/right mirror augmentation was intentionally excluded — handedness is semantically meaningful in KSL
Performance may degrade on signers not represented in the training split (signer diversity is bounded by the dataset)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support