KSL Word Recognition Model Repo

This folder is a Hugging Face-compatible model package for word-level Korean Sign Language recognition. It contains the model structure, inference contract, and the MediaPipe MVP runtime used by the backend.

Current MVP Artifact

The repository loads a lightweight MediaPipe MVP artifact when mediapipe_mvp.joblib is present. This artifact is trained for a controlled one-person proof-of-concept, not for production-quality Korean Sign Language translation.

Current MVP vocabulary:

수어, 좋다, 감사, 괜찮다, 싫다, 이해, 부탁, 모르다, 맞다, 힘

Measured locally on the development machine:

Validation accuracy: 98.0% on a 50-sample held-out signer split
Training samples: 900 balanced samples, 10 labels, 5 camera angles
Mean inference latency: about 112.3 ms/frame on CPU
Approximate throughput: about 8.9 fps

Important limitation: this metric is from controlled AIHub word clips and a small 10-word vocabulary. It is suitable for a one-person demo, but it is not evidence of robust real-world meeting performance.

Training command used for the current artifact:

python training\train_mediapipe_mvp.py --data-root "D:\수어 영상" --cache-dir "D:\ksl_cache\mediapipe_mvp_features" --max-per-label 90 --sequence-length 16 --angles F D U L R --classifier extra_trees --confidence-threshold 0.45

Benchmark command:

python training\benchmark_mediapipe_mvp.py --data-root "D:\수어 영상" --cache-dir "D:\ksl_cache\mediapipe_mvp_features" --max-frames 120

The trained mediapipe_mvp.joblib file is intentionally not tracked in GitHub. It should be distributed through Hugging Face.

Hugging Face model repo used by the backend:

TechieMoon/realtime-ksl-captioning-mediapipe-mvp

Backend environment:

$env:MODEL_BACKEND = "huggingface"
$env:HF_MODEL_ID = "TechieMoon/realtime-ksl-captioning-mediapipe-mvp"
$env:MODEL_DEVICE = "cpu"
uvicorn app.main:app --host 0.0.0.0 --port 8000

This MVP artifact does not contain the AIHub video dataset. It only contains the trained lightweight classifier and inference code.

Runtime Structure

RGB webcam frame
  -> MediaPipe MVP recognizer when mediapipe_mvp.joblib exists
     - resizes large RGB frames to max width 640 before MediaPipe
     - extracts pose and hand landmarks
     - classifies a short keypoint sequence with a lightweight classifier
  -> otherwise YoloRoiExtractor
     - uses yolo/yolo.pt when present and enabled
     - falls back to the full frame when no detector checkpoint exists
  -> VideoMAEWordClassifier
     - keeps a short frame clip buffer
     - loads a transformers VideoMAE checkpoint from videomae/ when present
     - returns unknown until enough clip frames and a checkpoint are available
  -> KslWordRecognizer
     - confidence thresholding
     - short-window smoothing
     - word-level caption JSON

The current repository intentionally does not include trained checkpoints. Add them later with this layout:

ai_model/
  inference.py
  modeling.py
  labels.json
  model_config.json
  preprocessor_config.json
  yolo/
    yolo.pt
  videomae/
    config.json
    model.safetensors
    preprocessor_config.json

Contract

def load_model(model_dir: str, device: str):
    ...

def predict(model, frames: list, timestamps_ms: list[int | None]) -> list[dict]:
    ...

Each frame must be an RGB numpy.ndarray with shape (height, width, 3) and dtype uint8.

Output:

{
  "text": "안녕하세요",
  "words": [
    {
      "text": "안녕하세요",
      "confidence": 0.91,
      "start_ms": 1200,
      "end_ms": 1700
    }
  ],
  "is_final": true
}

If no checkpoint exists, inference stays conservative and returns an empty caption:

{
  "text": "",
  "words": [],
  "is_final": false
}

Config

model_config.json controls the optional detector and classifier:

{
  "roi_detector": {
    "enabled": false,
    "weights_path": "yolo/yolo.pt"
  },
  "classifier": {
    "checkpoint_path": "videomae",
    "clip_size": 8
  }
}

Set roi_detector.enabled to true after adding YOLO weights. The classifier automatically loads videomae/ when that checkpoint directory exists.

AIHub Training Fit

For AIHub 수어 영상 데이터, the later training side should create word-level clips from video annotations and fine-tune a VideoMAE classifier with label IDs aligned to labels.json. The resulting Hugging Face checkpoint can be saved into videomae/; no backend changes are required.

YOLO can be trained or fine-tuned separately for signer/hand/upper-body ROI detection, then exported as yolo/yolo.pt.

Local Smoke Test

From the repository root:

cd server
$env:MODEL_BACKEND = "huggingface"
$env:HF_MODEL_ID = "..\ai_model"
$env:MODEL_DEVICE = "cpu"
pytest tests/test_ai_model_contract.py tests/test_huggingface_adapter.py

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Video Classification

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support