Hand Gesture Recognition - ViT-B/16 (Clean Baseline)

Model Description

This Vision Transformer (ViT-B/16) model is fine-tuned on the HaGRID dataset for real-time hand gesture recognition. This is a clean baseline implementation with no experimental augmentation or architectural modifications.

Model Version: clean-baseline-v2.0-hf
Architecture: ViT-B/16 (224×224 input)
Parameters: ~86M
Framework: PyTorch + timm

Performance Metrics

Metric	Value
Test Accuracy	0.9249 (92.49%)
Validation Accuracy	89.8775 (8987.75%)
Validation F1 Score	0.9249 (92.49%)
Training Accuracy	100.0000 (10000.00%)
Confusion Matrix Diagonality	92.4895

Training Details

Dataset

Name: HaGRID Subset
Source: local
Train samples: 103,171
Validation samples: 12,082
Test samples: 20,305

Training Configuration

Batch Size: 32
Learning Rate: 0.0001
Weight Decay: 0.01
Optimizer: AdamW
Scheduler: CosineAnnealingLR
Epochs Trained: 15/15
Label Smoothing: 0.0 (disabled)

Data Processing

Preprocessing: Single deterministic pipeline (no augmentation)
Resize: 224×224
Normalization: ImageNet statistics (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
Augmentation: none (clean baseline principle)
Test-Time Augmentation: False (disabled)

Architecture Details

Type: standard ViT-B/16 (no modifications)
Classification Head: single_linear (768 → 23 classes)
Hidden Dimensions: None
Dropout: 0.0

Gesture Classes (23 total)

call
dislike
fist
four
grabbing
grip
like
middle_finger
mute
no_gesture
ok
one
palm
peace
peace_inverted
rock
stop
stop_inverted
three
three2
three3
two_up
two_up_inverted

Usage

Installation

pip install torch timm pillow opencv-python

Quick Start

import torch
import timm
from PIL import Image
from torchvision import transforms

# Load model
model = timm.create_model('vit_base_patch16_224', pretrained=False, num_classes=23)
checkpoint = torch.load('best_model.pth', map_location='cpu')

# Handle both full checkpoint and state_dict formats
if 'model_state_dict' in checkpoint:
    model.load_state_dict(checkpoint['model_state_dict'])
else:
    model.load_state_dict(checkpoint)

model.eval()

# Prepare transform (MUST match training exactly)
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Inference
image = Image.open('hand_gesture.jpg').convert('RGB')
input_tensor = transform(image).unsqueeze(0)

with torch.no_grad():
    output = model(input_tensor)
    probabilities = torch.softmax(output, dim=1)
    predicted_class = output.argmax(dim=1).item()
    confidence = probabilities[0, predicted_class].item()

class_names = ['call', 'dislike', 'fist', 'four', 'grabbing', 'grip', 'like', 'middle_finger', 'mute', 'no_gesture', 'ok', 'one', 'palm', 'peace', 'peace_inverted', 'rock', 'stop', 'stop_inverted', 'three', 'three2', 'three3', 'two_up', 'two_up_inverted']
print(f"Predicted: {class_names[predicted_class]} ({confidence:.2%})")

Download from Hugging Face

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="YOUR_USERNAME/YOUR_REPO",
    filename="best_model.pth"
)

Model Architecture

Based on Vision Transformer (ViT-B/16) with standard architecture:

Input: 224×224 RGB image
Patch Size: 16×16 (196 patches)
Embedding Dim: 768
Layers: 12 transformer blocks
Attention Heads: 12 per layer
MLP Ratio: 4
Classification Head: Single linear layer (768 → 23)

Note: This is a clean baseline with no architectural modifications, custom heads, or ensemble methods.

Limitations

Single preprocessing pipeline: No augmentation during training or inference
Standard architecture: No custom attention mechanisms or ensemble methods
Dataset bias: Trained on HaGRID which primarily contains Western gestures
Performance constraints:
- Requires good lighting conditions
- Sensitive to heavy occlusion (>40% of hand)
- Best results with clear, centered hand poses
- Minimum resolution: ~100px for reliable detection
Hardware: Real-time performance requires GPU (15-30 FPS on modern GPUs)

Clean Baseline Principles

This model adheres to strict baseline principles:

No data augmentation - Single deterministic transform pipeline
No test-time augmentation - Direct inference only
Standard architecture - No custom heads or modifications
Reproducible - Fixed random seeds and deterministic operations
Minimal complexity - Simplest possible training configuration

These principles ensure:

Clear cause-and-effect relationships
Reproducible results
Fair comparison baselines
Academic rigor

Reproducibility

All hyperparameters and configurations are documented in the training notebook:

notebooks/finetune_vit_hagrid.ipynb

To reproduce:

# 1. Download dataset
python scripts/download_hagrid_kaggle.py

# 2. Train model
jupyter notebook notebooks/finetune_vit_hagrid.ipynb

# 3. Upload to HF
python scripts/upload_model_to_huggingface.py --repo-id username/repo-name

Citation

@misc{hagrid-vit-clean-baseline-2025,
  title={Hand Gesture Recognition with Vision Transformer (Clean Baseline)},
  author={CSU Capstone Project},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/neilrigaud/hagrid-vit-gesture}}
}

License

MIT License

Acknowledgments

Dataset: HaGRID (Hand Gesture Recognition Image Dataset)
Base Model: ViT-B/16 pretrained on ImageNet-21k (timm library)
Framework: PyTorch, timm, torchvision

Downloads last month: -