Hand Gesture Recognition - ViT-B/16 (Clean Baseline)

Model Description

This Vision Transformer (ViT-B/16) model is fine-tuned on the HaGRID dataset for real-time hand gesture recognition. This is a clean baseline implementation with no experimental augmentation or architectural modifications.

Model Version: clean-baseline-v1.0
Architecture: ViT-B/16 (224ร—224 input)
Parameters: ~86M
Framework: PyTorch + timm

Performance Metrics

Metric Value
Test Accuracy 0.8327 (83.27%)
Validation Accuracy 82.5513 (8255.13%)
Validation F1 Score 0.8330 (83.30%)
Training Accuracy 100.0000 (10000.00%)
Confusion Matrix Diagonality 83.2705

Training Details

Dataset

Training Configuration

  • Batch Size: 32
  • Learning Rate: 0.0001
  • Weight Decay: 0.01
  • Optimizer: AdamW
  • Scheduler: CosineAnnealingLR
  • Epochs Trained: 15/15
  • Label Smoothing: 0.0 (disabled)

Data Processing

  • Preprocessing: Single deterministic pipeline (no augmentation)
  • Resize: 224ร—224
  • Normalization: ImageNet statistics (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
  • Augmentation: none (clean baseline principle)
  • Test-Time Augmentation: False (disabled)

Architecture Details

  • Type: standard ViT-B/16 (no modifications)
  • Classification Head: single_linear (768 โ†’ 18 classes)
  • Hidden Dimensions: None
  • Dropout: 0.0

Gesture Classes (18 total)

  • call
  • dislike
  • fist
  • four
  • like
  • mute
  • ok
  • one
  • palm
  • peace
  • peace_inverted
  • rock
  • stop
  • stop_inverted
  • three
  • three2
  • two_up
  • two_up_inverted

Usage

Installation

pip install torch timm pillow opencv-python mediapipe

Quick Start

import torch
import timm
from PIL import Image
from torchvision import transforms

# Load model
model = timm.create_model('vit_base_patch16_224', pretrained=False, num_classes=18)
checkpoint = torch.load('best_model.pth', map_location='cpu')

# Handle both full checkpoint and state_dict formats
if 'model_state_dict' in checkpoint:
    model.load_state_dict(checkpoint['model_state_dict'])
else:
    model.load_state_dict(checkpoint)

model.eval()

# Prepare transform (MUST match training exactly)
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Inference
image = Image.open('hand_gesture.jpg').convert('RGB')
input_tensor = transform(image).unsqueeze(0)

with torch.no_grad():
    output = model(input_tensor)
    probabilities = torch.softmax(output, dim=1)
    predicted_class = output.argmax(dim=1).item()
    confidence = probabilities[0, predicted_class].item()

class_names = ['call', 'dislike', 'fist', 'four', 'like', 'mute', 'ok', 'one', 'palm', 'peace', 'peace_inverted', 'rock', 'stop', 'stop_inverted', 'three', 'three2', 'two_up', 'two_up_inverted']
print(f"Predicted: {class_names[predicted_class]} ({confidence:.2%})")

Download from Hugging Face

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="YOUR_USERNAME/YOUR_REPO",
    filename="best_model.pth"
)

Model Architecture

Based on Vision Transformer (ViT-B/16) with standard architecture:

  • Input: 224ร—224 RGB image
  • Patch Size: 16ร—16 (196 patches)
  • Embedding Dim: 768
  • Layers: 12 transformer blocks
  • Attention Heads: 12 per layer
  • MLP Ratio: 4
  • Classification Head: Single linear layer (768 โ†’ 18)

Note: This is a clean baseline with no architectural modifications, custom heads, or ensemble methods.

Limitations

  • Single preprocessing pipeline: No augmentation during training or inference
  • Standard architecture: No custom attention mechanisms or ensemble methods
  • Dataset bias: Trained on HaGRID which primarily contains Western gestures
  • Performance constraints:
    • Requires good lighting conditions
    • Sensitive to heavy occlusion (>40% of hand)
    • Best results with clear, centered hand poses
    • Minimum resolution: ~100px for reliable detection
  • Hardware: Real-time performance requires GPU (15-30 FPS on modern GPUs)

Clean Baseline Principles

This model adheres to strict baseline principles:

  1. No data augmentation - Single deterministic transform pipeline
  2. No test-time augmentation - Direct inference only
  3. Standard architecture - No custom heads or modifications
  4. Reproducible - Fixed random seeds and deterministic operations
  5. Minimal complexity - Simplest possible training configuration

These principles ensure:

  • Clear cause-and-effect relationships
  • Reproducible results
  • Fair comparison baselines
  • Academic rigor

Reproducibility

All hyperparameters and configurations are documented in the training notebook:

  • notebooks/finetune_vit_hagrid.ipynb

To reproduce:

# 1. Download dataset
python scripts/download_hagrid_kaggle.py

# 2. Train model
jupyter notebook notebooks/finetune_vit_hagrid.ipynb

# 3. Upload to HF
python scripts/upload_model_to_huggingface.py --repo-id username/repo-name

Citation

@misc{hagrid-vit-clean-baseline-2025,
  title={Hand Gesture Recognition with Vision Transformer (Clean Baseline)},
  author={CSU Capstone Project},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/neilrigaud/hagrid-vit-gesture}}
}

License

MIT License

Acknowledgments

  • Dataset: HaGRID (Hand Gesture Recognition Image Dataset)
  • Base Model: ViT-B/16 pretrained on ImageNet-21k (timm library)
  • Framework: PyTorch, timm, torchvision
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support