Hand Gesture Recognition - ViT-B/16 (Clean Baseline)
Model Description
This Vision Transformer (ViT-B/16) model is fine-tuned on the HaGRID dataset for real-time hand gesture recognition. This is a clean baseline implementation with no experimental augmentation or architectural modifications.
Model Version: clean-baseline-v1.0
Architecture: ViT-B/16 (224ร224 input)
Parameters: ~86M
Framework: PyTorch + timm
Performance Metrics
| Metric | Value |
|---|---|
| Test Accuracy | 0.8327 (83.27%) |
| Validation Accuracy | 82.5513 (8255.13%) |
| Validation F1 Score | 0.8330 (83.30%) |
| Training Accuracy | 100.0000 (10000.00%) |
| Confusion Matrix Diagonality | 83.2705 |
Training Details
Dataset
- Name: HaGRID Sample 30K (384p)
- Source: https://www.kaggle.com/datasets/innominate817/hagrid-sample-30k-384p
- Train samples: 22,283
- Validation samples: 4,774
- Test samples: 4,776
Training Configuration
- Batch Size: 32
- Learning Rate: 0.0001
- Weight Decay: 0.01
- Optimizer: AdamW
- Scheduler: CosineAnnealingLR
- Epochs Trained: 15/15
- Label Smoothing: 0.0 (disabled)
Data Processing
- Preprocessing: Single deterministic pipeline (no augmentation)
- Resize: 224ร224
- Normalization: ImageNet statistics (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
- Augmentation: none (clean baseline principle)
- Test-Time Augmentation: False (disabled)
Architecture Details
- Type: standard ViT-B/16 (no modifications)
- Classification Head: single_linear (768 โ 18 classes)
- Hidden Dimensions: None
- Dropout: 0.0
Gesture Classes (18 total)
- call
- dislike
- fist
- four
- like
- mute
- ok
- one
- palm
- peace
- peace_inverted
- rock
- stop
- stop_inverted
- three
- three2
- two_up
- two_up_inverted
Usage
Installation
pip install torch timm pillow opencv-python mediapipe
Quick Start
import torch
import timm
from PIL import Image
from torchvision import transforms
# Load model
model = timm.create_model('vit_base_patch16_224', pretrained=False, num_classes=18)
checkpoint = torch.load('best_model.pth', map_location='cpu')
# Handle both full checkpoint and state_dict formats
if 'model_state_dict' in checkpoint:
model.load_state_dict(checkpoint['model_state_dict'])
else:
model.load_state_dict(checkpoint)
model.eval()
# Prepare transform (MUST match training exactly)
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
# Inference
image = Image.open('hand_gesture.jpg').convert('RGB')
input_tensor = transform(image).unsqueeze(0)
with torch.no_grad():
output = model(input_tensor)
probabilities = torch.softmax(output, dim=1)
predicted_class = output.argmax(dim=1).item()
confidence = probabilities[0, predicted_class].item()
class_names = ['call', 'dislike', 'fist', 'four', 'like', 'mute', 'ok', 'one', 'palm', 'peace', 'peace_inverted', 'rock', 'stop', 'stop_inverted', 'three', 'three2', 'two_up', 'two_up_inverted']
print(f"Predicted: {class_names[predicted_class]} ({confidence:.2%})")
Download from Hugging Face
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
repo_id="YOUR_USERNAME/YOUR_REPO",
filename="best_model.pth"
)
Model Architecture
Based on Vision Transformer (ViT-B/16) with standard architecture:
- Input: 224ร224 RGB image
- Patch Size: 16ร16 (196 patches)
- Embedding Dim: 768
- Layers: 12 transformer blocks
- Attention Heads: 12 per layer
- MLP Ratio: 4
- Classification Head: Single linear layer (768 โ 18)
Note: This is a clean baseline with no architectural modifications, custom heads, or ensemble methods.
Limitations
- Single preprocessing pipeline: No augmentation during training or inference
- Standard architecture: No custom attention mechanisms or ensemble methods
- Dataset bias: Trained on HaGRID which primarily contains Western gestures
- Performance constraints:
- Requires good lighting conditions
- Sensitive to heavy occlusion (>40% of hand)
- Best results with clear, centered hand poses
- Minimum resolution: ~100px for reliable detection
- Hardware: Real-time performance requires GPU (15-30 FPS on modern GPUs)
Clean Baseline Principles
This model adheres to strict baseline principles:
- No data augmentation - Single deterministic transform pipeline
- No test-time augmentation - Direct inference only
- Standard architecture - No custom heads or modifications
- Reproducible - Fixed random seeds and deterministic operations
- Minimal complexity - Simplest possible training configuration
These principles ensure:
- Clear cause-and-effect relationships
- Reproducible results
- Fair comparison baselines
- Academic rigor
Reproducibility
All hyperparameters and configurations are documented in the training notebook:
notebooks/finetune_vit_hagrid.ipynb
To reproduce:
# 1. Download dataset
python scripts/download_hagrid_kaggle.py
# 2. Train model
jupyter notebook notebooks/finetune_vit_hagrid.ipynb
# 3. Upload to HF
python scripts/upload_model_to_huggingface.py --repo-id username/repo-name
Citation
@misc{hagrid-vit-clean-baseline-2025,
title={Hand Gesture Recognition with Vision Transformer (Clean Baseline)},
author={CSU Capstone Project},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/neilrigaud/hagrid-vit-gesture}}
}
License
MIT License
Acknowledgments
- Dataset: HaGRID (Hand Gesture Recognition Image Dataset)
- Base Model: ViT-B/16 pretrained on ImageNet-21k (timm library)
- Framework: PyTorch, timm, torchvision
- Downloads last month
- -