ViG-Tiny on CIFAR-100 @ 224ร224
Vision GNN (Graph Neural Network) model trained from scratch on CIFAR-100 upsampled to 224ร224. ViG represents images as graphs and uses graph convolutions instead of traditional convolutions or attention mechanisms.
Model Details
- Architecture: ViG-Tiny (Vision GNN)
- Parameters: ~7.6M
- Dataset: CIFAR-100 (100 classes, 32ร32โ224ร224)
- Training: From scratch, no pretraining
- Key Innovation: Graph-based image representation with k-NN dynamic graphs
Architecture
ViG-Tiny uses:
- Channels: 192
- Blocks: 12 (Grapher + FFN pairs)
- k-NN: 9 neighbors (increases to 18)
- Graph Conv: Max-Relative (MR) convolution
- Stem: Multi-scale CNN stem (224โ14ร14)
- Head: 1024-dim classifier
Training Setup
CIFAR-friendly hyperparameters with light augmentation and regularization:
- Optimizer: AdamW (lr=0.001, wd=0.0005)
- Scheduler: Cosine with 5 epoch warmup
- Epochs: 100
- Batch Size: 128
- Augmentation: RandAugment + Light Mixup (0.2)
- Regularization: No label smoothing, No drop path
- Normalization: CIFAR-100 statistics
- Mixed Precision: Enabled
Results
- Best Test Acc@1: 76.98%
- Best Test Acc@5: 93.66%
- Final Test Acc@1: 76.98%
- Final Test Acc@5: 93.40%
- Training Time: 5.60 hours
Available Checkpoints
best_model.pth- Best performing model (76.98% Acc@1)final_model.pth- Final model after 100 epochscheckpoint_epoch_X.pth- Saved every 20 epochs
Usage
import torch
# Load model (copy model definition from notebook)
model = vig_tiny(num_classes=100)
# Load trained weights
checkpoint = torch.load('best_model.pth')
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
# Inference
with torch.no_grad():
output = model(image_tensor)
probs = torch.softmax(output, dim=1)
Citation
@inproceedings{han2022vision,
title={Vision GNN: An Image is Worth Graph of Nodes},
author={Han, Kai and Wang, Yunhe and Guo, Jianyuan and Tang, Yehui and Wu, Enhua},
booktitle={NeurIPS},
year={2022}
}
Notes
This model uses CIFAR-friendly hyperparameters optimized for the smaller dataset, with reduced augmentation and regularization compared to ImageNet training.