DINOv2 with Spatial Refinement for Tulsi leaves Disease Classification
This model implements a hybrid architecture combining DINOv2 vision transformer with spatial refinement modules for Tulsi (Holy Basil) plant disease classification. The architecture achieves 85.67% test accuracy with 1.74M trainable parameters.
Fig. 1: Schematic overview of the proposed Spatial-Refined DINOv2 (SR-DINOv2) architecture.
The framework is designed to bridge the semantic gap between generic self-supervised foundation features and specific agricultural pathology representation. By leveraging a frozen, pre-trained backbone for robust feature extraction and coupling it with lightweight, trainable spatial refinement modules, the architecture achieves high parameter efficiency while effectively capturing subtle lesion textures under varying environmental conditions.
Architecture
Backbone: DINOv2-Small (ViT-S/14) - frozen pretrained weights
Novel Components:
- Token Reshaping Layer: Converts patch tokens (1Γ256Γ384) to spatial format (16Γ16Γ384)
- Feature Projection: 1Γ1 convolution with batch normalization
- Spatial Refinement Block (SRB):
- Two depthwise separable convolutions (3Γ3)
- Channel attention mechanism (reduction ratio: 4)
- Residual connection
- Multi-Scale Module (MSM):
- Three parallel branches: 3Γ3 DW conv, 5Γ5 DW conv, global pooling
- Concatenation followed by 1Γ1 fusion convolution
- Dual Pooling: Average and max pooling concatenation
- Classification Head: MLP with dropout (512β256β4 classes)
Architecture Flow:
Input (224Γ224Γ3) β DINOv2 Backbone β Reshape (16Γ16Γ384) β Feature Projection
β Spatial Refinement Block β Multi-Scale Module β Dual Pooling β Classifier
Performance
| Metric | Train | Validation | Test |
|---|---|---|---|
| Accuracy | 85.39% | 86.66% | 85.67% |
| Loss | 0.3641 | 0.3289 | 0.4007 |
Per-Class Metrics (Test Set)
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Bacterial | 0.8526 | 0.9878 | 0.9153 | 82 |
| Fungal | 0.9355 | 0.7073 | 0.8056 | 82 |
| Healthy | 0.8395 | 0.8293 | 0.8344 | 82 |
| Pests | 0.8222 | 0.9024 | 0.8605 | 82 |
Aggregate Metrics:
- Macro Average: Precision 0.8625, Recall 0.8567, F1-Score 0.8539
- Weighted Average: Precision 0.8625, Recall 0.8567, F1-Score 0.8539
Training Configuration
| Parameter | Value |
|---|---|
| Total Parameters | 22,484,580 |
| Trainable Parameters | 1,739,588 (7.7%) |
| Frozen Parameters | 20,744,992 (DINOv2 backbone) |
| Batch Size | 32 |
| Optimizer | AdamW (lr=0.001, weight_decay=1e-4) |
| Scheduler | CosineAnnealingLR (eta_min=1e-6) |
| Training Epochs | 32 (early stopped from 40) |
| Early Stopping Patience | 5 epochs |
| Mixed Precision | FP16 |
| Training Time | 33.66 minutes |
| Hardware | NVIDIA T4 GPU (15GB VRAM) |
Usage
from models import HybridDINOv2Classifier
import torch
model = HybridDINOv2Classifier(
num_classes=4,
freeze_backbone=True,
use_refinement=True,
num_refinement_blocks=2,
use_multiscale=True
)
model.load_state_dict(torch.load('best_model.pth'))
model.eval()
Model Files
best_model.pth: Trained model weights (95.3 MB)checkpoint.pth: Full checkpoint with optimizer state (109 MB)config.json: Model hyperparametersmetrics.json: Final evaluation metricshistory.json: Training history (32 epochs)training_results.png: Loss and accuracy curves
Key Features
- Efficient Training: Only 7.7% of parameters are trainable due to frozen backbone
- Spatial Inductive Bias: Convolutional refinement modules enhance ViT features
- Multi-Scale Processing: Captures disease patterns at different receptive fields
- Early Stopping: Automatic training halt after 5 epochs without improvement
- Robust Performance: Balanced accuracy across all disease categories
- Downloads last month
- 19
Model tree for LeafNet75/SR-dinov2-H-tulsi
Base model
facebook/dinov2-base