DINOv2 with Spatial Refinement for Tulsi leaves Disease Classification

This model implements a hybrid architecture combining DINOv2 vision transformer with spatial refinement modules for Tulsi (Holy Basil) plant disease classification. The architecture achieves 85.67% test accuracy with 1.74M trainable parameters.

Fig. 1: Schematic overview of the proposed Spatial-Refined DINOv2 (SR-DINOv2) architecture.

The framework is designed to bridge the semantic gap between generic self-supervised foundation features and specific agricultural pathology representation. By leveraging a frozen, pre-trained backbone for robust feature extraction and coupling it with lightweight, trainable spatial refinement modules, the architecture achieves high parameter efficiency while effectively capturing subtle lesion textures under varying environmental conditions.

Architecture

Backbone: DINOv2-Small (ViT-S/14) - frozen pretrained weights

Novel Components:

Token Reshaping Layer: Converts patch tokens (1×256×384) to spatial format (16×16×384)
Feature Projection: 1×1 convolution with batch normalization
Spatial Refinement Block (SRB):
- Two depthwise separable convolutions (3×3)
- Channel attention mechanism (reduction ratio: 4)
- Residual connection
Multi-Scale Module (MSM):
- Three parallel branches: 3×3 DW conv, 5×5 DW conv, global pooling
- Concatenation followed by 1×1 fusion convolution
Dual Pooling: Average and max pooling concatenation
Classification Head: MLP with dropout (512→256→4 classes)

Architecture Flow:

Input (224×224×3) → DINOv2 Backbone → Reshape (16×16×384) → Feature Projection 
→ Spatial Refinement Block → Multi-Scale Module → Dual Pooling → Classifier

Performance

Metric	Train	Validation	Test
Accuracy	85.39%	86.66%	85.67%
Loss	0.3641	0.3289	0.4007

Per-Class Metrics (Test Set)

Class	Precision	Recall	F1-Score	Support
Bacterial	0.8526	0.9878	0.9153	82
Fungal	0.9355	0.7073	0.8056	82
Healthy	0.8395	0.8293	0.8344	82
Pests	0.8222	0.9024	0.8605	82

Aggregate Metrics:

Macro Average: Precision 0.8625, Recall 0.8567, F1-Score 0.8539
Weighted Average: Precision 0.8625, Recall 0.8567, F1-Score 0.8539

Training Configuration

Parameter	Value
Total Parameters	22,484,580
Trainable Parameters	1,739,588 (7.7%)
Frozen Parameters	20,744,992 (DINOv2 backbone)
Batch Size	32
Optimizer	AdamW (lr=0.001, weight_decay=1e-4)
Scheduler	CosineAnnealingLR (eta_min=1e-6)
Training Epochs	32 (early stopped from 40)
Early Stopping Patience	5 epochs
Mixed Precision	FP16
Training Time	33.66 minutes
Hardware	NVIDIA T4 GPU (15GB VRAM)

Usage

from models import HybridDINOv2Classifier
import torch

model = HybridDINOv2Classifier(
    num_classes=4,
    freeze_backbone=True,
    use_refinement=True,
    num_refinement_blocks=2,
    use_multiscale=True
)

model.load_state_dict(torch.load('best_model.pth'))
model.eval()

Model Files

best_model.pth: Trained model weights (95.3 MB)
checkpoint.pth: Full checkpoint with optimizer state (109 MB)
config.json: Model hyperparameters
metrics.json: Final evaluation metrics
history.json: Training history (32 epochs)
training_results.png: Loss and accuracy curves

Key Features

Efficient Training: Only 7.7% of parameters are trainable due to frozen backbone
Spatial Inductive Bias: Convolutional refinement modules enhance ViT features
Multi-Scale Processing: Captures disease patterns at different receptive fields
Early Stopping: Automatic training halt after 5 epochs without improvement
Robust Performance: Balanced accuracy across all disease categories

Downloads last month: 7

Model tree for LeafNet75/SR-dinov2-H-tulsi

Base model

facebook/dinov2-base

Finetuned

(73)

this model