DINOv2 with Spatial Refinement for Tulsi leaves Disease Classification

This model implements a hybrid architecture combining DINOv2 vision transformer with spatial refinement modules for Tulsi (Holy Basil) plant disease classification. The architecture achieves 85.67% test accuracy with 1.74M trainable parameters.

SR-DINOv2 Architecture

Fig. 1: Schematic overview of the proposed Spatial-Refined DINOv2 (SR-DINOv2) architecture.

The framework is designed to bridge the semantic gap between generic self-supervised foundation features and specific agricultural pathology representation. By leveraging a frozen, pre-trained backbone for robust feature extraction and coupling it with lightweight, trainable spatial refinement modules, the architecture achieves high parameter efficiency while effectively capturing subtle lesion textures under varying environmental conditions.

Architecture

Backbone: DINOv2-Small (ViT-S/14) - frozen pretrained weights

Novel Components:

  1. Token Reshaping Layer: Converts patch tokens (1Γ—256Γ—384) to spatial format (16Γ—16Γ—384)
  2. Feature Projection: 1Γ—1 convolution with batch normalization
  3. Spatial Refinement Block (SRB):
    • Two depthwise separable convolutions (3Γ—3)
    • Channel attention mechanism (reduction ratio: 4)
    • Residual connection
  4. Multi-Scale Module (MSM):
    • Three parallel branches: 3Γ—3 DW conv, 5Γ—5 DW conv, global pooling
    • Concatenation followed by 1Γ—1 fusion convolution
  5. Dual Pooling: Average and max pooling concatenation
  6. Classification Head: MLP with dropout (512β†’256β†’4 classes)

Architecture Flow:

Input (224Γ—224Γ—3) β†’ DINOv2 Backbone β†’ Reshape (16Γ—16Γ—384) β†’ Feature Projection 
β†’ Spatial Refinement Block β†’ Multi-Scale Module β†’ Dual Pooling β†’ Classifier

Performance

Metric Train Validation Test
Accuracy 85.39% 86.66% 85.67%
Loss 0.3641 0.3289 0.4007

image

Per-Class Metrics (Test Set)

Class Precision Recall F1-Score Support
Bacterial 0.8526 0.9878 0.9153 82
Fungal 0.9355 0.7073 0.8056 82
Healthy 0.8395 0.8293 0.8344 82
Pests 0.8222 0.9024 0.8605 82

Aggregate Metrics:

  • Macro Average: Precision 0.8625, Recall 0.8567, F1-Score 0.8539
  • Weighted Average: Precision 0.8625, Recall 0.8567, F1-Score 0.8539

Training Configuration

Parameter Value
Total Parameters 22,484,580
Trainable Parameters 1,739,588 (7.7%)
Frozen Parameters 20,744,992 (DINOv2 backbone)
Batch Size 32
Optimizer AdamW (lr=0.001, weight_decay=1e-4)
Scheduler CosineAnnealingLR (eta_min=1e-6)
Training Epochs 32 (early stopped from 40)
Early Stopping Patience 5 epochs
Mixed Precision FP16
Training Time 33.66 minutes
Hardware NVIDIA T4 GPU (15GB VRAM)

Usage

from models import HybridDINOv2Classifier
import torch

model = HybridDINOv2Classifier(
    num_classes=4,
    freeze_backbone=True,
    use_refinement=True,
    num_refinement_blocks=2,
    use_multiscale=True
)

model.load_state_dict(torch.load('best_model.pth'))
model.eval()

Model Files

  • best_model.pth: Trained model weights (95.3 MB)
  • checkpoint.pth: Full checkpoint with optimizer state (109 MB)
  • config.json: Model hyperparameters
  • metrics.json: Final evaluation metrics
  • history.json: Training history (32 epochs)
  • training_results.png: Loss and accuracy curves

Key Features

  1. Efficient Training: Only 7.7% of parameters are trainable due to frozen backbone
  2. Spatial Inductive Bias: Convolutional refinement modules enhance ViT features
  3. Multi-Scale Processing: Captures disease patterns at different receptive fields
  4. Early Stopping: Automatic training halt after 5 epochs without improvement
  5. Robust Performance: Balanced accuracy across all disease categories
Downloads last month
19
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for LeafNet75/SR-dinov2-H-tulsi

Finetuned
(67)
this model