Update README.md
Browse files
README.md
CHANGED
|
@@ -13,4 +13,89 @@ tags:
|
|
| 13 |
datasets:
|
| 14 |
- enalis/LeafNet
|
| 15 |
library_name: transformers
|
| 16 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
datasets:
|
| 14 |
- enalis/LeafNet
|
| 15 |
library_name: transformers
|
| 16 |
+
---
|
| 17 |
+
---
|
| 18 |
+
license: mit
|
| 19 |
+
tags:
|
| 20 |
+
- vision-language
|
| 21 |
+
- image-encoder
|
| 22 |
+
- text-encoder
|
| 23 |
+
- multimodal
|
| 24 |
+
- contrastive-learning
|
| 25 |
+
- explainable-ai
|
| 26 |
+
- few-shot-learning
|
| 27 |
+
- agriculture
|
| 28 |
+
library_name: transformers
|
| 29 |
+
datasets:
|
| 30 |
+
- your-dataset-name
|
| 31 |
+
pipeline_tag: feature-extraction
|
| 32 |
+
---
|
| 33 |
+
|
| 34 |
+
# 🌿 SCOLD: A Vision-Language Foundation Model for Leaf Disease Identification
|
| 35 |
+
|
| 36 |
+
**SCOLD** (Leaf Disases Vision-Language) is a multimodal model that maps **images** and **text descriptions** into a shared embedding space. It combines a [Swin Transformer](https://huggingface.co/timm/swin_tiny_patch4_window7_224) as the **image encoder** and [RoBERTa](https://huggingface.co/roberta-base) as the **text encoder**, projected to a 512-dimensional common space.
|
| 37 |
+
|
| 38 |
+
This model is developed for **cross-modal retrieval**, **few-shot classification**, and **explainable AI in agriculture**, especially for plant disease diagnosis from both images and domain-specific text prompts.
|
| 39 |
+
|
| 40 |
+
---
|
| 41 |
+
|
| 42 |
+
## 🚀 Model Details
|
| 43 |
+
|
| 44 |
+
| Component | Architecture |
|
| 45 |
+
|------------------|-------------------------------------------|
|
| 46 |
+
| Image Encoder | Swin Base (patch4, window7, 224 resolution) |
|
| 47 |
+
| Text Encoder | RoBERTa-base |
|
| 48 |
+
| Projection Head | Linear layer (to 512-D space) |
|
| 49 |
+
| Normalization | L2 on both embeddings |
|
| 50 |
+
| Training Task | Contrastive learning |
|
| 51 |
+
|
| 52 |
+
The final embeddings from image and text encoders are aligned using cosine similarity.
|
| 53 |
+
|
| 54 |
+
---
|
| 55 |
+
|
| 56 |
+
## 🧩 Intended Uses & Limitations
|
| 57 |
+
|
| 58 |
+
### ✅ Intended Use
|
| 59 |
+
- Vision-language embedding for classification or retrieval tasks
|
| 60 |
+
- Few-shot learning in agricultural or medical datasets
|
| 61 |
+
- Multimodal interpretability or zero-shot transfer
|
| 62 |
+
|
| 63 |
+
### ❌ Limitations
|
| 64 |
+
- Not optimized for real-time inference
|
| 65 |
+
- Trained on LeafNet dataset
|
| 66 |
+
- May not generalize well to non-agricultural tasks without fine-tuning
|
| 67 |
+
|
| 68 |
+
---
|
| 69 |
+
|
| 70 |
+
## 🧪 How to Use
|
| 71 |
+
|
| 72 |
+
```python
|
| 73 |
+
import torch
|
| 74 |
+
from transformers import RobertaTokenizer
|
| 75 |
+
from torchvision import transforms
|
| 76 |
+
from PIL import Image
|
| 77 |
+
from modeling_lvl import LVL # Replace with your module or package
|
| 78 |
+
|
| 79 |
+
# Load model
|
| 80 |
+
model = LVL()
|
| 81 |
+
model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu"))
|
| 82 |
+
model.eval()
|
| 83 |
+
|
| 84 |
+
# Text preprocessing
|
| 85 |
+
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
|
| 86 |
+
text = "A maize leaf with bacterial blight"
|
| 87 |
+
inputs = tokenizer(text, return_tensors="pt")
|
| 88 |
+
|
| 89 |
+
# Image preprocessing
|
| 90 |
+
image = Image.open("path_to_leaf.jpg").convert("RGB")
|
| 91 |
+
transform = transforms.Compose([
|
| 92 |
+
transforms.Resize((224, 224)),
|
| 93 |
+
transforms.ToTensor()
|
| 94 |
+
])
|
| 95 |
+
image_tensor = transform(image).unsqueeze(0)
|
| 96 |
+
|
| 97 |
+
# Inference
|
| 98 |
+
with torch.no_grad():
|
| 99 |
+
image_emb, text_emb = model(image_tensor, inputs["input_ids"], inputs["attention_mask"])
|
| 100 |
+
similarity = torch.nn.functional.cosine_similarity(image_emb, text_emb)
|
| 101 |
+
print(f"Similarity score: {similarity.item():.4f}")
|