enalis
/

scold

 datasets:
 - enalis/LeafNet
 library_name: transformers
+---
+---
+license: mit
+tags:
+  - vision-language
+  - image-encoder
+  - text-encoder
+  - multimodal
+  - contrastive-learning
+  - explainable-ai
+  - few-shot-learning
+  - agriculture
+library_name: transformers
+datasets:
+  - your-dataset-name
+pipeline_tag: feature-extraction
+---
+# 🌿 SCOLD: A Vision-Language Foundation Model for Leaf Disease Identification
+**SCOLD** (Leaf Disases Vision-Language) is a multimodal model that maps **images** and **text descriptions** into a shared embedding space. It combines a [Swin Transformer](https://huggingface.co/timm/swin_tiny_patch4_window7_224) as the **image encoder** and [RoBERTa](https://huggingface.co/roberta-base) as the **text encoder**, projected to a 512-dimensional common space.
+This model is developed for **cross-modal retrieval**, **few-shot classification**, and **explainable AI in agriculture**, especially for plant disease diagnosis from both images and domain-specific text prompts.
+---
+## 🚀 Model Details
+| Component        | Architecture                             |
+|------------------|-------------------------------------------|
+| Image Encoder    | Swin Base (patch4, window7, 224 resolution) |
+| Text Encoder     | RoBERTa-base                              |
+| Projection Head  | Linear layer (to 512-D space)             |
+| Normalization    | L2 on both embeddings                     |
+| Training Task    | Contrastive learning                      |
+The final embeddings from image and text encoders are aligned using cosine similarity.
+---
+## 🧩 Intended Uses & Limitations
+### ✅ Intended Use
+- Vision-language embedding for classification or retrieval tasks
+- Few-shot learning in agricultural or medical datasets
+- Multimodal interpretability or zero-shot transfer
+### ❌ Limitations
+- Not optimized for real-time inference
+- Trained on LeafNet dataset
+- May not generalize well to non-agricultural tasks without fine-tuning
+---
+## 🧪 How to Use
+```python
+import torch
+from transformers import RobertaTokenizer
+from torchvision import transforms
+from PIL import Image
+from modeling_lvl import LVL  # Replace with your module or package
+# Load model
+model = LVL()
+model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu"))
+model.eval()
+# Text preprocessing
+tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
+text = "A maize leaf with bacterial blight"
+inputs = tokenizer(text, return_tensors="pt")
+# Image preprocessing
+image = Image.open("path_to_leaf.jpg").convert("RGB")
+transform = transforms.Compose([
+    transforms.Resize((224, 224)),
+    transforms.ToTensor()
+])
+image_tensor = transform(image).unsqueeze(0)
+# Inference
+with torch.no_grad():
+    image_emb, text_emb = model(image_tensor, inputs["input_ids"], inputs["attention_mask"])
+    similarity = torch.nn.functional.cosine_similarity(image_emb, text_emb)
+    print(f"Similarity score: {similarity.item():.4f}")