File size: 4,106 Bytes
464aca0 f0bd961 464aca0 c8f13a7 464aca0 a16c879 464aca0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 |
# DistilBERT Model for Crop Recommendation Based on Environmental Parameters
This repository contains a fine-tuned DistilBERT model trained for crop recommendation using structured agricultural data. By converting numerical environmental features into text format, the model leverages transformer-based NLP techniques to classify the most suitable crop type.
## πΎ Problem Statement
The goal is to recommend the best crop to cultivate based on parameters such as soil nutrients and weather conditions. Traditional ML models handle this as a tabular classification problem. Here, we explore the innovative approach of using NLP models (DistilBERT) on serialized tabular data.
---
## π Dataset
- **Source:** Crop Recommendation Dataset
- **Features:**
- N: Nitrogen content in soil
- P: Phosphorus content in soil
- K: Potassium content in soil
- Temperature: in Celsius
- Humidity: %
- pH: Acidity of soil
- Rainfall: mm
- **Target:** Crop label (22 crop types)
The dataset is preprocessed by concatenating all numeric features into a single space-separated string, making it suitable for transformer-based tokenization.
---
## π§ Model Details
- **Architecture:** DistilBERT
- **Tokenizer:** `DistilBertTokenizerFast`
- **Model:** `DistilBertForSequenceClassification`
- **Task Type:** Multi-Class Classification (22 classes)
---
## π§ Installation
```bash
pip install transformers datasets pandas scikit-learn torch
```
---
## Loading the Model
```python
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
import torch
# Load model and tokenizer
model_path = "model_fp32_dir"
tokenizer = DistilBertTokenizerFast.from_pretrained(model_path)
model = DistilBertForSequenceClassification.from_pretrained(model_path)
# Sample input
sample_text = "90 42 43 20.879744 82.002744 6.502985 202.935536"
inputs = tokenizer(sample_text, return_tensors="pt")
# Predict
with torch.no_grad():
outputs = model(**inputs)
predicted_class = torch.argmax(outputs.logits, dim=1).item()
print("Predicted class index:", predicted_class)
```
---
## π Performance Metrics
- **Accuracy:** 0.7636
- **Precision:** 0.7738
- **Recall:** 0.7636
- **F1 Score:** 0.7343
---
## ποΈ Fine-Tuning Details
### π Dataset
The dataset is sourced from the publicly available **Crop Recommendation Dataset**. It consists of structured features such as:
- Nitrogen (N)
- Phosphorus (P)
- Potassium (K)
- Temperature (Β°C)
- Humidity (%)
- pH
- Rainfall (mm)
All numerical features were converted into a single textual input string to be used with the DistilBERT tokenizer. Labels were factorized into class indices for training.
The dataset was split using an 80/20 ratio for training and testing.
---
### π§ Training Configuration
- **Epochs:** 3
- **Batch size:** 8
- **Learning rate:** 2e-5
- **Evaluation strategy:** `epoch`
- **Model Base:** DistilBERT (`distilbert-base-uncased`)
- **Framework:** Hugging Face Transformers + PyTorch
---
## π Quantization
Post-training quantization was applied using PyTorchβs `half()` precision (FP16).
This reduces the model size and speeds up inference with minimal impact on performance.
The quantized model can be loaded with:
```python
model = DistilBertForSequenceClassification.from_pretrained("quantized_model_fp16", torch_dtype=torch.float16)
```
---
## Repository Structure
```python
.
βββ quantized-model/ # Contains the quantized model files
β βββ config.json
β βββ model.safetensors
β βββ tokenizer_config.json
β βββ vocab.txt
β βββ special_tokens_map.json
βββ README.md # Model documentation
```
---
## Limitations
- Uses text conversion of tabular data, which may miss deeper feature interactions.
- Trained on a specific dataset; may not generalize to different regions or conditions.
- FP16 quantization may slightly reduce accuracy in rare cases.
---
## Contributing
Feel free to open issues or submit pull requests to improve the model or documentation.
|