Crop_Recommendation / README.md
KshitizTayal's picture
Update README.md
f0bd961 verified
# DistilBERT Model for Crop Recommendation Based on Environmental Parameters
This repository contains a fine-tuned DistilBERT model trained for crop recommendation using structured agricultural data. By converting numerical environmental features into text format, the model leverages transformer-based NLP techniques to classify the most suitable crop type.
## 🌾 Problem Statement
The goal is to recommend the best crop to cultivate based on parameters such as soil nutrients and weather conditions. Traditional ML models handle this as a tabular classification problem. Here, we explore the innovative approach of using NLP models (DistilBERT) on serialized tabular data.
---
## πŸ“Š Dataset
- **Source:** Crop Recommendation Dataset
- **Features:**
- N: Nitrogen content in soil
- P: Phosphorus content in soil
- K: Potassium content in soil
- Temperature: in Celsius
- Humidity: %
- pH: Acidity of soil
- Rainfall: mm
- **Target:** Crop label (22 crop types)
The dataset is preprocessed by concatenating all numeric features into a single space-separated string, making it suitable for transformer-based tokenization.
---
## 🧠 Model Details
- **Architecture:** DistilBERT
- **Tokenizer:** `DistilBertTokenizerFast`
- **Model:** `DistilBertForSequenceClassification`
- **Task Type:** Multi-Class Classification (22 classes)
---
## πŸ”§ Installation
```bash
pip install transformers datasets pandas scikit-learn torch
```
---
## Loading the Model
```python
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
import torch
# Load model and tokenizer
model_path = "model_fp32_dir"
tokenizer = DistilBertTokenizerFast.from_pretrained(model_path)
model = DistilBertForSequenceClassification.from_pretrained(model_path)
# Sample input
sample_text = "90 42 43 20.879744 82.002744 6.502985 202.935536"
inputs = tokenizer(sample_text, return_tensors="pt")
# Predict
with torch.no_grad():
outputs = model(**inputs)
predicted_class = torch.argmax(outputs.logits, dim=1).item()
print("Predicted class index:", predicted_class)
```
---
## πŸ“ˆ Performance Metrics
- **Accuracy:** 0.7636
- **Precision:** 0.7738
- **Recall:** 0.7636
- **F1 Score:** 0.7343
---
## πŸ‹οΈ Fine-Tuning Details
### πŸ“š Dataset
The dataset is sourced from the publicly available **Crop Recommendation Dataset**. It consists of structured features such as:
- Nitrogen (N)
- Phosphorus (P)
- Potassium (K)
- Temperature (Β°C)
- Humidity (%)
- pH
- Rainfall (mm)
All numerical features were converted into a single textual input string to be used with the DistilBERT tokenizer. Labels were factorized into class indices for training.
The dataset was split using an 80/20 ratio for training and testing.
---
### πŸ”§ Training Configuration
- **Epochs:** 3
- **Batch size:** 8
- **Learning rate:** 2e-5
- **Evaluation strategy:** `epoch`
- **Model Base:** DistilBERT (`distilbert-base-uncased`)
- **Framework:** Hugging Face Transformers + PyTorch
---
## πŸ”„ Quantization
Post-training quantization was applied using PyTorch’s `half()` precision (FP16).
This reduces the model size and speeds up inference with minimal impact on performance.
The quantized model can be loaded with:
```python
model = DistilBertForSequenceClassification.from_pretrained("quantized_model_fp16", torch_dtype=torch.float16)
```
---
## Repository Structure
```python
.
β”œβ”€β”€ quantized-model/ # Contains the quantized model files
β”‚ β”œβ”€β”€ config.json
β”‚ β”œβ”€β”€ model.safetensors
β”‚ β”œβ”€β”€ tokenizer_config.json
β”‚ β”œβ”€β”€ vocab.txt
β”‚ └── special_tokens_map.json
β”œβ”€β”€ README.md # Model documentation
```
---
## Limitations
- Uses text conversion of tabular data, which may miss deeper feature interactions.
- Trained on a specific dataset; may not generalize to different regions or conditions.
- FP16 quantization may slightly reduce accuracy in rare cases.
---
## Contributing
Feel free to open issues or submit pull requests to improve the model or documentation.