File size: 4,106 Bytes

# DistilBERT Model for Crop Recommendation Based on Environmental Parameters

This repository contains a fine-tuned DistilBERT model trained for crop recommendation using structured agricultural data. By converting numerical environmental features into text format, the model leverages transformer-based NLP techniques to classify the most suitable crop type.

## 🌾 Problem Statement

The goal is to recommend the best crop to cultivate based on parameters such as soil nutrients and weather conditions. Traditional ML models handle this as a tabular classification problem. Here, we explore the innovative approach of using NLP models (DistilBERT) on serialized tabular data.

---

## 📊 Dataset

- **Source:** Crop Recommendation Dataset
- **Features:**
  - N: Nitrogen content in soil
  - P: Phosphorus content in soil
  - K: Potassium content in soil
  - Temperature: in Celsius
  - Humidity: %
  - pH: Acidity of soil
  - Rainfall: mm

- **Target:** Crop label (22 crop types)

The dataset is preprocessed by concatenating all numeric features into a single space-separated string, making it suitable for transformer-based tokenization.

---

## 🧠 Model Details

- **Architecture:** DistilBERT
- **Tokenizer:** `DistilBertTokenizerFast`
- **Model:** `DistilBertForSequenceClassification`
- **Task Type:** Multi-Class Classification (22 classes)

---

## 🔧 Installation

```bash
pip install transformers datasets pandas scikit-learn torch
```

---

## Loading the Model

```python
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
import torch

# Load model and tokenizer
model_path = "model_fp32_dir"
tokenizer = DistilBertTokenizerFast.from_pretrained(model_path)
model = DistilBertForSequenceClassification.from_pretrained(model_path)

# Sample input
sample_text = "90 42 43 20.879744 82.002744 6.502985 202.935536"
inputs = tokenizer(sample_text, return_tensors="pt")

# Predict
with torch.no_grad():
    outputs = model(**inputs)
predicted_class = torch.argmax(outputs.logits, dim=1).item()
print("Predicted class index:", predicted_class)
```

---

## 📈 Performance Metrics

- **Accuracy:** 0.7636
- **Precision:** 0.7738
- **Recall:** 0.7636
- **F1 Score:** 0.7343

---

## 🏋️ Fine-Tuning Details

### 📚 Dataset

The dataset is sourced from the publicly available **Crop Recommendation Dataset**. It consists of structured features such as:
- Nitrogen (N)
- Phosphorus (P)
- Potassium (K)
- Temperature (°C)
- Humidity (%)
- pH
- Rainfall (mm)

All numerical features were converted into a single textual input string to be used with the DistilBERT tokenizer. Labels were factorized into class indices for training.

The dataset was split using an 80/20 ratio for training and testing.

---

### 🔧 Training Configuration

- **Epochs:** 3  
- **Batch size:** 8  
- **Learning rate:** 2e-5  
- **Evaluation strategy:** `epoch`  
- **Model Base:** DistilBERT (`distilbert-base-uncased`)  
- **Framework:** Hugging Face Transformers + PyTorch  

---

## 🔄 Quantization

Post-training quantization was applied using PyTorch’s `half()` precision (FP16).  
This reduces the model size and speeds up inference with minimal impact on performance.

The quantized model can be loaded with:

```python
model = DistilBertForSequenceClassification.from_pretrained("quantized_model_fp16", torch_dtype=torch.float16)
```

---

## Repository Structure

```python
.
├── quantized-model/               # Contains the quantized model files
│   ├── config.json
│   ├── model.safetensors
│   ├── tokenizer_config.json
│   ├── vocab.txt
│   └── special_tokens_map.json
├── README.md                      # Model documentation
```

---

## Limitations

- Uses text conversion of tabular data, which may miss deeper feature interactions.
- Trained on a specific dataset; may not generalize to different regions or conditions.
- FP16 quantization may slightly reduce accuracy in rare cases.

---

## Contributing

Feel free to open issues or submit pull requests to improve the model or documentation.