File size: 4,106 Bytes
464aca0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f0bd961
464aca0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c8f13a7
 
 
 
464aca0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a16c879
 
 
464aca0
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
# DistilBERT Model for Crop Recommendation Based on Environmental Parameters

This repository contains a fine-tuned DistilBERT model trained for crop recommendation using structured agricultural data. By converting numerical environmental features into text format, the model leverages transformer-based NLP techniques to classify the most suitable crop type.

## 🌾 Problem Statement

The goal is to recommend the best crop to cultivate based on parameters such as soil nutrients and weather conditions. Traditional ML models handle this as a tabular classification problem. Here, we explore the innovative approach of using NLP models (DistilBERT) on serialized tabular data.

---

## πŸ“Š Dataset

- **Source:** Crop Recommendation Dataset
- **Features:**
  - N: Nitrogen content in soil
  - P: Phosphorus content in soil
  - K: Potassium content in soil
  - Temperature: in Celsius
  - Humidity: %
  - pH: Acidity of soil
  - Rainfall: mm

- **Target:** Crop label (22 crop types)

The dataset is preprocessed by concatenating all numeric features into a single space-separated string, making it suitable for transformer-based tokenization.

---

## 🧠 Model Details

- **Architecture:** DistilBERT
- **Tokenizer:** `DistilBertTokenizerFast`
- **Model:** `DistilBertForSequenceClassification`
- **Task Type:** Multi-Class Classification (22 classes)

---

## πŸ”§ Installation

```bash
pip install transformers datasets pandas scikit-learn torch
```

---

## Loading the Model

```python
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
import torch

# Load model and tokenizer
model_path = "model_fp32_dir"
tokenizer = DistilBertTokenizerFast.from_pretrained(model_path)
model = DistilBertForSequenceClassification.from_pretrained(model_path)

# Sample input
sample_text = "90 42 43 20.879744 82.002744 6.502985 202.935536"
inputs = tokenizer(sample_text, return_tensors="pt")

# Predict
with torch.no_grad():
    outputs = model(**inputs)
predicted_class = torch.argmax(outputs.logits, dim=1).item()
print("Predicted class index:", predicted_class)
```

---

## πŸ“ˆ Performance Metrics

- **Accuracy:** 0.7636
- **Precision:** 0.7738
- **Recall:** 0.7636
- **F1 Score:** 0.7343

---

## πŸ‹οΈ Fine-Tuning Details

### πŸ“š Dataset

The dataset is sourced from the publicly available **Crop Recommendation Dataset**. It consists of structured features such as:
- Nitrogen (N)
- Phosphorus (P)
- Potassium (K)
- Temperature (Β°C)
- Humidity (%)
- pH
- Rainfall (mm)

All numerical features were converted into a single textual input string to be used with the DistilBERT tokenizer. Labels were factorized into class indices for training.

The dataset was split using an 80/20 ratio for training and testing.

---

### πŸ”§ Training Configuration

- **Epochs:** 3  
- **Batch size:** 8  
- **Learning rate:** 2e-5  
- **Evaluation strategy:** `epoch`  
- **Model Base:** DistilBERT (`distilbert-base-uncased`)  
- **Framework:** Hugging Face Transformers + PyTorch  

---

## πŸ”„ Quantization

Post-training quantization was applied using PyTorch’s `half()` precision (FP16).  
This reduces the model size and speeds up inference with minimal impact on performance.

The quantized model can be loaded with:

```python
model = DistilBertForSequenceClassification.from_pretrained("quantized_model_fp16", torch_dtype=torch.float16)
```

---

## Repository Structure

```python
.
β”œβ”€β”€ quantized-model/               # Contains the quantized model files
β”‚   β”œβ”€β”€ config.json
β”‚   β”œβ”€β”€ model.safetensors
β”‚   β”œβ”€β”€ tokenizer_config.json
β”‚   β”œβ”€β”€ vocab.txt
β”‚   └── special_tokens_map.json
β”œβ”€β”€ README.md                      # Model documentation
```

---

## Limitations

- Uses text conversion of tabular data, which may miss deeper feature interactions.
- Trained on a specific dataset; may not generalize to different regions or conditions.
- FP16 quantization may slightly reduce accuracy in rare cases.

---

## Contributing

Feel free to open issues or submit pull requests to improve the model or documentation.