Update README.md

f0bd961 verified 8 months ago

4.11 kB

	# DistilBERT Model for Crop Recommendation Based on Environmental Parameters

	This repository contains a fine-tuned DistilBERT model trained for crop recommendation using structured agricultural data. By converting numerical environmental features into text format, the model leverages transformer-based NLP techniques to classify the most suitable crop type.

	## 🌾 Problem Statement

	The goal is to recommend the best crop to cultivate based on parameters such as soil nutrients and weather conditions. Traditional ML models handle this as a tabular classification problem. Here, we explore the innovative approach of using NLP models (DistilBERT) on serialized tabular data.

	---

	## 📊 Dataset

	- Source: Crop Recommendation Dataset
	- Features:
	- N: Nitrogen content in soil
	- P: Phosphorus content in soil
	- K: Potassium content in soil
	- Temperature: in Celsius
	- Humidity: %
	- pH: Acidity of soil
	- Rainfall: mm

	- Target: Crop label (22 crop types)

	The dataset is preprocessed by concatenating all numeric features into a single space-separated string, making it suitable for transformer-based tokenization.

	---

	## 🧠 Model Details

	- Architecture: DistilBERT
	- Tokenizer: `DistilBertTokenizerFast`
	- Model: `DistilBertForSequenceClassification`
	- Task Type: Multi-Class Classification (22 classes)

	---

	## 🔧 Installation

	```bash
	pip install transformers datasets pandas scikit-learn torch
	```

	---

	## Loading the Model

	```python
	from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
	import torch

	# Load model and tokenizer
	model_path = "model_fp32_dir"
	tokenizer = DistilBertTokenizerFast.from_pretrained(model_path)
	model = DistilBertForSequenceClassification.from_pretrained(model_path)

	# Sample input
	sample_text = "90 42 43 20.879744 82.002744 6.502985 202.935536"
	inputs = tokenizer(sample_text, return_tensors="pt")

	# Predict
	with torch.no_grad():
	outputs = model(**inputs)
	predicted_class = torch.argmax(outputs.logits, dim=1).item()
	print("Predicted class index:", predicted_class)
	```

	---

	## 📈 Performance Metrics

	- Accuracy: 0.7636
	- Precision: 0.7738
	- Recall: 0.7636
	- F1 Score: 0.7343

	---

	## 🏋️ Fine-Tuning Details

	### 📚 Dataset

	The dataset is sourced from the publicly available Crop Recommendation Dataset. It consists of structured features such as:
	- Nitrogen (N)
	- Phosphorus (P)
	- Potassium (K)
	- Temperature (°C)
	- Humidity (%)
	- pH
	- Rainfall (mm)

	All numerical features were converted into a single textual input string to be used with the DistilBERT tokenizer. Labels were factorized into class indices for training.

	The dataset was split using an 80/20 ratio for training and testing.

	---

	### 🔧 Training Configuration

	- Epochs: 3
	- Batch size: 8
	- Learning rate: 2e-5
	- Evaluation strategy: `epoch`
	- Model Base: DistilBERT (`distilbert-base-uncased`)
	- Framework: Hugging Face Transformers + PyTorch

	---

	## 🔄 Quantization

	Post-training quantization was applied using PyTorch’s `half()` precision (FP16).
	This reduces the model size and speeds up inference with minimal impact on performance.

	The quantized model can be loaded with:

	```python
	model = DistilBertForSequenceClassification.from_pretrained("quantized_model_fp16", torch_dtype=torch.float16)
	```

	---

	## Repository Structure

	```python
	.
	├── quantized-model/ # Contains the quantized model files
	│ ├── config.json
	│ ├── model.safetensors
	│ ├── tokenizer_config.json
	│ ├── vocab.txt
	│ └── special_tokens_map.json
	├── README.md # Model documentation
	```

	---

	## Limitations

	- Uses text conversion of tabular data, which may miss deeper feature interactions.
	- Trained on a specific dataset; may not generalize to different regions or conditions.
	- FP16 quantization may slightly reduce accuracy in rare cases.

	---

	## Contributing

	Feel free to open issues or submit pull requests to improve the model or documentation.