| # DistilBERT Model for Crop Recommendation Based on Environmental Parameters | |
| This repository contains a fine-tuned DistilBERT model trained for crop recommendation using structured agricultural data. By converting numerical environmental features into text format, the model leverages transformer-based NLP techniques to classify the most suitable crop type. | |
| ## πΎ Problem Statement | |
| The goal is to recommend the best crop to cultivate based on parameters such as soil nutrients and weather conditions. Traditional ML models handle this as a tabular classification problem. Here, we explore the innovative approach of using NLP models (DistilBERT) on serialized tabular data. | |
| --- | |
| ## π Dataset | |
| - **Source:** Crop Recommendation Dataset | |
| - **Features:** | |
| - N: Nitrogen content in soil | |
| - P: Phosphorus content in soil | |
| - K: Potassium content in soil | |
| - Temperature: in Celsius | |
| - Humidity: % | |
| - pH: Acidity of soil | |
| - Rainfall: mm | |
| - **Target:** Crop label (22 crop types) | |
| The dataset is preprocessed by concatenating all numeric features into a single space-separated string, making it suitable for transformer-based tokenization. | |
| --- | |
| ## π§ Model Details | |
| - **Architecture:** DistilBERT | |
| - **Tokenizer:** `DistilBertTokenizerFast` | |
| - **Model:** `DistilBertForSequenceClassification` | |
| - **Task Type:** Multi-Class Classification (22 classes) | |
| --- | |
| ## π§ Installation | |
| ```bash | |
| pip install transformers datasets pandas scikit-learn torch | |
| ``` | |
| --- | |
| ## Loading the Model | |
| ```python | |
| from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification | |
| import torch | |
| # Load model and tokenizer | |
| model_path = "model_fp32_dir" | |
| tokenizer = DistilBertTokenizerFast.from_pretrained(model_path) | |
| model = DistilBertForSequenceClassification.from_pretrained(model_path) | |
| # Sample input | |
| sample_text = "90 42 43 20.879744 82.002744 6.502985 202.935536" | |
| inputs = tokenizer(sample_text, return_tensors="pt") | |
| # Predict | |
| with torch.no_grad(): | |
| outputs = model(**inputs) | |
| predicted_class = torch.argmax(outputs.logits, dim=1).item() | |
| print("Predicted class index:", predicted_class) | |
| ``` | |
| --- | |
| ## π Performance Metrics | |
| - **Accuracy:** 0.7636 | |
| - **Precision:** 0.7738 | |
| - **Recall:** 0.7636 | |
| - **F1 Score:** 0.7343 | |
| --- | |
| ## ποΈ Fine-Tuning Details | |
| ### π Dataset | |
| The dataset is sourced from the publicly available **Crop Recommendation Dataset**. It consists of structured features such as: | |
| - Nitrogen (N) | |
| - Phosphorus (P) | |
| - Potassium (K) | |
| - Temperature (Β°C) | |
| - Humidity (%) | |
| - pH | |
| - Rainfall (mm) | |
| All numerical features were converted into a single textual input string to be used with the DistilBERT tokenizer. Labels were factorized into class indices for training. | |
| The dataset was split using an 80/20 ratio for training and testing. | |
| --- | |
| ### π§ Training Configuration | |
| - **Epochs:** 3 | |
| - **Batch size:** 8 | |
| - **Learning rate:** 2e-5 | |
| - **Evaluation strategy:** `epoch` | |
| - **Model Base:** DistilBERT (`distilbert-base-uncased`) | |
| - **Framework:** Hugging Face Transformers + PyTorch | |
| --- | |
| ## π Quantization | |
| Post-training quantization was applied using PyTorchβs `half()` precision (FP16). | |
| This reduces the model size and speeds up inference with minimal impact on performance. | |
| The quantized model can be loaded with: | |
| ```python | |
| model = DistilBertForSequenceClassification.from_pretrained("quantized_model_fp16", torch_dtype=torch.float16) | |
| ``` | |
| --- | |
| ## Repository Structure | |
| ```python | |
| . | |
| βββ quantized-model/ # Contains the quantized model files | |
| β βββ config.json | |
| β βββ model.safetensors | |
| β βββ tokenizer_config.json | |
| β βββ vocab.txt | |
| β βββ special_tokens_map.json | |
| βββ README.md # Model documentation | |
| ``` | |
| --- | |
| ## Limitations | |
| - Uses text conversion of tabular data, which may miss deeper feature interactions. | |
| - Trained on a specific dataset; may not generalize to different regions or conditions. | |
| - FP16 quantization may slightly reduce accuracy in rare cases. | |
| --- | |
| ## Contributing | |
| Feel free to open issues or submit pull requests to improve the model or documentation. | |