# DistilBERT Model for Crop Recommendation Based on Environmental Parameters This repository contains a fine-tuned DistilBERT model trained for crop recommendation using structured agricultural data. By converting numerical environmental features into text format, the model leverages transformer-based NLP techniques to classify the most suitable crop type. ## 🌾 Problem Statement The goal is to recommend the best crop to cultivate based on parameters such as soil nutrients and weather conditions. Traditional ML models handle this as a tabular classification problem. Here, we explore the innovative approach of using NLP models (DistilBERT) on serialized tabular data. --- ## πŸ“Š Dataset - **Source:** Crop Recommendation Dataset - **Features:** - N: Nitrogen content in soil - P: Phosphorus content in soil - K: Potassium content in soil - Temperature: in Celsius - Humidity: % - pH: Acidity of soil - Rainfall: mm - **Target:** Crop label (22 crop types) The dataset is preprocessed by concatenating all numeric features into a single space-separated string, making it suitable for transformer-based tokenization. --- ## 🧠 Model Details - **Architecture:** DistilBERT - **Tokenizer:** `DistilBertTokenizerFast` - **Model:** `DistilBertForSequenceClassification` - **Task Type:** Multi-Class Classification (22 classes) --- ## πŸ”§ Installation ```bash pip install transformers datasets pandas scikit-learn torch ``` --- ## Loading the Model ```python from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification import torch # Load model and tokenizer model_path = "model_fp32_dir" tokenizer = DistilBertTokenizerFast.from_pretrained(model_path) model = DistilBertForSequenceClassification.from_pretrained(model_path) # Sample input sample_text = "90 42 43 20.879744 82.002744 6.502985 202.935536" inputs = tokenizer(sample_text, return_tensors="pt") # Predict with torch.no_grad(): outputs = model(**inputs) predicted_class = torch.argmax(outputs.logits, dim=1).item() print("Predicted class index:", predicted_class) ``` --- ## πŸ“ˆ Performance Metrics - **Accuracy:** 0.7636 - **Precision:** 0.7738 - **Recall:** 0.7636 - **F1 Score:** 0.7343 --- ## πŸ‹οΈ Fine-Tuning Details ### πŸ“š Dataset The dataset is sourced from the publicly available **Crop Recommendation Dataset**. It consists of structured features such as: - Nitrogen (N) - Phosphorus (P) - Potassium (K) - Temperature (Β°C) - Humidity (%) - pH - Rainfall (mm) All numerical features were converted into a single textual input string to be used with the DistilBERT tokenizer. Labels were factorized into class indices for training. The dataset was split using an 80/20 ratio for training and testing. --- ### πŸ”§ Training Configuration - **Epochs:** 3 - **Batch size:** 8 - **Learning rate:** 2e-5 - **Evaluation strategy:** `epoch` - **Model Base:** DistilBERT (`distilbert-base-uncased`) - **Framework:** Hugging Face Transformers + PyTorch --- ## πŸ”„ Quantization Post-training quantization was applied using PyTorch’s `half()` precision (FP16). This reduces the model size and speeds up inference with minimal impact on performance. The quantized model can be loaded with: ```python model = DistilBertForSequenceClassification.from_pretrained("quantized_model_fp16", torch_dtype=torch.float16) ``` --- ## Repository Structure ```python . β”œβ”€β”€ quantized-model/ # Contains the quantized model files β”‚ β”œβ”€β”€ config.json β”‚ β”œβ”€β”€ model.safetensors β”‚ β”œβ”€β”€ tokenizer_config.json β”‚ β”œβ”€β”€ vocab.txt β”‚ └── special_tokens_map.json β”œβ”€β”€ README.md # Model documentation ``` --- ## Limitations - Uses text conversion of tabular data, which may miss deeper feature interactions. - Trained on a specific dataset; may not generalize to different regions or conditions. - FP16 quantization may slightly reduce accuracy in rare cases. --- ## Contributing Feel free to open issues or submit pull requests to improve the model or documentation.