KshitizTayal commited on
Commit
464aca0
Β·
verified Β·
1 Parent(s): 93c4765

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +149 -3
README.md CHANGED
@@ -1,3 +1,149 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # DistilBERT Model for Crop Recommendation Based on Environmental Parameters
2
+
3
+ This repository contains a fine-tuned DistilBERT model trained for crop recommendation using structured agricultural data. By converting numerical environmental features into text format, the model leverages transformer-based NLP techniques to classify the most suitable crop type.
4
+
5
+ ## 🌾 Problem Statement
6
+
7
+ The goal is to recommend the best crop to cultivate based on parameters such as soil nutrients and weather conditions. Traditional ML models handle this as a tabular classification problem. Here, we explore the innovative approach of using NLP models (DistilBERT) on serialized tabular data.
8
+
9
+ ---
10
+
11
+ ## πŸ“Š Dataset
12
+
13
+ - **Source:** Crop Recommendation Dataset
14
+ - **Features:**
15
+ - N: Nitrogen content in soil
16
+ - P: Phosphorus content in soil
17
+ - K: Potassium content in soil
18
+ - Temperature: in Celsius
19
+ - Humidity: %
20
+ - pH: Acidity of soil
21
+ - Rainfall: mm
22
+
23
+ - **Target:** Crop label (22 crop types)
24
+
25
+ The dataset is preprocessed by concatenating all numeric features into a single space-separated string, making it suitable for transformer-based tokenization.
26
+
27
+ ---
28
+
29
+ ## 🧠 Model Details
30
+
31
+ - **Architecture:** DistilBERT
32
+ - **Tokenizer:** `DistilBertTokenizerFast`
33
+ - **Model:** `DistilBertForSequenceClassification`
34
+ - **Task Type:** Multi-Class Classification (22 classes)
35
+
36
+ ---
37
+
38
+ ## πŸ”§ Installation
39
+
40
+ ```bash
41
+ pip install transformers datasets pandas scikit-learn torch
42
+ ```
43
+
44
+ ---
45
+
46
+ ## Loading the Model
47
+
48
+ ```python
49
+ from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
50
+ import torch
51
+
52
+ # Load model and tokenizer
53
+ model_path = "path/to/your/saved_model"
54
+ tokenizer = DistilBertTokenizerFast.from_pretrained(model_path)
55
+ model = DistilBertForSequenceClassification.from_pretrained(model_path)
56
+
57
+ # Sample input
58
+ sample_text = "90 42 43 20.879744 82.002744 6.502985 202.935536"
59
+ inputs = tokenizer(sample_text, return_tensors="pt")
60
+
61
+ # Predict
62
+ with torch.no_grad():
63
+ outputs = model(**inputs)
64
+ predicted_class = torch.argmax(outputs.logits, dim=1).item()
65
+ print("Predicted class index:", predicted_class)
66
+ ```
67
+
68
+ ---
69
+
70
+ ## πŸ“ˆ Performance Metrics
71
+
72
+ *Note: These are placeholders. Replace with actual results after evaluation.*
73
+
74
+ - **Accuracy:** 0.0477
75
+ - **Precision:** 0.0023
76
+ - **Recall:** 0.0477
77
+ - **F1 Score:** 0.0043
78
+
79
+ ---
80
+
81
+ ## πŸ‹οΈ Fine-Tuning Details
82
+
83
+ ### πŸ“š Dataset
84
+
85
+ The dataset is sourced from the publicly available **Crop Recommendation Dataset**. It consists of structured features such as:
86
+ - Nitrogen (N)
87
+ - Phosphorus (P)
88
+ - Potassium (K)
89
+ - Temperature (Β°C)
90
+ - Humidity (%)
91
+ - pH
92
+ - Rainfall (mm)
93
+
94
+ All numerical features were converted into a single textual input string to be used with the DistilBERT tokenizer. Labels were factorized into class indices for training.
95
+
96
+ The dataset was split using an 80/20 ratio for training and testing.
97
+
98
+ ---
99
+
100
+ ### πŸ”§ Training Configuration
101
+
102
+ - **Epochs:** 3
103
+ - **Batch size:** 8
104
+ - **Learning rate:** 2e-5
105
+ - **Evaluation strategy:** `epoch`
106
+ - **Model Base:** DistilBERT (`distilbert-base-uncased`)
107
+ - **Framework:** Hugging Face Transformers + PyTorch
108
+
109
+ ---
110
+
111
+ ## πŸ”„ Quantization
112
+
113
+ Post-training quantization was applied using PyTorch’s `half()` precision (FP16).
114
+ This reduces the model size and speeds up inference with minimal impact on performance.
115
+
116
+ The quantized model can be loaded with:
117
+
118
+ ```python
119
+ model = DistilBertForSequenceClassification.from_pretrained("quantized_model_fp16", torch_dtype=torch.float16)
120
+ ```
121
+
122
+ ---
123
+
124
+ ## Repository Structure
125
+
126
+ ```python
127
+ .
128
+ β”œβ”€β”€ quantized-model/ # Contains the quantized model files
129
+ β”‚ β”œβ”€β”€ config.json
130
+ β”‚ β”œβ”€β”€ model.safetensors
131
+ β”‚ β”œβ”€β”€ tokenizer_config.json
132
+ β”‚ β”œβ”€β”€ vocab.txt
133
+ β”‚ └── special_tokens_map.json
134
+ β”œβ”€β”€ README.md # Model documentation
135
+ ```
136
+
137
+ ---
138
+
139
+ ## Limitations
140
+
141
+ - The model is trained specifically for binary sentiment classification on movie reviews.
142
+ - FP16 quantization may result in slight numerical instability in edge cases.
143
+ - Performance may degrade when used outside the IMDB domain.
144
+
145
+ ---
146
+
147
+ ## Contributing
148
+
149
+ Feel free to open issues or submit pull requests to improve the model or documentation.