chenhaoq87 commited on
Commit
e4c8531
·
verified ·
1 Parent(s): 176a01e

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +293 -10
README.md CHANGED
@@ -1,10 +1,293 @@
1
- ---
2
- title: PreharvestRiskModel
3
- emoji:
4
- colorFrom: pink
5
- colorTo: purple
6
- sdk: docker
7
- pinned: false
8
- ---
9
-
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: PreharvestRiskModel
3
+ emoji: 🦠
4
+ colorFrom: green
5
+ colorTo: blue
6
+ sdk: docker
7
+ app_file: app.py
8
+ pinned: false
9
+ ---
10
+
11
+ # E.coli Preharvest Risk Prediction Model
12
+
13
+ ## Model Description
14
+
15
+ This machine learning model predicts E.coli contamination risk in preharvest produce based on farm characteristics and weather conditions. The model was developed to replicate and improve upon the R-based analysis from the original preharvest risk modeling study.
16
+
17
+ ## Model Selection
18
+
19
+ Three state-of-the-art machine learning algorithms were trained and compared:
20
+
21
+ 1. **Random Forest** - Ensemble method with bootstrap aggregating
22
+ 2. **XGBoost** - Gradient boosting with advanced regularization
23
+ 3. **LightGBM** - Gradient boosting optimized for speed and efficiency
24
+
25
+ Each model was trained with:
26
+ - **5-fold stratified cross-validation**
27
+ - **Hyperparameter tuning** using RandomizedSearchCV
28
+ - **Two-stage class balancing**:
29
+ - Undersampling: Reduce majority class to 100:1 ratio
30
+ - SMOTE: Upsample minority class to 1:1 ratio
31
+
32
+ The best model was selected based on **ROC AUC score** from cross-validation.
33
+
34
+ ## Training Data
35
+
36
+ **Dataset**: `preharvest_data_modeling.csv`
37
+
38
+ **Features** (145 total after preprocessing):
39
+ - Farm characteristics: organic/conventional, acreage, location (lat/lon), season
40
+ - Weather variables for multiple time periods (day 0, 1, 3, and 7 days before):
41
+ - Temperature (avg, max, min)
42
+ - Humidity (avg, max, min)
43
+ - Wind (speed, direction, chill)
44
+ - Precipitation (rain, rain rate)
45
+ - Solar radiation
46
+ - Evapotranspiration (ET)
47
+ - Heating/cooling degree days
48
+
49
+ **Target Variable**: `e_coli_positive` (Binary: Positive/Negative)
50
+
51
+ **Class Distribution**: Highly imbalanced dataset (majority class: Negative)
52
+
53
+ ## Model Performance
54
+
55
+ ### Winning Algorithm
56
+
57
+ **[Algorithm will be determined after training]**
58
+
59
+ ### Cross-Validation Metrics
60
+
61
+ Model comparison results will be saved to `model/model_comparison.json` after training.
62
+
63
+ ### Training Metrics
64
+
65
+ Performance metrics will be saved to `model/model_metrics.json` after training, including:
66
+ - ROC AUC
67
+ - Accuracy
68
+ - Precision
69
+ - Recall (Sensitivity)
70
+ - F1 Score
71
+ - Confusion Matrix
72
+
73
+ ### Feature Importance
74
+
75
+ The top 10 most important features for prediction will be available in `model/model_metrics.json`.
76
+
77
+ ## Usage
78
+
79
+ ### Training the Model
80
+
81
+ To train the model and compare all algorithms:
82
+
83
+ ```bash
84
+ python train_model.py
85
+ ```
86
+
87
+ This will:
88
+ 1. Load the data from `preharvest_data_modeling.csv`
89
+ 2. Preprocess features (imputation, encoding, scaling)
90
+ 3. Train Random Forest, XGBoost, and LightGBM with hyperparameter tuning
91
+ 4. Select the best model based on ROC AUC
92
+ 5. Save model artifacts to the `model/` directory
93
+
94
+ ### Starting the API
95
+
96
+ To start the FastAPI inference server:
97
+
98
+ ```bash
99
+ uvicorn app:app --host 0.0.0.0 --port 8000
100
+ ```
101
+
102
+ Or run directly:
103
+
104
+ ```bash
105
+ python app.py
106
+ ```
107
+
108
+ ### Making Predictions
109
+
110
+ #### Health Check
111
+
112
+ ```bash
113
+ curl http://localhost:8000/health
114
+ ```
115
+
116
+ #### Get Model Information
117
+
118
+ ```bash
119
+ curl http://localhost:8000/model_info
120
+ ```
121
+
122
+ #### Get Model Comparison
123
+
124
+ ```bash
125
+ curl http://localhost:8000/model_comparison
126
+ ```
127
+
128
+ #### Single Prediction
129
+
130
+ ```bash
131
+ curl -X POST "http://localhost:8000/predict" \
132
+ -H "Content-Type: application/json" \
133
+ -d '{
134
+ "org_conv_kiptraq": "Conventional",
135
+ "acres_kiptraq": 10.0,
136
+ "lat": 36.5,
137
+ "lon": -121.5,
138
+ "season": "Fall",
139
+ "temperature_avg_d0": 70.0,
140
+ "temperature_max_d0": 85.0,
141
+ "temperature_min_d0": 55.0,
142
+ ...
143
+ }'
144
+ ```
145
+
146
+ Response:
147
+ ```json
148
+ {
149
+ "prediction": "Negative",
150
+ "probability_positive": 0.15,
151
+ "probability_negative": 0.85,
152
+ "risk_level": "Low"
153
+ }
154
+ ```
155
+
156
+ #### Batch Prediction
157
+
158
+ ```bash
159
+ curl -X POST "http://localhost:8000/predict_batch" \
160
+ -H "Content-Type: application/json" \
161
+ -d '[{...}, {...}, {...}]'
162
+ ```
163
+
164
+ ### Interactive API Documentation
165
+
166
+ FastAPI provides automatic interactive documentation:
167
+ - **Swagger UI**: http://localhost:8000/docs
168
+ - **ReDoc**: http://localhost:8000/redoc
169
+
170
+ ## API Endpoints
171
+
172
+ | Endpoint | Method | Description |
173
+ |----------|--------|-------------|
174
+ | `/` | GET | API information |
175
+ | `/health` | GET | Health check |
176
+ | `/model_info` | GET | Model metadata and performance metrics |
177
+ | `/model_comparison` | GET | Comparison of all trained models |
178
+ | `/predict` | POST | Single prediction |
179
+ | `/predict_batch` | POST | Batch predictions |
180
+
181
+ ## Model Artifacts
182
+
183
+ After training, the following files are saved in the `model/` directory:
184
+
185
+ - `best_model.joblib` - Trained model (winning algorithm)
186
+ - `preprocessor.joblib` - Preprocessing pipeline
187
+ - `feature_names.json` - List of feature names
188
+ - `model_metrics.json` - Performance metrics and feature importance
189
+ - `model_comparison.json` - Comparison results for all algorithms
190
+
191
+ ## Installation
192
+
193
+ ```bash
194
+ pip install -r requirements.txt
195
+ ```
196
+
197
+ ## Dependencies
198
+
199
+ - Python ≥ 3.8
200
+ - pandas ≥ 1.5.0
201
+ - numpy ≥ 1.23.0
202
+ - scikit-learn ≥ 1.3.0
203
+ - imbalanced-learn ≥ 0.11.0
204
+ - xgboost ≥ 2.0.0
205
+ - lightgbm ≥ 4.0.0
206
+ - fastapi ≥ 0.104.0
207
+ - uvicorn ≥ 0.24.0
208
+ - pydantic ≥ 2.0.0
209
+ - joblib ≥ 1.3.0
210
+
211
+ ## Deployment on HuggingFace Space
212
+
213
+ ### Option 1: Hugging Face Spaces (Recommended)
214
+
215
+ 1. Create a new Space on Hugging Face
216
+ 2. Select "Docker" as the Space SDK
217
+ 3. Upload all files including `Dockerfile` (see below)
218
+ 4. The Space will automatically build and deploy
219
+
220
+ ### Option 2: Push to Model Hub
221
+
222
+ ```bash
223
+ # Install huggingface_hub
224
+ pip install huggingface_hub
225
+
226
+ # Login
227
+ huggingface-cli login
228
+
229
+ # Push model
230
+ python -c "
231
+ from huggingface_hub import HfApi
232
+ api = HfApi()
233
+ api.upload_folder(
234
+ folder_path='model/',
235
+ repo_id='your-username/ecoli-risk-model',
236
+ repo_type='model'
237
+ )
238
+ "
239
+ ```
240
+
241
+ ### Dockerfile for Deployment
242
+
243
+ ```dockerfile
244
+ FROM python:3.10-slim
245
+
246
+ WORKDIR /app
247
+
248
+ COPY requirements.txt .
249
+ RUN pip install --no-cache-dir -r requirements.txt
250
+
251
+ COPY . .
252
+
253
+ EXPOSE 8000
254
+
255
+ CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
256
+ ```
257
+
258
+ ## Limitations and Considerations
259
+
260
+ 1. **Class Imbalance**: The dataset is highly imbalanced. The model uses two-stage balancing (undersampling + SMOTE) during training.
261
+
262
+ 2. **Temporal Validity**: The model is trained on historical data and may need retraining with new data to maintain performance.
263
+
264
+ 3. **Geographic Scope**: Model performance may vary for farms outside the geographic range of the training data.
265
+
266
+ 4. **Weather Data Dependency**: Predictions require complete weather data for day 0, 1, 3, and 7 days before sampling.
267
+
268
+ 5. **Missing Values**: The model handles missing values through imputation, but predictions may be less reliable with extensive missing data.
269
+
270
+ 6. **Risk Level Interpretation**:
271
+ - Low Risk: P(Positive) < 0.3
272
+ - Medium Risk: 0.3 ≤ P(Positive) < 0.7
273
+ - High Risk: P(Positive) ≥ 0.7
274
+
275
+ ## Citation
276
+
277
+ If you use this model, please cite the original R-based analysis:
278
+
279
+ ```
280
+ [Original analysis citation to be added]
281
+ ```
282
+
283
+ ## License
284
+
285
+ [License to be specified]
286
+
287
+ ## Contact
288
+
289
+ For questions or issues, please open an issue on the repository.
290
+
291
+ ## Version History
292
+
293
+ - **v1.0.0** (2026-01-28): Initial release with Random Forest, XGBoost, and LightGBM comparison