Spaces:

chenhaoq87
/

PreharvestRiskModel

Paused

App Files Files Community

chenhaoq87 commited on Jan 28

Commit

e4c8531

verified ·

1 Parent(s): 176a01e

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +293 -10

README.md CHANGED Viewed

@@ -1,10 +1,293 @@
----
-title: PreharvestRiskModel
-emoji: ⚡
-colorFrom: pink
-colorTo: purple
-sdk: docker
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+---
+title: PreharvestRiskModel
+emoji: 🦠
+colorFrom: green
+colorTo: blue
+sdk: docker
+app_file: app.py
+pinned: false
+---
+# E.coli Preharvest Risk Prediction Model
+## Model Description
+This machine learning model predicts E.coli contamination risk in preharvest produce based on farm characteristics and weather conditions. The model was developed to replicate and improve upon the R-based analysis from the original preharvest risk modeling study.
+## Model Selection
+Three state-of-the-art machine learning algorithms were trained and compared:
+1. **Random Forest** - Ensemble method with bootstrap aggregating
+2. **XGBoost** - Gradient boosting with advanced regularization
+3. **LightGBM** - Gradient boosting optimized for speed and efficiency
+Each model was trained with:
+- **5-fold stratified cross-validation**
+- **Hyperparameter tuning** using RandomizedSearchCV
+- **Two-stage class balancing**:
+  - Undersampling: Reduce majority class to 100:1 ratio
+  - SMOTE: Upsample minority class to 1:1 ratio
+The best model was selected based on **ROC AUC score** from cross-validation.
+## Training Data
+**Dataset**: `preharvest_data_modeling.csv`
+**Features** (145 total after preprocessing):
+- Farm characteristics: organic/conventional, acreage, location (lat/lon), season
+- Weather variables for multiple time periods (day 0, 1, 3, and 7 days before):
+  - Temperature (avg, max, min)
+  - Humidity (avg, max, min)
+  - Wind (speed, direction, chill)
+  - Precipitation (rain, rain rate)
+  - Solar radiation
+  - Evapotranspiration (ET)
+  - Heating/cooling degree days
+**Target Variable**: `e_coli_positive` (Binary: Positive/Negative)
+**Class Distribution**: Highly imbalanced dataset (majority class: Negative)
+## Model Performance
+### Winning Algorithm
+**[Algorithm will be determined after training]**
+### Cross-Validation Metrics
+Model comparison results will be saved to `model/model_comparison.json` after training.
+### Training Metrics
+Performance metrics will be saved to `model/model_metrics.json` after training, including:
+- ROC AUC
+- Accuracy
+- Precision
+- Recall (Sensitivity)
+- F1 Score
+- Confusion Matrix
+### Feature Importance
+The top 10 most important features for prediction will be available in `model/model_metrics.json`.
+## Usage
+### Training the Model
+To train the model and compare all algorithms:
+```bash
+python train_model.py
+```
+This will:
+1. Load the data from `preharvest_data_modeling.csv`
+2. Preprocess features (imputation, encoding, scaling)
+3. Train Random Forest, XGBoost, and LightGBM with hyperparameter tuning
+4. Select the best model based on ROC AUC
+5. Save model artifacts to the `model/` directory
+### Starting the API
+To start the FastAPI inference server:
+```bash
+uvicorn app:app --host 0.0.0.0 --port 8000
+```
+Or run directly:
+```bash
+python app.py
+```
+### Making Predictions
+#### Health Check
+```bash
+curl http://localhost:8000/health
+```
+#### Get Model Information
+```bash
+curl http://localhost:8000/model_info
+```
+#### Get Model Comparison
+```bash
+curl http://localhost:8000/model_comparison
+```
+#### Single Prediction
+```bash
+curl -X POST "http://localhost:8000/predict" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "org_conv_kiptraq": "Conventional",
+    "acres_kiptraq": 10.0,
+    "lat": 36.5,
+    "lon": -121.5,
+    "season": "Fall",
+    "temperature_avg_d0": 70.0,
+    "temperature_max_d0": 85.0,
+    "temperature_min_d0": 55.0,
+    ...
+  }'
+```
+Response:
+```json
+{
+  "prediction": "Negative",
+  "probability_positive": 0.15,
+  "probability_negative": 0.85,
+  "risk_level": "Low"
+}
+```
+#### Batch Prediction
+```bash
+curl -X POST "http://localhost:8000/predict_batch" \
+  -H "Content-Type: application/json" \
+  -d '[{...}, {...}, {...}]'
+```
+### Interactive API Documentation
+FastAPI provides automatic interactive documentation:
+- **Swagger UI**: http://localhost:8000/docs
+- **ReDoc**: http://localhost:8000/redoc
+## API Endpoints
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/` | GET | API information |
+| `/health` | GET | Health check |
+| `/model_info` | GET | Model metadata and performance metrics |
+| `/model_comparison` | GET | Comparison of all trained models |
+| `/predict` | POST | Single prediction |
+| `/predict_batch` | POST | Batch predictions |
+## Model Artifacts
+After training, the following files are saved in the `model/` directory:
+- `best_model.joblib` - Trained model (winning algorithm)
+- `preprocessor.joblib` - Preprocessing pipeline
+- `feature_names.json` - List of feature names
+- `model_metrics.json` - Performance metrics and feature importance
+- `model_comparison.json` - Comparison results for all algorithms
+## Installation
+```bash
+pip install -r requirements.txt
+```
+## Dependencies
+- Python ≥ 3.8
+- pandas ≥ 1.5.0
+- numpy ≥ 1.23.0
+- scikit-learn ≥ 1.3.0
+- imbalanced-learn ≥ 0.11.0
+- xgboost ≥ 2.0.0
+- lightgbm ≥ 4.0.0
+- fastapi ≥ 0.104.0
+- uvicorn ≥ 0.24.0
+- pydantic ≥ 2.0.0
+- joblib ≥ 1.3.0
+## Deployment on HuggingFace Space
+### Option 1: Hugging Face Spaces (Recommended)
+1. Create a new Space on Hugging Face
+2. Select "Docker" as the Space SDK
+3. Upload all files including `Dockerfile` (see below)
+4. The Space will automatically build and deploy
+### Option 2: Push to Model Hub
+```bash
+# Install huggingface_hub
+pip install huggingface_hub
+# Login
+huggingface-cli login
+# Push model
+python -c "
+from huggingface_hub import HfApi
+api = HfApi()
+api.upload_folder(
+    folder_path='model/',
+    repo_id='your-username/ecoli-risk-model',
+    repo_type='model'
+)
+"
+```
+### Dockerfile for Deployment
+```dockerfile
+FROM python:3.10-slim
+WORKDIR /app
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+COPY . .
+EXPOSE 8000
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
+```
+## Limitations and Considerations
+1. **Class Imbalance**: The dataset is highly imbalanced. The model uses two-stage balancing (undersampling + SMOTE) during training.
+2. **Temporal Validity**: The model is trained on historical data and may need retraining with new data to maintain performance.
+3. **Geographic Scope**: Model performance may vary for farms outside the geographic range of the training data.
+4. **Weather Data Dependency**: Predictions require complete weather data for day 0, 1, 3, and 7 days before sampling.
+5. **Missing Values**: The model handles missing values through imputation, but predictions may be less reliable with extensive missing data.
+6. **Risk Level Interpretation**:
+   - Low Risk: P(Positive) < 0.3
+   - Medium Risk: 0.3 ≤ P(Positive) < 0.7
+   - High Risk: P(Positive) ≥ 0.7
+## Citation
+If you use this model, please cite the original R-based analysis:
+```
+[Original analysis citation to be added]
+```
+## License
+[License to be specified]
+## Contact
+For questions or issues, please open an issue on the repository.
+## Version History
+- **v1.0.0** (2026-01-28): Initial release with Random Forest, XGBoost, and LightGBM comparison