| # File Upload Guide: Where Each File Goes | |
| This guide shows exactly which files are uploaded to which location (Hugging Face or GitHub) and when. | |
| --- | |
| ## Overview: Three Upload Destinations | |
| 1. **Hugging Face Dataset Repo** (`ananttripathiak/engine-maintenance-dataset`) | |
| 2. **Hugging Face Model Repo** (`ananttripathiak/engine-maintenance-model`) | |
| 3. **Hugging Face Space** (`ananttripathiak/engine-maintenance-space`) | |
| 4. **GitHub Repository** (`ananttripathi/engine-predictive-maintenance`) | |
| --- | |
| ## 1. Hugging Face Dataset Repo | |
| **Repo ID**: `ananttripathiak/engine-maintenance-dataset` | |
| **Created by**: `src/data_register.py` and `src/data_prep.py` | |
| ### Files Uploaded: | |
| #### A. Raw Data (via `src/data_register.py`) | |
| - **File**: `data/engine_data.csv` | |
| - **Uploaded to**: `data/engine_data.csv` in the dataset repo | |
| - **When**: Run `python src/data_register.py` | |
| #### B. Processed Data (via `src/data_prep.py`) | |
| - **File**: `data/processed/train.csv` | |
| - **Uploaded to**: `data/train.csv` in the dataset repo | |
| - **When**: Run `python src/data_prep.py` | |
| - **File**: `data/processed/test.csv` | |
| - **Uploaded to**: `data/test.csv` in the dataset repo | |
| - **When**: Run `python src/data_prep.py` | |
| **Scripts that upload here:** | |
| - `src/data_register.py` β uploads raw data | |
| - `src/data_prep.py` β uploads train/test splits | |
| --- | |
| ## 2. Hugging Face Model Repo | |
| **Repo ID**: `ananttripathiak/engine-maintenance-model` | |
| **Created by**: `src/train.py` | |
| ### Files Uploaded: | |
| - **File**: `models/best_model.joblib` | |
| - **Uploaded to**: `model.joblib` in the model repo | |
| - **When**: Run `python src/train.py` (after training completes) | |
| **Scripts that upload here:** | |
| - `src/train.py` β uploads the trained model | |
| --- | |
| ## 3. Hugging Face Space (Streamlit App) | |
| **Repo ID**: `ananttripathiak/engine-maintenance-space` | |
| **Created by**: `src/deploy_to_hf.py` | |
| ### Files Uploaded: | |
| The `src/deploy_to_hf.py` script uploads the entire project folder **except**: | |
| - `data/` (ignored - too large) | |
| - `mlruns/` (ignored - MLflow tracking data) | |
| - `models/` (ignored - model is in model repo) | |
| - `.github/` (ignored - GitHub-specific) | |
| **Files that ARE uploaded to Space:** | |
| - `src/app.py` β **Main Streamlit app** | |
| - `src/inference.py` β Inference utilities | |
| - `src/config.py` β Configuration | |
| - `Dockerfile` β Container definition | |
| - `requirements.txt` β Python dependencies | |
| - `README.md` β Documentation | |
| - Other `src/*.py` files (if needed by app) | |
| **Scripts that upload here:** | |
| - `src/deploy_to_hf.py` β uploads deployment files | |
| --- | |
| ## 4. GitHub Repository | |
| **Repo URL**: `https://github.com/ananttripathi/engine-predictive-maintenance` | |
| **Created by**: You (manually via `git push`) | |
| ### Files Uploaded: | |
| **Everything in the `mlops/` folder**, including: | |
| - β `data/` (including `engine_data.csv`, `processed/train.csv`, `processed/test.csv`) | |
| - β `src/` (all Python scripts) | |
| - β `notebooks/` (EDA notebooks, etc.) | |
| - β `.github/workflows/pipeline.yml` β **GitHub Actions workflow** | |
| - β `requirements.txt` | |
| - β `Dockerfile` | |
| - β `README.md` | |
| - β `models/` (if you want to track model versions in git) | |
| - β `mlruns/` (MLflow tracking data - optional) | |
| - β All other project files | |
| **How to upload:** | |
| ```bash | |
| cd /Users/ananttripathi/Desktop/mlops | |
| git init | |
| git add . | |
| git commit -m "Initial commit: Predictive maintenance MLOps pipeline" | |
| git remote add origin https://github.com/ananttripathi/engine-predictive-maintenance.git | |
| git push -u origin main | |
| ``` | |
| --- | |
| ## Upload Workflow Summary | |
| ### Step-by-Step Upload Process: | |
| 1. **Data Registration** β Hugging Face Dataset Repo | |
| ```bash | |
| python src/data_register.py | |
| ``` | |
| - Uploads: `data/engine_data.csv` β HF Dataset Repo | |
| 2. **Data Preparation** β Hugging Face Dataset Repo | |
| ```bash | |
| python src/data_prep.py | |
| ``` | |
| - Uploads: `data/processed/train.csv` and `test.csv` β HF Dataset Repo | |
| 3. **Model Training** β Hugging Face Model Repo | |
| ```bash | |
| python src/train.py | |
| ``` | |
| - Uploads: `models/best_model.joblib` β HF Model Repo | |
| 4. **Deploy App** β Hugging Face Space | |
| ```bash | |
| python src/deploy_to_hf.py | |
| ``` | |
| - Uploads: `src/app.py`, `Dockerfile`, `requirements.txt`, etc. β HF Space | |
| 5. **Push to GitHub** β GitHub Repository | |
| ```bash | |
| git add . | |
| git commit -m "Complete MLOps pipeline" | |
| git push origin main | |
| ``` | |
| - Uploads: Everything β GitHub Repo | |
| --- | |
| ## What Gets Uploaded Automatically vs Manually | |
| ### Automatic (via Scripts): | |
| - β Hugging Face Dataset Repo β `src/data_register.py` and `src/data_prep.py` | |
| - β Hugging Face Model Repo β `src/train.py` | |
| - β Hugging Face Space β `src/deploy_to_hf.py` | |
| - β GitHub Actions β Runs automatically when you push to GitHub | |
| ### Manual: | |
| - β οΈ **GitHub Repository** β You need to run `git push` yourself | |
| --- | |
| ## File Size Considerations | |
| ### Large Files (may be ignored): | |
| - `data/engine_data.csv` β Uploaded to HF Dataset, but you might want to add to `.gitignore` for GitHub | |
| - `mlruns/` β MLflow tracking data (can be large) - ignored by HF Space deploy | |
| - `models/best_model.joblib` β Uploaded to HF Model Repo, but you might want to add to `.gitignore` for GitHub | |
| ### Recommended `.gitignore`: | |
| ``` | |
| # Large data files | |
| data/*.csv | |
| data/processed/*.csv | |
| # MLflow tracking | |
| mlruns/ | |
| # Model files (already in HF Model Repo) | |
| models/*.joblib | |
| # Python cache | |
| __pycache__/ | |
| *.pyc | |
| .venv/ | |
| ``` | |
| --- | |
| ## Quick Reference Table | |
| | File/Folder | HF Dataset | HF Model | HF Space | GitHub | | |
| |------------|------------|----------|----------|--------| | |
| | `data/engine_data.csv` | β | β | β | β οΈ Optional | | |
| | `data/processed/train.csv` | β | β | β | β οΈ Optional | | |
| | `data/processed/test.csv` | β | β | β | β οΈ Optional | | |
| | `models/best_model.joblib` | β | β | β | β οΈ Optional | | |
| | `src/app.py` | β | β | β | β | | |
| | `src/train.py` | β | β | β | β | | |
| | `src/data_prep.py` | β | β | β | β | | |
| | `Dockerfile` | β | β | β | β | | |
| | `requirements.txt` | β | β | β | β | | |
| | `.github/workflows/pipeline.yml` | β | β | β | β | | |
| | `README.md` | β | β | β | β | | |
| **Legend:** | |
| - β = Uploaded automatically or should be uploaded | |
| - β = Not uploaded to this location | |
| - β οΈ Optional = Can be uploaded but might want to exclude from GitHub due to size | |
| --- | |
| ## Need Help? | |
| - **Hugging Face Dataset**: Check `src/hf_data_utils.py` | |
| - **Hugging Face Model**: Check `src/hf_model_utils.py` | |
| - **Hugging Face Space**: Check `src/deploy_to_hf.py` | |
| - **GitHub**: Standard git commands | |