engine-maintenance-space / UPLOAD_GUIDE.md
ananttripathiak's picture
Upload folder using huggingface_hub
1aa7fae verified

File Upload Guide: Where Each File Goes

This guide shows exactly which files are uploaded to which location (Hugging Face or GitHub) and when.


Overview: Three Upload Destinations

  1. Hugging Face Dataset Repo (ananttripathiak/engine-maintenance-dataset)
  2. Hugging Face Model Repo (ananttripathiak/engine-maintenance-model)
  3. Hugging Face Space (ananttripathiak/engine-maintenance-space)
  4. GitHub Repository (ananttripathi/engine-predictive-maintenance)

1. Hugging Face Dataset Repo

Repo ID: ananttripathiak/engine-maintenance-dataset
Created by: src/data_register.py and src/data_prep.py

Files Uploaded:

A. Raw Data (via src/data_register.py)

  • File: data/engine_data.csv
  • Uploaded to: data/engine_data.csv in the dataset repo
  • When: Run python src/data_register.py

B. Processed Data (via src/data_prep.py)

  • File: data/processed/train.csv

  • Uploaded to: data/train.csv in the dataset repo

  • When: Run python src/data_prep.py

  • File: data/processed/test.csv

  • Uploaded to: data/test.csv in the dataset repo

  • When: Run python src/data_prep.py

Scripts that upload here:

  • src/data_register.py β†’ uploads raw data
  • src/data_prep.py β†’ uploads train/test splits

2. Hugging Face Model Repo

Repo ID: ananttripathiak/engine-maintenance-model
Created by: src/train.py

Files Uploaded:

  • File: models/best_model.joblib
  • Uploaded to: model.joblib in the model repo
  • When: Run python src/train.py (after training completes)

Scripts that upload here:

  • src/train.py β†’ uploads the trained model

3. Hugging Face Space (Streamlit App)

Repo ID: ananttripathiak/engine-maintenance-space
Created by: src/deploy_to_hf.py

Files Uploaded:

The src/deploy_to_hf.py script uploads the entire project folder except:

  • data/ (ignored - too large)
  • mlruns/ (ignored - MLflow tracking data)
  • models/ (ignored - model is in model repo)
  • .github/ (ignored - GitHub-specific)

Files that ARE uploaded to Space:

  • src/app.py ← Main Streamlit app
  • src/inference.py ← Inference utilities
  • src/config.py ← Configuration
  • Dockerfile ← Container definition
  • requirements.txt ← Python dependencies
  • README.md ← Documentation
  • Other src/*.py files (if needed by app)

Scripts that upload here:

  • src/deploy_to_hf.py β†’ uploads deployment files

4. GitHub Repository

Repo URL: https://github.com/ananttripathi/engine-predictive-maintenance
Created by: You (manually via git push)

Files Uploaded:

Everything in the mlops/ folder, including:

  • βœ… data/ (including engine_data.csv, processed/train.csv, processed/test.csv)
  • βœ… src/ (all Python scripts)
  • βœ… notebooks/ (EDA notebooks, etc.)
  • βœ… .github/workflows/pipeline.yml ← GitHub Actions workflow
  • βœ… requirements.txt
  • βœ… Dockerfile
  • βœ… README.md
  • βœ… models/ (if you want to track model versions in git)
  • βœ… mlruns/ (MLflow tracking data - optional)
  • βœ… All other project files

How to upload:

cd /Users/ananttripathi/Desktop/mlops
git init
git add .
git commit -m "Initial commit: Predictive maintenance MLOps pipeline"
git remote add origin https://github.com/ananttripathi/engine-predictive-maintenance.git
git push -u origin main

Upload Workflow Summary

Step-by-Step Upload Process:

  1. Data Registration β†’ Hugging Face Dataset Repo

    python src/data_register.py
    
    • Uploads: data/engine_data.csv β†’ HF Dataset Repo
  2. Data Preparation β†’ Hugging Face Dataset Repo

    python src/data_prep.py
    
    • Uploads: data/processed/train.csv and test.csv β†’ HF Dataset Repo
  3. Model Training β†’ Hugging Face Model Repo

    python src/train.py
    
    • Uploads: models/best_model.joblib β†’ HF Model Repo
  4. Deploy App β†’ Hugging Face Space

    python src/deploy_to_hf.py
    
    • Uploads: src/app.py, Dockerfile, requirements.txt, etc. β†’ HF Space
  5. Push to GitHub β†’ GitHub Repository

    git add .
    git commit -m "Complete MLOps pipeline"
    git push origin main
    
    • Uploads: Everything β†’ GitHub Repo

What Gets Uploaded Automatically vs Manually

Automatic (via Scripts):

  • βœ… Hugging Face Dataset Repo β†’ src/data_register.py and src/data_prep.py
  • βœ… Hugging Face Model Repo β†’ src/train.py
  • βœ… Hugging Face Space β†’ src/deploy_to_hf.py
  • βœ… GitHub Actions β†’ Runs automatically when you push to GitHub

Manual:

  • ⚠️ GitHub Repository β†’ You need to run git push yourself

File Size Considerations

Large Files (may be ignored):

  • data/engine_data.csv β†’ Uploaded to HF Dataset, but you might want to add to .gitignore for GitHub
  • mlruns/ β†’ MLflow tracking data (can be large) - ignored by HF Space deploy
  • models/best_model.joblib β†’ Uploaded to HF Model Repo, but you might want to add to .gitignore for GitHub

Recommended .gitignore:

# Large data files
data/*.csv
data/processed/*.csv

# MLflow tracking
mlruns/

# Model files (already in HF Model Repo)
models/*.joblib

# Python cache
__pycache__/
*.pyc
.venv/

Quick Reference Table

File/Folder HF Dataset HF Model HF Space GitHub
data/engine_data.csv βœ… ❌ ❌ ⚠️ Optional
data/processed/train.csv βœ… ❌ ❌ ⚠️ Optional
data/processed/test.csv βœ… ❌ ❌ ⚠️ Optional
models/best_model.joblib ❌ βœ… ❌ ⚠️ Optional
src/app.py ❌ ❌ βœ… βœ…
src/train.py ❌ ❌ ❌ βœ…
src/data_prep.py ❌ ❌ ❌ βœ…
Dockerfile ❌ ❌ βœ… βœ…
requirements.txt ❌ ❌ βœ… βœ…
.github/workflows/pipeline.yml ❌ ❌ ❌ βœ…
README.md ❌ ❌ βœ… βœ…

Legend:

  • βœ… = Uploaded automatically or should be uploaded
  • ❌ = Not uploaded to this location
  • ⚠️ Optional = Can be uploaded but might want to exclude from GitHub due to size

Need Help?

  • Hugging Face Dataset: Check src/hf_data_utils.py
  • Hugging Face Model: Check src/hf_model_utils.py
  • Hugging Face Space: Check src/deploy_to_hf.py
  • GitHub: Standard git commands