| # Hugging Face Data/Code Split Workflow |
|
|
| Use two Hugging Face repos: |
|
|
| - **Data repo**: Dataset repo, e.g. `YOUR_USERNAME/decode-iblend-data` |
| - **Code repo**: Model repo by default, e.g. `YOUR_USERNAME/decode-iblend-code` |
|
|
| This keeps the large CSV data separate from the notebook/script code. |
|
|
| ## 1. Upload From This Machine |
|
|
| ```bash |
| python3 -m pip install huggingface_hub |
| export HF_TOKEN=hf_your_token_here |
| |
| python3 scripts/upload_to_huggingface.py \ |
| --data-repo-id YOUR_USERNAME/decode-iblend-data \ |
| --code-repo-id YOUR_USERNAME/decode-iblend-code |
| ``` |
|
|
| Optional private repos: |
|
|
| ```bash |
| python3 scripts/upload_to_huggingface.py \ |
| --data-repo-id YOUR_USERNAME/decode-iblend-data \ |
| --code-repo-id YOUR_USERNAME/decode-iblend-code \ |
| --private |
| ``` |
|
|
| Dry run: |
|
|
| ```bash |
| python3 scripts/upload_to_huggingface.py \ |
| --data-repo-id YOUR_USERNAME/decode-iblend-data \ |
| --code-repo-id YOUR_USERNAME/decode-iblend-code \ |
| --dry-run |
| ``` |
|
|
| If you also want the code repo to be a Dataset repo: |
|
|
| ```bash |
| python3 scripts/upload_to_huggingface.py \ |
| --data-repo-id YOUR_USERNAME/decode-iblend-data \ |
| --code-repo-id YOUR_USERNAME/decode-iblend-code \ |
| --code-repo-type dataset |
| ``` |
|
|
| ## 2. Download Data And Code In A Fresh Notebook |
|
|
| ```python |
| import os |
| import subprocess |
| import sys |
| from pathlib import Path |
| |
| subprocess.run([sys.executable, "-m", "pip", "install", "-q", "huggingface_hub"], check=True) |
| |
| from huggingface_hub import snapshot_download |
| |
| HF_DATA_REPO_ID = "YOUR_USERNAME/decode-iblend-data" |
| HF_CODE_REPO_ID = "YOUR_USERNAME/decode-iblend-code" |
| CODE_REPO_TYPE = "model" # change to "dataset" if you uploaded code as a dataset repo |
| |
| DATA_DIR = Path("hf_data") |
| CODE_DIR = Path("hf_code") |
| |
| snapshot_download( |
| repo_id=HF_DATA_REPO_ID, |
| repo_type="dataset", |
| local_dir=str(DATA_DIR), |
| local_dir_use_symlinks=False, |
| ) |
| |
| snapshot_download( |
| repo_id=HF_CODE_REPO_ID, |
| repo_type=CODE_REPO_TYPE, |
| local_dir=str(CODE_DIR), |
| local_dir_use_symlinks=False, |
| ) |
| |
| os.environ["IBLEND_DATA_ROOT"] = str(DATA_DIR.resolve()) |
| SCRIPT_PATH = str((CODE_DIR / "scripts" / "decode_reimplementation.py").resolve()) |
| |
| print("Data root:", os.environ["IBLEND_DATA_ROOT"]) |
| print("Script:", SCRIPT_PATH) |
| ``` |
|
|
| Then run immediately in the same notebook with `subprocess`: |
|
|
| ```python |
| import subprocess |
| import sys |
| |
| TRAIN_ENV = os.environ.copy() |
| TRAIN_ENV["DECODE_DISABLE_TENSORFLOW"] = "1" |
| TRAIN_ENV.pop("CUDA_VISIBLE_DEVICES", None) |
| |
| subprocess.run([ |
| sys.executable, SCRIPT_PATH, |
| "--mode", "paper_buildings", |
| "--target", "Academic", |
| "--test-span-days", "7", |
| "--skip-lstm", |
| ], check=True, env=TRAIN_ENV) |
| ``` |
|
|
| For deep learning, run each model separately: |
|
|
| ```python |
| for model_name in ["lstm", "cnn", "tcn"]: |
| subprocess.run([ |
| sys.executable, SCRIPT_PATH, |
| "--mode", "paper_buildings", |
| "--target", "Academic", |
| "--test-span-days", "7", |
| "--dl-models", model_name, |
| "--epochs", "20", |
| "--batch-size", "64", |
| ], check=True, env=TRAIN_ENV) |
| ``` |
|
|
| ## Data Repo Layout |
|
|
| ```text |
| energy_dataset/ |
| IIITD_occupancy_dataset/ |
| iiitd_calender_schedule/ |
| weather_comparison/ |
| ``` |
|
|
| ## Code Repo Layout |
|
|
| ```text |
| DECODE_Reimplementation.ipynb |
| HUGGINGFACE.md |
| scripts/decode_reimplementation.py |
| scripts/upload_to_huggingface.py |
| scripts/preprocess_and_eda_by_building.py |
| decode_reimplementation_outputs/README.md |
| ``` |
|
|
| `decode_reimplementation.py` reads data from `IBLEND_DATA_ROOT`, so code and data can live in separate folders. |
|
|