# Hugging Face Data/Code Split Workflow Use two Hugging Face repos: - **Data repo**: Dataset repo, e.g. `YOUR_USERNAME/decode-iblend-data` - **Code repo**: Model repo by default, e.g. `YOUR_USERNAME/decode-iblend-code` This keeps the large CSV data separate from the notebook/script code. ## 1. Upload From This Machine ```bash python3 -m pip install huggingface_hub export HF_TOKEN=hf_your_token_here python3 scripts/upload_to_huggingface.py \ --data-repo-id YOUR_USERNAME/decode-iblend-data \ --code-repo-id YOUR_USERNAME/decode-iblend-code ``` Optional private repos: ```bash python3 scripts/upload_to_huggingface.py \ --data-repo-id YOUR_USERNAME/decode-iblend-data \ --code-repo-id YOUR_USERNAME/decode-iblend-code \ --private ``` Dry run: ```bash python3 scripts/upload_to_huggingface.py \ --data-repo-id YOUR_USERNAME/decode-iblend-data \ --code-repo-id YOUR_USERNAME/decode-iblend-code \ --dry-run ``` If you also want the code repo to be a Dataset repo: ```bash python3 scripts/upload_to_huggingface.py \ --data-repo-id YOUR_USERNAME/decode-iblend-data \ --code-repo-id YOUR_USERNAME/decode-iblend-code \ --code-repo-type dataset ``` ## 2. Download Data And Code In A Fresh Notebook ```python import os import subprocess import sys from pathlib import Path subprocess.run([sys.executable, "-m", "pip", "install", "-q", "huggingface_hub"], check=True) from huggingface_hub import snapshot_download HF_DATA_REPO_ID = "YOUR_USERNAME/decode-iblend-data" HF_CODE_REPO_ID = "YOUR_USERNAME/decode-iblend-code" CODE_REPO_TYPE = "model" # change to "dataset" if you uploaded code as a dataset repo DATA_DIR = Path("hf_data") CODE_DIR = Path("hf_code") snapshot_download( repo_id=HF_DATA_REPO_ID, repo_type="dataset", local_dir=str(DATA_DIR), local_dir_use_symlinks=False, ) snapshot_download( repo_id=HF_CODE_REPO_ID, repo_type=CODE_REPO_TYPE, local_dir=str(CODE_DIR), local_dir_use_symlinks=False, ) os.environ["IBLEND_DATA_ROOT"] = str(DATA_DIR.resolve()) SCRIPT_PATH = str((CODE_DIR / "scripts" / "decode_reimplementation.py").resolve()) print("Data root:", os.environ["IBLEND_DATA_ROOT"]) print("Script:", SCRIPT_PATH) ``` Then run immediately in the same notebook with `subprocess`: ```python import subprocess import sys TRAIN_ENV = os.environ.copy() TRAIN_ENV["DECODE_DISABLE_TENSORFLOW"] = "1" TRAIN_ENV.pop("CUDA_VISIBLE_DEVICES", None) subprocess.run([ sys.executable, SCRIPT_PATH, "--mode", "paper_buildings", "--target", "Academic", "--test-span-days", "7", "--skip-lstm", ], check=True, env=TRAIN_ENV) ``` For deep learning, run each model separately: ```python for model_name in ["lstm", "cnn", "tcn"]: subprocess.run([ sys.executable, SCRIPT_PATH, "--mode", "paper_buildings", "--target", "Academic", "--test-span-days", "7", "--dl-models", model_name, "--epochs", "20", "--batch-size", "64", ], check=True, env=TRAIN_ENV) ``` ## Data Repo Layout ```text energy_dataset/ IIITD_occupancy_dataset/ iiitd_calender_schedule/ weather_comparison/ ``` ## Code Repo Layout ```text DECODE_Reimplementation.ipynb HUGGINGFACE.md scripts/decode_reimplementation.py scripts/upload_to_huggingface.py scripts/preprocess_and_eda_by_building.py decode_reimplementation_outputs/README.md ``` `decode_reimplementation.py` reads data from `IBLEND_DATA_ROOT`, so code and data can live in separate folders.