Hugging Face Data/Code Split Workflow
Use two Hugging Face repos:
- Data repo: Dataset repo, e.g.
YOUR_USERNAME/decode-iblend-data - Code repo: Model repo by default, e.g.
YOUR_USERNAME/decode-iblend-code
This keeps the large CSV data separate from the notebook/script code.
1. Upload From This Machine
python3 -m pip install huggingface_hub
export HF_TOKEN=hf_your_token_here
python3 scripts/upload_to_huggingface.py \
--data-repo-id YOUR_USERNAME/decode-iblend-data \
--code-repo-id YOUR_USERNAME/decode-iblend-code
Optional private repos:
python3 scripts/upload_to_huggingface.py \
--data-repo-id YOUR_USERNAME/decode-iblend-data \
--code-repo-id YOUR_USERNAME/decode-iblend-code \
--private
Dry run:
python3 scripts/upload_to_huggingface.py \
--data-repo-id YOUR_USERNAME/decode-iblend-data \
--code-repo-id YOUR_USERNAME/decode-iblend-code \
--dry-run
If you also want the code repo to be a Dataset repo:
python3 scripts/upload_to_huggingface.py \
--data-repo-id YOUR_USERNAME/decode-iblend-data \
--code-repo-id YOUR_USERNAME/decode-iblend-code \
--code-repo-type dataset
2. Download Data And Code In A Fresh Notebook
import os
import subprocess
import sys
from pathlib import Path
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "huggingface_hub"], check=True)
from huggingface_hub import snapshot_download
HF_DATA_REPO_ID = "YOUR_USERNAME/decode-iblend-data"
HF_CODE_REPO_ID = "YOUR_USERNAME/decode-iblend-code"
CODE_REPO_TYPE = "model" # change to "dataset" if you uploaded code as a dataset repo
DATA_DIR = Path("hf_data")
CODE_DIR = Path("hf_code")
snapshot_download(
repo_id=HF_DATA_REPO_ID,
repo_type="dataset",
local_dir=str(DATA_DIR),
local_dir_use_symlinks=False,
)
snapshot_download(
repo_id=HF_CODE_REPO_ID,
repo_type=CODE_REPO_TYPE,
local_dir=str(CODE_DIR),
local_dir_use_symlinks=False,
)
os.environ["IBLEND_DATA_ROOT"] = str(DATA_DIR.resolve())
SCRIPT_PATH = str((CODE_DIR / "scripts" / "decode_reimplementation.py").resolve())
print("Data root:", os.environ["IBLEND_DATA_ROOT"])
print("Script:", SCRIPT_PATH)
Then run immediately in the same notebook with subprocess:
import subprocess
import sys
TRAIN_ENV = os.environ.copy()
TRAIN_ENV["DECODE_DISABLE_TENSORFLOW"] = "1"
TRAIN_ENV.pop("CUDA_VISIBLE_DEVICES", None)
subprocess.run([
sys.executable, SCRIPT_PATH,
"--mode", "paper_buildings",
"--target", "Academic",
"--test-span-days", "7",
"--skip-lstm",
], check=True, env=TRAIN_ENV)
For deep learning, run each model separately:
for model_name in ["lstm", "cnn", "tcn"]:
subprocess.run([
sys.executable, SCRIPT_PATH,
"--mode", "paper_buildings",
"--target", "Academic",
"--test-span-days", "7",
"--dl-models", model_name,
"--epochs", "20",
"--batch-size", "64",
], check=True, env=TRAIN_ENV)
Data Repo Layout
energy_dataset/
IIITD_occupancy_dataset/
iiitd_calender_schedule/
weather_comparison/
Code Repo Layout
DECODE_Reimplementation.ipynb
HUGGINGFACE.md
scripts/decode_reimplementation.py
scripts/upload_to_huggingface.py
scripts/preprocess_and_eda_by_building.py
decode_reimplementation_outputs/README.md
decode_reimplementation.py reads data from IBLEND_DATA_ROOT, so code and data can live in separate folders.