decode-iblend-code / HUGGINGFACE.md
HoangTrungNguyen's picture
Upload HUGGINGFACE.md with huggingface_hub
277ce34 verified
|
Raw
History Blame Contribute Delete
3.47 kB
# Hugging Face Data/Code Split Workflow
Use two Hugging Face repos:
- **Data repo**: Dataset repo, e.g. `YOUR_USERNAME/decode-iblend-data`
- **Code repo**: Model repo by default, e.g. `YOUR_USERNAME/decode-iblend-code`
This keeps the large CSV data separate from the notebook/script code.
## 1. Upload From This Machine
```bash
python3 -m pip install huggingface_hub
export HF_TOKEN=hf_your_token_here
python3 scripts/upload_to_huggingface.py \
--data-repo-id YOUR_USERNAME/decode-iblend-data \
--code-repo-id YOUR_USERNAME/decode-iblend-code
```
Optional private repos:
```bash
python3 scripts/upload_to_huggingface.py \
--data-repo-id YOUR_USERNAME/decode-iblend-data \
--code-repo-id YOUR_USERNAME/decode-iblend-code \
--private
```
Dry run:
```bash
python3 scripts/upload_to_huggingface.py \
--data-repo-id YOUR_USERNAME/decode-iblend-data \
--code-repo-id YOUR_USERNAME/decode-iblend-code \
--dry-run
```
If you also want the code repo to be a Dataset repo:
```bash
python3 scripts/upload_to_huggingface.py \
--data-repo-id YOUR_USERNAME/decode-iblend-data \
--code-repo-id YOUR_USERNAME/decode-iblend-code \
--code-repo-type dataset
```
## 2. Download Data And Code In A Fresh Notebook
```python
import os
import subprocess
import sys
from pathlib import Path
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "huggingface_hub"], check=True)
from huggingface_hub import snapshot_download
HF_DATA_REPO_ID = "YOUR_USERNAME/decode-iblend-data"
HF_CODE_REPO_ID = "YOUR_USERNAME/decode-iblend-code"
CODE_REPO_TYPE = "model" # change to "dataset" if you uploaded code as a dataset repo
DATA_DIR = Path("hf_data")
CODE_DIR = Path("hf_code")
snapshot_download(
repo_id=HF_DATA_REPO_ID,
repo_type="dataset",
local_dir=str(DATA_DIR),
local_dir_use_symlinks=False,
)
snapshot_download(
repo_id=HF_CODE_REPO_ID,
repo_type=CODE_REPO_TYPE,
local_dir=str(CODE_DIR),
local_dir_use_symlinks=False,
)
os.environ["IBLEND_DATA_ROOT"] = str(DATA_DIR.resolve())
SCRIPT_PATH = str((CODE_DIR / "scripts" / "decode_reimplementation.py").resolve())
print("Data root:", os.environ["IBLEND_DATA_ROOT"])
print("Script:", SCRIPT_PATH)
```
Then run immediately in the same notebook with `subprocess`:
```python
import subprocess
import sys
TRAIN_ENV = os.environ.copy()
TRAIN_ENV["DECODE_DISABLE_TENSORFLOW"] = "1"
TRAIN_ENV.pop("CUDA_VISIBLE_DEVICES", None)
subprocess.run([
sys.executable, SCRIPT_PATH,
"--mode", "paper_buildings",
"--target", "Academic",
"--test-span-days", "7",
"--skip-lstm",
], check=True, env=TRAIN_ENV)
```
For deep learning, run each model separately:
```python
for model_name in ["lstm", "cnn", "tcn"]:
subprocess.run([
sys.executable, SCRIPT_PATH,
"--mode", "paper_buildings",
"--target", "Academic",
"--test-span-days", "7",
"--dl-models", model_name,
"--epochs", "20",
"--batch-size", "64",
], check=True, env=TRAIN_ENV)
```
## Data Repo Layout
```text
energy_dataset/
IIITD_occupancy_dataset/
iiitd_calender_schedule/
weather_comparison/
```
## Code Repo Layout
```text
DECODE_Reimplementation.ipynb
HUGGINGFACE.md
scripts/decode_reimplementation.py
scripts/upload_to_huggingface.py
scripts/preprocess_and_eda_by_building.py
decode_reimplementation_outputs/README.md
```
`decode_reimplementation.py` reads data from `IBLEND_DATA_ROOT`, so code and data can live in separate folders.