Spaces:

representational-alignment
/

iclr2026-realign-challenge

Running

App Files Files Community

iclr2026-realign-challenge / AGENTS.md

siddsuresh97

Initial commit: ICLR 2026 Representational Alignment Challenge

d6c8a4f 2 months ago

preview code

raw

history blame contribute delete

4.31 kB

A newer version of the Gradio SDK is available: 6.13.0

Upgrade

AGENTS.md - Engaging (OpenMind/BCS) Cluster Guide (Project-Agnostic)

This repo runs on MIT Engaging (EO) BCS resources. Use this file as the source of truth for job submission, storage, and path hygiene.

Login nodes and OS

Rocky 8 login nodes for BCS and MIT Rocky 8 partitions: orcd-login001..orcd-login004 (or OOD Engaging Shell Access).
CentOS 7 login nodes only for sched_mit_hill: orcd-vlogin001..orcd-vlogin004 (OOD Engaging Legacy Shell).
Do not run training or large jobs on login nodes; submit everything to Slurm.
Do not use interactive nodes; use sbatch for all compute, installs, and env verification.

BCS GPU partitions (Rocky 8)

ou_bcs_high: 4h walltime, up to 1 GPU, 32 CPUs, 1 node; reserved for interactive use only.
ou_bcs_normal: 12h walltime, up to 8 GPUs, 256 CPUs, 2 nodes; use for all batch jobs.
ou_bcs_low: 12h walltime, up to 16 GPUs, 512 CPUs, 4 nodes; preemptible, require checkpointing.
No Slurm account string is required for ou_bcs_* partitions; omit #SBATCH --account.
Always use ou_bcs_normal for installs, experiments, and verification jobs.

MIT partitions (Rocky 8, optional overflow)

mit_normal, mit_normal_gpu, mit_preemptible (Rocky 8 only).
Use sched_mit_hill only from CentOS 7 login nodes.

Storage layout (summary)

/home/<user>: small files, source code, 200 GB, snapshot backup.
/home/<user>/orcd/pool: medium I/O, 1 TB, no backup.
/home/<user>/orcd/scratch: fast scratch, 1 TB, no backup.
/orcd/data/<PI>/001: shared lab storage, medium I/O, no backup.
/orcd/scratch/bcs/001 and /orcd/scratch/bcs/002: shared fast scratch, no backup.
/orcd/compute/bcs/001 and /orcd/datasets/001: public datasets (read-only).

Path hygiene (important)

Avoid hard-coded absolute paths. Centralize via environment variables so scripts are portable across nodes.

Recommended pattern:

PROJECT_ROOT=/orcd/data/<PI>/001/<user>/<project>
DATA_ROOT=/orcd/data/<PI>/001/<user>/datasets
OUTPUT_ROOT=/orcd/scratch/bcs/001/<user>/<project>
HF_HOME=/orcd/scratch/bcs/001/<user>/.cache/huggingface
TORCH_HOME=/orcd/scratch/bcs/001/<user>/.cache/torch

Slurm template (BCS)

#!/bin/bash
#SBATCH --job-name=experiment
#SBATCH --partition=ou_bcs_normal
#SBATCH --time=12:00:00
#SBATCH --gpus=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --output=/orcd/data/<PI>/001/<user>/<project>/logs/%x_%j.out
#SBATCH --error=/orcd/data/<PI>/001/<user>/<project>/logs/%x_%j.err

set -euo pipefail
set +u
source "$HOME/miniconda3/etc/profile.d/conda.sh"
set -u
conda activate <env-name>

export HF_HOME=/orcd/scratch/bcs/001/$USER/.cache/huggingface
export TORCH_HOME=/orcd/scratch/bcs/001/$USER/.cache/torch
export OUTPUT_ROOT=/orcd/scratch/bcs/001/$USER/<project>

python -u <script>.py --out_dir "${OUTPUT_ROOT}/runs/run_${SLURM_JOB_ID}"

For sweeps, use arrays and include %A_%a in log paths:

#SBATCH --array=0-65%8

Common Slurm commands

sbatch <script.sh>
squeue -u $USER
sacct -j <job_id> --format=JobID,State,Elapsed,MaxRSS,AllocTRES
scontrol show job <job_id>
scancel <job_id>

Checkpointing and preemption

Jobs on ou_bcs_low can be preempted. Always enable checkpoint writes and resume logic.
Store checkpoints and final outputs on /orcd/data/<PI>/001 (longer term) or /home/<user>/orcd/pool.
Use /orcd/scratch/bcs/001 or /orcd/scratch/bcs/002 for intermediate outputs and large I/O.

Launch tips

Prefer ou_bcs_normal for installs and experiments; ou_bcs_high is often limited to one concurrent job.
Do large conda installs on compute nodes (Slurm job), not on login nodes.
Avoid source ~/.bashrc when using set -u (it can error on unbound vars); source conda.sh directly.
Use /bin/mkdir -p in batch scripts to avoid shell aliases or unexpected failures.
Set CONDA_PKGS_DIRS=/orcd/scratch/bcs/001/$USER/.conda/pkgs to speed installs and avoid home quota.
Chain setup → run with sbatch --dependency=afterok:<jobid> <script.sh> to ensure env is ready.

Local environment check

Login nodes vary; check with hostname.
Default base Python does not include GPU libraries; verify with python -c "import torch" via a short sbatch job after activating your env.