siddsuresh97's picture
Initial commit: ICLR 2026 Representational Alignment Challenge
d6c8a4f

A newer version of the Gradio SDK is available: 6.13.0

Upgrade

AGENTS.md - Engaging (OpenMind/BCS) Cluster Guide (Project-Agnostic)

This repo runs on MIT Engaging (EO) BCS resources. Use this file as the source of truth for job submission, storage, and path hygiene.

Login nodes and OS

  • Rocky 8 login nodes for BCS and MIT Rocky 8 partitions: orcd-login001..orcd-login004 (or OOD Engaging Shell Access).
  • CentOS 7 login nodes only for sched_mit_hill: orcd-vlogin001..orcd-vlogin004 (OOD Engaging Legacy Shell).
  • Do not run training or large jobs on login nodes; submit everything to Slurm.
  • Do not use interactive nodes; use sbatch for all compute, installs, and env verification.

BCS GPU partitions (Rocky 8)

  • ou_bcs_high: 4h walltime, up to 1 GPU, 32 CPUs, 1 node; reserved for interactive use only.
  • ou_bcs_normal: 12h walltime, up to 8 GPUs, 256 CPUs, 2 nodes; use for all batch jobs.
  • ou_bcs_low: 12h walltime, up to 16 GPUs, 512 CPUs, 4 nodes; preemptible, require checkpointing.
  • No Slurm account string is required for ou_bcs_* partitions; omit #SBATCH --account.
  • Always use ou_bcs_normal for installs, experiments, and verification jobs.

MIT partitions (Rocky 8, optional overflow)

  • mit_normal, mit_normal_gpu, mit_preemptible (Rocky 8 only).
  • Use sched_mit_hill only from CentOS 7 login nodes.

Storage layout (summary)

  • /home/<user>: small files, source code, 200 GB, snapshot backup.
  • /home/<user>/orcd/pool: medium I/O, 1 TB, no backup.
  • /home/<user>/orcd/scratch: fast scratch, 1 TB, no backup.
  • /orcd/data/<PI>/001: shared lab storage, medium I/O, no backup.
  • /orcd/scratch/bcs/001 and /orcd/scratch/bcs/002: shared fast scratch, no backup.
  • /orcd/compute/bcs/001 and /orcd/datasets/001: public datasets (read-only).

Path hygiene (important)

Avoid hard-coded absolute paths. Centralize via environment variables so scripts are portable across nodes.

Recommended pattern:

  • PROJECT_ROOT=/orcd/data/<PI>/001/<user>/<project>
  • DATA_ROOT=/orcd/data/<PI>/001/<user>/datasets
  • OUTPUT_ROOT=/orcd/scratch/bcs/001/<user>/<project>
  • HF_HOME=/orcd/scratch/bcs/001/<user>/.cache/huggingface
  • TORCH_HOME=/orcd/scratch/bcs/001/<user>/.cache/torch

Slurm template (BCS)

#!/bin/bash
#SBATCH --job-name=experiment
#SBATCH --partition=ou_bcs_normal
#SBATCH --time=12:00:00
#SBATCH --gpus=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --output=/orcd/data/<PI>/001/<user>/<project>/logs/%x_%j.out
#SBATCH --error=/orcd/data/<PI>/001/<user>/<project>/logs/%x_%j.err

set -euo pipefail
set +u
source "$HOME/miniconda3/etc/profile.d/conda.sh"
set -u
conda activate <env-name>

export HF_HOME=/orcd/scratch/bcs/001/$USER/.cache/huggingface
export TORCH_HOME=/orcd/scratch/bcs/001/$USER/.cache/torch
export OUTPUT_ROOT=/orcd/scratch/bcs/001/$USER/<project>

python -u <script>.py --out_dir "${OUTPUT_ROOT}/runs/run_${SLURM_JOB_ID}"

For sweeps, use arrays and include %A_%a in log paths:

#SBATCH --array=0-65%8

Common Slurm commands

  • sbatch <script.sh>
  • squeue -u $USER
  • sacct -j <job_id> --format=JobID,State,Elapsed,MaxRSS,AllocTRES
  • scontrol show job <job_id>
  • scancel <job_id>

Checkpointing and preemption

  • Jobs on ou_bcs_low can be preempted. Always enable checkpoint writes and resume logic.
  • Store checkpoints and final outputs on /orcd/data/<PI>/001 (longer term) or /home/<user>/orcd/pool.
  • Use /orcd/scratch/bcs/001 or /orcd/scratch/bcs/002 for intermediate outputs and large I/O.

Launch tips

  • Prefer ou_bcs_normal for installs and experiments; ou_bcs_high is often limited to one concurrent job.
  • Do large conda installs on compute nodes (Slurm job), not on login nodes.
  • Avoid source ~/.bashrc when using set -u (it can error on unbound vars); source conda.sh directly.
  • Use /bin/mkdir -p in batch scripts to avoid shell aliases or unexpected failures.
  • Set CONDA_PKGS_DIRS=/orcd/scratch/bcs/001/$USER/.conda/pkgs to speed installs and avoid home quota.
  • Chain setup → run with sbatch --dependency=afterok:<jobid> <script.sh> to ensure env is ready.

Local environment check

  • Login nodes vary; check with hostname.
  • Default base Python does not include GPU libraries; verify with python -c "import torch" via a short sbatch job after activating your env.