A newer version of the Gradio SDK is available: 6.13.0
AGENTS.md - Engaging (OpenMind/BCS) Cluster Guide (Project-Agnostic)
This repo runs on MIT Engaging (EO) BCS resources. Use this file as the source of truth for job submission, storage, and path hygiene.
Login nodes and OS
- Rocky 8 login nodes for BCS and MIT Rocky 8 partitions:
orcd-login001..orcd-login004(or OOD Engaging Shell Access). - CentOS 7 login nodes only for
sched_mit_hill:orcd-vlogin001..orcd-vlogin004(OOD Engaging Legacy Shell). - Do not run training or large jobs on login nodes; submit everything to Slurm.
- Do not use interactive nodes; use
sbatchfor all compute, installs, and env verification.
BCS GPU partitions (Rocky 8)
ou_bcs_high: 4h walltime, up to 1 GPU, 32 CPUs, 1 node; reserved for interactive use only.ou_bcs_normal: 12h walltime, up to 8 GPUs, 256 CPUs, 2 nodes; use for all batch jobs.ou_bcs_low: 12h walltime, up to 16 GPUs, 512 CPUs, 4 nodes; preemptible, require checkpointing.- No Slurm account string is required for
ou_bcs_*partitions; omit#SBATCH --account. - Always use
ou_bcs_normalfor installs, experiments, and verification jobs.
MIT partitions (Rocky 8, optional overflow)
mit_normal,mit_normal_gpu,mit_preemptible(Rocky 8 only).- Use
sched_mit_hillonly from CentOS 7 login nodes.
Storage layout (summary)
/home/<user>: small files, source code, 200 GB, snapshot backup./home/<user>/orcd/pool: medium I/O, 1 TB, no backup./home/<user>/orcd/scratch: fast scratch, 1 TB, no backup./orcd/data/<PI>/001: shared lab storage, medium I/O, no backup./orcd/scratch/bcs/001and/orcd/scratch/bcs/002: shared fast scratch, no backup./orcd/compute/bcs/001and/orcd/datasets/001: public datasets (read-only).
Path hygiene (important)
Avoid hard-coded absolute paths. Centralize via environment variables so scripts are portable across nodes.
Recommended pattern:
PROJECT_ROOT=/orcd/data/<PI>/001/<user>/<project>DATA_ROOT=/orcd/data/<PI>/001/<user>/datasetsOUTPUT_ROOT=/orcd/scratch/bcs/001/<user>/<project>HF_HOME=/orcd/scratch/bcs/001/<user>/.cache/huggingfaceTORCH_HOME=/orcd/scratch/bcs/001/<user>/.cache/torch
Slurm template (BCS)
#!/bin/bash
#SBATCH --job-name=experiment
#SBATCH --partition=ou_bcs_normal
#SBATCH --time=12:00:00
#SBATCH --gpus=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --output=/orcd/data/<PI>/001/<user>/<project>/logs/%x_%j.out
#SBATCH --error=/orcd/data/<PI>/001/<user>/<project>/logs/%x_%j.err
set -euo pipefail
set +u
source "$HOME/miniconda3/etc/profile.d/conda.sh"
set -u
conda activate <env-name>
export HF_HOME=/orcd/scratch/bcs/001/$USER/.cache/huggingface
export TORCH_HOME=/orcd/scratch/bcs/001/$USER/.cache/torch
export OUTPUT_ROOT=/orcd/scratch/bcs/001/$USER/<project>
python -u <script>.py --out_dir "${OUTPUT_ROOT}/runs/run_${SLURM_JOB_ID}"
For sweeps, use arrays and include %A_%a in log paths:
#SBATCH --array=0-65%8
Common Slurm commands
sbatch <script.sh>squeue -u $USERsacct -j <job_id> --format=JobID,State,Elapsed,MaxRSS,AllocTRESscontrol show job <job_id>scancel <job_id>
Checkpointing and preemption
- Jobs on
ou_bcs_lowcan be preempted. Always enable checkpoint writes and resume logic. - Store checkpoints and final outputs on
/orcd/data/<PI>/001(longer term) or/home/<user>/orcd/pool. - Use
/orcd/scratch/bcs/001or/orcd/scratch/bcs/002for intermediate outputs and large I/O.
Launch tips
- Prefer
ou_bcs_normalfor installs and experiments;ou_bcs_highis often limited to one concurrent job. - Do large conda installs on compute nodes (Slurm job), not on login nodes.
- Avoid
source ~/.bashrcwhen usingset -u(it can error on unbound vars); sourceconda.shdirectly. - Use
/bin/mkdir -pin batch scripts to avoid shell aliases or unexpected failures. - Set
CONDA_PKGS_DIRS=/orcd/scratch/bcs/001/$USER/.conda/pkgsto speed installs and avoid home quota. - Chain setup → run with
sbatch --dependency=afterok:<jobid> <script.sh>to ensure env is ready.
Local environment check
- Login nodes vary; check with
hostname. - Default base Python does not include GPU libraries; verify with
python -c "import torch"via a shortsbatchjob after activating your env.