AGENTS Guide - audio-embeddings

This file is for coding agents working in this repository. Follow these repo-specific rules over generic defaults.

1) Environment Snapshot

Python: >=3.12 (from pyproject.toml).
Dependency manager: uv.
Main stack: PyTorch, PyTorch Lightning, Hydra, OmegaConf.
Project root marker: .project-root.
Main entrypoint: src/train.py.

2) Cursor / Copilot Rule Files

Checked .cursor/rules/: not present.
Checked .cursorrules: not present.
Checked .github/copilot-instructions.md: not present.
Therefore, no additional Cursor/Copilot rule files are currently enforced.

3) Install / Setup Commands

uv sync
uv run <command>
uv add <package>

4) Build / Train / Eval Commands

There is no separate "build" step (this is a training codebase). Use quick-run training as the integration sanity check.

uv run src/train.py
uv run src/train.py trainer.fast_dev_run=True
uv run src/train.py trainer=cpu trainer.fast_dev_run=True
uv run src/train.py experiment=local/audio_jepa
uv run src/train.py trainer.max_epochs=10 data.batch_size=32 model.optimizer.lr=1e-4

Cluster-style execution (existing project pattern):

srun .venv/bin/python -u -O src/train.py experiment=cluster_jepa_audioset_rope +trainer.max_time="00:19:50:00"

5) Lint / Formatting / Static Checks

Use the commands below as pragmatic checks:

uv run pre-commit run --all-files
uv run pre-commit run ruff --all-files
uv run pre-commit run ruff-format --all-files
uv run python -m compileall src

Ruff is configured via .pre-commit-config.yaml and runs both lint fixes and formatting.

6) Test Commands (Including Single Test)

Primary validation in this repo is script-based verification under tests/. Run test files directly as native Python files:

uv run tests/verify_rope.py
uv run tests/verify_custom_rope.py
uv run tests/verify_data.py

Useful single-file checks (native execution):

uv run src/train.py trainer.fast_dev_run=True
uv run src/train.py trainer=cpu trainer.fast_dev_run=True
uv run scripts/verify_shapes.py
uv run scripts/verify_scheduler.py

Notes:

tests/test_*.py are pytest-style and are not part of the default native-file workflow.
Prefer tests/verify_*.py and scripts/verify_*.py for lightweight checks.

7) Repository Architecture Expectations

configs/: Hydra composition (trainer/data/model/logger/callbacks/experiment).
src/train.py: orchestration only (instantiate and run).
src/models/: LightningModules (high-level training logic).
src/models/components/: reusable nn.Module building blocks.
src/data/: DataModules/Datasets and collate logic.
src/utils/: logging, instantiation, wrappers, scheduler helpers. When possible, prefer config changes over hardcoded Python changes.

8) Code Style Guidelines

Imports

Group imports as: standard library -> third-party -> local src.*.
Keep one import per line unless importing multiple names from same module.
Avoid wildcard imports.
Prefer absolute imports from src....

Formatting

Use 4-space indentation and readable line lengths.
Keep functions small; extract helpers for complex logic.
Do not introduce unrelated reformatting in touched files.
Keep comments for non-obvious intent, not obvious mechanics.

Typing

Type hints are expected for function arguments and return values.
Use concrete tensor/container types when practical.
Use Optional[T] / T | None consistently within a file.
For dict-like configs, type as DictConfig when passing Hydra config objects.

Naming

snake_case: functions, variables, module filenames.
PascalCase: classes (AudioJEPAModule, AudioSetDataModule).
UPPER_SNAKE_CASE: constants.
Prefer descriptive names (mask_indices) over short names (m2) except local math temporaries.

PyTorch / Lightning / Hydra Conventions

Keep heavy compute out of __init__ where possible.
forward() for inference logic; training behavior in training_step().
Use self.log(...) with explicit flags (on_step, on_epoch, prog_bar, batch_size).
Instantiate components through Hydra (hydra.utils.instantiate).
Expose tunable parameters in config files, not hardcoded literals.

Error Handling and Validation

Raise informative ValueError / RuntimeError for invalid config/state.
Validate critical tensor assumptions with assertions or explicit checks.
Prefer logger/warnings over bare print() in new code.
For file I/O, prefer pathlib.Path and existence checks.

Data and Paths

Do not hardcode absolute machine paths.
Use rootutils.setup_root(..., indicator=".project-root", pythonpath=True) in entrypoints/scripts when needed.
Respect cfg.paths.* outputs for logs/checkpoints/artifacts.

9) Agent Workflow Rules

Reuse existing components before adding new abstractions.
Keep src/train.py generic; place model/data logic in dedicated modules.
Prefer minimal, focused diffs.
Update configs and docs when behavior changes.
Validate with the smallest meaningful command first (fast_dev_run, single test), then broader checks.

10) Git / Change Hygiene

Do not revert unrelated local changes.
Keep commits scoped to one concern.
Write clear commit messages describing intent.
Prefer Conventional Commit-like format: type(scope): intent.
Common types in this repo: feat, fix, conf, build, docs, style, chore.
Never commit secrets, credentials, or environment-specific absolute paths.

11) Practical Agent Defaults

Prefer reusing existing modules over creating new abstractions.
Keep edits local to the requested change; avoid drive-by refactors.
Run the smallest useful verification command after changes.
If you touch training logic, run at least one fast training sanity check.
If you touch model components, run relevant verify script(s) in tests/.
If you touch Hydra config wiring, run a config-backed entry command via uv run src/train.py ....

12) Common Pitfalls

Avoid hardcoding data paths; use config (cfg.paths, data config fields).
Avoid printing in new code paths; use ranked loggers/warnings.
Avoid putting heavy tensor compute in constructors.
Avoid bypassing Hydra by manually instantiating configurable components.
Avoid changing unrelated formatting in files you touch.