Spaces:

Ev3Dev
/

hackathon

Running

App Files Files Community

hackathon / H100_JUPYTER_SETUP.md

Ev3Dev

Upload folder using huggingface_hub

ad39f2a verified 2 days ago

preview code

raw

history blame contribute delete

9.18 kB

	# H100 Jupyter Notebook Setup

	This guide walks you through setting up the OpenEnv Bio Experiment environment on an NVIDIA H100 Jupyter notebook instance (e.g., Jupiter Labs, Lambda Labs, RunPod, or similar).

	## Prerequisites

	- Python 3.10, 3.11, or 3.12 (3.12 recommended for H100; 3.13 is not supported—numba, vllm, and others require <3.13)
	- uv – fast Python package manager ([install instructions](#installing-uv))
	- NVIDIA driver ≥ 535.104.05 (usually pre-installed on H100 instances)
	- CUDA – H100 uses CUDA 12.x; PyTorch wheels bundle the runtime, so a separate CUDA Toolkit is not required

	## Installing uv

	If `uv` is not already installed:

	```bash
	# Unix/Linux (including Jupiter notebook terminals)
	curl -LsSf https://astral.sh/uv/install.sh \| sh

	# Or with pip
	pip install uv
	```

	Verify:

	```bash
	uv --version
	```

	## Quick Setup (Recommended)

	### 1. Clone and enter the project

	```bash
	git clone <repository-url> OpenENV-Hackathon
	cd OpenENV-Hackathon
	```

	### 2. Use uv's auto PyTorch backend

	The project uses Python 3.12 (see `.python-version`). uv will create a 3.12 venv. For H100 (CUDA 12.x):

	```bash
	# Install everything: core + training (TRL, transformers, torch) + Jupyter
	UV_TORCH_BACKEND=cu128 uv sync --extra train

	# Add Unsloth for training_unsloth.py (skips trl downgrade; Unsloth works with TRL 0.29)
	uv pip install unsloth unsloth_zoo --no-deps

	# (ipykernel is included in --extra train)
	```

	If `UV_TORCH_BACKEND=cu128` fails (e.g., cu128 wheels not available yet), try:

	```bash
	UV_TORCH_BACKEND=cu126 uv sync --extra train
	```

	### 3. Register the environment as a Jupyter kernel

	```bash
	uv run python -m ipykernel install --user --name openenv-bio-312 --display-name "OpenEnv Bio (Python 3.12)"
	```

	Or run the helper script (from project root):

	```bash
	bash scripts/register_kernel_312.sh
	```

	Then select "OpenEnv Bio (Python 3.12)" in the notebook kernel picker.

	### 4. Verify CUDA

	In a new Jupyter notebook, select the "OpenEnv Bio (Python 3.12)" kernel and run:

	```python
	import torch
	print(f"PyTorch: {torch.__version__}")
	print(f"CUDA available: {torch.cuda.is_available()}")
	print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'N/A'}")
	```

	Expected output (or similar):

	```
	PyTorch: 2.x.x+cu128
	CUDA available: True
	GPU: NVIDIA H100 ...
	```

	### 5. Sanity check the environment

	```bash
	uv run pytest tests/test_environment.py tests/test_literature_benchmark.py -q
	```

	## Manual PyTorch CUDA Configuration

	If you need explicit control over the PyTorch index (e.g., for reproducibility), add the following to `pyproject.toml`:

	### Add to `pyproject.toml`

	```toml
	# After [tool.uv], add:

	[[tool.uv.index]]
	name = "pytorch-cu128"
	url = "https://download.pytorch.org/whl/cu128"
	explicit = true

	[tool.uv.sources]
	torch = [{ index = "pytorch-cu128" }]
	torchvision = [{ index = "pytorch-cu128" }]
	```

	Then run:

	```bash
	uv sync --extra train
	```

	For CUDA 12.6 instead of 12.8, use `cu126` in the index URL and source names.

	## Dependency Groups

	\| uv sync flag \| Contents \|
	\|-------------------\|--------------------------------------------------------------------------\|
	\| (default) \| Core: `openenv-core`, `numpy`, `scipy`, `pydantic` \|
	\| `--extra dev` \| Testing: `pytest`, `pytest-cov` \|
	\| `--extra train` \| Training: `torch`, `transformers`, `trl`, `accelerate`, `peft`, `unsloth`, etc. \|
	\| `--extra bio` \| Bioinformatics: `scanpy`, `biopython`, `gseapy` \|
	\| `--extra train --extra dev` \| Combined for development + training \|

	## Preferred H100 Workflow

	On H100, use the quantized Unsloth entrypoints:

	```bash
	uv run python training_unsloth.py --model-id Qwen/Qwen3-4B-Base --output-dir training/grpo-unsloth-qwen3-4b --dry-run
	uv run python training_unsloth.py --model-id Qwen/Qwen3-4B-Base --output-dir training/grpo-unsloth-qwen3-4b
	uv run python run_agent_unsloth.py
	```

	The checked-in `inference.ipynb` notebook uses `training_unsloth.py` helpers with 4-bit loading. vLLM fast inference is disabled to avoid dependency conflicts.

	## Running Training in a Jupyter Notebook

	Example cell:

	```python
	# In a notebook with the OpenEnv Bio (Python 3.12) kernel
	!uv run python training_unsloth.py --model-id Qwen/Qwen3-4B-Base --output-dir training/grpo-unsloth-qwen3-4b --dry-run
	```

	Or run interactively from Python:

	```python
	import subprocess
	subprocess.run([
	"uv", "run", "python", "training_unsloth.py",
	"--model-id", "Qwen/Qwen3-4B-Base",
	"--output-dir", "training/grpo-unsloth-qwen3-4b",
	], check=True)
	```

	## Requirements Summary

	\| Component \| Version / Notes \|
	\|----------------\|------------------------------------------------------\|
	\| Python \| 3.10–3.12 (3.12 recommended; 3.13 not supported) \|
	\| uv \| ≥ 0.5.3 (for PyTorch index support) \|
	\| torch \| ≥ 2.10.0 (cu128 or cu126 for H100) \|
	\| transformers \| ≥4.57 (with unsloth≥2025.10.14) \|
	\| trl \| ≥ 0.29.0 \|
	\| accelerate \| ≥ 1.13.0 \|
	\| Jupyter \| Optional, for notebook workflows \|

	## Troubleshooting

	### `RuntimeError: Cannot install on Python version 3.13.x` or numba / setup.py errors

	Python 3.13 is not supported (numba, vllm, and other deps require <3.13). Use Python 3.12:

	```bash
	# With uv: ensure Python 3.12 is available, then sync
	uv python install 3.12
	uv sync --extra train

	# Or create venv explicitly with 3.12
	uv venv --python 3.12
	UV_TORCH_BACKEND=cu128 uv sync --extra train
	```

	The project's `.python-version` file pins 3.12; uv will use it when creating the venv.

	### `torch.cuda.is_available()` is False

	- Confirm the Jupyter kernel is the one where you ran `uv sync` (the one with `ipykernel`).
	- Ensure no CPU-only PyTorch is overriding the CUDA build (e.g., from a different conda/pip env).
	- Run `uv run python -c "import torch; print(torch.__file__)"` to verify PyTorch comes from your project venv.

	### Flash Attention / causal-conv fallback warnings

	These are common and usually harmless; execution continues with a slower path. For best H100 performance, ensure `transformers` and `torch` are recent versions that support Flash Attention 2.

	### HuggingFace symlink warnings

	Set:

	```bash
	export HF_HUB_DISABLE_SYMLINKS_WARNING=1
	```

	### Out-of-memory during training

	- Reduce `--num-generations` or `--rollout-steps`.
	- Use a smaller model (e.g., `Qwen/Qwen3.5-0.8B`) for experiments.
	- Keep `--disable-4bit` off unless you explicitly need wider weights.

	### `ModuleNotFoundError: No module named 'vllm.lora.models'`

	Unsloth's `unsloth_zoo` imports vLLM at load time and expects `vllm.lora.models`, which some vLLM versions don't have. Fix by installing a compatible vLLM:

	```bash
	pip install "vllm==0.8.2"
	# or
	pip install "vllm==0.7.3"
	```

	Note: vLLM 0.8.2 pins `torch==2.6.0`, which conflicts with this project's `torch>=2.10.0`. If you hit that conflict:

	1. Use a separate environment with torch 2.6–2.8 + vllm 0.8.2 + unsloth.
	2. Or use the non-Unsloth path (`training_script.py` / `train.ipynb`) which doesn't depend on vLLM.

	### `KeyError: 'qwen3_5'` / Qwen3.5 not supported

	Qwen3.5 requires transformers 5.x. With transformers 4.57, use Qwen2.5 instead:
	- `unsloth/Qwen2.5-3B-Instruct-bnb-4bit`
	- `unsloth/Qwen2.5-7B-Instruct-bnb-4bit`
	- `Qwen/Qwen2.5-3B-Instruct`

	### `NameError: name 'PreTrainedConfig' is not defined` / `check_model_inputs` ImportError

	Use unsloth≥2025.10.14 (PreTrainedConfig fix) with transformers≥4.57 (check_model_inputs). Run `uv sync --extra train` to get compatible versions.

	### `ImportError: cannot import name 'ConstantLengthDataset' from 'trl.trainer.utils'`

	unsloth_zoo expects TRL <0.20. The project pins `trl>=0.19.0,<0.20`. If you see this error, ensure you've run `uv sync --extra train` so the locked trl version is used. Alternatively, try:

	```bash
	pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth unsloth_zoo
	```

	(A newer unsloth_zoo may fix this and allow TRL 0.20+.)

	### Unsloth import order warning

	If you see "Unsloth should be imported before trl, transformers, peft", ensure `training_unsloth` is imported before `training_script` in your notebook:

	```python
	from training_unsloth import make_training_args, run_training # first
	import training_script as base
	```

	## See Also

	- Main [README.md](README.md) for project overview, APIs, and usage
	- [uv PyTorch guide](https://docs.astral.sh/uv/guides/integration/pytorch/) for advanced PyTorch configuration