Spaces:

Ev3Dev
/

hackathon

Running

App Files Files Community

hackathon / H100_JUPYTER_SETUP.md

Ev3Dev

Upload folder using huggingface_hub

ad39f2a verified 2 days ago

preview code

raw

history blame contribute delete

9.18 kB

H100 Jupyter Notebook Setup

This guide walks you through setting up the OpenEnv Bio Experiment environment on an NVIDIA H100 Jupyter notebook instance (e.g., Jupiter Labs, Lambda Labs, RunPod, or similar).

Prerequisites

Python 3.10, 3.11, or 3.12 (3.12 recommended for H100; 3.13 is not supported—numba, vllm, and others require <3.13)
uv – fast Python package manager (install instructions)
NVIDIA driver ≥ 535.104.05 (usually pre-installed on H100 instances)
CUDA – H100 uses CUDA 12.x; PyTorch wheels bundle the runtime, so a separate CUDA Toolkit is not required

Installing uv

If uv is not already installed:

# Unix/Linux (including Jupiter notebook terminals)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Or with pip
pip install uv

Verify:

uv --version

Quick Setup (Recommended)

1. Clone and enter the project

git clone <repository-url> OpenENV-Hackathon
cd OpenENV-Hackathon

2. Use uv's auto PyTorch backend

The project uses Python 3.12 (see .python-version). uv will create a 3.12 venv. For H100 (CUDA 12.x):

# Install everything: core + training (TRL, transformers, torch) + Jupyter
UV_TORCH_BACKEND=cu128 uv sync --extra train

# Add Unsloth for training_unsloth.py (skips trl downgrade; Unsloth works with TRL 0.29)
uv pip install unsloth unsloth_zoo --no-deps

# (ipykernel is included in --extra train)

If UV_TORCH_BACKEND=cu128 fails (e.g., cu128 wheels not available yet), try:

UV_TORCH_BACKEND=cu126 uv sync --extra train

3. Register the environment as a Jupyter kernel

uv run python -m ipykernel install --user --name openenv-bio-312 --display-name "OpenEnv Bio (Python 3.12)"

Or run the helper script (from project root):

bash scripts/register_kernel_312.sh

Then select "OpenEnv Bio (Python 3.12)" in the notebook kernel picker.

4. Verify CUDA

In a new Jupyter notebook, select the "OpenEnv Bio (Python 3.12)" kernel and run:

import torch
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'N/A'}")

Expected output (or similar):

PyTorch: 2.x.x+cu128
CUDA available: True
GPU: NVIDIA H100 ...

5. Sanity check the environment

uv run pytest tests/test_environment.py tests/test_literature_benchmark.py -q

Manual PyTorch CUDA Configuration

If you need explicit control over the PyTorch index (e.g., for reproducibility), add the following to pyproject.toml:

Add to `pyproject.toml`

# After [tool.uv], add:

[[tool.uv.index]]
name = "pytorch-cu128"
url = "https://download.pytorch.org/whl/cu128"
explicit = true

[tool.uv.sources]
torch = [{ index = "pytorch-cu128" }]
torchvision = [{ index = "pytorch-cu128" }]

Then run:

uv sync --extra train

For CUDA 12.6 instead of 12.8, use cu126 in the index URL and source names.

Dependency Groups

uv sync flag	Contents
(default)	Core: `openenv-core`, `numpy`, `scipy`, `pydantic`
`--extra dev`	Testing: `pytest`, `pytest-cov`
`--extra train`	Training: `torch`, `transformers`, `trl`, `accelerate`, `peft`, `unsloth`, etc.
`--extra bio`	Bioinformatics: `scanpy`, `biopython`, `gseapy`
`--extra train --extra dev`	Combined for development + training

Preferred H100 Workflow

On H100, use the quantized Unsloth entrypoints:

uv run python training_unsloth.py --model-id Qwen/Qwen3-4B-Base --output-dir training/grpo-unsloth-qwen3-4b --dry-run
uv run python training_unsloth.py --model-id Qwen/Qwen3-4B-Base --output-dir training/grpo-unsloth-qwen3-4b
uv run python run_agent_unsloth.py

The checked-in inference.ipynb notebook uses training_unsloth.py helpers with 4-bit loading. vLLM fast inference is disabled to avoid dependency conflicts.

Running Training in a Jupyter Notebook

Example cell:

# In a notebook with the OpenEnv Bio (Python 3.12) kernel
!uv run python training_unsloth.py --model-id Qwen/Qwen3-4B-Base --output-dir training/grpo-unsloth-qwen3-4b --dry-run

Or run interactively from Python:

import subprocess
subprocess.run([
    "uv", "run", "python", "training_unsloth.py",
    "--model-id", "Qwen/Qwen3-4B-Base",
    "--output-dir", "training/grpo-unsloth-qwen3-4b",
], check=True)

Requirements Summary

Component	Version / Notes
Python	3.10–3.12 (3.12 recommended; 3.13 not supported)
uv	≥ 0.5.3 (for PyTorch index support)
torch	≥ 2.10.0 (cu128 or cu126 for H100)
transformers	≥4.57 (with unsloth≥2025.10.14)
trl	≥ 0.29.0
accelerate	≥ 1.13.0
Jupyter	Optional, for notebook workflows

Troubleshooting

`RuntimeError: Cannot install on Python version 3.13.x` or numba / setup.py errors

Python 3.13 is not supported (numba, vllm, and other deps require <3.13). Use Python 3.12:

# With uv: ensure Python 3.12 is available, then sync
uv python install 3.12
uv sync --extra train

# Or create venv explicitly with 3.12
uv venv --python 3.12
UV_TORCH_BACKEND=cu128 uv sync --extra train

The project's .python-version file pins 3.12; uv will use it when creating the venv.

`torch.cuda.is_available()` is False

Confirm the Jupyter kernel is the one where you ran uv sync (the one with ipykernel).
Ensure no CPU-only PyTorch is overriding the CUDA build (e.g., from a different conda/pip env).
Run uv run python -c "import torch; print(torch.__file__)" to verify PyTorch comes from your project venv.

Flash Attention / causal-conv fallback warnings

These are common and usually harmless; execution continues with a slower path. For best H100 performance, ensure transformers and torch are recent versions that support Flash Attention 2.

HuggingFace symlink warnings

Set:

export HF_HUB_DISABLE_SYMLINKS_WARNING=1

Out-of-memory during training

Reduce --num-generations or --rollout-steps.
Use a smaller model (e.g., Qwen/Qwen3.5-0.8B) for experiments.
Keep --disable-4bit off unless you explicitly need wider weights.

`ModuleNotFoundError: No module named 'vllm.lora.models'`

Unsloth's unsloth_zoo imports vLLM at load time and expects vllm.lora.models, which some vLLM versions don't have. Fix by installing a compatible vLLM:

pip install "vllm==0.8.2"
# or
pip install "vllm==0.7.3"

Note: vLLM 0.8.2 pins torch==2.6.0, which conflicts with this project's torch>=2.10.0. If you hit that conflict:

Use a separate environment with torch 2.6–2.8 + vllm 0.8.2 + unsloth.
Or use the non-Unsloth path (training_script.py / train.ipynb) which doesn't depend on vLLM.

`KeyError: 'qwen3_5'` / Qwen3.5 not supported

Qwen3.5 requires transformers 5.x. With transformers 4.57, use Qwen2.5 instead:

unsloth/Qwen2.5-3B-Instruct-bnb-4bit
unsloth/Qwen2.5-7B-Instruct-bnb-4bit
Qwen/Qwen2.5-3B-Instruct

`NameError: name 'PreTrainedConfig' is not defined` / `check_model_inputs` ImportError

Use unsloth≥2025.10.14 (PreTrainedConfig fix) with transformers≥4.57 (check_model_inputs). Run uv sync --extra train to get compatible versions.

`ImportError: cannot import name 'ConstantLengthDataset' from 'trl.trainer.utils'`

unsloth_zoo expects TRL <0.20. The project pins trl>=0.19.0,<0.20. If you see this error, ensure you've run uv sync --extra train so the locked trl version is used. Alternatively, try:

pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth unsloth_zoo

(A newer unsloth_zoo may fix this and allow TRL 0.20+.)

Unsloth import order warning

If you see "Unsloth should be imported before trl, transformers, peft", ensure training_unsloth is imported before training_script in your notebook:

from training_unsloth import make_training_args, run_training  # first
import training_script as base