| # H100 Jupyter Notebook Setup | |
| This guide walks you through setting up the OpenEnv Bio Experiment environment on an **NVIDIA H100** Jupyter notebook instance (e.g., Jupiter Labs, Lambda Labs, RunPod, or similar). | |
| ## Prerequisites | |
| - **Python** 3.10, 3.11, or **3.12** (3.12 recommended for H100; 3.13 is not supported—numba, vllm, and others require <3.13) | |
| - **uv** – fast Python package manager ([install instructions](#installing-uv)) | |
| - **NVIDIA driver** ≥ 535.104.05 (usually pre-installed on H100 instances) | |
| - **CUDA** – H100 uses CUDA 12.x; PyTorch wheels bundle the runtime, so a separate CUDA Toolkit is not required | |
| ## Installing uv | |
| If `uv` is not already installed: | |
| ```bash | |
| # Unix/Linux (including Jupiter notebook terminals) | |
| curl -LsSf https://astral.sh/uv/install.sh | sh | |
| # Or with pip | |
| pip install uv | |
| ``` | |
| Verify: | |
| ```bash | |
| uv --version | |
| ``` | |
| ## Quick Setup (Recommended) | |
| ### 1. Clone and enter the project | |
| ```bash | |
| git clone <repository-url> OpenENV-Hackathon | |
| cd OpenENV-Hackathon | |
| ``` | |
| ### 2. Use uv's auto PyTorch backend | |
| The project uses Python 3.12 (see `.python-version`). uv will create a 3.12 venv. For H100 (CUDA 12.x): | |
| ```bash | |
| # Install everything: core + training (TRL, transformers, torch) + Jupyter | |
| UV_TORCH_BACKEND=cu128 uv sync --extra train | |
| # Add Unsloth for training_unsloth.py (skips trl downgrade; Unsloth works with TRL 0.29) | |
| uv pip install unsloth unsloth_zoo --no-deps | |
| # (ipykernel is included in --extra train) | |
| ``` | |
| If `UV_TORCH_BACKEND=cu128` fails (e.g., cu128 wheels not available yet), try: | |
| ```bash | |
| UV_TORCH_BACKEND=cu126 uv sync --extra train | |
| ``` | |
| ### 3. Register the environment as a Jupyter kernel | |
| ```bash | |
| uv run python -m ipykernel install --user --name openenv-bio-312 --display-name "OpenEnv Bio (Python 3.12)" | |
| ``` | |
| Or run the helper script (from project root): | |
| ```bash | |
| bash scripts/register_kernel_312.sh | |
| ``` | |
| Then select **"OpenEnv Bio (Python 3.12)"** in the notebook kernel picker. | |
| ### 4. Verify CUDA | |
| In a new Jupyter notebook, select the **"OpenEnv Bio (Python 3.12)"** kernel and run: | |
| ```python | |
| import torch | |
| print(f"PyTorch: {torch.__version__}") | |
| print(f"CUDA available: {torch.cuda.is_available()}") | |
| print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'N/A'}") | |
| ``` | |
| Expected output (or similar): | |
| ``` | |
| PyTorch: 2.x.x+cu128 | |
| CUDA available: True | |
| GPU: NVIDIA H100 ... | |
| ``` | |
| ### 5. Sanity check the environment | |
| ```bash | |
| uv run pytest tests/test_environment.py tests/test_literature_benchmark.py -q | |
| ``` | |
| ## Manual PyTorch CUDA Configuration | |
| If you need explicit control over the PyTorch index (e.g., for reproducibility), add the following to `pyproject.toml`: | |
| ### Add to `pyproject.toml` | |
| ```toml | |
| # After [tool.uv], add: | |
| [[tool.uv.index]] | |
| name = "pytorch-cu128" | |
| url = "https://download.pytorch.org/whl/cu128" | |
| explicit = true | |
| [tool.uv.sources] | |
| torch = [{ index = "pytorch-cu128" }] | |
| torchvision = [{ index = "pytorch-cu128" }] | |
| ``` | |
| Then run: | |
| ```bash | |
| uv sync --extra train | |
| ``` | |
| For CUDA 12.6 instead of 12.8, use `cu126` in the index URL and source names. | |
| ## Dependency Groups | |
| | uv sync flag | Contents | | |
| |-------------------|--------------------------------------------------------------------------| | |
| | *(default)* | Core: `openenv-core`, `numpy`, `scipy`, `pydantic` | | |
| | `--extra dev` | Testing: `pytest`, `pytest-cov` | | |
| | `--extra train` | Training: `torch`, `transformers`, `trl`, `accelerate`, `peft`, `unsloth`, etc. | | |
| | `--extra bio` | Bioinformatics: `scanpy`, `biopython`, `gseapy` | | |
| | `--extra train --extra dev` | Combined for development + training | | |
| ## Preferred H100 Workflow | |
| On H100, use the quantized Unsloth entrypoints: | |
| ```bash | |
| uv run python training_unsloth.py --model-id Qwen/Qwen3-4B-Base --output-dir training/grpo-unsloth-qwen3-4b --dry-run | |
| uv run python training_unsloth.py --model-id Qwen/Qwen3-4B-Base --output-dir training/grpo-unsloth-qwen3-4b | |
| uv run python run_agent_unsloth.py | |
| ``` | |
| The checked-in `inference.ipynb` notebook uses `training_unsloth.py` helpers with 4-bit loading. vLLM fast inference is disabled to avoid dependency conflicts. | |
| ## Running Training in a Jupyter Notebook | |
| Example cell: | |
| ```python | |
| # In a notebook with the OpenEnv Bio (Python 3.12) kernel | |
| !uv run python training_unsloth.py --model-id Qwen/Qwen3-4B-Base --output-dir training/grpo-unsloth-qwen3-4b --dry-run | |
| ``` | |
| Or run interactively from Python: | |
| ```python | |
| import subprocess | |
| subprocess.run([ | |
| "uv", "run", "python", "training_unsloth.py", | |
| "--model-id", "Qwen/Qwen3-4B-Base", | |
| "--output-dir", "training/grpo-unsloth-qwen3-4b", | |
| ], check=True) | |
| ``` | |
| ## Requirements Summary | |
| | Component | Version / Notes | | |
| |----------------|------------------------------------------------------| | |
| | Python | 3.10–3.12 (3.12 recommended; 3.13 not supported) | | |
| | uv | ≥ 0.5.3 (for PyTorch index support) | | |
| | torch | ≥ 2.10.0 (cu128 or cu126 for H100) | | |
| | transformers | ≥4.57 (with unsloth≥2025.10.14) | | |
| | trl | ≥ 0.29.0 | | |
| | accelerate | ≥ 1.13.0 | | |
| | Jupyter | Optional, for notebook workflows | | |
| ## Troubleshooting | |
| ### `RuntimeError: Cannot install on Python version 3.13.x` or numba / setup.py errors | |
| Python 3.13 is not supported (numba, vllm, and other deps require <3.13). Use Python 3.12: | |
| ```bash | |
| # With uv: ensure Python 3.12 is available, then sync | |
| uv python install 3.12 | |
| uv sync --extra train | |
| # Or create venv explicitly with 3.12 | |
| uv venv --python 3.12 | |
| UV_TORCH_BACKEND=cu128 uv sync --extra train | |
| ``` | |
| The project's `.python-version` file pins 3.12; uv will use it when creating the venv. | |
| ### `torch.cuda.is_available()` is False | |
| - Confirm the Jupyter kernel is the one where you ran `uv sync` (the one with `ipykernel`). | |
| - Ensure no CPU-only PyTorch is overriding the CUDA build (e.g., from a different conda/pip env). | |
| - Run `uv run python -c "import torch; print(torch.__file__)"` to verify PyTorch comes from your project venv. | |
| ### Flash Attention / causal-conv fallback warnings | |
| These are common and usually harmless; execution continues with a slower path. For best H100 performance, ensure `transformers` and `torch` are recent versions that support Flash Attention 2. | |
| ### HuggingFace symlink warnings | |
| Set: | |
| ```bash | |
| export HF_HUB_DISABLE_SYMLINKS_WARNING=1 | |
| ``` | |
| ### Out-of-memory during training | |
| - Reduce `--num-generations` or `--rollout-steps`. | |
| - Use a smaller model (e.g., `Qwen/Qwen3.5-0.8B`) for experiments. | |
| - Keep `--disable-4bit` off unless you explicitly need wider weights. | |
| ### `ModuleNotFoundError: No module named 'vllm.lora.models'` | |
| Unsloth's `unsloth_zoo` imports vLLM at load time and expects `vllm.lora.models`, which some vLLM versions don't have. Fix by installing a compatible vLLM: | |
| ```bash | |
| pip install "vllm==0.8.2" | |
| # or | |
| pip install "vllm==0.7.3" | |
| ``` | |
| **Note:** vLLM 0.8.2 pins `torch==2.6.0`, which conflicts with this project's `torch>=2.10.0`. If you hit that conflict: | |
| 1. Use a **separate environment** with torch 2.6–2.8 + vllm 0.8.2 + unsloth. | |
| 2. Or use the non-Unsloth path (`training_script.py` / `train.ipynb`) which doesn't depend on vLLM. | |
| ### `KeyError: 'qwen3_5'` / Qwen3.5 not supported | |
| Qwen3.5 requires transformers 5.x. With transformers 4.57, use **Qwen2.5** instead: | |
| - `unsloth/Qwen2.5-3B-Instruct-bnb-4bit` | |
| - `unsloth/Qwen2.5-7B-Instruct-bnb-4bit` | |
| - `Qwen/Qwen2.5-3B-Instruct` | |
| ### `NameError: name 'PreTrainedConfig' is not defined` / `check_model_inputs` ImportError | |
| Use unsloth≥2025.10.14 (PreTrainedConfig fix) with transformers≥4.57 (check_model_inputs). Run `uv sync --extra train` to get compatible versions. | |
| ### `ImportError: cannot import name 'ConstantLengthDataset' from 'trl.trainer.utils'` | |
| unsloth_zoo expects TRL <0.20. The project pins `trl>=0.19.0,<0.20`. If you see this error, ensure you've run `uv sync --extra train` so the locked trl version is used. Alternatively, try: | |
| ```bash | |
| pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth unsloth_zoo | |
| ``` | |
| (A newer unsloth_zoo may fix this and allow TRL 0.20+.) | |
| ### Unsloth import order warning | |
| If you see "Unsloth should be imported before trl, transformers, peft", ensure `training_unsloth` is imported before `training_script` in your notebook: | |
| ```python | |
| from training_unsloth import make_training_args, run_training # first | |
| import training_script as base | |
| ``` | |
| ## See Also | |
| - Main [README.md](README.md) for project overview, APIs, and usage | |
| - [uv PyTorch guide](https://docs.astral.sh/uv/guides/integration/pytorch/) for advanced PyTorch configuration | |