File size: 9,180 Bytes
db03c40 ad39f2a db03c40 ad39f2a db03c40 ad39f2a db03c40 ad39f2a db03c40 ad39f2a db03c40 ad39f2a db03c40 ad39f2a db03c40 ad39f2a db03c40 ad39f2a db03c40 ad39f2a db03c40 ad39f2a db03c40 ad39f2a db03c40 ad39f2a db03c40 ad39f2a db03c40 ad39f2a db03c40 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 | # H100 Jupyter Notebook Setup
This guide walks you through setting up the OpenEnv Bio Experiment environment on an **NVIDIA H100** Jupyter notebook instance (e.g., Jupiter Labs, Lambda Labs, RunPod, or similar).
## Prerequisites
- **Python** 3.10, 3.11, or **3.12** (3.12 recommended for H100; 3.13 is not supported—numba, vllm, and others require <3.13)
- **uv** – fast Python package manager ([install instructions](#installing-uv))
- **NVIDIA driver** ≥ 535.104.05 (usually pre-installed on H100 instances)
- **CUDA** – H100 uses CUDA 12.x; PyTorch wheels bundle the runtime, so a separate CUDA Toolkit is not required
## Installing uv
If `uv` is not already installed:
```bash
# Unix/Linux (including Jupiter notebook terminals)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Or with pip
pip install uv
```
Verify:
```bash
uv --version
```
## Quick Setup (Recommended)
### 1. Clone and enter the project
```bash
git clone <repository-url> OpenENV-Hackathon
cd OpenENV-Hackathon
```
### 2. Use uv's auto PyTorch backend
The project uses Python 3.12 (see `.python-version`). uv will create a 3.12 venv. For H100 (CUDA 12.x):
```bash
# Install everything: core + training (TRL, transformers, torch) + Jupyter
UV_TORCH_BACKEND=cu128 uv sync --extra train
# Add Unsloth for training_unsloth.py (skips trl downgrade; Unsloth works with TRL 0.29)
uv pip install unsloth unsloth_zoo --no-deps
# (ipykernel is included in --extra train)
```
If `UV_TORCH_BACKEND=cu128` fails (e.g., cu128 wheels not available yet), try:
```bash
UV_TORCH_BACKEND=cu126 uv sync --extra train
```
### 3. Register the environment as a Jupyter kernel
```bash
uv run python -m ipykernel install --user --name openenv-bio-312 --display-name "OpenEnv Bio (Python 3.12)"
```
Or run the helper script (from project root):
```bash
bash scripts/register_kernel_312.sh
```
Then select **"OpenEnv Bio (Python 3.12)"** in the notebook kernel picker.
### 4. Verify CUDA
In a new Jupyter notebook, select the **"OpenEnv Bio (Python 3.12)"** kernel and run:
```python
import torch
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'N/A'}")
```
Expected output (or similar):
```
PyTorch: 2.x.x+cu128
CUDA available: True
GPU: NVIDIA H100 ...
```
### 5. Sanity check the environment
```bash
uv run pytest tests/test_environment.py tests/test_literature_benchmark.py -q
```
## Manual PyTorch CUDA Configuration
If you need explicit control over the PyTorch index (e.g., for reproducibility), add the following to `pyproject.toml`:
### Add to `pyproject.toml`
```toml
# After [tool.uv], add:
[[tool.uv.index]]
name = "pytorch-cu128"
url = "https://download.pytorch.org/whl/cu128"
explicit = true
[tool.uv.sources]
torch = [{ index = "pytorch-cu128" }]
torchvision = [{ index = "pytorch-cu128" }]
```
Then run:
```bash
uv sync --extra train
```
For CUDA 12.6 instead of 12.8, use `cu126` in the index URL and source names.
## Dependency Groups
| uv sync flag | Contents |
|-------------------|--------------------------------------------------------------------------|
| *(default)* | Core: `openenv-core`, `numpy`, `scipy`, `pydantic` |
| `--extra dev` | Testing: `pytest`, `pytest-cov` |
| `--extra train` | Training: `torch`, `transformers`, `trl`, `accelerate`, `peft`, `unsloth`, etc. |
| `--extra bio` | Bioinformatics: `scanpy`, `biopython`, `gseapy` |
| `--extra train --extra dev` | Combined for development + training |
## Preferred H100 Workflow
On H100, use the quantized Unsloth entrypoints:
```bash
uv run python training_unsloth.py --model-id Qwen/Qwen3-4B-Base --output-dir training/grpo-unsloth-qwen3-4b --dry-run
uv run python training_unsloth.py --model-id Qwen/Qwen3-4B-Base --output-dir training/grpo-unsloth-qwen3-4b
uv run python run_agent_unsloth.py
```
The checked-in `inference.ipynb` notebook uses `training_unsloth.py` helpers with 4-bit loading. vLLM fast inference is disabled to avoid dependency conflicts.
## Running Training in a Jupyter Notebook
Example cell:
```python
# In a notebook with the OpenEnv Bio (Python 3.12) kernel
!uv run python training_unsloth.py --model-id Qwen/Qwen3-4B-Base --output-dir training/grpo-unsloth-qwen3-4b --dry-run
```
Or run interactively from Python:
```python
import subprocess
subprocess.run([
"uv", "run", "python", "training_unsloth.py",
"--model-id", "Qwen/Qwen3-4B-Base",
"--output-dir", "training/grpo-unsloth-qwen3-4b",
], check=True)
```
## Requirements Summary
| Component | Version / Notes |
|----------------|------------------------------------------------------|
| Python | 3.10–3.12 (3.12 recommended; 3.13 not supported) |
| uv | ≥ 0.5.3 (for PyTorch index support) |
| torch | ≥ 2.10.0 (cu128 or cu126 for H100) |
| transformers | ≥4.57 (with unsloth≥2025.10.14) |
| trl | ≥ 0.29.0 |
| accelerate | ≥ 1.13.0 |
| Jupyter | Optional, for notebook workflows |
## Troubleshooting
### `RuntimeError: Cannot install on Python version 3.13.x` or numba / setup.py errors
Python 3.13 is not supported (numba, vllm, and other deps require <3.13). Use Python 3.12:
```bash
# With uv: ensure Python 3.12 is available, then sync
uv python install 3.12
uv sync --extra train
# Or create venv explicitly with 3.12
uv venv --python 3.12
UV_TORCH_BACKEND=cu128 uv sync --extra train
```
The project's `.python-version` file pins 3.12; uv will use it when creating the venv.
### `torch.cuda.is_available()` is False
- Confirm the Jupyter kernel is the one where you ran `uv sync` (the one with `ipykernel`).
- Ensure no CPU-only PyTorch is overriding the CUDA build (e.g., from a different conda/pip env).
- Run `uv run python -c "import torch; print(torch.__file__)"` to verify PyTorch comes from your project venv.
### Flash Attention / causal-conv fallback warnings
These are common and usually harmless; execution continues with a slower path. For best H100 performance, ensure `transformers` and `torch` are recent versions that support Flash Attention 2.
### HuggingFace symlink warnings
Set:
```bash
export HF_HUB_DISABLE_SYMLINKS_WARNING=1
```
### Out-of-memory during training
- Reduce `--num-generations` or `--rollout-steps`.
- Use a smaller model (e.g., `Qwen/Qwen3.5-0.8B`) for experiments.
- Keep `--disable-4bit` off unless you explicitly need wider weights.
### `ModuleNotFoundError: No module named 'vllm.lora.models'`
Unsloth's `unsloth_zoo` imports vLLM at load time and expects `vllm.lora.models`, which some vLLM versions don't have. Fix by installing a compatible vLLM:
```bash
pip install "vllm==0.8.2"
# or
pip install "vllm==0.7.3"
```
**Note:** vLLM 0.8.2 pins `torch==2.6.0`, which conflicts with this project's `torch>=2.10.0`. If you hit that conflict:
1. Use a **separate environment** with torch 2.6–2.8 + vllm 0.8.2 + unsloth.
2. Or use the non-Unsloth path (`training_script.py` / `train.ipynb`) which doesn't depend on vLLM.
### `KeyError: 'qwen3_5'` / Qwen3.5 not supported
Qwen3.5 requires transformers 5.x. With transformers 4.57, use **Qwen2.5** instead:
- `unsloth/Qwen2.5-3B-Instruct-bnb-4bit`
- `unsloth/Qwen2.5-7B-Instruct-bnb-4bit`
- `Qwen/Qwen2.5-3B-Instruct`
### `NameError: name 'PreTrainedConfig' is not defined` / `check_model_inputs` ImportError
Use unsloth≥2025.10.14 (PreTrainedConfig fix) with transformers≥4.57 (check_model_inputs). Run `uv sync --extra train` to get compatible versions.
### `ImportError: cannot import name 'ConstantLengthDataset' from 'trl.trainer.utils'`
unsloth_zoo expects TRL <0.20. The project pins `trl>=0.19.0,<0.20`. If you see this error, ensure you've run `uv sync --extra train` so the locked trl version is used. Alternatively, try:
```bash
pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth unsloth_zoo
```
(A newer unsloth_zoo may fix this and allow TRL 0.20+.)
### Unsloth import order warning
If you see "Unsloth should be imported before trl, transformers, peft", ensure `training_unsloth` is imported before `training_script` in your notebook:
```python
from training_unsloth import make_training_args, run_training # first
import training_script as base
```
## See Also
- Main [README.md](README.md) for project overview, APIs, and usage
- [uv PyTorch guide](https://docs.astral.sh/uv/guides/integration/pytorch/) for advanced PyTorch configuration
|