replicalab / docs /max /training_connection.md
maxxie114's picture
Initial HF Spaces deployment
80d8c84
# Training Connection Guide
This note closes `TRN 11`: how notebook-based training code should connect to
the ReplicaLab environment, which URLs to use, which client transport to
prefer, which secrets matter, and what to check first when a connection fails.
## Preferred Connection Order
Use the environment in this order:
1. Local backend for smoke tests and fast debugging
2. Hosted Hugging Face Space for shared team validation
3. H100 notebook runtime for training compute
The notebook runtime and the environment server are separate concerns. The
notebook supplies compute; the environment server supplies `reset`, `step`,
`state`, and `replay`.
## Base URLs
### Local
- REST base URL: `http://localhost:7860`
- WebSocket URL: `ws://localhost:7860/ws`
### Hosted
- Space page: `https://huggingface.co/spaces/ayushozha/replicalab`
- REST base URL: `https://ayushozha-replicalab.hf.space`
- WebSocket URL: `wss://ayushozha-replicalab.hf.space/ws`
## Which Transport To Use
Prefer `transport="rest"` first in notebooks:
- easier to debug with plain responses
- simpler error handling
- easier to reproduce single-step failures
Use `transport="websocket"` when you specifically want:
- long-lived per-connection sessions
- parity with frontend interactive behavior
- lower-overhead repeated `step()` calls after reset
## Required Secrets
### For environment access
No secret is required to talk to the current deterministic environment when it
is publicly reachable.
### For model downloads in notebook training
- `HF_TOKEN`
- needed for gated model downloads and authenticated Hugging Face access
- `REPLICALAB_URL`
- optional convenience variable for the environment base URL
- defaults can still be hardcoded in a notebook cell
### Important security note
Do not commit notebook URLs, notebook passwords, or temporary runtime access
links to the repo. Keep notebook credentials out-of-band.
## Minimal Client Usage
### Direct environment client
```python
import os
from replicalab.agents import build_baseline_scientist_action
from replicalab.client import ReplicaLabClient
base_url = os.getenv("REPLICALAB_URL", "http://localhost:7860")
with ReplicaLabClient(base_url, transport="rest") as client:
observation = client.reset(seed=42, scenario="ml_benchmark", difficulty="easy")
result = client.step(build_baseline_scientist_action(observation.scientist))
print(result.reward, result.done, result.info.verdict)
```
### Rollout worker
```python
import os
from replicalab.agents import build_baseline_scientist_action
from replicalab.client import ReplicaLabClient
from replicalab.training import RolloutWorker
base_url = os.getenv("REPLICALAB_URL", "http://localhost:7860")
with ReplicaLabClient(base_url, transport="rest") as client:
worker = RolloutWorker(client)
episode = worker.rollout(
build_baseline_scientist_action,
seed=42,
scenario="ml_benchmark",
difficulty="easy",
)
print(episode.total_reward, episode.verdict, episode.rounds_used)
```
## Troubleshooting
### `GET /` returns 404 or a simple landing page
That is not the training interface. The environment lives behind:
- `/health`
- `/scenarios`
- `/reset`
- `/step`
- `/ws`
### `Call reset() before step()`
The client has no active session yet. Always call `reset()` first.
### `404` on `/step`
Usually means the `session_id` is stale or the server restarted. Call `reset()`
again and start a fresh episode.
### WebSocket disconnects or times out
Retry with REST first. If REST works and WebSocket does not, the problem is
usually transport-specific rather than environment-specific.
### Space is up but root path looks broken
Check `GET /health` and `GET /scenarios` directly. The Space can be healthy
even if the root route is only a small landing page.
### Hugging Face Space is slow on the first request
Cold starts are expected on the free tier. Retry after the Space has fully
started.
### Notebook can download models but cannot reach the env
Verify:
1. `REPLICALAB_URL` points to the correct server
2. local server is running on port `7860` or the HF Space is healthy
3. you are using the matching transport (`rest` vs `websocket`)
## Relationship To Other Docs
- Deployment and hosted verification: [deployment.md](C:\Users\ayush\Desktop\Hackathons\replicalab-ai\docs\max\deployment.md)
- Client implementation: [client.py](C:\Users\ayush\Desktop\Hackathons\replicalab-ai\replicalab\client.py)
- Rollout implementation: [rollout.py](C:\Users\ayush\Desktop\Hackathons\replicalab-ai\replicalab\training\rollout.py)
This file is the notebook-facing connection note. Deployment-specific secret
management and HF Space operations remain in `docs/max/deployment.md`.