replicalab / docs /max /training_connection.md
maxxie114's picture
Initial HF Spaces deployment
80d8c84

Training Connection Guide

This note closes TRN 11: how notebook-based training code should connect to the ReplicaLab environment, which URLs to use, which client transport to prefer, which secrets matter, and what to check first when a connection fails.

Preferred Connection Order

Use the environment in this order:

  1. Local backend for smoke tests and fast debugging
  2. Hosted Hugging Face Space for shared team validation
  3. H100 notebook runtime for training compute

The notebook runtime and the environment server are separate concerns. The notebook supplies compute; the environment server supplies reset, step, state, and replay.

Base URLs

Local

  • REST base URL: http://localhost:7860
  • WebSocket URL: ws://localhost:7860/ws

Hosted

  • Space page: https://huggingface.co/spaces/ayushozha/replicalab
  • REST base URL: https://ayushozha-replicalab.hf.space
  • WebSocket URL: wss://ayushozha-replicalab.hf.space/ws

Which Transport To Use

Prefer transport="rest" first in notebooks:

  • easier to debug with plain responses
  • simpler error handling
  • easier to reproduce single-step failures

Use transport="websocket" when you specifically want:

  • long-lived per-connection sessions
  • parity with frontend interactive behavior
  • lower-overhead repeated step() calls after reset

Required Secrets

For environment access

No secret is required to talk to the current deterministic environment when it is publicly reachable.

For model downloads in notebook training

  • HF_TOKEN
    • needed for gated model downloads and authenticated Hugging Face access
  • REPLICALAB_URL
    • optional convenience variable for the environment base URL
    • defaults can still be hardcoded in a notebook cell

Important security note

Do not commit notebook URLs, notebook passwords, or temporary runtime access links to the repo. Keep notebook credentials out-of-band.

Minimal Client Usage

Direct environment client

import os

from replicalab.agents import build_baseline_scientist_action
from replicalab.client import ReplicaLabClient

base_url = os.getenv("REPLICALAB_URL", "http://localhost:7860")

with ReplicaLabClient(base_url, transport="rest") as client:
    observation = client.reset(seed=42, scenario="ml_benchmark", difficulty="easy")
    result = client.step(build_baseline_scientist_action(observation.scientist))
    print(result.reward, result.done, result.info.verdict)

Rollout worker

import os

from replicalab.agents import build_baseline_scientist_action
from replicalab.client import ReplicaLabClient
from replicalab.training import RolloutWorker

base_url = os.getenv("REPLICALAB_URL", "http://localhost:7860")

with ReplicaLabClient(base_url, transport="rest") as client:
    worker = RolloutWorker(client)
    episode = worker.rollout(
        build_baseline_scientist_action,
        seed=42,
        scenario="ml_benchmark",
        difficulty="easy",
    )
    print(episode.total_reward, episode.verdict, episode.rounds_used)

Troubleshooting

GET / returns 404 or a simple landing page

That is not the training interface. The environment lives behind:

  • /health
  • /scenarios
  • /reset
  • /step
  • /ws

Call reset() before step()

The client has no active session yet. Always call reset() first.

404 on /step

Usually means the session_id is stale or the server restarted. Call reset() again and start a fresh episode.

WebSocket disconnects or times out

Retry with REST first. If REST works and WebSocket does not, the problem is usually transport-specific rather than environment-specific.

Space is up but root path looks broken

Check GET /health and GET /scenarios directly. The Space can be healthy even if the root route is only a small landing page.

Hugging Face Space is slow on the first request

Cold starts are expected on the free tier. Retry after the Space has fully started.

Notebook can download models but cannot reach the env

Verify:

  1. REPLICALAB_URL points to the correct server
  2. local server is running on port 7860 or the HF Space is healthy
  3. you are using the matching transport (rest vs websocket)

Relationship To Other Docs

This file is the notebook-facing connection note. Deployment-specific secret management and HF Space operations remain in docs/max/deployment.md.