Spaces:

atomic24
/

planetary-rover-navigation

Paused

planetary-rover-navigation / inference.py

Bhaskar

Final submission: Meta OpenEnv Hackathon

b163d5f about 1 month ago

30.3 kB

	"""
	inference.py — LLM-Driven Inference Script
	Planetary Rover Navigation Simulator · Meta PyTorch Hackathon Round 1
	=====================================================================

	This script connects to the running OpenEnv Docker container, runs
	one episode per task (easy → medium → hard), and uses an LLM via the
	OpenAI-compatible API to choose an action at every step.

	Environment variables (all required unless marked optional)
	-----------------------------------------------------------
	API_BASE_URL Base URL of the OpenAI-compatible endpoint
	e.g. "https://api-inference.huggingface.co/v1"
	API_KEY Bearer token / HF_TOKEN for the LLM endpoint
	MODEL_NAME Model identifier sent in every chat-completion request
	e.g. "meta-llama/Llama-3.3-70B-Instruct"
	IMAGE_NAME Docker image or base URL of the rover environment server
	e.g. "http://localhost:7860" (running container)
	or "rover-env:latest" (image name, if using
	openenv_core.MyEnvV4Env.from_docker_image)

	Logging format (mandated by hackathon judges)
	---------------------------------------------
	[START] task=<task_id> env=<IMAGE_NAME> model=<MODEL_NAME>
	[STEP] step=<n> action=<json> reward=<float> done=<bool> error=<str\|null>
	[END] success=<bool> steps=<n> score=<float> rewards=<csv>

	Exit codes
	----------
	0 all three tasks returned score > 0.0
	1 at least one task scored 0.0 (smoke-test failure)
	"""

	from __future__ import annotations

	import asyncio
	import json
	import math
	import os
	import re
	import sys
	import time
	from typing import Any

	import aiohttp
	from openai import AsyncOpenAI

	# =============================================================================
	# Environment variable resolution
	# =============================================================================

	API_BASE_URL = os.getenv("API_BASE_URL", "https://api-inference.huggingface.co/v1")
	HF_TOKEN = os.getenv("HF_TOKEN")
	MODEL_NAME = os.getenv("MODEL_NAME", "meta-llama/Llama-3.3-70B-Instruct")
	LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME", "http://localhost:7860")

	# Strip trailing slash so we can always append a path safely
	_BASE_URL: str = LOCAL_IMAGE_NAME.rstrip("/")

	# Task execution order (fixed — easy first builds confidence before hard)
	TASKS: list[str] = ["easy", "medium", "hard"]

	# LLM generation parameters
	LLM_MAX_TOKENS: int = 256 # action JSON is short; 256 is generous
	LLM_TEMPERATURE: float = 0.2 # low temperature → more deterministic navigation
	LLM_TIMEOUT: float = 30.0 # seconds before we fall back to a safe action

	# Fallback action used when the LLM fails or returns unparseable JSON.
	# thrust=0.5 / steering=0.0 / brake=0 is the safest possible motion:
	# it moves the rover straight ahead at half speed, burning minimal battery.
	FALLBACK_ACTION: dict[str, Any] = {
	"thrust": 0.5,
	"steering": 0.0,
	"brake": 0,
	"vertical_thruster": 0.0,
	}

	# =============================================================================
	# System prompt
	# =============================================================================
	# Written as a single-shot instruction set. The model must understand:
	# 1. Its identity as a rover navigation controller.
	# 2. The exact JSON schema it must output — no prose, no markdown.
	# 3. The physical meaning of each action field and its bounds.
	# 4. Task-specific strategies baked in so it doesn't need to reason
	# from scratch on every step.
	#
	# Deliberate choices:
	# - "Respond ONLY with a JSON object" is repeated twice — once in the
	# identity block, once at the end — because LLMs tend to add prose
	# when they feel like explaining themselves.
	# - Bounds are stated as hard limits, not suggestions, to prevent the
	# LLM from generating out-of-range floats that would fail Pydantic
	# validation on the server.
	# - The three task strategies are embedded here so the model has policy
	# knowledge at inference time without needing chain-of-thought.

	SYSTEM_PROMPT = """\
	You are the autonomous navigation controller for a planetary rover.
	Your sole responsibility is to output a single JSON action object every turn.
	Respond ONLY with a JSON object — no explanation, no markdown, no extra text.

	## Action space (exact JSON schema)
	{
	"thrust": <float in [0.0, 1.0]>, // forward drive power
	"steering": <float in [-1.0, 1.0]>, // -1=hard left, 0=straight, 1=hard right
	"brake": <int 0 or 1>, // 1=apply regen braking, 0=drive/coast
	"vertical_thruster": <float in [-0.2, 0.2]> // vertical adjust (ignored on flat terrain)
	}

	## Physics you must know
	- heading_error = atan2(target_dy, target_dx) - rover_heading
	- Normalise heading_error to (-π, π] before using it.
	- steering = clamp(heading_error * 2.5, -1.0, 1.0) → P-controller
	- Rover only steers when thrust > 0; steering at thrust=0 has no effect.
	- Battery depletes every step; brake=1 halves speed and recovers a tiny
	amount of battery — only useful when you would otherwise overshoot.

	## Task strategies

	EASY (Flat Plains Transit)
	- Compute heading to target, set thrust=1.0, steer to correct heading.
	- Battery is abundant; never brake unless target_distance < 3 m.
	- Scoring: proximity0.85 + step_efficiency0.15 → arrive fast.

	MEDIUM (Crater Avoidance)
	- A ring of obstacles sits between you and the waypoint.
	- If nearest_obstacle_distance < 28 m: steer 90° perpendicular to the
	waypoint direction (pick left or right and hold it) until
	nearest_obstacle_distance > 35 m, then resume beeline.
	- Each collision costs -0.06 from the final score.
	- Scoring: proximity0.75 + step_efficiency0.25 - collision_penalty.

	HARD (Battery Sprint)
	- Starting battery is only 35%. Drain multiplier is ×4.
	- On step 1: compute atan2(target_dy, target_dx), lock that heading, NEVER change it.
	- Use thrust=1.0 every step. NEVER brake. NEVER deviate.
	- Scoring: proximity0.65 + battery_efficiency0.35.

	Respond ONLY with the JSON object. Nothing else.\
	"""


	# =============================================================================
	# User prompt builder
	# =============================================================================
	# Called once per step. Feeds the LLM the minimum state it needs:
	# - Which task (determines which strategy to apply)
	# - Distance and direction to waypoint (primary navigation signal)
	# - Battery (critical for hard; informational for easy/medium)
	# - Nearest obstacle (determines whether to trigger detour for medium)
	# - Current heading and step budget remaining
	#
	# We deliberately omit the full obstacle_map array (8×3 floats) from the
	# prompt because it adds ~200 tokens and the scalar
	# nearest_obstacle_distance is sufficient for the FSM-style detour policy
	# we describe in the system prompt. If you want the full map, add:
	# f"obstacle_map: {obs['obstacle_map']}\n"

	def build_user_prompt(
	task_id: str,
	obs: dict[str, Any],
	step_num: int,
	max_steps: int,
	) -> str:
	"""
	Build the per-step user message sent to the LLM.

	Parameters
	----------
	task_id : "easy" \| "medium" \| "hard"
	obs : the Observation dict returned by /reset or /step
	step_num : current step index (1-based)
	max_steps : step budget for this task

	Returns
	-------
	A compact plain-text string. JSON was considered but plain text is
	more token-efficient and models handle it well for numeric inputs.
	"""
	# Extract the fields we feed to the model.
	# target_relative gives (dx, dy) — the vector from rover to waypoint.
	# We compute the exact heading error here so the model only needs to
	# clamp and multiply rather than doing trig from scratch.
	dx = obs["target_relative"]["x"]
	dy = obs["target_relative"]["y"]

	# Heading error in radians, normalised to (-π, π]
	target_heading = math.atan2(dy, dx)
	raw_error = target_heading - obs["rover_heading"]
	# Normalise to (-π, π]
	while raw_error > math.pi: raw_error -= 2 * math.pi
	while raw_error <= -math.pi: raw_error += 2 * math.pi

	# Pre-compute the P-controller steering value so the model can adopt
	# it directly or nudge it based on obstacle proximity.
	suggested_steering = max(-1.0, min(1.0, raw_error * 2.5))

	terrain_names = {0: "flat/sand", 1: "rocky", 2: "crater_floor", 3: "crater_rim"}
	terrain_label = terrain_names.get(obs["terrain_type"], "unknown")

	return (
	f"TASK: {task_id}\n"
	f"STEP: {step_num}/{max_steps} "
	f"steps_remaining_norm={obs['steps_remaining_norm']:.3f}\n"
	f"\n"
	f"NAVIGATION\n"
	f" target_distance = {obs['target_distance']:.2f} m\n"
	f" target_dx = {dx:.2f} m\n"
	f" target_dy = {dy:.2f} m\n"
	f" rover_heading = {obs['rover_heading']:.4f} rad\n"
	f" heading_error = {raw_error:.4f} rad\n"
	f" suggested_steering = {suggested_steering:.4f} "
	f"(P-control, clamp to [-1,1])\n"
	f"\n"
	f"POWER\n"
	f" battery_level = {obs['battery_level']:.4f} "
	f"(0=dead, 1=full)\n"
	f" battery_drain_rate = {obs['battery_drain_rate']:.6f} per step\n"
	f"\n"
	f"OBSTACLES\n"
	f" nearest_obstacle_distance = {obs['nearest_obstacle_distance']:.2f} m "
	f"(sensor range=50 m; collision at 0.5 m)\n"
	f" obstacle_count = {obs['obstacle_count']}\n"
	f"\n"
	f"TERRAIN\n"
	f" terrain_type = {obs['terrain_type']} ({terrain_label})\n"
	f" terrain_slope = {obs['terrain_slope']}\n"
	f"\n"
	f"Output your action JSON now.\n"
	f"Remember: ONLY a JSON object, no explanation."
	)


	# =============================================================================
	# LLM action parser
	# =============================================================================
	# The LLM is instructed to return raw JSON but will sometimes:
	# (a) wrap it in a markdown code block ```json { ... } ```
	# (b) add a preamble sentence before the JSON
	# (c) return a partial JSON (truncated at max_tokens)
	# (d) use wrong field names (e.g. "steer" instead of "steering")
	# (e) return floats outside the declared bounds
	#
	# The parser handles all five cases in order, falling back to
	# FALLBACK_ACTION only if recovery is impossible.

	def _clamp(v: float, lo: float, hi: float) -> float:
	return max(lo, min(hi, v))


	def parse_llm_action(raw_text: str) -> tuple[dict[str, Any], str \| None]:
	"""
	Parse the LLM's raw text response into a valid Action dict.

	Returns
	-------
	(action_dict, error_str)
	action_dict : always a valid action (fallback if parsing failed)
	error_str : None if parsing succeeded; human-readable error string
	if we fell back (this goes into [STEP] error=<str>)

	Strategy
	--------
	Step 1 — Strip markdown fences.
	Models trained with RLHF often wrap JSON in ```json ... ```.
	We remove those first.

	Step 2 — Extract the first { ... } block.
	If the model prepended prose ("Sure, here is my action:"), this
	regex finds the JSON object regardless of what came before it.

	Step 3 — Parse JSON.
	Standard json.loads(). If it fails we try a light repair:
	replace single quotes with double quotes (common LLM mistake).

	Step 4 — Field normalisation.
	Accept common aliases (e.g. "steer" → "steering", "gas" → "thrust").
	Any missing required field is filled from FALLBACK_ACTION.

	Step 5 — Bounds clamping.
	Every float/int is clamped to its declared range so the server's
	Pydantic validation never rejects our action.
	"""

	# ── Step 1: strip markdown code fences ───────────────────────────────
	# Handles: ```json\n{...}\n``` and ```\n{...}\n```
	stripped = re.sub(r"```(?:json)?\s*", "", raw_text).replace("```", "").strip()

	# ── Step 2: extract first JSON object ────────────────────────────────
	# re.DOTALL because the JSON may span multiple lines.
	match = re.search(r"\{[^{}]*\}", stripped, re.DOTALL)
	if not match:
	# No JSON object found at all — return fallback immediately.
	return FALLBACK_ACTION.copy(), f"no JSON object found in: {raw_text[:80]!r}"

	candidate = match.group(0)

	# ── Step 3: JSON parse with single-quote repair ───────────────────────
	try:
	parsed = json.loads(candidate)
	except json.JSONDecodeError:
	# Common LLM mistake: single-quoted strings ("'thrust': 0.9")
	repaired = candidate.replace("'", '"')
	try:
	parsed = json.loads(repaired)
	except json.JSONDecodeError as e:
	return FALLBACK_ACTION.copy(), f"JSON parse failed: {e} \| text: {candidate[:80]!r}"

	if not isinstance(parsed, dict):
	return FALLBACK_ACTION.copy(), f"parsed JSON is not a dict: {type(parsed)}"

	# ── Step 4: field normalisation / alias resolution ────────────────────
	# Map common LLM hallucinated field names to canonical ones.
	ALIASES: dict[str, str] = {
	"steer": "steering",
	"turn": "steering",
	"yaw": "steering",
	"gas": "thrust",
	"throttle": "thrust",
	"accelerate": "thrust",
	"speed": "thrust",
	"brakes": "brake",
	"braking": "brake",
	"vert": "vertical_thruster",
	"vertical": "vertical_thruster",
	"vertical_thrust": "vertical_thruster",
	"vthruster": "vertical_thruster",
	}
	normalised: dict[str, Any] = {}
	for key, val in parsed.items():
	canonical = ALIASES.get(key.lower().strip(), key.lower().strip())
	normalised[canonical] = val

	# Fill any missing required fields from FALLBACK_ACTION.
	# This makes the parser tolerant of partial JSON outputs.
	action: dict[str, Any] = {}
	missing_fields: list[str] = []
	required_fields = ["thrust", "steering", "brake", "vertical_thruster"]

	for field in required_fields:
	if field in normalised:
	action[field] = normalised[field]
	else:
	action[field] = FALLBACK_ACTION[field]
	missing_fields.append(field)

	# ── Step 5: type coercion and bounds clamping ─────────────────────────
	# The LLM may output "1" (string) instead of 1 (int) for brake,
	# or "0.95" (string) for floats. We coerce first, then clamp.
	coerce_errors: list[str] = []
	try:
	action["thrust"] = _clamp(float(action["thrust"]), 0.0, 1.0)
	except (TypeError, ValueError) as e:
	action["thrust"] = FALLBACK_ACTION["thrust"]
	coerce_errors.append(f"thrust coerce: {e}")

	try:
	action["steering"] = _clamp(float(action["steering"]), -1.0, 1.0)
	except (TypeError, ValueError) as e:
	action["steering"] = FALLBACK_ACTION["steering"]
	coerce_errors.append(f"steering coerce: {e}")

	try:
	# brake must be int 0 or 1. Accept True/False from JSON booleans.
	raw_brake = action["brake"]
	if isinstance(raw_brake, bool):
	action["brake"] = 1 if raw_brake else 0
	else:
	action["brake"] = int(round(float(raw_brake)))
	action["brake"] = max(0, min(1, action["brake"]))
	except (TypeError, ValueError) as e:
	action["brake"] = FALLBACK_ACTION["brake"]
	coerce_errors.append(f"brake coerce: {e}")

	try:
	action["vertical_thruster"] = _clamp(float(action["vertical_thruster"]), -0.2, 0.2)
	except (TypeError, ValueError) as e:
	action["vertical_thruster"] = FALLBACK_ACTION["vertical_thruster"]
	coerce_errors.append(f"vertical_thruster coerce: {e}")

	# Build error string for [STEP] log — null if everything parsed cleanly.
	error_parts: list[str] = []
	if missing_fields:
	error_parts.append(f"missing_fields={missing_fields}")
	if coerce_errors:
	error_parts.append(f"coerce_errors={coerce_errors}")

	error_str = "; ".join(error_parts) if error_parts else None
	return action, error_str


	# =============================================================================
	# Logging helpers — exact judge-mandated format
	# =============================================================================
	# All log lines go to stdout (not stderr) so they are captured by the
	# OpenEnv harness. We flush after every write so lines appear immediately
	# even when stdout is line-buffered (e.g. inside Docker).

	def log_start(task_id: str) -> None:
	"""[START] task=<task> env=<LOCAL_IMAGE_NAME> model=<MODEL_NAME>"""
	print(f"[START] task={task_id} env={LOCAL_IMAGE_NAME} model={MODEL_NAME}", flush=True)


	def log_step(
	step_num: int,
	action: dict[str, Any],
	reward: float,
	done: bool,
	error: str \| None,
	) -> None:
	"""[STEP] step=<n> action=<json> reward=<float> done=<bool> error=<str\|null>"""
	action_json = json.dumps(action, separators=(",", ":"))
	error_val = f'"{error}"' if error else "null"
	print(
	f"[STEP] step={step_num} "
	f"action={action_json} "
	f"reward={reward:.4f} "
	f"done={str(done).lower()} "
	f"error={error_val}",
	flush=True,
	)


	def log_end(
	success: bool,
	steps: int,
	score: float,
	rewards: list[float],
	) -> None:
	"""[END] success=<bool> steps=<n> score=<float> rewards=<csv>"""
	rewards_csv = ",".join(f"{r:.4f}" for r in rewards)
	print(
	f"[END] success={str(success).lower()} "
	f"steps={steps} "
	f"score={score:.4f} "
	f"rewards={rewards_csv}",
	flush=True,
	)


	# =============================================================================
	# HTTP helpers — thin async wrappers around aiohttp
	# =============================================================================
	# We use aiohttp rather than the requests library so the LLM call and
	# env poll can be interleaved with asyncio without blocking the event loop.
	# All calls are retried once on transient network errors.

	async def _http_get(session: aiohttp.ClientSession, path: str, **params) -> dict[str, Any]:
	url = f"{_BASE_URL}{path}"
	async with session.get(url, params=params, timeout=aiohttp.ClientTimeout(total=30)) as r:
	r.raise_for_status()
	return await r.json()


	async def _http_post(
	session: aiohttp.ClientSession,
	path: str,
	body: dict[str, Any],
	**params,
	) -> dict[str, Any]:
	url = f"{_BASE_URL}{path}"
	async with session.post(
	url, json=body, params=params,
	timeout=aiohttp.ClientTimeout(total=30),
	) as r:
	r.raise_for_status()
	return await r.json()


	# =============================================================================
	# LLM call — one async chat completion per step
	# =============================================================================

	async def llm_action(
	client: AsyncOpenAI,
	task_meta: dict[str, Any],
	obs: dict[str, Any],
	step_num: int,
	) -> tuple[dict[str, Any], str \| None]:
	"""
	Ask the LLM for one action and parse its response.

	Returns (action_dict, error_str).
	error_str is None on clean parse; a short description on fallback.
	"""
	user_msg = build_user_prompt(
	task_id = task_meta["id"],
	obs = obs,
	step_num = step_num,
	max_steps = task_meta["max_steps"],
	)

	try:
	response = await asyncio.wait_for(
	client.chat.completions.create(
	model = MODEL_NAME,
	messages = [
	{"role": "system", "content": SYSTEM_PROMPT},
	{"role": "user", "content": user_msg},
	],
	max_tokens = LLM_MAX_TOKENS,
	temperature = LLM_TEMPERATURE,
	),
	timeout=LLM_TIMEOUT,
	)
	raw_text = response.choices[0].message.content or ""
	except asyncio.TimeoutError:
	return FALLBACK_ACTION.copy(), f"LLM timeout after {LLM_TIMEOUT}s"
	except Exception as e:
	return FALLBACK_ACTION.copy(), f"LLM API error: {type(e).__name__}: {e}"

	return parse_llm_action(raw_text)


	# =============================================================================
	# Grade a completed episode via /grader
	# =============================================================================

	async def grade_episode(
	session: aiohttp.ClientSession,
	episode_id: str,
	task_id: str,
	last_info: dict[str, Any],
	last_obs: dict[str, Any],
	) -> float:
	"""
	POST /grader with the trajectory summary from the final step's info dict.
	Returns the normalised score [0.0, 1.0].
	Falls back to 0.0 if the grader call fails.
	"""
	body = {
	"episode_id": episode_id,
	"task_id": task_id,
	"termination_reason": last_info.get("termination_reason", "unknown"),
	"initial_distance": last_info.get("initial_distance", last_obs.get("target_distance", 0.0)),
	"min_distance_achieved": last_info.get("min_distance", last_obs.get("target_distance", 0.0)),
	"waypoints_reached": last_info.get("waypoints_hit", 0),
	"total_waypoints": last_info.get("total_waypoints", 1),
	"steps_taken": last_info.get("steps", 0),
	"max_steps": last_info.get("max_steps", 500),
	"battery_remaining": last_info.get("battery", last_obs.get("battery_level", 0.0)),
	"collision_count": last_info.get("collision_count", 0),
	}
	try:
	resp = await _http_post(session, "/grader", body)
	return float(resp.get("score", 0.0))
	except Exception as e:
	print(f"[WARN] /grader call failed: {e}", flush=True)
	return 0.0


	# =============================================================================
	# Single task episode runner
	# =============================================================================

	async def run_task(
	session: aiohttp.ClientSession,
	client: AsyncOpenAI,
	task_meta: dict[str, Any],
	) -> float:
	"""
	Run one complete episode for the given task.

	Flow
	----
	1. POST /reset → episode_id, initial obs
	2. [START] log
	3. loop until done or truncated:
	a. call LLM for action
	b. POST /step → obs, reward, done, truncated, info
	c. [STEP] log
	4. POST /grader → score
	5. [END] log
	6. return score
	"""
	task_id = task_meta["id"]
	max_steps = task_meta["max_steps"]

	# ── 1. Reset ─────────────────────────────────────────────────────────
	reset_resp = await _http_post(session, "/reset", {"task_id": task_id})
	episode_id = reset_resp["episode_id"]
	obs = reset_resp["obs"]

	# ── 2. START log ──────────────────────────────────────────────────────
	log_start(task_id)

	# Accumulators for [END] log
	rewards: list[float] = []
	step_num: int = 0
	last_info: dict = {}
	last_obs: dict = obs

	# ── 3. Step loop ──────────────────────────────────────────────────────
	while True:
	step_num += 1

	# a. Ask the LLM for an action
	action, parse_error = await llm_action(client, task_meta, obs, step_num)

	# b. Send action to the environment
	try:
	step_resp = await _http_post(
	session, "/step", action,
	episode_id=episode_id,
	)
	obs = step_resp["obs"]
	reward = step_resp["reward"]
	done = step_resp["done"]
	truncated = step_resp["truncated"]
	last_info = step_resp.get("info", {})
	last_obs = obs
	step_error = parse_error # propagate LLM parse error if any

	except Exception as e:
	# Env step failed — log the error and terminate this episode.
	reward = 0.0
	done = True
	truncated = False
	step_error = f"step HTTP error: {type(e).__name__}: {e}"

	rewards.append(reward)

	# c. [STEP] log
	log_step(step_num, action, reward, done or truncated, step_error)

	if done or truncated:
	break

	# Hard budget guard — should never trigger (server enforces it) but
	# protects against an infinite loop if the server misbehaves.
	if step_num >= max_steps:
	break

	# ── 4. Grade ──────────────────────────────────────────────────────────
	score = await grade_episode(session, episode_id, task_id, last_info, last_obs)

	# ── 5. END log ────────────────────────────────────────────────────────
	success = score > 0.0
	log_end(success=success, steps=step_num, score=score, rewards=rewards)

	return score


	# =============================================================================
	# Main entry point
	# =============================================================================

	async def main() -> int:
	"""
	Run all three tasks sequentially.
	Returns 0 if every task scored > 0.0; returns 1 otherwise.
	"""
	# ── Validate environment variables ────────────────────────────────────
	missing = [v for v in ("API_BASE_URL", "HF_TOKEN", "MODEL_NAME", "LOCAL_IMAGE_NAME")
	if not os.environ.get(v)]
	if missing:
	# Soft warning — we fall back to defaults for most vars, but print
	# the warning so operators know the environment is not fully configured.
	print(
	f"[WARN] The following environment variables are not set and "
	f"defaults will be used: {missing}",
	flush=True,
	)

	if not HF_TOKEN:
	print(
	"[ERROR] API_KEY / HF_TOKEN is required for LLM calls. "
	"Set it as an environment variable and re-run.",
	file=sys.stderr, flush=True,
	)
	return 2

	# ── Initialise clients ────────────────────────────────────────────────
	# AsyncOpenAI is the async variant of the standard openai client.
	# base_url points to any OpenAI-compatible endpoint (HF TGI, vLLM, etc.)
	llm_client = AsyncOpenAI(
	api_key = HF_TOKEN,
	base_url = API_BASE_URL,
	)

	# aiohttp session shared across all HTTP calls to the env server.
	# Connection pool is reused between tasks to avoid reconnect overhead.
	connector = aiohttp.TCPConnector(limit=4)
	async with aiohttp.ClientSession(connector=connector) as http_session:

	# ── Discover tasks from /tasks ────────────────────────────────────
	try:
	tasks_list = await _http_get(http_session, "/tasks")
	except Exception as e:
	print(f"[ERROR] Could not reach env server at {_BASE_URL}/tasks: {e}",
	file=sys.stderr, flush=True)
	return 2

	# Index task metadata by id for O(1) lookup
	tasks_by_id: dict[str, dict] = {t["id"]: t for t in tasks_list}

	# ── Run each task ─────────────────────────────────────────────────
	scores: dict[str, float] = {}
	for task_id in TASKS:
	if task_id not in tasks_by_id:
	print(f"[WARN] task_id={task_id!r} not found in /tasks response — skipping.",
	flush=True)
	scores[task_id] = 0.0
	continue

	task_meta = tasks_by_id[task_id]
	try:
	score = await run_task(http_session, llm_client, task_meta)
	except Exception as e:
	print(f"[ERROR] Unhandled exception in task={task_id}: {e}", flush=True)
	log_end(success=False, steps=0, score=0.0, rewards=[])
	score = 0.0
	scores[task_id] = score

	# Brief pause between tasks to let the server drain any in-flight
	# connections before we start the next episode.
	await asyncio.sleep(0.5)

	# ── Summary ───────────────────────────────────────────────────────────
	print("\n" + "=" * 60, flush=True)
	print("INFERENCE COMPLETE", flush=True)
	print("=" * 60, flush=True)
	for task_id, score in scores.items():
	status = "PASS" if score > 0.0 else "FAIL"
	print(f" [{status}] {task_id:6s} score={score:.4f}", flush=True)
	print("=" * 60, flush=True)

	all_passed = all(s > 0.0 for s in scores.values())
	return 0 if all_passed else 1


	if __name__ == "__main__":
	sys.exit(asyncio.run(main()))