Luc

preserve medium adapter, evals, and rl train artifacts

9a8acce about 2 months ago

14.3 kB

	diff --git a/CURRENT-RL-SYSTEM.md b/CURRENT-RL-SYSTEM.md
	index 866d1ac..d090b20 100644
	--- a/CURRENT-RL-SYSTEM.md
	+++ b/CURRENT-RL-SYSTEM.md
	@@ -5,23 +5,27 @@ exists today.

	The short version:

	-- this repo does not currently have a learned PPO / GRPO / APPO-style RL
	- training loop,
	-- it does have:
	+- this repo now has a real APPO backend for low-level RL training,
	+- it also has:
	- supervised fine-tuning for a forward model,
	- gameplay rollout code for collecting training data,
	- a task-directed closed-loop control and evaluation harness,
	- - a counterfactual rollout generator for "what if I took action X?" data.
	+ - a counterfactual rollout generator for "what if I took action X?" data,
	+ - a custom NetHack skill environment wired into Sample Factory APPO.

	If you came in expecting a standard RL codebase with rollout workers,
	-advantages, value heads, clipped objectives, and policy updates: that is not
	- what is implemented here yet.
	+advantages, value heads, clipped objectives, and policy updates: that now
	+does exist in early form, but only for the new APPO path. The rest of the
	+repo still contains older non-RL and pre-RL systems.


	## 1. What Is Actually Trained Today

	-The only actual model training path in the repo right now is in
	-[train.py](/home/luc/rl-nethack/train.py).
	+There are now two distinct training paths in the repo.
	+
	+### Path A: forward-model supervised training
	+
	+This is in [train.py](/home/luc/rl-nethack/train.py).

	That script uses:

	@@ -51,6 +55,33 @@ The trainer is configured in [train.py](/home/luc/rl-nethack/train.py):
	There is no RL loss here. No returns, no GAE, no policy ratio, no clipped
	objective, no reward model training, no rollout buffer.

	+### Path B: APPO low-level RL training
	+
	+This is in:
	+
	+- [rl/trainer.py](/home/luc/rl-nethack/rl/trainer.py)
	+- [rl/train_appo.py](/home/luc/rl-nethack/rl/train_appo.py)
	+- [rl/sf_env.py](/home/luc/rl-nethack/rl/sf_env.py)
	+
	+This path uses:
	+
	+- Sample Factory
	+- APPO
	+- a custom Gymnasium env wrapping NLE
	+- hand-shaped task rewards projected into a skill-conditioned RL env
	+
	+The APPO path does have:
	+
	+- rollout workers
	+- rollout length
	+- recurrence
	+- actor-critic training
	+- value loss
	+- GAE
	+- PPO-style clipping
	+
	+So the repo now has real learned RL, but only in this new RL subtree.
	+

	## 2. What "Rollout" Means In This Repo

	@@ -65,7 +96,7 @@ In this repo today, a "rollout" usually means:
	- write training/eval examples,
	- or score task behavior.

	-There are three main rollout systems.
	+There are now four main rollout/training systems.


	## 3. System A: LLM / Policy Data Generation
	@@ -305,7 +336,89 @@ This controller does not do:
	It only does one-step counterfactual selection.


	-## 6. Task Rewards: What The Harness Optimizes
	+## 6. System D: Sample Factory APPO Backend
	+
	+This is in:
	+
	+- [rl/train_appo.py](/home/luc/rl-nethack/rl/train_appo.py)
	+- [rl/trainer.py](/home/luc/rl-nethack/rl/trainer.py)
	+- [rl/sf_env.py](/home/luc/rl-nethack/rl/sf_env.py)
	+- [rl/env_adapter.py](/home/luc/rl-nethack/rl/env_adapter.py)
	+- [rl/feature_encoder.py](/home/luc/rl-nethack/rl/feature_encoder.py)
	+
	+### What it does
	+
	+It trains a low-level policy with APPO against a custom NetHack skill env.
	+
	+The APPO env currently:
	+
	+- wraps NLE
	+- tracks memory via `MemoryTracker`
	+- tracks an active skill
	+- computes a compact vector observation
	+- uses the repo’s hand-shaped task rewards as the main training signal
	+
	+### Are these multi-turn rollouts?
	+
	+Yes.
	+
	+This is the first true RL rollout system in the repo.
	+
	+Rollouts are controlled by Sample Factory parameters such as:
	+
	+- `num_workers`
	+- `num_envs_per_worker`
	+- `rollout`
	+- `recurrence`
	+
	+These are now real RL rollout knobs, not just data-generation concurrency.
	+
	+### What is learned?
	+
	+Sample Factory builds and trains an actor-critic model.
	+
	+In the current smoke-tested setup it used:
	+
	+- observation space:
	+ - `Dict('obs': Box(-10.0, 10.0, (106,), float32))`
	+- action space:
	+ - `Discrete(13)`
	+- model:
	+ - MLP encoder
	+ - GRU core
	+ - linear policy head
	+ - linear value head
	+
	+### What is the policy optimizing?
	+
	+Right now:
	+
	+- intrinsic/task reward from the repo’s hand-shaped task reward functions
	+- optionally mixed with env reward through:
	+ - `intrinsic_reward_weight`
	+ - `extrinsic_reward_weight`
	+
	+So the APPO path is learned RL, but the reward source is still the old
	+hand-shaped task logic, not a learned reward model.
	+
	+### What is "group size" here?
	+
	+Still no GRPO-style group size.
	+
	+The important RL knobs are now:
	+
	+- `num_workers`
	+- `num_envs_per_worker`
	+- `rollout`
	+- `recurrence`
	+- `batch_size`
	+- `num_batches_per_epoch`
	+- `num_epochs`
	+
	+Those are the real APPO dataflow parameters.
	+
	+
	+## 7. Task Rewards: What The Harness And APPO Env Optimize

	Task rewards are defined in
	[src/task_rewards.py](/home/luc/rl-nethack/src/task_rewards.py).
	@@ -331,10 +444,11 @@ Examples:
	This is important:

	- the harness is already task-conditioned,
	-- but the reward is currently hand-shaped, not learned from feedback.
	+- the APPO env also currently consumes these rewards,
	+- but the reward is still hand-shaped, not learned from feedback.


	-## 7. Evaluation: What We Measure Today
	+## 8. Evaluation: What We Measure Today

	There are currently two very different evaluation paths.

	@@ -375,8 +489,23 @@ This measures trajectory behavior:

	This is the current behavior-level evaluation layer.

	+### APPO smoke validation
	+
	+The new APPO backend has been smoke-tested through a real run launched from
	+`cli.py rl-train-appo`.
	+
	+That run successfully:

	-## 8. What "Batch Size" Means In This Repo
	+- registered the custom env,
	+- initialized the actor-critic,
	+- collected experience,
	+- trained,
	+- and wrote checkpoints/config under `train_dir/rl/...`
	+
	+This is backend validation, not yet a meaningful benchmark of policy quality.
	+
	+
	+## 9. What "Batch Size" Means In This Repo

	There are several unrelated batch-like knobs.

	@@ -414,21 +543,37 @@ In the `vllm-batch` backend:

	This is inference throughput tuning, not RL grouping.

	+### APPO learner batch size
	+
	+In [rl/train_appo.py](/home/luc/rl-nethack/rl/train_appo.py) and
	+[rl/trainer.py](/home/luc/rl-nethack/rl/trainer.py):
	+
	+- `--batch-size`
	+ - APPO minibatch size
	+- `--num-batches-per-epoch`
	+ - how many minibatches are collected before each training iteration
	+- `--ppo-epochs`
	+ - how many passes over that dataset
	+
	+This is now a real RL batch concept in the repo.
	+

	-## 9. What Is Missing If You Expect "Real RL"
	+## 10. What Is Missing If You Expect A Mature RL Stack

	-The repo currently does not have:
	+The repo now does have APPO.
	+
	+But it still does not yet have:

	-- PPO
	- GRPO
	-- APPO
	-- actor-critic training
	- replay buffer / rollout buffer
	-- advantage estimation
	-- value function training
	- reward model training
	-- option / skill policy training
	+- learned option / skill policy conditioning in the model itself
	- learned high-level policy over skills
	+- learned reward models from preferences
	+- robust action masking inside the APPO policy
	+- a richer observation encoder than the current compact 106-dim vector
	+- serious multi-GPU / high-throughput APPO configs for this box
	+- real benchmark results against `task_greedy`

	What it does have is:

	@@ -436,16 +581,18 @@ What it does have is:
	- one-step counterfactual branching
	- hand-shaped task reward functions
	- trajectory evaluation
	+- a real APPO actor-critic training path

	So the current stack is best described as:

	1. collect trajectories,
	2. train a forward model with SFT,
	-3. evaluate closed-loop task behavior with a hand-shaped controller,
	-4. prepare for future RL / planning work.
	+3. train a low-level APPO policy on hand-shaped task reward,
	+4. evaluate closed-loop task behavior with both a hand-shaped controller and a learned RL backend,
	+5. prepare for hierarchical / reward-model / scheduler work.


	-## 10. Practical Answers To Your Specific Questions
	+## 11. Practical Answers To Your Specific Questions

	### "How many rollouts are there?"

	@@ -459,12 +606,15 @@ Depends on which subsystem you mean.
	- `generate_counterfactual_data.py`
	- one outer rollout per game
	- plus up to 9 one-step branches per interesting state
	+- `rl-train-appo`
	+ - `num_workers * num_envs_per_worker` live RL rollouts in parallel

	### "Are these multi-turn rollouts?"

	- forward-data generation games: yes
	- task-harness episodes: yes
	- counterfactual branches: no, one-step only
	+- APPO env rollouts: yes
	- SFT training itself: not rollouts at all

	### "What is the group size?"
	@@ -479,25 +629,37 @@ Closest answers:
	- up to 9 actions
	- rollout concurrency for data generation:
	- `--workers`
	+- APPO parallel rollouts:
	+ - `num_workers * num_envs_per_worker`
	- SFT global batch:
	- `batch_size * grad_accum * world_size`
	+- APPO learner batch:
	+ - `batch_size`, `num_batches_per_epoch`, `num_epochs`

	### "Is there any learned policy right now?"

	-Not in the main path.
	+Yes.

	-The main trained model is a forward prediction LM.
	-The current task controller is algorithmic, not learned.
	+There are now two learned model families:

	+- the forward prediction LM from [train.py](/home/luc/rl-nethack/train.py)
	+- the APPO low-level policy/value model from the new `rl/` stack

	-## 11. Recommended Mental Model For This Repo
	+The current task controller in [src/task_harness.py](/home/luc/rl-nethack/src/task_harness.py)
	+is still algorithmic.
	+
	+
	+## 12. Recommended Mental Model For This Repo

	If you want the correct mental model, think of the repo as:

	- today
	- - forward-model training project with control/eval scaffolding
	+ - hybrid project with:
	+ - forward-model SFT,
	+ - a hand-shaped control harness,
	+ - and a first real APPO RL backend
	- not yet
	- - full RL agent training system
	+ - full hierarchical skill RL system with learned rewards and scheduler

	The most important files for understanding that are:

	@@ -507,17 +669,21 @@ The most important files for understanding that are:
	- [src/task_harness.py](/home/luc/rl-nethack/src/task_harness.py)
	- [src/task_rewards.py](/home/luc/rl-nethack/src/task_rewards.py)
	- [src/evaluator.py](/home/luc/rl-nethack/src/evaluator.py)
	+- [rl/train_appo.py](/home/luc/rl-nethack/rl/train_appo.py)
	+- [rl/trainer.py](/home/luc/rl-nethack/rl/trainer.py)
	+- [rl/sf_env.py](/home/luc/rl-nethack/rl/sf_env.py)


	-## 12. If We Wanted To Make This A Real RL System Next
	+## 13. If We Wanted To Mature This RL System Next

	The current code naturally points to this progression:

	1. keep the current task rewards and task evaluation,
	-2. use counterfactual rollouts to label better/worse action outcomes,
	-3. train a reward or value model per task,
	-4. replace one-step brute-force action selection with a learned scorer,
	-5. then add real policy optimization if needed.
	+2. benchmark APPO against `task_greedy`,
	+3. improve observations / model architecture,
	+4. use counterfactual rollouts to label better/worse action outcomes,
	+5. train a reward or value model per task,
	+6. move from flat task-conditioned APPO toward full options / scheduler hierarchy

	That would be the first point where terms like:

	@@ -529,4 +695,3 @@ That would be the first point where terms like:
	- group size

	become central in the usual RL sense.
	-
	diff --git a/rl/sf_env.py b/rl/sf_env.py
	index d524a08..d550459 100644
	--- a/rl/sf_env.py
	+++ b/rl/sf_env.py
	@@ -108,7 +108,17 @@ def make_nethack_skill_env(full_env_name, cfg, env_config, render_mode=None, **k
	rl_config = RLConfig()
	rl_config.env.seed = getattr(cfg, "seed", rl_config.env.seed)
	rl_config.env.max_episode_steps = getattr(cfg, "env_max_episode_steps", rl_config.env.max_episode_steps)
	+ rl_config.env.active_skill_bootstrap = getattr(
	+ cfg, "active_skill_bootstrap", rl_config.env.active_skill_bootstrap
	+ )
	rl_config.reward.source = getattr(cfg, "reward_source", rl_config.reward.source)
	rl_config.reward.extrinsic_weight = getattr(cfg, "extrinsic_reward_weight", rl_config.reward.extrinsic_weight)
	rl_config.reward.intrinsic_weight = getattr(cfg, "intrinsic_reward_weight", rl_config.reward.intrinsic_weight)
	+ rl_config.options.scheduler = getattr(cfg, "skill_scheduler", rl_config.options.scheduler)
	+ enabled_skills = getattr(cfg, "enabled_skills", None)
	+ if enabled_skills:
	+ if isinstance(enabled_skills, str):
	+ rl_config.options.enabled_skills = [s.strip() for s in enabled_skills.split(",") if s.strip()]
	+ else:
	+ rl_config.options.enabled_skills = list(enabled_skills)
	return NethackSkillEnv(rl_config)
	diff --git a/rl/trainer.py b/rl/trainer.py
	index 32223d0..c2f36dc 100644
	--- a/rl/trainer.py
	+++ b/rl/trainer.py
	@@ -85,6 +85,9 @@ class APPOTrainerScaffold:
	f"--reward_source={cfg.reward.source}",
	f"--intrinsic_reward_weight={cfg.reward.intrinsic_weight}",
	f"--extrinsic_reward_weight={cfg.reward.extrinsic_weight}",
	+ f"--skill_scheduler={cfg.options.scheduler}",
	+ f"--enabled_skills={','.join(cfg.options.enabled_skills)}",
	+ f"--active_skill_bootstrap={cfg.env.active_skill_bootstrap}",
	]

	def launch(self, dry_run: bool = True) -> dict:
	@@ -111,6 +114,9 @@ class APPOTrainerScaffold:
	parser.add_argument("--reward_source", type=str, default=self.config.reward.source)
	parser.add_argument("--intrinsic_reward_weight", type=float, default=self.config.reward.intrinsic_weight)
	parser.add_argument("--extrinsic_reward_weight", type=float, default=self.config.reward.extrinsic_weight)
	+ parser.add_argument("--skill_scheduler", type=str, default=self.config.options.scheduler)
	+ parser.add_argument("--enabled_skills", type=str, default=",".join(self.config.options.enabled_skills))
	+ parser.add_argument("--active_skill_bootstrap", type=str, default=self.config.env.active_skill_bootstrap)
	sf_cfg = parse_full_cfg(parser, argv)
	status = run_rl(sf_cfg)
	return {