Luc
preserve medium adapter, evals, and rl train artifacts
9a8acce
diff --git a/CURRENT-RL-SYSTEM.md b/CURRENT-RL-SYSTEM.md
index 866d1ac..d090b20 100644
--- a/CURRENT-RL-SYSTEM.md
+++ b/CURRENT-RL-SYSTEM.md
@@ -5,23 +5,27 @@ exists today.
The short version:
-- this repo does **not** currently have a learned PPO / GRPO / APPO-style RL
- training loop,
-- it **does** have:
+- this repo now has a **real APPO backend** for low-level RL training,
+- it also has:
- supervised fine-tuning for a forward model,
- gameplay rollout code for collecting training data,
- a task-directed closed-loop control and evaluation harness,
- - a counterfactual rollout generator for "what if I took action X?" data.
+ - a counterfactual rollout generator for "what if I took action X?" data,
+ - a custom NetHack skill environment wired into Sample Factory APPO.
If you came in expecting a standard RL codebase with rollout workers,
-advantages, value heads, clipped objectives, and policy updates: that is **not**
- what is implemented here yet.
+advantages, value heads, clipped objectives, and policy updates: that now
+**does exist in early form**, but only for the new APPO path. The rest of the
+repo still contains older non-RL and pre-RL systems.
## 1. What Is Actually Trained Today
-The only actual model training path in the repo right now is in
-[train.py](/home/luc/rl-nethack/train.py).
+There are now **two distinct training paths** in the repo.
+
+### Path A: forward-model supervised training
+
+This is in [train.py](/home/luc/rl-nethack/train.py).
That script uses:
@@ -51,6 +55,33 @@ The trainer is configured in [train.py](/home/luc/rl-nethack/train.py):
There is no RL loss here. No returns, no GAE, no policy ratio, no clipped
objective, no reward model training, no rollout buffer.
+### Path B: APPO low-level RL training
+
+This is in:
+
+- [rl/trainer.py](/home/luc/rl-nethack/rl/trainer.py)
+- [rl/train_appo.py](/home/luc/rl-nethack/rl/train_appo.py)
+- [rl/sf_env.py](/home/luc/rl-nethack/rl/sf_env.py)
+
+This path uses:
+
+- Sample Factory
+- APPO
+- a custom Gymnasium env wrapping NLE
+- hand-shaped task rewards projected into a skill-conditioned RL env
+
+The APPO path does have:
+
+- rollout workers
+- rollout length
+- recurrence
+- actor-critic training
+- value loss
+- GAE
+- PPO-style clipping
+
+So the repo now has **real learned RL**, but only in this new RL subtree.
+
## 2. What "Rollout" Means In This Repo
@@ -65,7 +96,7 @@ In this repo today, a "rollout" usually means:
- write training/eval examples,
- or score task behavior.
-There are three main rollout systems.
+There are now four main rollout/training systems.
## 3. System A: LLM / Policy Data Generation
@@ -305,7 +336,89 @@ This controller does not do:
It only does **one-step counterfactual selection**.
-## 6. Task Rewards: What The Harness Optimizes
+## 6. System D: Sample Factory APPO Backend
+
+This is in:
+
+- [rl/train_appo.py](/home/luc/rl-nethack/rl/train_appo.py)
+- [rl/trainer.py](/home/luc/rl-nethack/rl/trainer.py)
+- [rl/sf_env.py](/home/luc/rl-nethack/rl/sf_env.py)
+- [rl/env_adapter.py](/home/luc/rl-nethack/rl/env_adapter.py)
+- [rl/feature_encoder.py](/home/luc/rl-nethack/rl/feature_encoder.py)
+
+### What it does
+
+It trains a low-level policy with APPO against a custom NetHack skill env.
+
+The APPO env currently:
+
+- wraps NLE
+- tracks memory via `MemoryTracker`
+- tracks an active skill
+- computes a compact vector observation
+- uses the repo’s hand-shaped task rewards as the main training signal
+
+### Are these multi-turn rollouts?
+
+Yes.
+
+This is the first true RL rollout system in the repo.
+
+Rollouts are controlled by Sample Factory parameters such as:
+
+- `num_workers`
+- `num_envs_per_worker`
+- `rollout`
+- `recurrence`
+
+These are now real RL rollout knobs, not just data-generation concurrency.
+
+### What is learned?
+
+Sample Factory builds and trains an actor-critic model.
+
+In the current smoke-tested setup it used:
+
+- observation space:
+ - `Dict('obs': Box(-10.0, 10.0, (106,), float32))`
+- action space:
+ - `Discrete(13)`
+- model:
+ - MLP encoder
+ - GRU core
+ - linear policy head
+ - linear value head
+
+### What is the policy optimizing?
+
+Right now:
+
+- intrinsic/task reward from the repo’s hand-shaped task reward functions
+- optionally mixed with env reward through:
+ - `intrinsic_reward_weight`
+ - `extrinsic_reward_weight`
+
+So the APPO path is learned RL, but the reward source is still the old
+hand-shaped task logic, not a learned reward model.
+
+### What is "group size" here?
+
+Still no GRPO-style group size.
+
+The important RL knobs are now:
+
+- `num_workers`
+- `num_envs_per_worker`
+- `rollout`
+- `recurrence`
+- `batch_size`
+- `num_batches_per_epoch`
+- `num_epochs`
+
+Those are the real APPO dataflow parameters.
+
+
+## 7. Task Rewards: What The Harness And APPO Env Optimize
Task rewards are defined in
[src/task_rewards.py](/home/luc/rl-nethack/src/task_rewards.py).
@@ -331,10 +444,11 @@ Examples:
This is important:
- the harness is already task-conditioned,
-- but the reward is currently **hand-shaped**, not learned from feedback.
+- the APPO env also currently consumes these rewards,
+- but the reward is still **hand-shaped**, not learned from feedback.
-## 7. Evaluation: What We Measure Today
+## 8. Evaluation: What We Measure Today
There are currently two very different evaluation paths.
@@ -375,8 +489,23 @@ This measures trajectory behavior:
This is the current behavior-level evaluation layer.
+### APPO smoke validation
+
+The new APPO backend has been smoke-tested through a real run launched from
+`cli.py rl-train-appo`.
+
+That run successfully:
-## 8. What "Batch Size" Means In This Repo
+- registered the custom env,
+- initialized the actor-critic,
+- collected experience,
+- trained,
+- and wrote checkpoints/config under `train_dir/rl/...`
+
+This is backend validation, not yet a meaningful benchmark of policy quality.
+
+
+## 9. What "Batch Size" Means In This Repo
There are several unrelated batch-like knobs.
@@ -414,21 +543,37 @@ In the `vllm-batch` backend:
This is inference throughput tuning, not RL grouping.
+### APPO learner batch size
+
+In [rl/train_appo.py](/home/luc/rl-nethack/rl/train_appo.py) and
+[rl/trainer.py](/home/luc/rl-nethack/rl/trainer.py):
+
+- `--batch-size`
+ - APPO minibatch size
+- `--num-batches-per-epoch`
+ - how many minibatches are collected before each training iteration
+- `--ppo-epochs`
+ - how many passes over that dataset
+
+This is now a real RL batch concept in the repo.
+
-## 9. What Is Missing If You Expect "Real RL"
+## 10. What Is Missing If You Expect A Mature RL Stack
-The repo currently does **not** have:
+The repo now **does** have APPO.
+
+But it still does **not yet** have:
-- PPO
- GRPO
-- APPO
-- actor-critic training
- replay buffer / rollout buffer
-- advantage estimation
-- value function training
- reward model training
-- option / skill policy training
+- learned option / skill policy conditioning in the model itself
- learned high-level policy over skills
+- learned reward models from preferences
+- robust action masking inside the APPO policy
+- a richer observation encoder than the current compact 106-dim vector
+- serious multi-GPU / high-throughput APPO configs for this box
+- real benchmark results against `task_greedy`
What it does have is:
@@ -436,16 +581,18 @@ What it does have is:
- one-step counterfactual branching
- hand-shaped task reward functions
- trajectory evaluation
+- a real APPO actor-critic training path
So the current stack is best described as:
1. collect trajectories,
2. train a forward model with SFT,
-3. evaluate closed-loop task behavior with a hand-shaped controller,
-4. prepare for future RL / planning work.
+3. train a low-level APPO policy on hand-shaped task reward,
+4. evaluate closed-loop task behavior with both a hand-shaped controller and a learned RL backend,
+5. prepare for hierarchical / reward-model / scheduler work.
-## 10. Practical Answers To Your Specific Questions
+## 11. Practical Answers To Your Specific Questions
### "How many rollouts are there?"
@@ -459,12 +606,15 @@ Depends on which subsystem you mean.
- `generate_counterfactual_data.py`
- one outer rollout per game
- plus up to 9 one-step branches per interesting state
+- `rl-train-appo`
+ - `num_workers * num_envs_per_worker` live RL rollouts in parallel
### "Are these multi-turn rollouts?"
- forward-data generation games: **yes**
- task-harness episodes: **yes**
- counterfactual branches: **no**, one-step only
+- APPO env rollouts: **yes**
- SFT training itself: **not rollouts at all**
### "What is the group size?"
@@ -479,25 +629,37 @@ Closest answers:
- up to **9 actions**
- rollout concurrency for data generation:
- `--workers`
+- APPO parallel rollouts:
+ - `num_workers * num_envs_per_worker`
- SFT global batch:
- `batch_size * grad_accum * world_size`
+- APPO learner batch:
+ - `batch_size`, `num_batches_per_epoch`, `num_epochs`
### "Is there any learned policy right now?"
-Not in the main path.
+Yes.
-The main trained model is a **forward prediction LM**.
-The current task controller is **algorithmic**, not learned.
+There are now two learned model families:
+- the forward prediction LM from [train.py](/home/luc/rl-nethack/train.py)
+- the APPO low-level policy/value model from the new `rl/` stack
-## 11. Recommended Mental Model For This Repo
+The current task controller in [src/task_harness.py](/home/luc/rl-nethack/src/task_harness.py)
+is still algorithmic.
+
+
+## 12. Recommended Mental Model For This Repo
If you want the correct mental model, think of the repo as:
- **today**
- - forward-model training project with control/eval scaffolding
+ - hybrid project with:
+ - forward-model SFT,
+ - a hand-shaped control harness,
+ - and a first real APPO RL backend
- **not yet**
- - full RL agent training system
+ - full hierarchical skill RL system with learned rewards and scheduler
The most important files for understanding that are:
@@ -507,17 +669,21 @@ The most important files for understanding that are:
- [src/task_harness.py](/home/luc/rl-nethack/src/task_harness.py)
- [src/task_rewards.py](/home/luc/rl-nethack/src/task_rewards.py)
- [src/evaluator.py](/home/luc/rl-nethack/src/evaluator.py)
+- [rl/train_appo.py](/home/luc/rl-nethack/rl/train_appo.py)
+- [rl/trainer.py](/home/luc/rl-nethack/rl/trainer.py)
+- [rl/sf_env.py](/home/luc/rl-nethack/rl/sf_env.py)
-## 12. If We Wanted To Make This A Real RL System Next
+## 13. If We Wanted To Mature This RL System Next
The current code naturally points to this progression:
1. keep the current task rewards and task evaluation,
-2. use counterfactual rollouts to label better/worse action outcomes,
-3. train a reward or value model per task,
-4. replace one-step brute-force action selection with a learned scorer,
-5. then add real policy optimization if needed.
+2. benchmark APPO against `task_greedy`,
+3. improve observations / model architecture,
+4. use counterfactual rollouts to label better/worse action outcomes,
+5. train a reward or value model per task,
+6. move from flat task-conditioned APPO toward full options / scheduler hierarchy
That would be the first point where terms like:
@@ -529,4 +695,3 @@ That would be the first point where terms like:
- group size
become central in the usual RL sense.
-
diff --git a/rl/sf_env.py b/rl/sf_env.py
index d524a08..d550459 100644
--- a/rl/sf_env.py
+++ b/rl/sf_env.py
@@ -108,7 +108,17 @@ def make_nethack_skill_env(full_env_name, cfg, env_config, render_mode=None, **k
rl_config = RLConfig()
rl_config.env.seed = getattr(cfg, "seed", rl_config.env.seed)
rl_config.env.max_episode_steps = getattr(cfg, "env_max_episode_steps", rl_config.env.max_episode_steps)
+ rl_config.env.active_skill_bootstrap = getattr(
+ cfg, "active_skill_bootstrap", rl_config.env.active_skill_bootstrap
+ )
rl_config.reward.source = getattr(cfg, "reward_source", rl_config.reward.source)
rl_config.reward.extrinsic_weight = getattr(cfg, "extrinsic_reward_weight", rl_config.reward.extrinsic_weight)
rl_config.reward.intrinsic_weight = getattr(cfg, "intrinsic_reward_weight", rl_config.reward.intrinsic_weight)
+ rl_config.options.scheduler = getattr(cfg, "skill_scheduler", rl_config.options.scheduler)
+ enabled_skills = getattr(cfg, "enabled_skills", None)
+ if enabled_skills:
+ if isinstance(enabled_skills, str):
+ rl_config.options.enabled_skills = [s.strip() for s in enabled_skills.split(",") if s.strip()]
+ else:
+ rl_config.options.enabled_skills = list(enabled_skills)
return NethackSkillEnv(rl_config)
diff --git a/rl/trainer.py b/rl/trainer.py
index 32223d0..c2f36dc 100644
--- a/rl/trainer.py
+++ b/rl/trainer.py
@@ -85,6 +85,9 @@ class APPOTrainerScaffold:
f"--reward_source={cfg.reward.source}",
f"--intrinsic_reward_weight={cfg.reward.intrinsic_weight}",
f"--extrinsic_reward_weight={cfg.reward.extrinsic_weight}",
+ f"--skill_scheduler={cfg.options.scheduler}",
+ f"--enabled_skills={','.join(cfg.options.enabled_skills)}",
+ f"--active_skill_bootstrap={cfg.env.active_skill_bootstrap}",
]
def launch(self, dry_run: bool = True) -> dict:
@@ -111,6 +114,9 @@ class APPOTrainerScaffold:
parser.add_argument("--reward_source", type=str, default=self.config.reward.source)
parser.add_argument("--intrinsic_reward_weight", type=float, default=self.config.reward.intrinsic_weight)
parser.add_argument("--extrinsic_reward_weight", type=float, default=self.config.reward.extrinsic_weight)
+ parser.add_argument("--skill_scheduler", type=str, default=self.config.options.scheduler)
+ parser.add_argument("--enabled_skills", type=str, default=",".join(self.config.options.enabled_skills))
+ parser.add_argument("--active_skill_bootstrap", type=str, default=self.config.env.active_skill_bootstrap)
sf_cfg = parse_full_cfg(parser, argv)
status = run_rl(sf_cfg)
return {