diff --git a/CURRENT-RL-SYSTEM.md b/CURRENT-RL-SYSTEM.md
index 866d1ac..d090b20 100644
--- a/CURRENT-RL-SYSTEM.md
+++ b/CURRENT-RL-SYSTEM.md
@@ -5,23 +5,27 @@ exists today.
 
 The short version:
 
-- this repo does **not** currently have a learned PPO / GRPO / APPO-style RL
-  training loop,
-- it **does** have:
+- this repo now has a **real APPO backend** for low-level RL training,
+- it also has:
   - supervised fine-tuning for a forward model,
   - gameplay rollout code for collecting training data,
   - a task-directed closed-loop control and evaluation harness,
-  - a counterfactual rollout generator for "what if I took action X?" data.
+  - a counterfactual rollout generator for "what if I took action X?" data,
+  - a custom NetHack skill environment wired into Sample Factory APPO.
 
 If you came in expecting a standard RL codebase with rollout workers,
-advantages, value heads, clipped objectives, and policy updates: that is **not**
- what is implemented here yet.
+advantages, value heads, clipped objectives, and policy updates: that now
+**does exist in early form**, but only for the new APPO path. The rest of the
+repo still contains older non-RL and pre-RL systems.
 
 
 ## 1. What Is Actually Trained Today
 
-The only actual model training path in the repo right now is in
-[train.py](/home/luc/rl-nethack/train.py).
+There are now **two distinct training paths** in the repo.
+
+### Path A: forward-model supervised training
+
+This is in [train.py](/home/luc/rl-nethack/train.py).
 
 That script uses:
 
@@ -51,6 +55,33 @@ The trainer is configured in [train.py](/home/luc/rl-nethack/train.py):
 There is no RL loss here. No returns, no GAE, no policy ratio, no clipped
 objective, no reward model training, no rollout buffer.
 
+### Path B: APPO low-level RL training
+
+This is in:
+
+- [rl/trainer.py](/home/luc/rl-nethack/rl/trainer.py)
+- [rl/train_appo.py](/home/luc/rl-nethack/rl/train_appo.py)
+- [rl/sf_env.py](/home/luc/rl-nethack/rl/sf_env.py)
+
+This path uses:
+
+- Sample Factory
+- APPO
+- a custom Gymnasium env wrapping NLE
+- hand-shaped task rewards projected into a skill-conditioned RL env
+
+The APPO path does have:
+
+- rollout workers
+- rollout length
+- recurrence
+- actor-critic training
+- value loss
+- GAE
+- PPO-style clipping
+
+So the repo now has **real learned RL**, but only in this new RL subtree.
+
 
 ## 2. What "Rollout" Means In This Repo
 
@@ -65,7 +96,7 @@ In this repo today, a "rollout" usually means:
 - write training/eval examples,
 - or score task behavior.
 
-There are three main rollout systems.
+There are now four main rollout/training systems.
 
 
 ## 3. System A: LLM / Policy Data Generation
@@ -305,7 +336,89 @@ This controller does not do:
 It only does **one-step counterfactual selection**.
 
 
-## 6. Task Rewards: What The Harness Optimizes
+## 6. System D: Sample Factory APPO Backend
+
+This is in:
+
+- [rl/train_appo.py](/home/luc/rl-nethack/rl/train_appo.py)
+- [rl/trainer.py](/home/luc/rl-nethack/rl/trainer.py)
+- [rl/sf_env.py](/home/luc/rl-nethack/rl/sf_env.py)
+- [rl/env_adapter.py](/home/luc/rl-nethack/rl/env_adapter.py)
+- [rl/feature_encoder.py](/home/luc/rl-nethack/rl/feature_encoder.py)
+
+### What it does
+
+It trains a low-level policy with APPO against a custom NetHack skill env.
+
+The APPO env currently:
+
+- wraps NLE
+- tracks memory via `MemoryTracker`
+- tracks an active skill
+- computes a compact vector observation
+- uses the repo’s hand-shaped task rewards as the main training signal
+
+### Are these multi-turn rollouts?
+
+Yes.
+
+This is the first true RL rollout system in the repo.
+
+Rollouts are controlled by Sample Factory parameters such as:
+
+- `num_workers`
+- `num_envs_per_worker`
+- `rollout`
+- `recurrence`
+
+These are now real RL rollout knobs, not just data-generation concurrency.
+
+### What is learned?
+
+Sample Factory builds and trains an actor-critic model.
+
+In the current smoke-tested setup it used:
+
+- observation space:
+  - `Dict('obs': Box(-10.0, 10.0, (106,), float32))`
+- action space:
+  - `Discrete(13)`
+- model:
+  - MLP encoder
+  - GRU core
+  - linear policy head
+  - linear value head
+
+### What is the policy optimizing?
+
+Right now:
+
+- intrinsic/task reward from the repo’s hand-shaped task reward functions
+- optionally mixed with env reward through:
+  - `intrinsic_reward_weight`
+  - `extrinsic_reward_weight`
+
+So the APPO path is learned RL, but the reward source is still the old
+hand-shaped task logic, not a learned reward model.
+
+### What is "group size" here?
+
+Still no GRPO-style group size.
+
+The important RL knobs are now:
+
+- `num_workers`
+- `num_envs_per_worker`
+- `rollout`
+- `recurrence`
+- `batch_size`
+- `num_batches_per_epoch`
+- `num_epochs`
+
+Those are the real APPO dataflow parameters.
+
+
+## 7. Task Rewards: What The Harness And APPO Env Optimize
 
 Task rewards are defined in
 [src/task_rewards.py](/home/luc/rl-nethack/src/task_rewards.py).
@@ -331,10 +444,11 @@ Examples:
 This is important:
 
 - the harness is already task-conditioned,
-- but the reward is currently **hand-shaped**, not learned from feedback.
+- the APPO env also currently consumes these rewards,
+- but the reward is still **hand-shaped**, not learned from feedback.
 
 
-## 7. Evaluation: What We Measure Today
+## 8. Evaluation: What We Measure Today
 
 There are currently two very different evaluation paths.
 
@@ -375,8 +489,23 @@ This measures trajectory behavior:
 
 This is the current behavior-level evaluation layer.
 
+### APPO smoke validation
+
+The new APPO backend has been smoke-tested through a real run launched from
+`cli.py rl-train-appo`.
+
+That run successfully:
 
-## 8. What "Batch Size" Means In This Repo
+- registered the custom env,
+- initialized the actor-critic,
+- collected experience,
+- trained,
+- and wrote checkpoints/config under `train_dir/rl/...`
+
+This is backend validation, not yet a meaningful benchmark of policy quality.
+
+
+## 9. What "Batch Size" Means In This Repo
 
 There are several unrelated batch-like knobs.
 
@@ -414,21 +543,37 @@ In the `vllm-batch` backend:
 
 This is inference throughput tuning, not RL grouping.
 
+### APPO learner batch size
+
+In [rl/train_appo.py](/home/luc/rl-nethack/rl/train_appo.py) and
+[rl/trainer.py](/home/luc/rl-nethack/rl/trainer.py):
+
+- `--batch-size`
+  - APPO minibatch size
+- `--num-batches-per-epoch`
+  - how many minibatches are collected before each training iteration
+- `--ppo-epochs`
+  - how many passes over that dataset
+
+This is now a real RL batch concept in the repo.
+
 
-## 9. What Is Missing If You Expect "Real RL"
+## 10. What Is Missing If You Expect A Mature RL Stack
 
-The repo currently does **not** have:
+The repo now **does** have APPO.
+
+But it still does **not yet** have:
 
-- PPO
 - GRPO
-- APPO
-- actor-critic training
 - replay buffer / rollout buffer
-- advantage estimation
-- value function training
 - reward model training
-- option / skill policy training
+- learned option / skill policy conditioning in the model itself
 - learned high-level policy over skills
+- learned reward models from preferences
+- robust action masking inside the APPO policy
+- a richer observation encoder than the current compact 106-dim vector
+- serious multi-GPU / high-throughput APPO configs for this box
+- real benchmark results against `task_greedy`
 
 What it does have is:
 
@@ -436,16 +581,18 @@ What it does have is:
 - one-step counterfactual branching
 - hand-shaped task reward functions
 - trajectory evaluation
+- a real APPO actor-critic training path
 
 So the current stack is best described as:
 
 1. collect trajectories,
 2. train a forward model with SFT,
-3. evaluate closed-loop task behavior with a hand-shaped controller,
-4. prepare for future RL / planning work.
+3. train a low-level APPO policy on hand-shaped task reward,
+4. evaluate closed-loop task behavior with both a hand-shaped controller and a learned RL backend,
+5. prepare for hierarchical / reward-model / scheduler work.
 
 
-## 10. Practical Answers To Your Specific Questions
+## 11. Practical Answers To Your Specific Questions
 
 ### "How many rollouts are there?"
 
@@ -459,12 +606,15 @@ Depends on which subsystem you mean.
 - `generate_counterfactual_data.py`
   - one outer rollout per game
   - plus up to 9 one-step branches per interesting state
+- `rl-train-appo`
+  - `num_workers * num_envs_per_worker` live RL rollouts in parallel
 
 ### "Are these multi-turn rollouts?"
 
 - forward-data generation games: **yes**
 - task-harness episodes: **yes**
 - counterfactual branches: **no**, one-step only
+- APPO env rollouts: **yes**
 - SFT training itself: **not rollouts at all**
 
 ### "What is the group size?"
@@ -479,25 +629,37 @@ Closest answers:
   - up to **9 actions**
 - rollout concurrency for data generation:
   - `--workers`
+- APPO parallel rollouts:
+  - `num_workers * num_envs_per_worker`
 - SFT global batch:
   - `batch_size * grad_accum * world_size`
+- APPO learner batch:
+  - `batch_size`, `num_batches_per_epoch`, `num_epochs`
 
 ### "Is there any learned policy right now?"
 
-Not in the main path.
+Yes.
 
-The main trained model is a **forward prediction LM**.
-The current task controller is **algorithmic**, not learned.
+There are now two learned model families:
 
+- the forward prediction LM from [train.py](/home/luc/rl-nethack/train.py)
+- the APPO low-level policy/value model from the new `rl/` stack
 
-## 11. Recommended Mental Model For This Repo
+The current task controller in [src/task_harness.py](/home/luc/rl-nethack/src/task_harness.py)
+is still algorithmic.
+
+
+## 12. Recommended Mental Model For This Repo
 
 If you want the correct mental model, think of the repo as:
 
 - **today**
-  - forward-model training project with control/eval scaffolding
+  - hybrid project with:
+    - forward-model SFT,
+    - a hand-shaped control harness,
+    - and a first real APPO RL backend
 - **not yet**
-  - full RL agent training system
+  - full hierarchical skill RL system with learned rewards and scheduler
 
 The most important files for understanding that are:
 
@@ -507,17 +669,21 @@ The most important files for understanding that are:
 - [src/task_harness.py](/home/luc/rl-nethack/src/task_harness.py)
 - [src/task_rewards.py](/home/luc/rl-nethack/src/task_rewards.py)
 - [src/evaluator.py](/home/luc/rl-nethack/src/evaluator.py)
+- [rl/train_appo.py](/home/luc/rl-nethack/rl/train_appo.py)
+- [rl/trainer.py](/home/luc/rl-nethack/rl/trainer.py)
+- [rl/sf_env.py](/home/luc/rl-nethack/rl/sf_env.py)
 
 
-## 12. If We Wanted To Make This A Real RL System Next
+## 13. If We Wanted To Mature This RL System Next
 
 The current code naturally points to this progression:
 
 1. keep the current task rewards and task evaluation,
-2. use counterfactual rollouts to label better/worse action outcomes,
-3. train a reward or value model per task,
-4. replace one-step brute-force action selection with a learned scorer,
-5. then add real policy optimization if needed.
+2. benchmark APPO against `task_greedy`,
+3. improve observations / model architecture,
+4. use counterfactual rollouts to label better/worse action outcomes,
+5. train a reward or value model per task,
+6. move from flat task-conditioned APPO toward full options / scheduler hierarchy
 
 That would be the first point where terms like:
 
@@ -529,4 +695,3 @@ That would be the first point where terms like:
 - group size
 
 become central in the usual RL sense.
-
diff --git a/rl/sf_env.py b/rl/sf_env.py
index d524a08..d550459 100644
--- a/rl/sf_env.py
+++ b/rl/sf_env.py
@@ -108,7 +108,17 @@ def make_nethack_skill_env(full_env_name, cfg, env_config, render_mode=None, **k
     rl_config = RLConfig()
     rl_config.env.seed = getattr(cfg, "seed", rl_config.env.seed)
     rl_config.env.max_episode_steps = getattr(cfg, "env_max_episode_steps", rl_config.env.max_episode_steps)
+    rl_config.env.active_skill_bootstrap = getattr(
+        cfg, "active_skill_bootstrap", rl_config.env.active_skill_bootstrap
+    )
     rl_config.reward.source = getattr(cfg, "reward_source", rl_config.reward.source)
     rl_config.reward.extrinsic_weight = getattr(cfg, "extrinsic_reward_weight", rl_config.reward.extrinsic_weight)
     rl_config.reward.intrinsic_weight = getattr(cfg, "intrinsic_reward_weight", rl_config.reward.intrinsic_weight)
+    rl_config.options.scheduler = getattr(cfg, "skill_scheduler", rl_config.options.scheduler)
+    enabled_skills = getattr(cfg, "enabled_skills", None)
+    if enabled_skills:
+        if isinstance(enabled_skills, str):
+            rl_config.options.enabled_skills = [s.strip() for s in enabled_skills.split(",") if s.strip()]
+        else:
+            rl_config.options.enabled_skills = list(enabled_skills)
     return NethackSkillEnv(rl_config)
diff --git a/rl/trainer.py b/rl/trainer.py
index 32223d0..c2f36dc 100644
--- a/rl/trainer.py
+++ b/rl/trainer.py
@@ -85,6 +85,9 @@ class APPOTrainerScaffold:
             f"--reward_source={cfg.reward.source}",
             f"--intrinsic_reward_weight={cfg.reward.intrinsic_weight}",
             f"--extrinsic_reward_weight={cfg.reward.extrinsic_weight}",
+            f"--skill_scheduler={cfg.options.scheduler}",
+            f"--enabled_skills={','.join(cfg.options.enabled_skills)}",
+            f"--active_skill_bootstrap={cfg.env.active_skill_bootstrap}",
         ]
 
     def launch(self, dry_run: bool = True) -> dict:
@@ -111,6 +114,9 @@ class APPOTrainerScaffold:
         parser.add_argument("--reward_source", type=str, default=self.config.reward.source)
         parser.add_argument("--intrinsic_reward_weight", type=float, default=self.config.reward.intrinsic_weight)
         parser.add_argument("--extrinsic_reward_weight", type=float, default=self.config.reward.extrinsic_weight)
+        parser.add_argument("--skill_scheduler", type=str, default=self.config.options.scheduler)
+        parser.add_argument("--enabled_skills", type=str, default=",".join(self.config.options.enabled_skills))
+        parser.add_argument("--active_skill_bootstrap", type=str, default=self.config.env.active_skill_bootstrap)
         sf_cfg = parse_full_cfg(parser, argv)
         status = run_rl(sf_cfg)
         return {