Attuned Resonance RL Router — PPO Agent (v3)

The GitHub repo slug is still "CEPM" during the gradual rename; the Hugging Face slugs were migrated to attuned-resonance-* on 2026-05-09 (HF preserves the old cepm-* URLs as redirects).

PPO agent (stable-baselines3) that selects an advisor for an incoming call, trained against the outcome predictor as a learned reward model.

Research/educational use only. See disclaimer below.

v3 update (2026-05): the v2 PPO run reported on this card was a baseline-equivalent policy — it routed validly but did not outperform the skill-weighted-random heuristic. v3 introduces Δ-reward shaping against that heuristic and reduces top-k pre-filtering from 10 to 50; the resulting policy beats the heuristic by +1.27 reward units per step on average, climbing monotonically across 250k timesteps. Full v2 root-cause analysis is preserved at the bottom of this card under "v2 → v3 history."

Headline Result

	v2 (50k steps, top_k=10, abs reward)	v3 (250k steps, top_k=50, Δ-reward)
Reward sign	Absolute (positive only)	Δ vs heuristic (zero-centered)
Mean episode reward	flat ~171.5	+1.27 (climbing, see curve)
Std across checkpoints	0.05 (0.03% relative)	0.13 over second-half evals
Beats heuristic?	No (tied within noise)	Yes (+1.27 > 0)

A reward of +1.27 means: averaged over an episode, the policy's chosen advisor scored 1.27 composite-reward units better than the heuristic baseline's pick on the same call. Composite reward is a weighted sum over predicted FCR / handle-time / CSAT (weights from generator/config.py:routing_reward_weights), so a one-unit shift roughly corresponds to one full bucket of improvement on one of those axes.

Architecture

call_intake (action label, urgency, complexity, ...)  ┐
                                                       ├──▶  CallRoutingEnv  ──▶  PPO MlpPolicy  ──▶  advisor_id
top_k=50 candidate advisors (filtered by             ─┘                              (action)
shift availability + skill match)

Observation: per-step state from CallRoutingEnv — current call's intake features + 12-dim summary of each of the top-50 candidate advisors + queue features (~613-dim total)
Action: categorical, choose one of top_k=50 pre-filtered advisor candidates
Reward (v3): composite(predictor(chosen)) − composite(predictor(heuristic_pick))
- composite = weighted sum of (1 − handle_time_normalized) + fcr_probability + csat/5 + workload_balance + burnout_protection
- heuristic_pick reuses the skill-weighted-random sampling from generator/routing_generator.py
Episode: 200 sequential calls

The env uses the trained predictor as a simulator — rollouts are synthesized from predictor outputs rather than collected from real calls. Δ-reward calls the predictor twice per step (once for the policy's pick, once for the baseline's), which adds latency but is fast enough on CPU when env stepping is parallelized.

Training (v3)

Algorithm: PPO with MlpPolicy (stable-baselines3 default)
Total timesteps: 250,000
Episode length: 200 calls
Optimizer: Adam (sb3 defaults), learning_rate=3e-4, n_steps=512, batch_size=64, n_epochs=10
Parallelism: SubprocVecEnv(n_envs=4) with torch.set_num_threads(1) per subprocess (without that thread limit, vec-env gives no speedup — see "Engineering notes")
Device: cpu (MlpPolicy is small enough that GPU is wasted; env stepping dominates wall-clock)
Predictor (reward model): post-B2 refresh, val_loss 0.6235, trained on 6.8M synthetic calls (see predictor card)
Hardware: 16-core CPU box, ~4h 17m wall clock for the full 250k run at nice -n 19
Tracked: MLflow experiment prod-router-v3
Checkpoints: CheckpointCallback(save_freq=25000) — every 25k timesteps a checkpoint is saved; the empirical peak was at step 225k (see "Checkpoint selection")

Learning curve

step	mean_episode_reward
2,500	−0.0159
27,500	+0.1803
52,500	+0.5019
77,500	+0.6876
102,500	+0.8230
127,500	+0.9433
152,500	+1.0239
177,500	+1.1472
202,500	+1.2535
227,500	+1.3386
250,000	+1.2705

100 evaluation points logged at eval_freq=2500 callback-call units; only every 10th is shown above. Curve is monotonic up to ~225k steps; the last ~25k drift downward by ~0.05, which is small but real (see "Limitations").

Checkpoint selection

The shipped ppo_router.zip is the final model at step 250k (mean reward +1.2705). The peak checkpoint (ppo_router_peak_225000_steps.zip, also uploaded) is at step 225k with a window-mean reward of +1.3284. Either is suitable for inference; we ship the final by convention but document both for reproducibility.

Intended Use

Research on RL agent training against learned reward models
Educational demonstrations of cascade composition (NLP intake → outcome predictor → RL router)
A reproducible positive-result baseline for Δ-reward shaping in routing-style action spaces — directly comparable to the v2 negative result on the same env

Out of Scope / Not Intended

Any production or commercial use. Not validated for operational deployment.
Real-world routing decisions. The reward model is itself a learned proxy; biases in the predictor propagate into the policy.
Direct comparison to "real call-center RL" benchmarks. The reward signal here is predictor − heuristic, both running on synthetic data — gains do not translate 1:1 to real handle time or CSAT.

Limitations

What v3 does and does not prove

Does prove: with Δ-reward + top_k=50, PPO can learn a policy that systematically beats a skill-weighted-random baseline on this synthetic env. The curve is clean and monotonic, the magnitude is large (+1.27 on a reward whose std at evaluation is ~0.13).
Does not prove: that this policy generalizes to a real call center, that the heuristic baseline is a meaningful operational comparator, or that a +1.27 Δ-reward translates to e.g. measurable CSAT gain in production. The predictor is a synthetic-data model; both the chosen and baseline arms in the Δ-reward run through it. The policy is learning a structure of the predictor as much as anything else.

Late-training regression

The peak window mean (+1.3284 at step 225k) is ~0.05 higher than the final model (+1.2801 at step 250k). This is small but consistent — the curve genuinely turns over in the last ~25k steps. Plausible causes:

PPO entropy decay reducing exploration too aggressively at the end of training
The fixed RNG seed in env construction makes calls reproducible across rollouts; the policy may begin to over-fit on the specific call distribution it sees
Reward variance is small enough at this point that ordinary noise can produce a small monotonic drop

For applications where the absolute peak matters, prefer the 225k checkpoint. For "final policy" comparability with the conventional training-end model, use the 250k.

Inherent caveats

Top-50 still pre-filters by shift availability + skill match. The policy never sees globally bad picks. Removing pre-filtering entirely could change the result — not yet tried.
Heuristic baseline is the comparator, by construction. We do not know how this policy performs against e.g. a "best-skill-match wins" deterministic baseline or a human dispatcher.
Synthetic-data lineage. The whole cascade (intake silver labels, predictor synthetic calls, env reward via predictor) is synthetic. The router policy's "rightness" is bounded by the synthetic environment's fidelity to a real call center.

Engineering notes (v3)

Two issues that ate session-days during v3 and are worth flagging:

torch.set_num_threads(1) inside CallRoutingEnv.__init__. Without it, each SubprocVecEnv subprocess spawns ~num_cores PyTorch threads via OpenMP/MKL, and n_envs=4 × ~16 threads each = 64 threads competing for 16 cores. Net effect: vec env produces zero speedup (or worse — context-switch overhead). With the fix, parallelism actually scales. Setting it at module level does not help — the call must run inside the subprocess.
CheckpointCallback.save_freq is in callback-call units, not timestep units. With n_envs=4, you must pass save_freq // n_envs to get checkpoints at the timesteps you intend.

Both issues now have notes in the project's auto-memory and are documented in source-code comments.

v2 → v3 history (preserved for context)

The v1/v2 model on this same env converged to a flat reward of ~171.5 with std ~0.05 across 50k timesteps — a non-learning policy. The root cause analysis at the time identified three issues:

top_k=10 pre-filtering hides advisor differentiation — the policy was being asked to discriminate among ~10 already-well-matched candidates, where predicted outcomes are tight. v3 bumps this to 50.
Absolute reward dwarfs reward variance — magnitude ~171.5, variance ~0.05 (0.03% relative); PPO's clip-range mechanic produces near-zero advantages on such ratios. v3 switches to Δ-reward, which is zero-centered and signed.
No baseline comparison in the reward function. The v2 reward was the absolute predicted composite; the policy had no signal that said "this pick is better than what you would have done by default." v3 fixes this directly: reward = composite(chosen) − composite(heuristic_pick).

v3's Tier 1 implements (1) and (2)+(3); the result is the curve above. Tier 2 (reward decomposition into separate FCR / handle-time / CSAT channels, multi-objective continuous action) and Tier 3 (curriculum learning, predictor uncertainty as auxiliary reward) remain future work and would address how the policy generalizes rather than whether it learns.

How to Load

from stable_baselines3 import PPO

model = PPO.load("ppo_router.zip")

# Inference: action is the index of the chosen advisor in the top-k candidate list,
# not a global advisor_id. Pair with the env to map back to the advisor pool.
# IMPORTANT: must construct the env with top_k=50 to match the training-time
# observation/action shapes (a saved PPO model encodes those shapes).
obs, _ = env.reset()
action, _ = model.predict(obs, deterministic=True)

Full pipeline code (env, training loop, inference) at tedrubin80/CEPM.

License

CC-BY-NC-4.0. Non-commercial use only. Attribution required.

Citation

@software{attuned_resonance_router_2026,
  author = {Rubin, Ted},
  title = {Attuned Resonance RL Router: PPO Agent for Call-Center Routing (v3)},
  year = {2026},
  url = {https://github.com/tedrubin80/CEPM},
  note = {v3 reward shaping (Δ-reward + top_k=50) produces a policy that beats the skill-weighted-random heuristic by +1.27 reward units per step over 250k timesteps.}
}

Downloads last month: 20

Video Preview

Reinforcement Learning