Instructions to use datamatters24/attuned-resonance-router with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- stable-baselines3
How to use datamatters24/attuned-resonance-router with stable-baselines3:
from huggingface_sb3 import load_from_hub checkpoint = load_from_hub( repo_id="datamatters24/attuned-resonance-router", filename="{MODEL FILENAME}.zip", ) - Notebooks
- Google Colab
- Kaggle
Attuned Resonance RL Router β PPO Agent (v3)
The GitHub repo slug is still "CEPM" during the gradual rename; the Hugging Face slugs were migrated to
attuned-resonance-*on 2026-05-09 (HF preserves the oldcepm-*URLs as redirects).
PPO agent (stable-baselines3) that selects an advisor for an incoming call, trained against the outcome predictor as a learned reward model.
Research/educational use only. See disclaimer below.
v3 update (2026-05): the v2 PPO run reported on this card was a baseline-equivalent policy β it routed validly but did not outperform the skill-weighted-random heuristic. v3 introduces Ξ-reward shaping against that heuristic and reduces top-k pre-filtering from 10 to 50; the resulting policy beats the heuristic by +1.27 reward units per step on average, climbing monotonically across 250k timesteps. Full v2 root-cause analysis is preserved at the bottom of this card under "v2 β v3 history."
Headline Result
| v2 (50k steps, top_k=10, abs reward) | v3 (250k steps, top_k=50, Ξ-reward) | |
|---|---|---|
| Reward sign | Absolute (positive only) | Ξ vs heuristic (zero-centered) |
| Mean episode reward | flat ~171.5 | +1.27 (climbing, see curve) |
| Std across checkpoints | 0.05 (0.03% relative) | 0.13 over second-half evals |
| Beats heuristic? | No (tied within noise) | Yes (+1.27 > 0) |
A reward of +1.27 means: averaged over an episode, the policy's chosen advisor scored 1.27 composite-reward units better than the heuristic baseline's pick on the same call. Composite reward is a weighted sum over predicted FCR / handle-time / CSAT (weights from generator/config.py:routing_reward_weights), so a one-unit shift roughly corresponds to one full bucket of improvement on one of those axes.
Architecture
call_intake (action label, urgency, complexity, ...) β
ββββΆ CallRoutingEnv βββΆ PPO MlpPolicy βββΆ advisor_id
top_k=50 candidate advisors (filtered by ββ (action)
shift availability + skill match)
- Observation: per-step state from
CallRoutingEnvβ current call's intake features + 12-dim summary of each of the top-50 candidate advisors + queue features (~613-dim total) - Action: categorical, choose one of
top_k=50pre-filtered advisor candidates - Reward (v3):
composite(predictor(chosen)) β composite(predictor(heuristic_pick))composite= weighted sum of(1 β handle_time_normalized) + fcr_probability + csat/5 + workload_balance + burnout_protectionheuristic_pickreuses the skill-weighted-random sampling fromgenerator/routing_generator.py
- Episode: 200 sequential calls
The env uses the trained predictor as a simulator β rollouts are synthesized from predictor outputs rather than collected from real calls. Ξ-reward calls the predictor twice per step (once for the policy's pick, once for the baseline's), which adds latency but is fast enough on CPU when env stepping is parallelized.
Training (v3)
- Algorithm: PPO with
MlpPolicy(stable-baselines3 default) - Total timesteps: 250,000
- Episode length: 200 calls
- Optimizer: Adam (sb3 defaults),
learning_rate=3e-4,n_steps=512,batch_size=64,n_epochs=10 - Parallelism:
SubprocVecEnv(n_envs=4)withtorch.set_num_threads(1)per subprocess (without that thread limit, vec-env gives no speedup β see "Engineering notes") - Device:
cpu(MlpPolicy is small enough that GPU is wasted; env stepping dominates wall-clock) - Predictor (reward model): post-B2 refresh, val_loss 0.6235, trained on 6.8M synthetic calls (see predictor card)
- Hardware: 16-core CPU box, ~4h 17m wall clock for the full 250k run at
nice -n 19 - Tracked: MLflow experiment
prod-router-v3 - Checkpoints:
CheckpointCallback(save_freq=25000)β every 25k timesteps a checkpoint is saved; the empirical peak was at step 225k (see "Checkpoint selection")
Learning curve
| step | mean_episode_reward |
|---|---|
| 2,500 | β0.0159 |
| 27,500 | +0.1803 |
| 52,500 | +0.5019 |
| 77,500 | +0.6876 |
| 102,500 | +0.8230 |
| 127,500 | +0.9433 |
| 152,500 | +1.0239 |
| 177,500 | +1.1472 |
| 202,500 | +1.2535 |
| 227,500 | +1.3386 |
| 250,000 | +1.2705 |
100 evaluation points logged at eval_freq=2500 callback-call units; only every 10th is shown above. Curve is monotonic up to ~225k steps; the last ~25k drift downward by ~0.05, which is small but real (see "Limitations").
Checkpoint selection
The shipped ppo_router.zip is the final model at step 250k (mean reward +1.2705). The peak checkpoint (ppo_router_peak_225000_steps.zip, also uploaded) is at step 225k with a window-mean reward of +1.3284. Either is suitable for inference; we ship the final by convention but document both for reproducibility.
Intended Use
- Research on RL agent training against learned reward models
- Educational demonstrations of cascade composition (NLP intake β outcome predictor β RL router)
- A reproducible positive-result baseline for Ξ-reward shaping in routing-style action spaces β directly comparable to the v2 negative result on the same env
Out of Scope / Not Intended
- Any production or commercial use. Not validated for operational deployment.
- Real-world routing decisions. The reward model is itself a learned proxy; biases in the predictor propagate into the policy.
- Direct comparison to "real call-center RL" benchmarks. The reward signal here is predictor β heuristic, both running on synthetic data β gains do not translate 1:1 to real handle time or CSAT.
Limitations
What v3 does and does not prove
- Does prove: with Ξ-reward + top_k=50, PPO can learn a policy that systematically beats a skill-weighted-random baseline on this synthetic env. The curve is clean and monotonic, the magnitude is large (+1.27 on a reward whose std at evaluation is ~0.13).
- Does not prove: that this policy generalizes to a real call center, that the heuristic baseline is a meaningful operational comparator, or that a +1.27 Ξ-reward translates to e.g. measurable CSAT gain in production. The predictor is a synthetic-data model; both the chosen and baseline arms in the Ξ-reward run through it. The policy is learning a structure of the predictor as much as anything else.
Late-training regression
The peak window mean (+1.3284 at step 225k) is ~0.05 higher than the final model (+1.2801 at step 250k). This is small but consistent β the curve genuinely turns over in the last ~25k steps. Plausible causes:
- PPO entropy decay reducing exploration too aggressively at the end of training
- The fixed RNG seed in env construction makes calls reproducible across rollouts; the policy may begin to over-fit on the specific call distribution it sees
- Reward variance is small enough at this point that ordinary noise can produce a small monotonic drop
For applications where the absolute peak matters, prefer the 225k checkpoint. For "final policy" comparability with the conventional training-end model, use the 250k.
Inherent caveats
- Top-50 still pre-filters by shift availability + skill match. The policy never sees globally bad picks. Removing pre-filtering entirely could change the result β not yet tried.
- Heuristic baseline is the comparator, by construction. We do not know how this policy performs against e.g. a "best-skill-match wins" deterministic baseline or a human dispatcher.
- Synthetic-data lineage. The whole cascade (intake silver labels, predictor synthetic calls, env reward via predictor) is synthetic. The router policy's "rightness" is bounded by the synthetic environment's fidelity to a real call center.
Engineering notes (v3)
Two issues that ate session-days during v3 and are worth flagging:
torch.set_num_threads(1)insideCallRoutingEnv.__init__. Without it, eachSubprocVecEnvsubprocess spawns ~num_cores PyTorch threads via OpenMP/MKL, andn_envs=4Γ ~16 threads each = 64 threads competing for 16 cores. Net effect: vec env produces zero speedup (or worse β context-switch overhead). With the fix, parallelism actually scales. Setting it at module level does not help β the call must run inside the subprocess.CheckpointCallback.save_freqis in callback-call units, not timestep units. Withn_envs=4, you must passsave_freq // n_envsto get checkpoints at the timesteps you intend.
Both issues now have notes in the project's auto-memory and are documented in source-code comments.
v2 β v3 history (preserved for context)
The v1/v2 model on this same env converged to a flat reward of ~171.5 with std ~0.05 across 50k timesteps β a non-learning policy. The root cause analysis at the time identified three issues:
top_k=10pre-filtering hides advisor differentiation β the policy was being asked to discriminate among ~10 already-well-matched candidates, where predicted outcomes are tight. v3 bumps this to 50.- Absolute reward dwarfs reward variance β magnitude ~171.5, variance ~0.05 (0.03% relative); PPO's clip-range mechanic produces near-zero advantages on such ratios. v3 switches to Ξ-reward, which is zero-centered and signed.
- No baseline comparison in the reward function. The v2 reward was the absolute predicted composite; the policy had no signal that said "this pick is better than what you would have done by default." v3 fixes this directly:
reward = composite(chosen) β composite(heuristic_pick).
v3's Tier 1 implements (1) and (2)+(3); the result is the curve above. Tier 2 (reward decomposition into separate FCR / handle-time / CSAT channels, multi-objective continuous action) and Tier 3 (curriculum learning, predictor uncertainty as auxiliary reward) remain future work and would address how the policy generalizes rather than whether it learns.
How to Load
from stable_baselines3 import PPO
model = PPO.load("ppo_router.zip")
# Inference: action is the index of the chosen advisor in the top-k candidate list,
# not a global advisor_id. Pair with the env to map back to the advisor pool.
# IMPORTANT: must construct the env with top_k=50 to match the training-time
# observation/action shapes (a saved PPO model encodes those shapes).
obs, _ = env.reset()
action, _ = model.predict(obs, deterministic=True)
Full pipeline code (env, training loop, inference) at tedrubin80/CEPM.
License
CC-BY-NC-4.0. Non-commercial use only. Attribution required.
Citation
@software{attuned_resonance_router_2026,
author = {Rubin, Ted},
title = {Attuned Resonance RL Router: PPO Agent for Call-Center Routing (v3)},
year = {2026},
url = {https://github.com/tedrubin80/CEPM},
note = {v3 reward shaping (Ξ-reward + top_k=50) produces a policy that beats the skill-weighted-random heuristic by +1.27 reward units per step over 250k timesteps.}
}
- Downloads last month
- 49