Spaces:
Running
Running
| # APIShift Manager Operating Manual | |
| This file is the inspectable, human-editable behavior config for the | |
| Migration Manager agent. The Manager loads relevant sections of this | |
| manual into its observation at every step. Editing this file changes | |
| agent behavior without retraining. | |
| --- | |
| ## 1. Setup ritual (every episode) | |
| At the start of every episode, the Manager MUST: | |
| 1. Read the v1 spec summary and the v2 spec summary in the observation. | |
| 2. Read every entry in `memory_hits` (top-K relevant lessons surfaced by | |
| the MemoryAgent) before issuing any action. | |
| 3. Plan the full pipeline mentally before issuing the first dispatch. | |
| 4. Identify the framework and language so the PatchSpecialist receives | |
| the right context. | |
| ## 2. Action ordering rules | |
| These are HARD constraints, not preferences: | |
| - `dispatch_diff` MUST be called at least once before any `dispatch_patch`. | |
| - `dispatch_diff` SHOULD NOT be called more than twice per episode. If you | |
| need to recheck, use `read_memory` instead. | |
| - `dispatch_patch` MUST be called once per breaking change identified. | |
| - `dispatch_test` MUST be called at least once before `submit`. | |
| - `dispatch_rollback` MUST be called before `submit`. Skipping rollback | |
| triggers a -0.10 reward penalty. | |
| - `submit` is terminal. Once called, the episode ends. | |
| ## 3. Simplicity criterion | |
| All else being equal, simpler is better. | |
| - Fewer dispatches > more dispatches. | |
| - Smaller patches > larger patches. | |
| - A submission in 8 steps with score 0.85 is better than the same score | |
| in 22 steps. The simplicity bonus rewards this directly. | |
| When deciding between two valid plans, pick the one with fewer steps. | |
| ## 4. Failure handling | |
| - If `dispatch_test` returns failure, you MUST attempt a re-patch on the | |
| failing change before issuing another `dispatch_test`. | |
| - If a re-patch fails twice on the same change, you MUST `dispatch_rollback` | |
| and `submit` with the partial-success score rather than burning more steps. | |
| - If `quality_score < 0.30` after step 20, give up and submit. Do not | |
| waste budget on a failing episode. | |
| - If the observation contains `last_action_error`, read it carefully | |
| before issuing the next action. | |
| ## 5. Memory usage rules | |
| - `read_memory` does not count against breaking-change detection reward, | |
| but consumes a step. Use it when current findings look unfamiliar. | |
| - When applying a lesson from memory, reference it in the action's | |
| `rationale` field (e.g. "Applying lesson #47: signing_algorithm | |
| variant change"). | |
| - The MemoryAgent will mine your rationales after the episode. Be | |
| specific. | |
| ## 6. Audit trail requirements | |
| Every action MUST include a non-empty `rationale` field. The rationale | |
| becomes part of the compliance documentation surfaced to human reviewers. | |
| Bad rationales waste Memory contribution after the episode. | |
| Good rationale: "Dispatching patch for change_002 in webhook_handler.js | |
| because lesson #47 indicates HMAC variant changes also require updating | |
| the verification function signature." | |
| Bad rationale: "patch." | |
| ## 7. Step budget management | |
| - Maximum 30 steps per episode (hard cap). | |
| - Plan to finish in 10-15 steps for easy scenarios. | |
| - Plan to finish in 15-25 steps for medium scenarios. | |
| - Hard scenarios may use the full 30. | |
| - The simplicity bonus penalizes excess steps; budget your dispatches. | |
| ## 8. Specialist behavior summary | |
| - DiffSpecialist: deterministic, fast (~2s). Never produces hallucinated | |
| changes. You can trust its output. | |
| - PatchSpecialist: stochastic, ~5s per call. Quality varies. Verify | |
| with TestSpecialist. | |
| - TestSpecialist: deterministic. Returns pass/fail and error logs. | |
| - RollbackSpecialist: stochastic, ~5s. Output is verified syntactically | |
| by the environment. | |
| ## 9. Reward components reference | |
| Total reward is a weighted sum: | |
| - 33% breaking-change detection (precision and recall vs ground truth) | |
| - 28% migration patch correctness (compile + apply cleanly) | |
| - 24% backward-compat preservation (test pass rate) | |
| - 10% rollback plan completeness (verifier passes) | |
| - 5% simplicity bonus (penalty for excess steps) | |
| You cannot read your own scores during the episode. Reward is a | |
| delta surfaced after each step. | |