apishift-env / docs /ARCHITECTURE.md
yaswanth169's picture
Initial APIShift env push
3040bf7 verified
# APIShift Architecture
This document is the design reference for APIShift. For the high-level
story see the root README. For the operating-manual the trained Manager
reads at every step see `apishift_program.md`.
## System Diagram
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ APISHIFT SYSTEM β”‚
β”‚ β”‚
β”‚ CurriculumAgent β”‚
β”‚ reads results.tsv (last 40 episodes) β”‚
β”‚ edits curriculum.md (sampling weights) β”‚
β”‚ β”‚ β”‚
β”‚ β–Ό β”‚
β”‚ DifficultySampler ──► Scenario library (Layer 1/2/3) β”‚
β”‚ β”‚ β”‚
β”‚ β–Ό β”‚
β”‚ APIShiftEnvironment (OpenEnv server) β”‚
β”‚ reset(scenario_id) β†’ Observation β”‚
β”‚ step(action) β†’ Observation + reward β”‚
β”‚ state() β†’ episode metadata β”‚
β”‚ β”‚ β”‚
β”‚ β–Ό β”‚
β”‚ Migration Manager (Qwen2.5-7B + LoRA, GRPO-trained) β”‚
β”‚ reads at every step: β”‚
β”‚ 1. apishift_program.md (operating manual) β”‚
β”‚ 2. v1/v2 spec summary + client code β”‚
β”‚ 3. top-K lessons from lessons.md β”‚
β”‚ 4. recent_episode_summary from results.tsv β”‚
β”‚ β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β–Ό β–Ό β–Ό β–Ό β–Ό β”‚
β”‚ Diff Patch Test Rollback β”‚
β”‚ Specialist Specialist Specialist Specialist β”‚
β”‚ (deterministic (template- (verifier- (template + β”‚
β”‚ walker) driven) based) verifier) β”‚
β”‚ β”‚ β”‚
β”‚ β–Ό β”‚
β”‚ Multi-component reward (5 verifiers) β”‚
β”‚ β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β–Ό β–Ό β”‚
β”‚ TRL GRPO MemoryAgent + EpisodeLogger β”‚
β”‚ updates LoRA edits lessons.md, appends results.tsv β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## Self-Improvement Mechanisms
APIShift improves through FOUR independent loops:
1. **GRPO weight updates.** TRL's GRPOTrainer updates the LoRA adapters
on the Manager every batch.
2. **Markdown memory.** The MemoryAgent appends/updates lessons.md after
each successful episode. The Manager retrieves top-K lessons before
acting on a new scenario.
3. **Adaptive curriculum.** The CurriculumAgent re-reads results.tsv
every 20 episodes and shifts the sampling weights to keep training
near the Manager's frontier of mastery.
4. **Operating manual evolution.** Humans (or downstream tooling) can
edit apishift_program.md to change agent behavior without retraining.
## File Responsibilities
| File | Purpose |
|------|---------|
| `apishift_program.md` | Manager operating manual (sections injected into observations) |
| `lessons.md` | Cross-episode memory artifact, edited by MemoryAgent |
| `curriculum.md` | Difficulty sampling weights, edited by CurriculumAgent |
| `results.tsv` | Per-episode log, append-only |
| `models.py` | Pydantic Action/Observation/State |
| `server/environment.py` | OpenEnv Environment implementation |
| `server/reward.py` | 5-component reward function |
| `server/specialists/*.py` | Diff, Patch, Test, Rollback specialists |
| `server/memory/*.py` | MemoryAgent + retrieval |
| `server/curriculum/*.py` | CurriculumAgent + sampler |
| `server/program_loader.py` | Loads relevant section of apishift_program.md |
| `server/episode_logger.py` | Appends rows to results.tsv |
## Why this design
Inspectability is non-negotiable. Banking compliance, OSS users, and
hackathon judges all need to be able to read what the agent learned and
how its training distribution evolved. Markdown + TSV files satisfy
this; vector stores and binary checkpoints do not.
The autoresearch-style "agent reads markdown that it (or a peer agent)
also writes" pattern keeps the system inspectable without sacrificing
self-improvement capability.