agentic-rag-gym / documents /data_flow.md
williyam's picture
feat: complete Agentic RAG Gym implementation
7f38b9c

Data Flow — Agentic RAG Gym

Episode Lifecycle

Agent                    Environment                    Backend
  │                          │                            │
  │──── POST /reset ────────▶│                            │
  │                          │── Select Task              │
  │                          │── Initialize State          │
  │                          │── Clear Trajectory          │
  │◀──── Observation ────────│                            │
  │                          │                            │
  │──── POST /step ─────────▶│                            │
  │    {type: "retrieve"}    │── Dispatch to Handler      │
  │                          │── Multi-Agent Processing ──▶│── FAISS Search
  │                          │◀── Retrieval Results ──────│
  │                          │── Compute Step Reward       │
  │                          │── Record to Trajectory      │
  │◀──── (obs, reward) ─────│                            │
  │                          │                            │
  │──── POST /step ─────────▶│                            │
  │    {type: "reason"}      │── Dispatch to Reasoner     │
  │                          │── Agent Reasoning ─────────▶│── LLM API Call
  │                          │◀── Reasoning Output ───────│
  │                          │── Compute Step Reward       │
  │◀──── (obs, reward) ─────│                            │
  │                          │                            │
  │──── POST /step ─────────▶│                            │
  │    {type: "answer"}      │── Dispatch to Answer       │
  │                          │── Check Termination         │
  │                          │── Compute Episode Reward    │
  │◀──── (obs, reward, done)│                            │
  │                          │                            │
  │──── POST /grade ────────▶│                            │
  │                          │── Domain Grader Evaluation  │
  │◀──── Score [0.01-0.99] ─│                            │

Multi-Agent Message Flow

┌──────────┐     query      ┌──────────┐   retrieval    ┌──────────┐
│ Planner  │────────────────▶│ Retriever│───results─────▶│ Reasoner │
│  Agent   │                 │  Agent   │               │  Agent   │
└──────────┘                 └──────────┘               └────┬─────┘
                                  ▲                          │
                                  │                     reasoning
                            refinement                   output
                              query                         │
                                  │                    ┌────▼─────┐
                             ┌────┴─────┐              │  Critic  │
                             │  Critic  │◀─────────────│  Agent   │
                             │  (loop)  │              └──────────┘
                             └──────────┘
                                                       ┌──────────┐
                                                       │ Verifier │
                                                       │  Agent   │
                                                       └──────────┘

Reward Computation Pipeline

Step Record
    │
    ├──▶ Retrieval Quality Signal ──────────┐
    │    (doc relevance scores)              │
    │                                        │
    ├──▶ Reasoning Quality Signal ──────────┤
    │    (trace analysis: evidence, logic)   │     ┌──────────────────┐
    │                                        ├────▶│ Weighted          │
    ├──▶ Answer Quality Signal ─────────────┤     │ Combination       │──▶ Step Reward
    │    (length, grounding, coverage)       │     └──────────────────┘
    │                                        │            │
    ├──▶ Efficiency Signal ─────────────────┤            │
    │    (step_ratio penalty)                │            ▼
    │                                        │     Anti-Hacking
    └──▶ Anti-Hacking Penalty ──────────────┘     Verification
         (repetition, degenerate output)          │
                                                  ▼
                                            Clamped [0.01, 0.99]

Database Persistence Flow

Episode Complete
    │
    ├──▶ Save EpisodeRecord (episode_id, task_id, score, answer)
    │
    └──▶ Save StepLogs (per-step action, reward, reasoning trace)
            │
            └──▶ Available for:
                 • Training data extraction
                 • Performance analytics
                 • Self-improvement curriculum