Spaces:

jdsb06
/

meta-r2

Sleeping

File size: 6,373 Bytes

ddbc1ba

# Episodic Memory

**Source file:** `agent/memory.py`

---

## Overview

`LifeStackMemory` gives the agent access to a history of past decisions and their outcomes. On each new conflict, it retrieves the most relevant past high-reward decisions and injects them as few-shot context into the prompt. This is not RAG over a knowledge base — it's a replay buffer that shapes the agent's prior before the model even generates a token.

The 116.81% reward improvement reported in `data/before_after_comparison.json` compares the memory-augmented agent against the same model without memory context. It is not a trained-vs-untrained comparison.

---

## Storage

Three ChromaDB collections, all in `./lifestack_memory/` by default:

| Collection | Content |
|------------|---------|
| `decisions` | Per-step action records: conflict title, action type, domain, reward, reasoning |
| `trajectories` | Per-episode summaries: task ID, route taken, total reward, milestone hits |
| `feedback` | Human outcome feedback: episode ID, effectiveness rating, improved/worsened domains |

```python
memory = LifeStackMemory(path="./lifestack_memory")
# Persistent ChromaDB; falls back to in-memory client on path failure
```

On first initialization, if the `decisions` collection is empty, `LifeStackMemory` auto-hydrates from `data/preseeded_memory*.json` (3 partitioned files). This ensures the agent has something useful to retrieve on a fresh deployment without requiring prior training runs.

---

## Embedding

`LifeStackMemory` uses `SentenceTransformer('all-MiniLM-L6-v2')` for 384-dim embeddings. If the model isn't available locally (`local_files_only=True`), it falls back to a deterministic hash-based embedding:

```python
# Hash fallback: adler32 hash of each token, bucketed into 384 dimensions
for token in text.lower().split():
    idx = zlib.adler32(token.encode()) % len(buckets)
    buckets[idx] += 1.0
```

The hash fallback preserves semantic retrieval quality well enough for the lexically consistent LifeStack vocabulary (conflict titles, domain names, action types appear repeatedly).

---

## Storing decisions

```python
memory.store_decision(
    conflict_title="Friday 6PM",
    action_type="communicate",
    target_domain="relationships",
    reward=0.72,
    metrics_snapshot={"relationships.romantic": 55, "mental_wellbeing.stress_level": 80},
    reasoning="A quick call prevents relationship erosion during high-stress periods."
)
```

The stored text is `f"{conflict_title} Action: {action_type} Domain: {target_domain} Reward: {reward:.2f} {reasoning[:100]}"`. This text is embedded and stored with the full metadata dict. Only the embedding is used at retrieval time.

---

## Retrieving similar decisions

```python
similar = memory.retrieve_similar(
    conflict_title="The Perfect Storm",
    current_metrics={"career.workload": 90, "mental_wellbeing.stress_level": 85},
    n=3
)
```

Query construction: embeds `f"{conflict_title} <top_3_most_stressed_metrics>"` — the most stressed metrics are the 3 with lowest values after sorting `current_metrics.items()`. This grounds the query in the agent's current situation rather than just the conflict name.

Results are filtered to `reward >= 0.05` before returning. The function retrieves `n*2` candidates and selects the top `n` by cosine similarity after filtering.

Return format:
```python
[{
    "action_type": "communicate",
    "target_domain": "relationships",
    "reward": 0.72,
    "reasoning": "...",
    "similarity_score": 0.87,
    ...
}]
```

---

## Few-shot prompt injection

```python
few_shot = memory.build_few_shot_prompt("Friday 6PM", current_metrics)
# Output:
# --- PAST EXPERIENCE & HUMAN VERIFICATION ---
# - Action Taken: [COMMUNICATE] on RELATIONSHIPS
#   Agent's Initial Reasoning: A quick call prevents relationship erosion...
#   HUMAN FEEDBACK: Rated 8/10. Notes: Partner appreciated the transparency.
```

`build_few_shot_prompt()` calls `retrieve_similar()`, then for each retrieved decision checks whether its `episode_id` has stored feedback in the `feedback` collection. If feedback exists, it appends the effectiveness rating and unexpected effects as additional context. This is the mechanism that brings human feedback into the agent's prompt without any fine-tuning.

---

## Trajectory storage and retrieval

```python
memory.store_trajectory(
    task_id="flight_crisis_task_main",
    route_taken="rebook_premium",
    total_reward=2.5,
    trajectory_summary={"milestones_hit": ["m1"], "steps": 8}
)

similar_trajectories = memory.retrieve_similar_trajectories(
    task_domain="flight_crisis",
    current_world={"lounge_access": True, "flight_rebooked": False},
    n=3
)
```

Trajectories are stored in the separate `traj_collection`. Query construction uses `f"TaskDomain: {task_domain} <top_3_most_stressed_world_values>"`. These are surfaced to the agent in `LifeStackAgent.plan()` but are not currently injected into the main GRPO training prompt (the training prompt uses `decisions` only).

---

## Human feedback storage

```python
from core.feedback import OutcomeFeedback
from datetime import datetime

feedback = OutcomeFeedback(
    episode_id="ep_12345",
    overall_effectiveness=8,
    domains_improved=["relationships", "mental_wellbeing"],
    domains_worsened=[],
    unexpected_effects="Partner called back and offered help with finances.",
    resolution_time_hours=2.5
)
memory.store_feedback(feedback)
```

Stored at doc ID `f"fb_{episode_id}"`. Retrieved by `reward_human_feedback_fn` during GRPO training via embedding similarity on the prompt text. This is what closes the loop between real-world outcomes and training signal — if a human reports that the agent's relationship actions worked well, future training batches for similar conflicts will reward those actions more.

---

## Memory stats

```python
stats = memory.get_stats()
# {"total_memories": 145, "average_reward": 0.623, "by_action_type": {"communicate": 38, ...}}
```

---

## Related files

- `agent/agent.py` — `LifeStackAgent` uses `LifeStackMemory.build_few_shot_prompt()`
- `core/feedback.py` — `OutcomeFeedback` dataclass, `compute_human_feedback_reward()`
- `scripts/train_trl.py` — `reward_human_feedback_fn` queries the feedback collection during training
- `data/preseeded_memory*.json` — initial hydration data (decisions collection)