Spaces:

iteratehack
/

MentorFlow

Paused

File size: 2,847 Bytes

a52f96d

# Analysis: Why Accuracy Drops and How to Fix

## Issue 1: Accuracy Drops at End ❌

### Root Causes Found:

1. **Evaluation uses NEW tasks each time** (line 171-175 in compare_strategies.py)
   - `general_accuracy = student.evaluate([generator.generate_task(...) for ...])`
   - Creates new tasks every iteration → variance and inconsistency
   - Should use FIXED eval set

2. **Forgetting rate too aggressive for 500 iterations**
   - Forgetting rate: 0.05
   - After 500 iterations (500 time units): retention = exp(-0.05 * 500) ≈ 0.0
   - **All skills forgotten by the end!**
   - Retention drops to near-zero after ~50-100 time units

3. **Evaluation timing confusion**
   - Currently: Learn → Evaluate → Advance time
   - Should be clearer about when evaluation happens relative to forgetting

## Issue 2: Accuracy Calculation Method

### Current Method:
- Uses `student.evaluate(eval_tasks)` which:
  - Calls `answer()` for each task (stochastic, uses randomness)
  - Accounts for forgetting via `_get_effective_skill()`
  - Returns fraction of correct answers

### Problems:
1. **Stochastic variance**: Random sampling introduces noise
2. **Eval tasks regenerated**: Different tasks each time = inconsistent
3. **Small eval set**: Only 10-15 tasks = high variance

### Better Methods:
1. **Use FIXED eval set** generated once at start
2. **Use expected accuracy** instead of sampled (less variance)
   - Expected acc = mean(prob_correct) over all tasks
3. **Larger eval set** (50-100 tasks) for stability
4. **Separate eval timing**: Evaluate BEFORE time advance

## Issue 3: Mock vs Real Components

### Current Mock Components:

**Mock Student:**
- ✅ Captures learning and forgetting
- ✅ Per-topic skill tracking
- ✅ Realistic Ebbinghaus curve
- ❌ Simplified learning model (linear skill increase)
- ❌ Stochastic but not as complex as real PPO

**Mock Task Generator:**
- ✅ Simple template-based tasks
- ✅ Multiple topics and difficulties
- ❌ Fixed templates (not procedural)
- ❌ Limited diversity

**Real Components (in MentorFlow):**
- Student: Full PPO agent with neural network
- Task Generator: Procedural generation with 15 task families

### Will Real Components Be Better?

**YES, likely:**
1. **Real PPO student** can learn more complex patterns
2. **Procedural task generator** provides more diverse tasks
3. **Better generalization** to unseen tasks
4. **More realistic learning curves**

**BUT:**
- Real components are slower to train
- Harder to debug and verify
- Teacher agent algorithm (UCB) should still work

## Recommended Fixes

1. **Fix evaluation to use FIXED eval sets**
2. **Reduce forgetting rate** or **reset time** periodically
3. **Use expected accuracy** for more stable measurements
4. **Add evaluation BEFORE time advance** option
5. **Document evaluation methodology** clearly