File size: 2,847 Bytes
a52f96d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
# Analysis: Why Accuracy Drops and How to Fix

## Issue 1: Accuracy Drops at End ❌

### Root Causes Found:

1. **Evaluation uses NEW tasks each time** (line 171-175 in compare_strategies.py)
   - `general_accuracy = student.evaluate([generator.generate_task(...) for ...])`
   - Creates new tasks every iteration β†’ variance and inconsistency
   - Should use FIXED eval set

2. **Forgetting rate too aggressive for 500 iterations**
   - Forgetting rate: 0.05
   - After 500 iterations (500 time units): retention = exp(-0.05 * 500) β‰ˆ 0.0
   - **All skills forgotten by the end!**
   - Retention drops to near-zero after ~50-100 time units

3. **Evaluation timing confusion**
   - Currently: Learn β†’ Evaluate β†’ Advance time
   - Should be clearer about when evaluation happens relative to forgetting

## Issue 2: Accuracy Calculation Method

### Current Method:
- Uses `student.evaluate(eval_tasks)` which:
  - Calls `answer()` for each task (stochastic, uses randomness)
  - Accounts for forgetting via `_get_effective_skill()`
  - Returns fraction of correct answers

### Problems:
1. **Stochastic variance**: Random sampling introduces noise
2. **Eval tasks regenerated**: Different tasks each time = inconsistent
3. **Small eval set**: Only 10-15 tasks = high variance

### Better Methods:
1. **Use FIXED eval set** generated once at start
2. **Use expected accuracy** instead of sampled (less variance)
   - Expected acc = mean(prob_correct) over all tasks
3. **Larger eval set** (50-100 tasks) for stability
4. **Separate eval timing**: Evaluate BEFORE time advance

## Issue 3: Mock vs Real Components

### Current Mock Components:

**Mock Student:**
- βœ… Captures learning and forgetting
- βœ… Per-topic skill tracking
- βœ… Realistic Ebbinghaus curve
- ❌ Simplified learning model (linear skill increase)
- ❌ Stochastic but not as complex as real PPO

**Mock Task Generator:**
- βœ… Simple template-based tasks
- βœ… Multiple topics and difficulties
- ❌ Fixed templates (not procedural)
- ❌ Limited diversity

**Real Components (in MentorFlow):**
- Student: Full PPO agent with neural network
- Task Generator: Procedural generation with 15 task families

### Will Real Components Be Better?

**YES, likely:**
1. **Real PPO student** can learn more complex patterns
2. **Procedural task generator** provides more diverse tasks
3. **Better generalization** to unseen tasks
4. **More realistic learning curves**

**BUT:**
- Real components are slower to train
- Harder to debug and verify
- Teacher agent algorithm (UCB) should still work

## Recommended Fixes

1. **Fix evaluation to use FIXED eval sets**
2. **Reduce forgetting rate** or **reset time** periodically
3. **Use expected accuracy** for more stable measurements
4. **Add evaluation BEFORE time advance** option
5. **Document evaluation methodology** clearly