Spaces:
Paused
Paused
File size: 2,847 Bytes
a52f96d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
# Analysis: Why Accuracy Drops and How to Fix
## Issue 1: Accuracy Drops at End β
### Root Causes Found:
1. **Evaluation uses NEW tasks each time** (line 171-175 in compare_strategies.py)
- `general_accuracy = student.evaluate([generator.generate_task(...) for ...])`
- Creates new tasks every iteration β variance and inconsistency
- Should use FIXED eval set
2. **Forgetting rate too aggressive for 500 iterations**
- Forgetting rate: 0.05
- After 500 iterations (500 time units): retention = exp(-0.05 * 500) β 0.0
- **All skills forgotten by the end!**
- Retention drops to near-zero after ~50-100 time units
3. **Evaluation timing confusion**
- Currently: Learn β Evaluate β Advance time
- Should be clearer about when evaluation happens relative to forgetting
## Issue 2: Accuracy Calculation Method
### Current Method:
- Uses `student.evaluate(eval_tasks)` which:
- Calls `answer()` for each task (stochastic, uses randomness)
- Accounts for forgetting via `_get_effective_skill()`
- Returns fraction of correct answers
### Problems:
1. **Stochastic variance**: Random sampling introduces noise
2. **Eval tasks regenerated**: Different tasks each time = inconsistent
3. **Small eval set**: Only 10-15 tasks = high variance
### Better Methods:
1. **Use FIXED eval set** generated once at start
2. **Use expected accuracy** instead of sampled (less variance)
- Expected acc = mean(prob_correct) over all tasks
3. **Larger eval set** (50-100 tasks) for stability
4. **Separate eval timing**: Evaluate BEFORE time advance
## Issue 3: Mock vs Real Components
### Current Mock Components:
**Mock Student:**
- β
Captures learning and forgetting
- β
Per-topic skill tracking
- β
Realistic Ebbinghaus curve
- β Simplified learning model (linear skill increase)
- β Stochastic but not as complex as real PPO
**Mock Task Generator:**
- β
Simple template-based tasks
- β
Multiple topics and difficulties
- β Fixed templates (not procedural)
- β Limited diversity
**Real Components (in MentorFlow):**
- Student: Full PPO agent with neural network
- Task Generator: Procedural generation with 15 task families
### Will Real Components Be Better?
**YES, likely:**
1. **Real PPO student** can learn more complex patterns
2. **Procedural task generator** provides more diverse tasks
3. **Better generalization** to unseen tasks
4. **More realistic learning curves**
**BUT:**
- Real components are slower to train
- Harder to debug and verify
- Teacher agent algorithm (UCB) should still work
## Recommended Fixes
1. **Fix evaluation to use FIXED eval sets**
2. **Reduce forgetting rate** or **reset time** periodically
3. **Use expected accuracy** for more stable measurements
4. **Add evaluation BEFORE time advance** option
5. **Document evaluation methodology** clearly
|