Spaces:

iteratehack
/

MentorFlow

Paused

App Files Files Community

MentorFlow / teacher_agent_dev /ANALYSIS_AND_FIXES.md

Cornelius

Deploy MentorFlow with GPU support

a52f96d 15 days ago

preview code

raw

history blame contribute delete

2.85 kB

	# Analysis: Why Accuracy Drops and How to Fix

	## Issue 1: Accuracy Drops at End ❌

	### Root Causes Found:

	1. Evaluation uses NEW tasks each time (line 171-175 in compare_strategies.py)
	- `general_accuracy = student.evaluate([generator.generate_task(...) for ...])`
	- Creates new tasks every iteration → variance and inconsistency
	- Should use FIXED eval set

	2. Forgetting rate too aggressive for 500 iterations
	- Forgetting rate: 0.05
	- After 500 iterations (500 time units): retention = exp(-0.05 * 500) ≈ 0.0
	- All skills forgotten by the end!
	- Retention drops to near-zero after ~50-100 time units

	3. Evaluation timing confusion
	- Currently: Learn → Evaluate → Advance time
	- Should be clearer about when evaluation happens relative to forgetting

	## Issue 2: Accuracy Calculation Method

	### Current Method:
	- Uses `student.evaluate(eval_tasks)` which:
	- Calls `answer()` for each task (stochastic, uses randomness)
	- Accounts for forgetting via `_get_effective_skill()`
	- Returns fraction of correct answers

	### Problems:
	1. Stochastic variance: Random sampling introduces noise
	2. Eval tasks regenerated: Different tasks each time = inconsistent
	3. Small eval set: Only 10-15 tasks = high variance

	### Better Methods:
	1. Use FIXED eval set generated once at start
	2. Use expected accuracy instead of sampled (less variance)
	- Expected acc = mean(prob_correct) over all tasks
	3. Larger eval set (50-100 tasks) for stability
	4. Separate eval timing: Evaluate BEFORE time advance

	## Issue 3: Mock vs Real Components

	### Current Mock Components:

	Mock Student:
	- ✅ Captures learning and forgetting
	- ✅ Per-topic skill tracking
	- ✅ Realistic Ebbinghaus curve
	- ❌ Simplified learning model (linear skill increase)
	- ❌ Stochastic but not as complex as real PPO

	Mock Task Generator:
	- ✅ Simple template-based tasks
	- ✅ Multiple topics and difficulties
	- ❌ Fixed templates (not procedural)
	- ❌ Limited diversity

	Real Components (in MentorFlow):
	- Student: Full PPO agent with neural network
	- Task Generator: Procedural generation with 15 task families

	### Will Real Components Be Better?

	YES, likely:
	1. Real PPO student can learn more complex patterns
	2. Procedural task generator provides more diverse tasks
	3. Better generalization to unseen tasks
	4. More realistic learning curves

	BUT:
	- Real components are slower to train
	- Harder to debug and verify
	- Teacher agent algorithm (UCB) should still work

	## Recommended Fixes

	1. Fix evaluation to use FIXED eval sets
	2. Reduce forgetting rate or reset time periodically
	3. Use expected accuracy for more stable measurements
	4. Add evaluation BEFORE time advance option
	5. Document evaluation methodology clearly