Spaces:

jalFaizy
/

mnist-digit-classifier

Sleeping

mnist-digit-classifier / docs /sample /planning.md.sample

faizan

docs: add project documentation and reference files

4508e42 2 months ago

23.9 kB

	# Planning Document: Tool Wear Prediction with Two-Stage Transformer

	## Current Status

	Current Phase: Phase 8 - Signal Preprocessing & Temporal Features
	Last Updated: December 22, 2025
	Status: 🟡 IN PROGRESS - Best result: 22.76±11.66 μm (gap to paper: +12.71 μm)

	### Quick Summary

	Best Performance:
	- Mean RMSE: 22.76 ± 11.66 μm (CV = 51.2%)
	- Fold 1: 18.41 μm
	- Fold 2: 36.84 μm (problematic - train3/train4 intensity mismatch)
	- Fold 3: 13.02 μm (best - only 3 μm from paper!)
	- Gap to paper (10.05 μm): +12.71 μm
	- Gap to interpolation baseline (11.10 μm): +11.66 μm (simple baseline beats complex model!)

	Current Strategy:
	Focus on signal preprocessing and temporal features - likely paper's undocumented steps:
	1. Temporal context (time_elapsed) - CRITICAL, likely biggest impact
	2. Temporal signal structure analysis (10s ramp, 35s steady, 5s taper)
	3. Cut 2 quality investigation
	4. Steady-state feature extraction (exclude noise periods)
	5. Bandpass filtering

	Phase Progress:
	- ✅ Phases 0-5: Complete - See [completed_phases.md](completed_phases.md)
	- ✅ Phase 6: Experimentation complete - See [phase6_experimentation.md](phase6_experimentation.md)
	- ✅ Phase 7: Baselines complete - See [phase7_baselines.md](phase7_baselines.md)
	- ✅ Phase 8 Analysis: Core analysis complete - See [phase8_findings.md](phase8_findings.md)
	- 🟡 Phase 8 Active: Signal preprocessing tasks (8.0.14-8.0.18)
	- ⬜ Phase 9: Final evaluation

	---

	## Phase Status Overview

	\| Phase \| Tasks \| Status \| Key Result \| Document \|
	\|-------\|-------\|--------\|------------\|----------\|
	\| 0-5: Core Pipeline \| 36/36 \| ✅ \| 353 tests, full pipeline \| [completed_phases.md](completed_phases.md) \|
	\| 6: Experimentation \| 6/7 \| ✅ \| 27.56±11.37 μm, 41% CV \| [phase6_experimentation.md](phase6_experimentation.md) \|
	\| 7: Baselines \| 3/3 \| ✅ \| Interp: 11.10 μm, EN: 19.35 μm \| [phase7_baselines.md](phase7_baselines.md) \|
	\| 8: Analysis \| 13/18 \| ✅ \| 22.76±11.66 μm (best), found issues \| [phase8_findings.md](phase8_findings.md) \|
	\| 8: Active (Signal) \| 0/5 \| 🟡 \| Temporal + signal preprocessing \| This doc \|
	\| 9: Final Eval \| 0/4 \| ⬜ \| TBD \| Pending \|

	---

	## Critical Context

	### Why Simple Baseline Beats Complex Model

	The Problem:
	- Interpolation baseline: 11.10 μm (just uses time/position)
	- Our transformer: 22.76 μm (complex model with features)
	- Paper target: 10.05 μm

	The Insight:
	Interpolation works because wear is fundamentally time-dependent. Our model doesn't have temporal context! It's like predicting cumulative distance without knowing how long you've been driving.

	### Key Findings from Phase 8 Analysis

	What Works:
	- ✅ Model architecture is sound (Fold 3: 13.02 μm, close to paper!)
	- ✅ Per-dataset normalization helps (35% improvement)
	- ✅ Guardrails prevent outliers (31% improvement)
	- ✅ Labels are high quality (verified in Task 8.0.5)

	What's Missing:
	- 🔴 Temporal context (time_elapsed) - model doesn't know cutting history
	- 🔴 Signal preprocessing (10s ramp + 35s steady + 5s taper structure)
	- 🔴 Cut 2 handling (systematically problematic across datasets)
	- 🔴 Steady-state features (current features contaminated by noise)

	What Doesn't Work:
	- ✗ train6 reprocessing (15.46 vs 13.02 μm - made it worse)
	- ✗ Cut-1 anchoring on validation (groups exclude cut 1 by design)
	- ✗ Variance penalty (λ=0.05, r=0.127 correlation with error - ineffective)
	- ⚠️ Fold 2 unstable (train3/train4 have 10.8% lower intensity)

	---

	## Phase 8: Active Tasks - Signal Preprocessing & Temporal Features

	> Purpose: Implement signal preprocessing and temporal features that paper authors likely used but didn't document. Focus on quick wins with highest ROI before committing to lengthy hyperparameter optimization.

	Status: � BLOCKED (critical data loading bug discovered)
	Current Best: 22.76 ± 11.66 μm
	Target: < 16 μm (30% improvement, competitive with paper)
	Gap to Close: 12.71 μm

	---

	### Task 8.0.19: Fix Data Loading Bug 🔥 [CRITICAL - BLOCKING]
	Status: ⏳ IN PROGRESS
	Priority: CRITICAL (blocks all other work)
	Estimated Time: 2-3 hours
	Branch: `feature/temporal-effective-cutting-time`

	Problem Discovery:
	During Task 8.0.18 analysis, discovered that `src/data/raw_loader.py` incorrectly extracts cut data:
	- Current behavior: Splits accelerometer files evenly (total_rows // n_cuts)
	- Actual problem: Cuts have different durations, causing misalignment
	- Evidence: train4 cut 2 loads 88s of data, but controller says cut was only 55.79s
	- Visual inspection shows cut 2 (0-65s low intensity) + cut 3 data (65-88s high intensity)
	- This explains why cut 2 appears to have cutting activity when it should be a dud

	Additional Discovery - Timestamp Misalignment:
	After implementing timestamp-based loading, discovered sensor/controller timestamp offsets:
	- train1-5: Sensor starts 3-21s BEFORE controller (acceptable - trimmed at start)
	- train6: Sensor starts 42.6s AFTER controller (problematic - loses beginning of cuts)
	- This explains train6's poor performance - incomplete data!

	Updated Dataset Links (Dec 23, 2025):
	- Training: https://drive.google.com/file/d/17rmopMdagBrcNw4Ks-P4f3eGdkvghnum/view?usp=drive_link
	- Evaluation: https://drive.google.com/file/d/1vGekm6VrIjUzUBkKOCnWPRR5djNy2Yn2/view?usp=drive_link
	- Action Required: Download and verify if timestamp alignment issues are fixed

	Impact:
	- ALL feature extraction to date has been using misaligned/contaminated data
	- Explains inconsistent model performance and high variance
	- train3/train4 intensity mismatches may be due to this bug
	- Must fix before proceeding with any retraining

	Root Cause:
	Controller data provides exact `start_cut` and `end_cut` timestamps, but raw_loader ignores them.

	Solution:
	Rewrite `load_sensor_data()` to:
	1. Load controller Cut_XX.csv to get start_cut/end_cut timestamps
	2. Load accelerometer data Date/Time column
	3. Filter accelerometer rows where timestamp is between start_cut and end_cut
	4. Return only the correct time range for each cut

	Workflow:
	- [x] Review: Identified bug in data loading (equal-split assumption wrong)
	- [x] Design: Plan timestamp-based extraction approach using polars
	- [x] Implement: Rewrite `_get_cut_data_with_timestamps()` and `load_sensor_data()`
	- [x] Run: Validate on all train1-6, cuts 1-26
	- [x] Assess:
	- train4 cut 2: Fixed from 88s→53.5s, now correctly identified as dud
	- All cut 2s now properly show as duds (max intensity <3.0)
	- Exposed train6 timestamp issue: 42s offset, incomplete data
	- Old loader was hiding real data quality problems
	- [x] Commit: Critical bug fix + documentation
	- [x] Download new datasets (Dec 23, 2024):
	- Training: https://drive.google.com/file/d/17rmopMdagBrcNw4Ks-P4f3eGdkvghnum/view
	- Evaluation: https://drive.google.com/file/d/1vGekm6VrIjUzUBkKOCnWPRR5djNy2Yn2/view
	- Note: New datasets included evaluation data and labels (deleted after verification)
	- [x] Verify: ✗ New datasets have SAME timestamp issues (train6 still 42.6s offset)
	- [x] Decision: Continue with existing data + timestamp-based loader fix
	- [x] Regenerate ALL features with fixed loader:
	- ✓ All 6 training datasets: 138,109 windows (26 MB)
	- ✓ train6 now shows 19,131 windows (reflects incomplete data correctly)
	- ✓ Using 40s fallback effective time (phase detection integration pending)
	- ✓ All data extracted with correct temporal boundaries

	---

	### Task 8.0.18: Add effective cutting time features ⭐ [PAUSED - WAITING ON 8.0.19]
	Status: 🟡 PAUSED (Phase B complete, blocked by data loading bug)
	Priority: CRITICAL (highest ROI, likely THE missing piece)
	Branch: `feature/temporal-effective-cutting-time`
	Objective: Calculate and add ACTUAL cutting time (excluding ramp-up/ramp-down) as temporal feature

	User Requirement (Updated):
	> "Time elapsed is how much time was the tool actually working. We can calculate this by removing the ramp up and rampdown for each cut, the intensities of which can be calculated from cut 1."

	Key Insight:
	- Not just `cut_number * 50s` - need EFFECTIVE cutting time
	- Each 50s cut has: ramp-up (~10s) + steady-state (~35s) + ramp-down (~5s)
	- Only steady-state time counts as "actual working"
	- Use cut 1 as reference to determine phase boundaries based on signal intensity
	- Accumulate effective cutting time across all cuts

	Why This Is More Complex:
	1. Must analyze signal intensity patterns in cut 1
	2. Define intensity thresholds for ramp-up/steady/ramp-down phases
	3. Apply phase detection to all cuts (may vary per cut)
	4. Calculate cumulative effective time excluding ramp periods
	5. Handle variations in phase durations across cuts

	Implementation Phases:

	Phase A: Signal Phase Detection (2-3 hours)
	- [x] Task 8.0.18a: Analyze cut 1 signal intensity patterns ✅ COMPLETE
	- Created `scripts/analyze_phase_boundaries.py` (358 lines)
	- Analyzed cut 1 from all 6 training datasets using RMS intensity
	- Algorithm: Quantile-based thresholds with hybrid approach
	- Steady-state start: 75th percentile of RMS intensity
	- Steady-state end: max(55th percentile, 50% of 75th percentile)
	- Hybrid threshold prevents using noise floor for signals with long quiet periods
	- Final findings:
	- Steady-state durations: 38.1-40.7s (median=40.2s, CV=2.3%)
	- Ramp-up times: 12.8-27.8s (median=16.0s)
	- Extremely consistent across all 6 datasets
	- Raw cut files are 68-112s (include buffer before/after actual cutting)
	- Visualization: `experiments/plots/phase_boundaries/cut1_phase_detection.png`
	- Key insight: Quantile-based detection is robust and consistent across different tool profiles

	- [x] Task 8.0.18b: Implement phase detection algorithm ✅ COMPLETE
	- Created `src/data/phase_detection.py` module (305 lines)
	- Functions:
	- `calculate_rms_intensity()`: Compute RMS with sliding window smoothing
	- `detect_cutting_phases()`: Quantile-based phase detection with hybrid threshold
	- `calculate_effective_time()`: Cumulative effective time for cut sequences
	- Created `tests/data/test_phase_detection.py` (270 lines)
	- 17 unit tests, all passing
	- Validated on real data: matches analysis script results exactly
	- Status: Module ready for integration into preprocessing pipeline

	Phase B: Effective Time Calculation (2 hours)
	- [x] Task 8.0.18c & 8.0.18d: Add temporal feature to preprocessing ✅ COMPLETE
	- Modified `src/data/preprocessing.py::preprocess_dataset()`
	- Calculates effective time for all 26 cuts per dataset using phase detection
	- Added two new columns to features DataFrame:
	- `effective_time_elapsed`: Cumulative steady-state duration (seconds)
	- `effective_time_norm`: Normalized to 0-1 range per dataset
	- Fallback: Uses 40s/cut if phase detection fails
	- Logs effective time range for verification
	- Status: Ready to regenerate all feature files

	Phase C: Feature Integration & Testing (2-3 hours)
	- [ ] Task 8.0.18e: Regenerate all training features
	- Run `scripts/preprocess_features.py --force --train-only`
	- Verify new column exists in all train1-6 features
	- Check effective time values are reasonable (< 26*50s, monotonic)
	- Backup old features before overwriting

	- [ ] Task 8.0.18f: Retrain and evaluate
	- Retrain Fold 3 with new temporal feature
	- Compare RMSE: old (13.02 μm) vs new
	- Check if temporal feature has high importance
	- If successful, retrain all 3 folds

	Expected Impact:
	- Best case: 30-40% improvement (22.76 → 14-16 μm)
	- Effective cutting time is THE missing piece
	- Model learns wear as function of actual working time
	- Likely case: 20-30% improvement (22.76 → 16-18 μm)
	- Better than simple cut_number, captures real physics
	- Worst case: 10-15% improvement
	- Still significant, better than raw cut number

	Codebase Changes:
	```
	New files:
	src/data/phase_detection.py # Phase detection algorithm
	tests/data/test_phase_detection.py # Unit tests
	scripts/analyze_phase_boundaries.py # Analysis tool

	Modified files:
	src/data/preprocessing.py # Add effective time calculation
	data/preprocessed/*_features.parquet # Regenerate with new column

	Dependencies:
	- Task 8.0.14 analysis can inform phase detection
	- May want to combine with steady-state feature extraction
	```

	Success Criteria:
	- Phase detection works reliably across all datasets
	- Effective time values are physically plausible
	- Temporal feature has high model importance
	- RMSE improvement >15% vs baseline without temporal feature
	- Model finally beats interpolation baseline (11.10 μm)

	Estimated Time: 6-8 hours (major feature, needs careful implementation)

	Git Workflow:
	- Branch: `feature/temporal-effective-cutting-time` (created)
	- Commit after each phase (A, B, C)
	- Merge to main after validation shows improvement

	---

	### Task 8.0.14: Analyze temporal signal structure
	Status: ⬜ NOT STARTED
	Priority: HIGH
	Objective: Quantify temporal structure of cutting signals to identify noise periods

	User Domain Insights:
	> "Most cut vibrations start after 10 seconds, continues for ~35 sec, and then have a tapering five second ending. There's lot of static noise especially at the start and the end."

	Key Hypothesis:
	- Cutting signals have 3 phases: ramp-up (0-10s), steady-state (10-45s), tapering (45-50s)
	- Start/end periods contaminate features with static noise
	- Steady-state period (35s) contains true cutting information

	Implementation:
	1. Load raw signal data for representative cuts
	2. Visualize temporal structure:
	- Plot RMS energy over time
	- Identify ramp-up, steady-state, tapering boundaries
	- Quantify signal-to-noise ratio per phase
	3. Compare feature statistics:
	- Features from full signal (0-50s)
	- Features from steady-state only (10-45s)
	4. Test hypothesis:
	- Do steady-state features have lower variance?
	- Are they more consistent across datasets?

	Expected Findings:
	- Ramp-up period (0-10s): high noise, low signal
	- Steady-state (10-45s): clean cutting vibrations
	- Tapering (45-50s): decreasing signal, increasing noise
	- Current features may be diluted by 30% noise (15s/50s)

	Deliverables:
	- [ ] Script: `scripts/analyze_temporal_structure.py`
	- [ ] Visualization: Signal phases across datasets
	- [ ] Analysis: SNR comparison by temporal phase

	Success Criteria:
	- Quantified temporal boundaries (ramp, steady, taper)
	- Clear difference in SNR between phases
	- Recommendation: use steady-state vs full signal

	Estimated Time: 2 hours

	---

	### Task 8.0.15: Investigate cut 2 data quality
	Status: ⬜ NOT STARTED
	Priority: HIGH
	Objective: Analyze why cut 2 is systematically problematic

	User Observation:
	> "Also, cut2 data is almost always a dud cut"

	Questions to Answer:
	1. What makes cut 2 different from other cuts?
	- Signal characteristics (amplitude, frequency, noise)
	- Feature statistics (mean, std, outliers)
	- Wear progression (is label at cut 6 affected?)
	2. Is this consistent across all datasets?
	- Check all 6 training sets
	- Compare cut 2 vs cut 3-6
	3. Should we exclude cut 2 from training?
	- Impact on group structure (cuts 2-6 → 3-6?)
	- Trade-off: less data vs better quality

	Implementation:
	1. Compare cut 2 vs other cuts in group 1:
	- Feature distributions (box plots, histograms)
	- Signal quality metrics (SNR, outliers)
	- Prediction errors (if cut 2 always wrong)
	2. Visualize cut 2 signals:
	- Raw waveforms across datasets
	- Identify common artifacts
	3. Test exclusion:
	- Retrain with groups [3-6, 7-11, ...] instead of [2-6, 7-11, ...]
	- Compare validation RMSE

	Expected Impact:
	- If cut 2 is truly bad: 5-10% RMSE improvement by excluding it
	- If cut 2 is just harder: no improvement, keep for training

	Deliverables:
	- [ ] Script: `scripts/analyze_cut2_quality.py`
	- [ ] Visualization: Cut 2 vs other cuts comparison
	- [ ] Recommendation: exclude or keep cut 2

	Success Criteria:
	- Quantified quality difference (if any)
	- Clear decision on cut 2 handling

	Estimated Time: 1.5 hours

	---

	### Task 8.0.16: Extract steady-state features
	Status: ⬜ NOT STARTED
	Priority: HIGH
	Objective: Reprocess features using only steady-state cutting period

	Dependencies: Task 8.0.14 (temporal analysis)

	Implementation:
	1. Update feature extraction pipeline:
	- Identify steady-state window per dataset (e.g., 10-45s)
	- Extract features from steady-state only
	- Save as `*_features_steadystate.parquet`
	2. Compare feature quality:
	- Variance across datasets (steady-state vs full signal)
	- Correlation with wear (better for steady-state?)
	3. Retrain model with steady-state features:
	- Use same architecture (no changes)
	- Train on steady-state features only
	- Compare validation RMSE

	Expected Impact:
	- Best case: 20-30% RMSE reduction (22.76 → 16-18 μm)
	- Likely case: 10-15% improvement (22.76 → 19-20 μm)
	- Steady-state features should have lower fold variance

	Deliverables:
	- [ ] Updated preprocessing script with temporal windowing
	- [ ] New feature files: `*_features_steadystate.parquet`
	- [ ] Retrained models with steady-state features
	- [ ] RMSE comparison: full signal vs steady-state

	Success Criteria:
	- Steady-state features show lower dataset variance
	- Validation RMSE improves
	- Fold 2 (train3/train4) benefits most

	Estimated Time: 3 hours

	---

	### Task 8.0.17: Apply bandpass filtering
	Status: ⬜ NOT STARTED
	Priority: MEDIUM
	Objective: Remove static noise through frequency-domain filtering

	Dependencies: Task 8.0.14 (identify noise characteristics)

	Implementation:
	1. Analyze frequency spectrum:
	- FFT of signals across datasets
	- Identify cutting frequency range
	- Identify noise frequency range (static/low-freq)
	2. Design bandpass filter:
	- Lower cutoff: remove DC offset and static noise
	- Upper cutoff: remove high-frequency sensor noise
	- Typical cutting frequencies: 50-500 Hz (verify with data)
	3. Apply filtering:
	- Filter raw signals before feature extraction
	- Recompute features
	- Save as `*_features_filtered.parquet`
	4. Test impact:
	- Retrain with filtered features
	- Compare RMSE

	Expected Impact:
	- Best case: 10-15% improvement if high noise levels
	- Likely case: 5-10% improvement
	- May combine with steady-state extraction

	Deliverables:
	- [ ] Script: `scripts/apply_bandpass_filter.py`
	- [ ] Frequency analysis plots
	- [ ] Filtered feature files
	- [ ] RMSE comparison

	Success Criteria:
	- Identified optimal filter parameters
	- Improved signal-to-noise ratio
	- RMSE improvement (if filtering helps)

	Estimated Time: 2 hours

	---

	Phase 8 Active Tasks Summary:
	- Total tasks: 5 (1 CRITICAL + 4 HIGH priority)
	- Estimated time: ~10.5 hours
	- Expected impact: 20-40% RMSE reduction (22.76 → 14-18 μm)
	- Goal: Close gap to paper (currently +12.71 μm)
	- Strategy: Quick wins (temporal context) before lengthy optimization

	---

	## Phase 9: Final Evaluation & Deployment (Planned)

	Status: ⬜ NOT STARTED
	Prerequisites: Phase 8 signal preprocessing complete, RMSE < 18 μm

	Planned Tasks:
	1. Task 9.1: Implement cut-1 anchoring for test inference
	- Cannot test on validation (groups exclude cut 1)
	- But valid strategy for final evaluation where cut 1 is provided

	2. Task 9.2: Run final evaluation on all test sets
	- Apply best preprocessing pipeline
	- Use temporal features + steady-state extraction
	- Apply cut-1 anchoring at inference time

	3. Task 9.3: Compare to paper and baselines
	- Paper target: 10.05 μm
	- Interpolation baseline: 11.10 μm
	- Document where we stand

	4. Task 9.4: Package and document solution
	- Clean inference pipeline
	- Model cards
	- Usage examples

	Estimated Time: 8-10 hours

	---

	## Strategic Options Beyond Phase 8

	If Phase 8 signal preprocessing doesn't close the gap sufficiently (<16 μm), consider:

	Option A: Hyperparameter Optimization (12-14 hours)
	- Systematic tuning: λ, learning rates, architecture size, dropout
	- Grid/random search across parameter space
	- Expected: 10-15% additional improvement

	Option B: Architectural Simplification (8-10 hours)
	- Test if simpler models work better (baseline dominance suggests this)
	- Try: simple LSTM, 1D CNN, or even linear regression with good features
	- Expected: Unknown, but could surprise us

	Option C: Accept Current & Productionize (8-10 hours)
	- Document findings and gaps
	- Package best model for deployment
	- Focus on practical usability

	---

	## Workflow Reference

	For detailed workflow guidelines, see [task_development_workflow.md](task_development_workflow.md).

	Standard workflow for each task:
	1. Review: Understand requirements, check dependencies
	2. Design: Plan implementation approach
	3. Implement: Write code, tests if needed
	4. Run: Execute and collect results
	5. Assess: Analyze impact, document findings
	6. Commit: Save progress with clear message

	---

	## Key Metrics & Targets

	Current Best Performance:
	- Mean RMSE: 22.76 ± 11.66 μm (CV = 51.2%)
	- Best fold: 13.02 μm (Fold 3)
	- Worst fold: 36.84 μm (Fold 2)

	Targets:
	- Minimum viable: < 18 μm (20% improvement)
	- Competitive: < 16 μm (30% improvement)
	- Match paper: < 12 μm (47% improvement)
	- Stability: CV < 30% (currently 51.2%)

	Baselines:
	- Paper: 10.05 μm
	- Interpolation: 11.10 ± 1.40 μm (CV = 12.6%)
	- Elastic Net: 19.35 ± 5.11 μm (CV = 26.4%)

	---

	## Quick Links

	Documentation:
	- [Completed Phases 0-5](completed_phases.md) - Core pipeline development
	- [Phase 6: Experimentation](phase6_experimentation.md) - Multi-fold validation, diagnostics
	- [Phase 7: Baselines](phase7_baselines.md) - Interpolation, Elastic Net, normalization
	- [Phase 8: Analysis Findings](phase8_findings.md) - Scale mismatch, outliers, embeddings
	- [Task Development Workflow](task_development_workflow.md) - How to execute tasks
	- [Challenge Description](challenge_description.md) - Original problem statement
	- [Original Paper](original_paper.md) - Target to match

	Key Scripts:
	- `scripts/run_multifold_validation.py` - Train and validate models
	- `scripts/analyze_predictions.py` - Detailed prediction analysis
	- `scripts/baseline_*.py` - Baseline model implementations
	- `scripts/launch_mlflow_ui.sh` - View experiment tracking

	Data:
	- `data/preprocessed/` - Feature and label files
	- `data/raw/` - Original challenge data

	Models:
	- `models_normalized/` - Best models with per-dataset normalization
	- `models_normalized_old/` - Backups from experiments

	---

	## Notes

	Critical Insights:
	1. Temporal context is key: Simple interpolation beats complex model → we need time
	2. Signal preprocessing matters: 10s ramp + 35s steady + 5s taper structure
	3. Cut 2 is problematic: Systematically bad across datasets
	4. Fold 3 proves it works: 13.02 μm (only 3 μm from paper!)
	5. Fold 2 is the challenge: train3/train4 have different intensity (10.8% lower)

	What We've Learned:
	- Model architecture is sound (when data is clean)
	- Per-dataset normalization helps (35% improvement)
	- Guardrails prevent catastrophic outliers (31% improvement)
	- Cut-1 anchoring valid for test time (not validation)
	- train6 reprocessing didn't help (old model better)

	Next Priority:
	Task 8.0.18 (temporal context) - 2 hours, expected 20-40% improvement, highest ROI