faizan commited on
Commit
4508e42
·
1 Parent(s): fd17037

docs: add project documentation and reference files

Browse files

- Add problem statement and sample planning documents
- Add MNIST reference notebook for data loading
- Add data/raw/README.md explaining dataset structure
- Keep binary data files untracked (git-ignored)

data/raw/README.md ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # MNIST Dataset
2
+
3
+ This directory contains the MNIST handwritten digit dataset in IDX binary format.
4
+
5
+ ## Files
6
+
7
+ ### Training Data
8
+ - `train-images-idx3-ubyte.gz` / `train-images.idx3-ubyte.zip` - 60,000 training images (28x28 grayscale)
9
+ - `train-labels-idx1-ubyte.gz` / `train-labels.idx1-ubyte` - 60,000 training labels (0-9)
10
+
11
+ ### Test Data
12
+ - `t10k-images-idx3-ubyte.gz` / `t10k-images.idx3-ubyte.zip` - 10,000 test images (28x28 grayscale)
13
+ - `t10k-labels-idx1-ubyte.gz` / `t10k-labels.idx1-ubyte` - 10,000 test labels (0-9)
14
+
15
+ ### Reference
16
+ - `read-mnist-dataset.ipynb` - Sample notebook demonstrating how to read IDX format
17
+
18
+ ## Data Source
19
+
20
+ MNIST dataset from: http://yann.lecun.com/exdb/mnist/
21
+
22
+ ## Format
23
+
24
+ IDX format is a simple binary format used by MNIST:
25
+ - Magic numbers: 2051 (images), 2049 (labels)
26
+ - Images: 28x28 pixels, grayscale (0-255)
27
+ - Labels: Single digit (0-9)
28
+
29
+ ## Note
30
+
31
+ Binary data files (*.gz, *.zip, *.idx*-ubyte) are git-ignored to avoid repository bloat.
32
+ Download from the official source if needed.
data/raw/read-mnist-dataset.ipynb ADDED
@@ -0,0 +1 @@
 
 
1
+ {"cells":[{"metadata":{"_cell_guid":"79c7e3d0-c299-4dcb-8224-4455121ee9b0","_uuid":"d629ff2d2480ee46fbb7e2d37f6b5fab8052498a","trusted":true},"cell_type":"code","source":"#\n# This is a sample Notebook to demonstrate how to read \"MNIST Dataset\"\n#\nimport numpy as np # linear algebra\nimport struct\nfrom array import array\nfrom os.path import join\n\n#\n# MNIST Data Loader Class\n#\nclass MnistDataloader(object):\n def __init__(self, training_images_filepath,training_labels_filepath,\n test_images_filepath, test_labels_filepath):\n self.training_images_filepath = training_images_filepath\n self.training_labels_filepath = training_labels_filepath\n self.test_images_filepath = test_images_filepath\n self.test_labels_filepath = test_labels_filepath\n \n def read_images_labels(self, images_filepath, labels_filepath): \n labels = []\n with open(labels_filepath, 'rb') as file:\n magic, size = struct.unpack(\">II\", file.read(8))\n if magic != 2049:\n raise ValueError('Magic number mismatch, expected 2049, got {}'.format(magic))\n labels = array(\"B\", file.read()) \n \n with open(images_filepath, 'rb') as file:\n magic, size, rows, cols = struct.unpack(\">IIII\", file.read(16))\n if magic != 2051:\n raise ValueError('Magic number mismatch, expected 2051, got {}'.format(magic))\n image_data = array(\"B\", file.read()) \n images = []\n for i in range(size):\n images.append([0] * rows * cols)\n for i in range(size):\n img = np.array(image_data[i * rows * cols:(i + 1) * rows * cols])\n img = img.reshape(28, 28)\n images[i][:] = img \n \n return images, labels\n \n def load_data(self):\n x_train, y_train = self.read_images_labels(self.training_images_filepath, self.training_labels_filepath)\n x_test, y_test = self.read_images_labels(self.test_images_filepath, self.test_labels_filepath)\n return (x_train, y_train),(x_test, y_test) \n\n","execution_count":null,"outputs":[]},{"metadata":{"trusted":true,"_uuid":"d89732350afbbd4f4cf1aeaf877a9b2c0c9e9f43"},"cell_type":"code","source":"#\n# Verify Reading Dataset via MnistDataloader class\n#\n%matplotlib inline\nimport random\nimport matplotlib.pyplot as plt\n\n#\n# Set file paths based on added MNIST Datasets\n#\ninput_path = '../input'\ntraining_images_filepath = join(input_path, 'train-images-idx3-ubyte/train-images-idx3-ubyte')\ntraining_labels_filepath = join(input_path, 'train-labels-idx1-ubyte/train-labels-idx1-ubyte')\ntest_images_filepath = join(input_path, 't10k-images-idx3-ubyte/t10k-images-idx3-ubyte')\ntest_labels_filepath = join(input_path, 't10k-labels-idx1-ubyte/t10k-labels-idx1-ubyte')\n\n#\n# Helper function to show a list of images with their relating titles\n#\ndef show_images(images, title_texts):\n cols = 5\n rows = int(len(images)/cols) + 1\n plt.figure(figsize=(30,20))\n index = 1 \n for x in zip(images, title_texts): \n image = x[0] \n title_text = x[1]\n plt.subplot(rows, cols, index) \n plt.imshow(image, cmap=plt.cm.gray)\n if (title_text != ''):\n plt.title(title_text, fontsize = 15); \n index += 1\n\n#\n# Load MINST dataset\n#\nmnist_dataloader = MnistDataloader(training_images_filepath, training_labels_filepath, test_images_filepath, test_labels_filepath)\n(x_train, y_train), (x_test, y_test) = mnist_dataloader.load_data()\n\n#\n# Show some random training and test images \n#\nimages_2_show = []\ntitles_2_show = []\nfor i in range(0, 10):\n r = random.randint(1, 60000)\n images_2_show.append(x_train[r])\n titles_2_show.append('training image [' + str(r) + '] = ' + str(y_train[r])) \n\nfor i in range(0, 5):\n r = random.randint(1, 10000)\n images_2_show.append(x_test[r]) \n titles_2_show.append('test image [' + str(r) + '] = ' + str(y_test[r])) \n\nshow_images(images_2_show, titles_2_show)","execution_count":null,"outputs":[]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"name":"python","version":"3.6.6","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat":4,"nbformat_minor":1}
docs/problem_statement.md ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Data-Driven Machine Learning: Ensuring Quality in Model Development
2
+
3
+ ## Project Overview:
4
+
5
+ This project provides practical experience in machine learning, covering aspects from data quality and preprocessing to model development and deployment. Students will choose one of the following tasks, each involving a different data type. The tasks are designed to highlight the importance of data quality in AI and also to engage students in the entire process of machine learning model development, including software engineering best practices and deployment on a modern platform like Hugging Face.
6
+
7
+
8
+ * Objective: Develop a model for image recognition.
9
+ * Key Considerations:
10
+ * Data Augmentation: Enhance the dataset's diversity and robustness through augmentation techniques.
11
+ * Model Architecture: Select or design a convolutional neural network (CNN) for image classification.
12
+ * Evaluation Metrics: Use appropriate metrics like accuracy, precision, and recall for image-related tasks.
13
+ * SE Best Practices: Follow SE best practices for code quality, including modularization and version control.
14
+ * Dataset: MNIST dataset (Handwritten digits)Links to an external site..
15
+
16
+
17
+ ## General Instructions:
18
+
19
+ * Model Development: Develop and evaluate the model, focusing on accuracy, efficiency, and interpretability while respecting SE best practices.
20
+ Deployment: Deploy the model on the Hugging Face platform and showcase its application.
21
+ * Data Quality Report: Include a detailed analysis of data quality, challenges faced, and measures taken to ensure data integrity.
22
+ Report and Presentation: Include workflow pipeline, model development, data quality analysis, SE best practices, and key findings.
23
+ * You can use the dataset of your choice.
24
+
25
+ ## Final Deliverables:
26
+ * A 20-30 pages well-documented report about HOW you solved the problem, including all its steps (data cleaning, outlier detection, etc.)
27
+ * A fully functioning ipynb to validate your solutions.
docs/sample/DEVELOPMENT_WORKFLOW.md.sample ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🛠️ AI-Assisted Development Workflow
2
+
3
+ ## 1. Context & Environment (Pre-Flight)
4
+ - **Target Env:** Always verify Conda environment ai_engg is active.
5
+ - **Context Injection:** Before starting, tell Copilot: *"Read planning.md and [SPEC_FILE].md to understand the current task and architecture."*
6
+ - **Clean Slate:** Ensure `git status` is clean to avoid mixing concerns.
7
+
8
+ ## 2. Specification Review (Spec-First)
9
+ **Before writing code, extract requirements from the source of truth (Papers, RFCs, or User Stories):**
10
+ - **Logic/Math:** Extract exact formulas. Use LaTeX for clarity: $Loss = \sum (y_i - \hat{y}_i)^2$.
11
+ - **Constraints:** Identify what the code **must not** do.
12
+ - **Data Flow:** Map inputs and outputs before defining function signatures.
13
+
14
+ ## 3. Codebase Survey (Anti-Redundancy)
15
+ **MANDATORY: Search before building.**
16
+ - Ask Copilot: *"Are there existing utilities in the codebase that handle [X]?"*
17
+ - Manually check `src/utils/` or `scripts/` for reusable logic.
18
+ - Document found components in the task's "Reused Code" section.
19
+
20
+ ## 4. Implementation Loop
21
+ - **Atomic Tasks:** One task = one logic block = one commit.
22
+ - **Consistency:** Follow existing naming conventions and directory structures.
23
+ - **Drafting:** Let AI generate the boilerplate, but manually review the logic "spine."
24
+
25
+ ## 5. Verification & Polish
26
+ - **Functional:** Do the tests pass?
27
+ - **Alignment:** Does the implementation match the Spec exactly?
28
+ - **Code Quality:** - No unused imports.
29
+ - Proper docstrings.
30
+ - Run `ruff check . --fix` or project-specific linter.
31
+
32
+ ## 6. Cleanup & Commit
33
+ - **Update Planning:** Mark task ✅ in `planning.md`.
34
+ - **Commit:** Use `type: brief summary` (e.g., `feat: add grouped sampler`).
35
+ - **Archive:** Every [X] phases, move completed tasks to `docs/archive/`.
docs/sample/planning.md.sample ADDED
@@ -0,0 +1,594 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Planning Document: Tool Wear Prediction with Two-Stage Transformer
2
+
3
+ ## Current Status
4
+
5
+ **Current Phase:** Phase 8 - Signal Preprocessing & Temporal Features
6
+ **Last Updated:** December 22, 2025
7
+ **Status:** 🟡 IN PROGRESS - Best result: 22.76±11.66 μm (gap to paper: +12.71 μm)
8
+
9
+ ### Quick Summary
10
+
11
+ **Best Performance:**
12
+ - Mean RMSE: **22.76 ± 11.66 μm** (CV = 51.2%)
13
+ - Fold 1: 18.41 μm
14
+ - Fold 2: 36.84 μm (problematic - train3/train4 intensity mismatch)
15
+ - Fold 3: 13.02 μm (best - only 3 μm from paper!)
16
+ - **Gap to paper (10.05 μm): +12.71 μm**
17
+ - **Gap to interpolation baseline (11.10 μm): +11.66 μm** (simple baseline beats complex model!)
18
+
19
+ **Current Strategy:**
20
+ Focus on signal preprocessing and temporal features - likely paper's undocumented steps:
21
+ 1. Temporal context (time_elapsed) - CRITICAL, likely biggest impact
22
+ 2. Temporal signal structure analysis (10s ramp, 35s steady, 5s taper)
23
+ 3. Cut 2 quality investigation
24
+ 4. Steady-state feature extraction (exclude noise periods)
25
+ 5. Bandpass filtering
26
+
27
+ **Phase Progress:**
28
+ - ✅ **Phases 0-5:** Complete - See [completed_phases.md](completed_phases.md)
29
+ - ✅ **Phase 6:** Experimentation complete - See [phase6_experimentation.md](phase6_experimentation.md)
30
+ - ✅ **Phase 7:** Baselines complete - See [phase7_baselines.md](phase7_baselines.md)
31
+ - ✅ **Phase 8 Analysis:** Core analysis complete - See [phase8_findings.md](phase8_findings.md)
32
+ - 🟡 **Phase 8 Active:** Signal preprocessing tasks (8.0.14-8.0.18)
33
+ - ⬜ **Phase 9:** Final evaluation
34
+
35
+ ---
36
+
37
+ ## Phase Status Overview
38
+
39
+ | Phase | Tasks | Status | Key Result | Document |
40
+ |-------|-------|--------|------------|----------|
41
+ | 0-5: Core Pipeline | 36/36 | ✅ | 353 tests, full pipeline | [completed_phases.md](completed_phases.md) |
42
+ | 6: Experimentation | 6/7 | ✅ | 27.56±11.37 μm, 41% CV | [phase6_experimentation.md](phase6_experimentation.md) |
43
+ | 7: Baselines | 3/3 | ✅ | Interp: 11.10 μm, EN: 19.35 μm | [phase7_baselines.md](phase7_baselines.md) |
44
+ | 8: Analysis | 13/18 | ✅ | 22.76±11.66 μm (best), found issues | [phase8_findings.md](phase8_findings.md) |
45
+ | **8: Active (Signal)** | **0/5** | **🟡** | **Temporal + signal preprocessing** | **This doc** |
46
+ | 9: Final Eval | 0/4 | ⬜ | TBD | Pending |
47
+
48
+ ---
49
+
50
+ ## Critical Context
51
+
52
+ ### Why Simple Baseline Beats Complex Model
53
+
54
+ **The Problem:**
55
+ - **Interpolation baseline**: 11.10 μm (just uses time/position)
56
+ - **Our transformer**: 22.76 μm (complex model with features)
57
+ - **Paper target**: 10.05 μm
58
+
59
+ **The Insight:**
60
+ Interpolation works because wear is fundamentally **time-dependent**. Our model doesn't have temporal context! It's like predicting cumulative distance without knowing how long you've been driving.
61
+
62
+ ### Key Findings from Phase 8 Analysis
63
+
64
+ **What Works:**
65
+ - ✅ Model architecture is sound (Fold 3: 13.02 μm, close to paper!)
66
+ - ✅ Per-dataset normalization helps (35% improvement)
67
+ - ✅ Guardrails prevent outliers (31% improvement)
68
+ - ✅ Labels are high quality (verified in Task 8.0.5)
69
+
70
+ **What's Missing:**
71
+ - 🔴 **Temporal context** (time_elapsed) - model doesn't know cutting history
72
+ - 🔴 **Signal preprocessing** (10s ramp + 35s steady + 5s taper structure)
73
+ - 🔴 **Cut 2 handling** (systematically problematic across datasets)
74
+ - 🔴 **Steady-state features** (current features contaminated by noise)
75
+
76
+ **What Doesn't Work:**
77
+ - ✗ train6 reprocessing (15.46 vs 13.02 μm - made it worse)
78
+ - ✗ Cut-1 anchoring on validation (groups exclude cut 1 by design)
79
+ - ✗ Variance penalty (λ=0.05, r=0.127 correlation with error - ineffective)
80
+ - ⚠️ Fold 2 unstable (train3/train4 have 10.8% lower intensity)
81
+
82
+ ---
83
+
84
+ ## Phase 8: Active Tasks - Signal Preprocessing & Temporal Features
85
+
86
+ > **Purpose:** Implement signal preprocessing and temporal features that paper authors likely used but didn't document. Focus on quick wins with highest ROI before committing to lengthy hyperparameter optimization.
87
+
88
+ **Status:** � BLOCKED (critical data loading bug discovered)
89
+ **Current Best:** 22.76 ± 11.66 μm
90
+ **Target:** < 16 μm (30% improvement, competitive with paper)
91
+ **Gap to Close:** 12.71 μm
92
+
93
+ ---
94
+
95
+ ### **Task 8.0.19:** Fix Data Loading Bug 🔥 [CRITICAL - BLOCKING]
96
+ **Status:** ⏳ IN PROGRESS
97
+ **Priority:** CRITICAL (blocks all other work)
98
+ **Estimated Time:** 2-3 hours
99
+ **Branch:** `feature/temporal-effective-cutting-time`
100
+
101
+ **Problem Discovery:**
102
+ During Task 8.0.18 analysis, discovered that `src/data/raw_loader.py` incorrectly extracts cut data:
103
+ - **Current behavior:** Splits accelerometer files evenly (total_rows // n_cuts)
104
+ - **Actual problem:** Cuts have different durations, causing misalignment
105
+ - **Evidence:** train4 cut 2 loads 88s of data, but controller says cut was only 55.79s
106
+ - Visual inspection shows cut 2 (0-65s low intensity) + cut 3 data (65-88s high intensity)
107
+ - This explains why cut 2 appears to have cutting activity when it should be a dud
108
+
109
+ **Additional Discovery - Timestamp Misalignment:**
110
+ After implementing timestamp-based loading, discovered sensor/controller timestamp offsets:
111
+ - train1-5: Sensor starts 3-21s BEFORE controller (acceptable - trimmed at start)
112
+ - train6: Sensor starts 42.6s AFTER controller (problematic - loses beginning of cuts)
113
+ - This explains train6's poor performance - incomplete data!
114
+
115
+ **Updated Dataset Links (Dec 23, 2025):**
116
+ - Training: https://drive.google.com/file/d/17rmopMdagBrcNw4Ks-P4f3eGdkvghnum/view?usp=drive_link
117
+ - Evaluation: https://drive.google.com/file/d/1vGekm6VrIjUzUBkKOCnWPRR5djNy2Yn2/view?usp=drive_link
118
+ - **Action Required:** Download and verify if timestamp alignment issues are fixed
119
+
120
+ **Impact:**
121
+ - ALL feature extraction to date has been using misaligned/contaminated data
122
+ - Explains inconsistent model performance and high variance
123
+ - train3/train4 intensity mismatches may be due to this bug
124
+ - **Must fix before proceeding with any retraining**
125
+
126
+ **Root Cause:**
127
+ Controller data provides exact `start_cut` and `end_cut` timestamps, but raw_loader ignores them.
128
+
129
+ **Solution:**
130
+ Rewrite `load_sensor_data()` to:
131
+ 1. Load controller Cut_XX.csv to get start_cut/end_cut timestamps
132
+ 2. Load accelerometer data Date/Time column
133
+ 3. Filter accelerometer rows where timestamp is between start_cut and end_cut
134
+ 4. Return only the correct time range for each cut
135
+
136
+ **Workflow:**
137
+ - [x] Review: Identified bug in data loading (equal-split assumption wrong)
138
+ - [x] Design: Plan timestamp-based extraction approach using polars
139
+ - [x] Implement: Rewrite `_get_cut_data_with_timestamps()` and `load_sensor_data()`
140
+ - [x] Run: Validate on all train1-6, cuts 1-26
141
+ - [x] Assess:
142
+ - train4 cut 2: Fixed from 88s→53.5s, now correctly identified as dud
143
+ - All cut 2s now properly show as duds (max intensity <3.0)
144
+ - Exposed train6 timestamp issue: 42s offset, incomplete data
145
+ - Old loader was hiding real data quality problems
146
+ - [x] Commit: Critical bug fix + documentation
147
+ - [x] Download new datasets (Dec 23, 2024):
148
+ - Training: https://drive.google.com/file/d/17rmopMdagBrcNw4Ks-P4f3eGdkvghnum/view
149
+ - Evaluation: https://drive.google.com/file/d/1vGekm6VrIjUzUBkKOCnWPRR5djNy2Yn2/view
150
+ - **Note:** New datasets included evaluation data and labels (deleted after verification)
151
+ - [x] Verify: ✗ New datasets have SAME timestamp issues (train6 still 42.6s offset)
152
+ - [x] Decision: Continue with existing data + timestamp-based loader fix
153
+ - [x] Regenerate ALL features with fixed loader:
154
+ - ✓ All 6 training datasets: 138,109 windows (26 MB)
155
+ - ✓ train6 now shows 19,131 windows (reflects incomplete data correctly)
156
+ - ✓ Using 40s fallback effective time (phase detection integration pending)
157
+ - ✓ All data extracted with correct temporal boundaries
158
+
159
+ ---
160
+
161
+ ### **Task 8.0.18:** Add effective cutting time features ⭐ [PAUSED - WAITING ON 8.0.19]
162
+ **Status:** 🟡 PAUSED (Phase B complete, blocked by data loading bug)
163
+ **Priority:** CRITICAL (highest ROI, likely THE missing piece)
164
+ **Branch:** `feature/temporal-effective-cutting-time`
165
+ **Objective:** Calculate and add ACTUAL cutting time (excluding ramp-up/ramp-down) as temporal feature
166
+
167
+ **User Requirement (Updated):**
168
+ > "Time elapsed is how much time was the tool actually working. We can calculate this by removing the ramp up and rampdown for each cut, the intensities of which can be calculated from cut 1."
169
+
170
+ **Key Insight:**
171
+ - Not just `cut_number * 50s` - need EFFECTIVE cutting time
172
+ - Each 50s cut has: ramp-up (~10s) + steady-state (~35s) + ramp-down (~5s)
173
+ - Only steady-state time counts as "actual working"
174
+ - Use cut 1 as reference to determine phase boundaries based on signal intensity
175
+ - Accumulate effective cutting time across all cuts
176
+
177
+ **Why This Is More Complex:**
178
+ 1. Must analyze signal intensity patterns in cut 1
179
+ 2. Define intensity thresholds for ramp-up/steady/ramp-down phases
180
+ 3. Apply phase detection to all cuts (may vary per cut)
181
+ 4. Calculate cumulative effective time excluding ramp periods
182
+ 5. Handle variations in phase durations across cuts
183
+
184
+ **Implementation Phases:**
185
+
186
+ **Phase A: Signal Phase Detection (2-3 hours)**
187
+ - [x] Task 8.0.18a: Analyze cut 1 signal intensity patterns ✅ COMPLETE
188
+ - Created `scripts/analyze_phase_boundaries.py` (358 lines)
189
+ - Analyzed cut 1 from all 6 training datasets using RMS intensity
190
+ - **Algorithm:** Quantile-based thresholds with hybrid approach
191
+ - Steady-state start: 75th percentile of RMS intensity
192
+ - Steady-state end: max(55th percentile, 50% of 75th percentile)
193
+ - Hybrid threshold prevents using noise floor for signals with long quiet periods
194
+ - **Final findings:**
195
+ - Steady-state durations: 38.1-40.7s (median=40.2s, CV=2.3%)
196
+ - Ramp-up times: 12.8-27.8s (median=16.0s)
197
+ - Extremely consistent across all 6 datasets
198
+ - Raw cut files are 68-112s (include buffer before/after actual cutting)
199
+ - Visualization: `experiments/plots/phase_boundaries/cut1_phase_detection.png`
200
+ - **Key insight:** Quantile-based detection is robust and consistent across different tool profiles
201
+
202
+ - [x] Task 8.0.18b: Implement phase detection algorithm ✅ COMPLETE
203
+ - Created `src/data/phase_detection.py` module (305 lines)
204
+ - Functions:
205
+ - `calculate_rms_intensity()`: Compute RMS with sliding window smoothing
206
+ - `detect_cutting_phases()`: Quantile-based phase detection with hybrid threshold
207
+ - `calculate_effective_time()`: Cumulative effective time for cut sequences
208
+ - Created `tests/data/test_phase_detection.py` (270 lines)
209
+ - 17 unit tests, all passing
210
+ - Validated on real data: matches analysis script results exactly
211
+ - **Status:** Module ready for integration into preprocessing pipeline
212
+
213
+ **Phase B: Effective Time Calculation (2 hours)**
214
+ - [x] Task 8.0.18c & 8.0.18d: Add temporal feature to preprocessing ✅ COMPLETE
215
+ - Modified `src/data/preprocessing.py::preprocess_dataset()`
216
+ - Calculates effective time for all 26 cuts per dataset using phase detection
217
+ - Added two new columns to features DataFrame:
218
+ - `effective_time_elapsed`: Cumulative steady-state duration (seconds)
219
+ - `effective_time_norm`: Normalized to 0-1 range per dataset
220
+ - Fallback: Uses 40s/cut if phase detection fails
221
+ - Logs effective time range for verification
222
+ - **Status:** Ready to regenerate all feature files
223
+
224
+ **Phase C: Feature Integration & Testing (2-3 hours)**
225
+ - [ ] Task 8.0.18e: Regenerate all training features
226
+ - Run `scripts/preprocess_features.py --force --train-only`
227
+ - Verify new column exists in all train1-6 features
228
+ - Check effective time values are reasonable (< 26*50s, monotonic)
229
+ - Backup old features before overwriting
230
+
231
+ - [ ] Task 8.0.18f: Retrain and evaluate
232
+ - Retrain Fold 3 with new temporal feature
233
+ - Compare RMSE: old (13.02 μm) vs new
234
+ - Check if temporal feature has high importance
235
+ - If successful, retrain all 3 folds
236
+
237
+ **Expected Impact:**
238
+ - **Best case**: 30-40% improvement (22.76 → 14-16 μm)
239
+ - Effective cutting time is THE missing piece
240
+ - Model learns wear as function of actual working time
241
+ - **Likely case**: 20-30% improvement (22.76 → 16-18 μm)
242
+ - Better than simple cut_number, captures real physics
243
+ - **Worst case**: 10-15% improvement
244
+ - Still significant, better than raw cut number
245
+
246
+ **Codebase Changes:**
247
+ ```
248
+ New files:
249
+ src/data/phase_detection.py # Phase detection algorithm
250
+ tests/data/test_phase_detection.py # Unit tests
251
+ scripts/analyze_phase_boundaries.py # Analysis tool
252
+
253
+ Modified files:
254
+ src/data/preprocessing.py # Add effective time calculation
255
+ data/preprocessed/*_features.parquet # Regenerate with new column
256
+
257
+ Dependencies:
258
+ - Task 8.0.14 analysis can inform phase detection
259
+ - May want to combine with steady-state feature extraction
260
+ ```
261
+
262
+ **Success Criteria:**
263
+ - Phase detection works reliably across all datasets
264
+ - Effective time values are physically plausible
265
+ - Temporal feature has high model importance
266
+ - RMSE improvement >15% vs baseline without temporal feature
267
+ - Model finally beats interpolation baseline (11.10 μm)
268
+
269
+ **Estimated Time:** 6-8 hours (major feature, needs careful implementation)
270
+
271
+ **Git Workflow:**
272
+ - Branch: `feature/temporal-effective-cutting-time` (created)
273
+ - Commit after each phase (A, B, C)
274
+ - Merge to main after validation shows improvement
275
+
276
+ ---
277
+
278
+ ### **Task 8.0.14:** Analyze temporal signal structure
279
+ **Status:** ⬜ NOT STARTED
280
+ **Priority:** HIGH
281
+ **Objective:** Quantify temporal structure of cutting signals to identify noise periods
282
+
283
+ **User Domain Insights:**
284
+ > "Most cut vibrations start after 10 seconds, continues for ~35 sec, and then have a tapering five second ending. There's lot of static noise especially at the start and the end."
285
+
286
+ **Key Hypothesis:**
287
+ - Cutting signals have 3 phases: ramp-up (0-10s), steady-state (10-45s), tapering (45-50s)
288
+ - Start/end periods contaminate features with static noise
289
+ - Steady-state period (35s) contains true cutting information
290
+
291
+ **Implementation:**
292
+ 1. Load raw signal data for representative cuts
293
+ 2. Visualize temporal structure:
294
+ - Plot RMS energy over time
295
+ - Identify ramp-up, steady-state, tapering boundaries
296
+ - Quantify signal-to-noise ratio per phase
297
+ 3. Compare feature statistics:
298
+ - Features from full signal (0-50s)
299
+ - Features from steady-state only (10-45s)
300
+ 4. Test hypothesis:
301
+ - Do steady-state features have lower variance?
302
+ - Are they more consistent across datasets?
303
+
304
+ **Expected Findings:**
305
+ - Ramp-up period (0-10s): high noise, low signal
306
+ - Steady-state (10-45s): clean cutting vibrations
307
+ - Tapering (45-50s): decreasing signal, increasing noise
308
+ - Current features may be diluted by 30% noise (15s/50s)
309
+
310
+ **Deliverables:**
311
+ - [ ] Script: `scripts/analyze_temporal_structure.py`
312
+ - [ ] Visualization: Signal phases across datasets
313
+ - [ ] Analysis: SNR comparison by temporal phase
314
+
315
+ **Success Criteria:**
316
+ - Quantified temporal boundaries (ramp, steady, taper)
317
+ - Clear difference in SNR between phases
318
+ - Recommendation: use steady-state vs full signal
319
+
320
+ **Estimated Time:** 2 hours
321
+
322
+ ---
323
+
324
+ ### **Task 8.0.15:** Investigate cut 2 data quality
325
+ **Status:** ⬜ NOT STARTED
326
+ **Priority:** HIGH
327
+ **Objective:** Analyze why cut 2 is systematically problematic
328
+
329
+ **User Observation:**
330
+ > "Also, cut2 data is almost always a dud cut"
331
+
332
+ **Questions to Answer:**
333
+ 1. What makes cut 2 different from other cuts?
334
+ - Signal characteristics (amplitude, frequency, noise)
335
+ - Feature statistics (mean, std, outliers)
336
+ - Wear progression (is label at cut 6 affected?)
337
+ 2. Is this consistent across all datasets?
338
+ - Check all 6 training sets
339
+ - Compare cut 2 vs cut 3-6
340
+ 3. Should we exclude cut 2 from training?
341
+ - Impact on group structure (cuts 2-6 → 3-6?)
342
+ - Trade-off: less data vs better quality
343
+
344
+ **Implementation:**
345
+ 1. Compare cut 2 vs other cuts in group 1:
346
+ - Feature distributions (box plots, histograms)
347
+ - Signal quality metrics (SNR, outliers)
348
+ - Prediction errors (if cut 2 always wrong)
349
+ 2. Visualize cut 2 signals:
350
+ - Raw waveforms across datasets
351
+ - Identify common artifacts
352
+ 3. Test exclusion:
353
+ - Retrain with groups [3-6, 7-11, ...] instead of [2-6, 7-11, ...]
354
+ - Compare validation RMSE
355
+
356
+ **Expected Impact:**
357
+ - If cut 2 is truly bad: 5-10% RMSE improvement by excluding it
358
+ - If cut 2 is just harder: no improvement, keep for training
359
+
360
+ **Deliverables:**
361
+ - [ ] Script: `scripts/analyze_cut2_quality.py`
362
+ - [ ] Visualization: Cut 2 vs other cuts comparison
363
+ - [ ] Recommendation: exclude or keep cut 2
364
+
365
+ **Success Criteria:**
366
+ - Quantified quality difference (if any)
367
+ - Clear decision on cut 2 handling
368
+
369
+ **Estimated Time:** 1.5 hours
370
+
371
+ ---
372
+
373
+ ### **Task 8.0.16:** Extract steady-state features
374
+ **Status:** ⬜ NOT STARTED
375
+ **Priority:** HIGH
376
+ **Objective:** Reprocess features using only steady-state cutting period
377
+
378
+ **Dependencies:** Task 8.0.14 (temporal analysis)
379
+
380
+ **Implementation:**
381
+ 1. Update feature extraction pipeline:
382
+ - Identify steady-state window per dataset (e.g., 10-45s)
383
+ - Extract features from steady-state only
384
+ - Save as `*_features_steadystate.parquet`
385
+ 2. Compare feature quality:
386
+ - Variance across datasets (steady-state vs full signal)
387
+ - Correlation with wear (better for steady-state?)
388
+ 3. Retrain model with steady-state features:
389
+ - Use same architecture (no changes)
390
+ - Train on steady-state features only
391
+ - Compare validation RMSE
392
+
393
+ **Expected Impact:**
394
+ - Best case: 20-30% RMSE reduction (22.76 → 16-18 μm)
395
+ - Likely case: 10-15% improvement (22.76 → 19-20 μm)
396
+ - Steady-state features should have lower fold variance
397
+
398
+ **Deliverables:**
399
+ - [ ] Updated preprocessing script with temporal windowing
400
+ - [ ] New feature files: `*_features_steadystate.parquet`
401
+ - [ ] Retrained models with steady-state features
402
+ - [ ] RMSE comparison: full signal vs steady-state
403
+
404
+ **Success Criteria:**
405
+ - Steady-state features show lower dataset variance
406
+ - Validation RMSE improves
407
+ - Fold 2 (train3/train4) benefits most
408
+
409
+ **Estimated Time:** 3 hours
410
+
411
+ ---
412
+
413
+ ### **Task 8.0.17:** Apply bandpass filtering
414
+ **Status:** ⬜ NOT STARTED
415
+ **Priority:** MEDIUM
416
+ **Objective:** Remove static noise through frequency-domain filtering
417
+
418
+ **Dependencies:** Task 8.0.14 (identify noise characteristics)
419
+
420
+ **Implementation:**
421
+ 1. Analyze frequency spectrum:
422
+ - FFT of signals across datasets
423
+ - Identify cutting frequency range
424
+ - Identify noise frequency range (static/low-freq)
425
+ 2. Design bandpass filter:
426
+ - Lower cutoff: remove DC offset and static noise
427
+ - Upper cutoff: remove high-frequency sensor noise
428
+ - Typical cutting frequencies: 50-500 Hz (verify with data)
429
+ 3. Apply filtering:
430
+ - Filter raw signals before feature extraction
431
+ - Recompute features
432
+ - Save as `*_features_filtered.parquet`
433
+ 4. Test impact:
434
+ - Retrain with filtered features
435
+ - Compare RMSE
436
+
437
+ **Expected Impact:**
438
+ - Best case: 10-15% improvement if high noise levels
439
+ - Likely case: 5-10% improvement
440
+ - May combine with steady-state extraction
441
+
442
+ **Deliverables:**
443
+ - [ ] Script: `scripts/apply_bandpass_filter.py`
444
+ - [ ] Frequency analysis plots
445
+ - [ ] Filtered feature files
446
+ - [ ] RMSE comparison
447
+
448
+ **Success Criteria:**
449
+ - Identified optimal filter parameters
450
+ - Improved signal-to-noise ratio
451
+ - RMSE improvement (if filtering helps)
452
+
453
+ **Estimated Time:** 2 hours
454
+
455
+ ---
456
+
457
+ **Phase 8 Active Tasks Summary:**
458
+ - **Total tasks:** 5 (1 CRITICAL + 4 HIGH priority)
459
+ - **Estimated time:** ~10.5 hours
460
+ - **Expected impact:** 20-40% RMSE reduction (22.76 → 14-18 μm)
461
+ - **Goal:** Close gap to paper (currently +12.71 μm)
462
+ - **Strategy:** Quick wins (temporal context) before lengthy optimization
463
+
464
+ ---
465
+
466
+ ## Phase 9: Final Evaluation & Deployment (Planned)
467
+
468
+ **Status:** ⬜ NOT STARTED
469
+ **Prerequisites:** Phase 8 signal preprocessing complete, RMSE < 18 μm
470
+
471
+ **Planned Tasks:**
472
+ 1. **Task 9.1:** Implement cut-1 anchoring for test inference
473
+ - Cannot test on validation (groups exclude cut 1)
474
+ - But valid strategy for final evaluation where cut 1 is provided
475
+
476
+ 2. **Task 9.2:** Run final evaluation on all test sets
477
+ - Apply best preprocessing pipeline
478
+ - Use temporal features + steady-state extraction
479
+ - Apply cut-1 anchoring at inference time
480
+
481
+ 3. **Task 9.3:** Compare to paper and baselines
482
+ - Paper target: 10.05 μm
483
+ - Interpolation baseline: 11.10 μm
484
+ - Document where we stand
485
+
486
+ 4. **Task 9.4:** Package and document solution
487
+ - Clean inference pipeline
488
+ - Model cards
489
+ - Usage examples
490
+
491
+ **Estimated Time:** 8-10 hours
492
+
493
+ ---
494
+
495
+ ## Strategic Options Beyond Phase 8
496
+
497
+ If Phase 8 signal preprocessing doesn't close the gap sufficiently (<16 μm), consider:
498
+
499
+ **Option A: Hyperparameter Optimization** (12-14 hours)
500
+ - Systematic tuning: λ, learning rates, architecture size, dropout
501
+ - Grid/random search across parameter space
502
+ - Expected: 10-15% additional improvement
503
+
504
+ **Option B: Architectural Simplification** (8-10 hours)
505
+ - Test if simpler models work better (baseline dominance suggests this)
506
+ - Try: simple LSTM, 1D CNN, or even linear regression with good features
507
+ - Expected: Unknown, but could surprise us
508
+
509
+ **Option C: Accept Current & Productionize** (8-10 hours)
510
+ - Document findings and gaps
511
+ - Package best model for deployment
512
+ - Focus on practical usability
513
+
514
+ ---
515
+
516
+ ## Workflow Reference
517
+
518
+ For detailed workflow guidelines, see [task_development_workflow.md](task_development_workflow.md).
519
+
520
+ **Standard workflow for each task:**
521
+ 1. **Review:** Understand requirements, check dependencies
522
+ 2. **Design:** Plan implementation approach
523
+ 3. **Implement:** Write code, tests if needed
524
+ 4. **Run:** Execute and collect results
525
+ 5. **Assess:** Analyze impact, document findings
526
+ 6. **Commit:** Save progress with clear message
527
+
528
+ ---
529
+
530
+ ## Key Metrics & Targets
531
+
532
+ **Current Best Performance:**
533
+ - Mean RMSE: 22.76 ± 11.66 μm (CV = 51.2%)
534
+ - Best fold: 13.02 μm (Fold 3)
535
+ - Worst fold: 36.84 μm (Fold 2)
536
+
537
+ **Targets:**
538
+ - **Minimum viable**: < 18 μm (20% improvement)
539
+ - **Competitive**: < 16 μm (30% improvement)
540
+ - **Match paper**: < 12 μm (47% improvement)
541
+ - **Stability**: CV < 30% (currently 51.2%)
542
+
543
+ **Baselines:**
544
+ - Paper: 10.05 μm
545
+ - Interpolation: 11.10 ± 1.40 μm (CV = 12.6%)
546
+ - Elastic Net: 19.35 ± 5.11 μm (CV = 26.4%)
547
+
548
+ ---
549
+
550
+ ## Quick Links
551
+
552
+ **Documentation:**
553
+ - [Completed Phases 0-5](completed_phases.md) - Core pipeline development
554
+ - [Phase 6: Experimentation](phase6_experimentation.md) - Multi-fold validation, diagnostics
555
+ - [Phase 7: Baselines](phase7_baselines.md) - Interpolation, Elastic Net, normalization
556
+ - [Phase 8: Analysis Findings](phase8_findings.md) - Scale mismatch, outliers, embeddings
557
+ - [Task Development Workflow](task_development_workflow.md) - How to execute tasks
558
+ - [Challenge Description](challenge_description.md) - Original problem statement
559
+ - [Original Paper](original_paper.md) - Target to match
560
+
561
+ **Key Scripts:**
562
+ - `scripts/run_multifold_validation.py` - Train and validate models
563
+ - `scripts/analyze_predictions.py` - Detailed prediction analysis
564
+ - `scripts/baseline_*.py` - Baseline model implementations
565
+ - `scripts/launch_mlflow_ui.sh` - View experiment tracking
566
+
567
+ **Data:**
568
+ - `data/preprocessed/` - Feature and label files
569
+ - `data/raw/` - Original challenge data
570
+
571
+ **Models:**
572
+ - `models_normalized/` - Best models with per-dataset normalization
573
+ - `models_normalized_old/` - Backups from experiments
574
+
575
+ ---
576
+
577
+ ## Notes
578
+
579
+ **Critical Insights:**
580
+ 1. **Temporal context is key**: Simple interpolation beats complex model → we need time
581
+ 2. **Signal preprocessing matters**: 10s ramp + 35s steady + 5s taper structure
582
+ 3. **Cut 2 is problematic**: Systematically bad across datasets
583
+ 4. **Fold 3 proves it works**: 13.02 μm (only 3 μm from paper!)
584
+ 5. **Fold 2 is the challenge**: train3/train4 have different intensity (10.8% lower)
585
+
586
+ **What We've Learned:**
587
+ - Model architecture is sound (when data is clean)
588
+ - Per-dataset normalization helps (35% improvement)
589
+ - Guardrails prevent catastrophic outliers (31% improvement)
590
+ - Cut-1 anchoring valid for test time (not validation)
591
+ - train6 reprocessing didn't help (old model better)
592
+
593
+ **Next Priority:**
594
+ Task 8.0.18 (temporal context) - 2 hours, expected 20-40% improvement, highest ROI