Spaces:
Sleeping
Sleeping
| # data_preparation | |
| Handles loading, splitting, scaling, and serving the collected dataset for training and evaluation. | |
| ## Links | |
| - Participant consent form: [Consent document](https://drive.google.com/file/d/1g1Hc764ffljoKrjApD6nmWDCXJGYTR0j/view?usp=drive_link) | |
| - Dataset (staff access): [Dataset folder](https://drive.google.com/drive/folders/1fwACM6i6uVGFkTlJKSlqVhizzgrHl_gY?usp=sharing) | |
| ## Data collection protocol | |
| 9 team members each recorded 5-10 minute webcam sessions using a purpose-built tool (`models/collect_features.py`). During recording: | |
| - Participants simulated **focused** behaviour (reading, typing) and **unfocused** behaviour (looking at phone, turning away) | |
| - Binary labels were annotated in real-time via key presses | |
| - Sessions were recorded across different rooms, workspaces, and home offices using consumer webcams under varying lighting | |
| - Real-time quality guidance warned if class balance fell outside 30-70% or if fewer than 10 state transitions occurred | |
| - An automated post-collection quality report validated minimum duration (120s), sample count (3,000+ frames), balance, and transition frequency | |
| All participants provided informed consent for their facial landmark data to be used within this coursework project. Raw video frames are never stored; only the 17-dimensional feature vector and binary labels are saved. | |
| Raw participant dataset is excluded from this repository (coursework policy and privacy constraints). It is shared separately via the dataset link above. | |
| ## Dataset summary | |
| | Metric | Value | | |
| |--------|-------| | |
| | Participants | 9 | | |
| | Total frames | 144,793 | | |
| | Class balance | 61.5% focused / 38.5% unfocused | | |
| | Features extracted | 17 per frame | | |
| | Features selected | 10 (used by ML models) | | |
| ## Data format | |
| Training data lives under `data/collected_<participant>/` as `.npz` files. Each file contains: | |
| | Key | Shape | Description | | |
| |-----|-------|-------------| | |
| | `features` | (N, 17) | Float array of extracted features | | |
| | `labels` | (N,) | Binary: 0 = unfocused, 1 = focused | | |
| | `feature_names` | (17,) | String names matching `FEATURE_NAMES` in `collect_features.py` | | |
| Data files are not included in this repository due to privacy considerations. | |
| ## Files | |
| | File | Purpose | | |
| |------|---------| | |
| | `prepare_dataset.py` | Core data pipeline: loads `.npz`, applies feature selection, stratified splits, StandardScaler on train only | | |
| | `data_exploration.ipynb` | Exploratory analysis: feature distributions, class balance, per-person statistics, correlation heatmaps | | |
| ## Feature selection | |
| `SELECTED_FEATURES["face_orientation"]` defines the 10 features used by all ML models: | |
| **Head pose (3):** `head_deviation`, `s_face`, `pitch` | |
| **Eye state (4):** `ear_left`, `ear_right`, `ear_avg`, `perclos` | |
| **Gaze (3):** `h_gaze`, `gaze_offset`, `s_eye` | |
| Excluded: `v_gaze` (noisy), `mar` (1.7% trigger rate), `yaw`/`roll` (redundant with `head_deviation`/`s_face`), `blink_rate`/`closure_duration`/`yawn_duration` (temporal overlap with `perclos`). | |
| Selection was validated by XGBoost gain importance and LOPO channel ablation: | |
| | Channel subset | Mean LOPO F1 | | |
| |---------------|-------------| | |
| | All 10 features | 0.829 | | |
| | Eye state only | 0.807 | | |
| | Head pose only | 0.748 | | |
| | Gaze only | 0.726 | | |
| ## Key functions | |
| | Function | What it does | | |
| |----------|-------------| | |
| | `load_all_pooled(model_name)` | Concatenates all participant data into one array | | |
| | `load_per_person(model_name)` | Returns `{person: (X, y)}` dict for LOPO cross-validation | | |
| | `get_numpy_splits(model_name)` | Returns scaled train/val/test numpy arrays (70/15/15 split) | | |
| | `get_dataloaders(model_name)` | Returns PyTorch DataLoaders for MLP training | | |
| | `get_default_split_config()` | Returns split ratios and seed from `config/default.yaml` | | |
| ## Data cleaning | |
| Applied before splitting (in `ui/pipeline.py` at inference, in `prepare_dataset.py` for training): | |
| 1. Angles clipped to physiological ranges (yaw +/-45, pitch/roll +/-30) | |
| 2. `head_deviation` recomputed from clipped angles (not clipped after computation) | |
| 3. EAR clipped to [0, 0.85], MAR to [0, 1.0] | |
| 4. Physiological bounds on gaze_offset, PERCLOS, blink_rate, closure/yawn duration | |
| 5. StandardScaler fit on training split only, applied to val/test | |