Spaces:

FocusGuard
/

final_test

Sleeping

App Files Files Community

final_test / data_preparation /README.md

Abdelrahman Almatrooshi

Deploy snapshot from main b7a59b11809483dfc959f196f1930240f2662c49

22a6915 about 2 months ago

preview code

raw

history blame contribute delete

4.25 kB

	# data_preparation

	Handles loading, splitting, scaling, and serving the collected dataset for training and evaluation.

	## Links

	- Participant consent form: [Consent document](https://drive.google.com/file/d/1g1Hc764ffljoKrjApD6nmWDCXJGYTR0j/view?usp=drive_link)
	- Dataset (staff access): [Dataset folder](https://drive.google.com/drive/folders/1fwACM6i6uVGFkTlJKSlqVhizzgrHl_gY?usp=sharing)

	## Data collection protocol

	9 team members each recorded 5-10 minute webcam sessions using a purpose-built tool (`models/collect_features.py`). During recording:

	- Participants simulated focused behaviour (reading, typing) and unfocused behaviour (looking at phone, turning away)
	- Binary labels were annotated in real-time via key presses
	- Sessions were recorded across different rooms, workspaces, and home offices using consumer webcams under varying lighting
	- Real-time quality guidance warned if class balance fell outside 30-70% or if fewer than 10 state transitions occurred
	- An automated post-collection quality report validated minimum duration (120s), sample count (3,000+ frames), balance, and transition frequency

	All participants provided informed consent for their facial landmark data to be used within this coursework project. Raw video frames are never stored; only the 17-dimensional feature vector and binary labels are saved.

	Raw participant dataset is excluded from this repository (coursework policy and privacy constraints). It is shared separately via the dataset link above.

	## Dataset summary

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Participants \| 9 \|
	\| Total frames \| 144,793 \|
	\| Class balance \| 61.5% focused / 38.5% unfocused \|
	\| Features extracted \| 17 per frame \|
	\| Features selected \| 10 (used by ML models) \|

	## Data format

	Training data lives under `data/collected_<participant>/` as `.npz` files. Each file contains:

	\| Key \| Shape \| Description \|
	\|-----\|-------\|-------------\|
	\| `features` \| (N, 17) \| Float array of extracted features \|
	\| `labels` \| (N,) \| Binary: 0 = unfocused, 1 = focused \|
	\| `feature_names` \| (17,) \| String names matching `FEATURE_NAMES` in `collect_features.py` \|

	Data files are not included in this repository due to privacy considerations.

	## Files

	\| File \| Purpose \|
	\|------\|---------\|
	\| `prepare_dataset.py` \| Core data pipeline: loads `.npz`, applies feature selection, stratified splits, StandardScaler on train only \|
	\| `data_exploration.ipynb` \| Exploratory analysis: feature distributions, class balance, per-person statistics, correlation heatmaps \|

	## Feature selection

	`SELECTED_FEATURES["face_orientation"]` defines the 10 features used by all ML models:

	Head pose (3): `head_deviation`, `s_face`, `pitch`
	Eye state (4): `ear_left`, `ear_right`, `ear_avg`, `perclos`
	Gaze (3): `h_gaze`, `gaze_offset`, `s_eye`

	Excluded: `v_gaze` (noisy), `mar` (1.7% trigger rate), `yaw`/`roll` (redundant with `head_deviation`/`s_face`), `blink_rate`/`closure_duration`/`yawn_duration` (temporal overlap with `perclos`).

	Selection was validated by XGBoost gain importance and LOPO channel ablation:

	\| Channel subset \| Mean LOPO F1 \|
	\|---------------\|-------------\|
	\| All 10 features \| 0.829 \|
	\| Eye state only \| 0.807 \|
	\| Head pose only \| 0.748 \|
	\| Gaze only \| 0.726 \|

	## Key functions

	\| Function \| What it does \|
	\|----------\|-------------\|
	\| `load_all_pooled(model_name)` \| Concatenates all participant data into one array \|
	\| `load_per_person(model_name)` \| Returns `{person: (X, y)}` dict for LOPO cross-validation \|
	\| `get_numpy_splits(model_name)` \| Returns scaled train/val/test numpy arrays (70/15/15 split) \|
	\| `get_dataloaders(model_name)` \| Returns PyTorch DataLoaders for MLP training \|
	\| `get_default_split_config()` \| Returns split ratios and seed from `config/default.yaml` \|

	## Data cleaning

	Applied before splitting (in `ui/pipeline.py` at inference, in `prepare_dataset.py` for training):

	1. Angles clipped to physiological ranges (yaw +/-45, pitch/roll +/-30)
	2. `head_deviation` recomputed from clipped angles (not clipped after computation)
	3. EAR clipped to [0, 0.85], MAR to [0, 1.0]
	4. Physiological bounds on gaze_offset, PERCLOS, blink_rate, closure/yawn duration
	5. StandardScaler fit on training split only, applied to val/test

	# data_preparation

	Handles loading, splitting, scaling, and serving the collected dataset for training and evaluation.

	## Links

	- Participant consent form: [Consent document](https://drive.google.com/file/d/1g1Hc764ffljoKrjApD6nmWDCXJGYTR0j/view?usp=drive_link)
	- Dataset (staff access): [Dataset folder](https://drive.google.com/drive/folders/1fwACM6i6uVGFkTlJKSlqVhizzgrHl_gY?usp=sharing)

	## Data collection protocol

	9 team members each recorded 5-10 minute webcam sessions using a purpose-built tool (`models/collect_features.py`). During recording:

	- Participants simulated focused behaviour (reading, typing) and unfocused behaviour (looking at phone, turning away)
	- Binary labels were annotated in real-time via key presses
	- Sessions were recorded across different rooms, workspaces, and home offices using consumer webcams under varying lighting
	- Real-time quality guidance warned if class balance fell outside 30-70% or if fewer than 10 state transitions occurred
	- An automated post-collection quality report validated minimum duration (120s), sample count (3,000+ frames), balance, and transition frequency

	All participants provided informed consent for their facial landmark data to be used within this coursework project. Raw video frames are never stored; only the 17-dimensional feature vector and binary labels are saved.

	Raw participant dataset is excluded from this repository (coursework policy and privacy constraints). It is shared separately via the dataset link above.

	## Dataset summary

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Participants \| 9 \|
	\| Total frames \| 144,793 \|
	\| Class balance \| 61.5% focused / 38.5% unfocused \|
	\| Features extracted \| 17 per frame \|
	\| Features selected \| 10 (used by ML models) \|

	## Data format

	Training data lives under `data/collected_<participant>/` as `.npz` files. Each file contains:

	\| Key \| Shape \| Description \|
	\|-----\|-------\|-------------\|
	\| `features` \| (N, 17) \| Float array of extracted features \|
	\| `labels` \| (N,) \| Binary: 0 = unfocused, 1 = focused \|
	\| `feature_names` \| (17,) \| String names matching `FEATURE_NAMES` in `collect_features.py` \|

	Data files are not included in this repository due to privacy considerations.

	## Files

	\| File \| Purpose \|
	\|------\|---------\|
	\| `prepare_dataset.py` \| Core data pipeline: loads `.npz`, applies feature selection, stratified splits, StandardScaler on train only \|
	\| `data_exploration.ipynb` \| Exploratory analysis: feature distributions, class balance, per-person statistics, correlation heatmaps \|

	## Feature selection

	`SELECTED_FEATURES["face_orientation"]` defines the 10 features used by all ML models:

	Head pose (3): `head_deviation`, `s_face`, `pitch`
	Eye state (4): `ear_left`, `ear_right`, `ear_avg`, `perclos`
	Gaze (3): `h_gaze`, `gaze_offset`, `s_eye`

	Excluded: `v_gaze` (noisy), `mar` (1.7% trigger rate), `yaw`/`roll` (redundant with `head_deviation`/`s_face`), `blink_rate`/`closure_duration`/`yawn_duration` (temporal overlap with `perclos`).

	Selection was validated by XGBoost gain importance and LOPO channel ablation:

	\| Channel subset \| Mean LOPO F1 \|
	\|---------------\|-------------\|
	\| All 10 features \| 0.829 \|
	\| Eye state only \| 0.807 \|
	\| Head pose only \| 0.748 \|
	\| Gaze only \| 0.726 \|

	## Key functions

	\| Function \| What it does \|
	\|----------\|-------------\|
	\| `load_all_pooled(model_name)` \| Concatenates all participant data into one array \|
	\| `load_per_person(model_name)` \| Returns `{person: (X, y)}` dict for LOPO cross-validation \|
	\| `get_numpy_splits(model_name)` \| Returns scaled train/val/test numpy arrays (70/15/15 split) \|
	\| `get_dataloaders(model_name)` \| Returns PyTorch DataLoaders for MLP training \|
	\| `get_default_split_config()` \| Returns split ratios and seed from `config/default.yaml` \|

	## Data cleaning

	Applied before splitting (in `ui/pipeline.py` at inference, in `prepare_dataset.py` for training):

	1. Angles clipped to physiological ranges (yaw +/-45, pitch/roll +/-30)
	2. `head_deviation` recomputed from clipped angles (not clipped after computation)
	3. EAR clipped to [0, 0.85], MAR to [0, 1.0]
	4. Physiological bounds on gaze_offset, PERCLOS, blink_rate, closure/yawn duration
	5. StandardScaler fit on training split only, applied to val/test