Create README.md

1d1da63 verified about 1 month ago

4.4 kB

	# UO Processed Dataset

	Documentation for the University of Ottawa (UO) bearing dataset after running the updated preprocessing pipeline. Along with KAIST and PU, this is one of the three datasets that were refreshed most recently.

	---

	## Folder Layout

	```
	UO/
	├── train.pt
	├── val.pt
	├── test.pt
	├── args.json
	├── additional_features.pt
	└── before_sliding_window/
	├── train.pt
	├── val.pt
	└── test.pt
	```

	- `train.pt`, `val.pt`, `test.pt` – windowed tensors post sliding-window subsampling.
	- `before_sliding_window/*.pt` – the same splits before windowing (full-length sequences).
	- `args.json` – preprocessing arguments (window size, stride, split ratios, etc.).
	- `additional_features.pt` – Torch-serialized metadata collected from each MAT file (including channel names and auxiliary sensor info).

	---

	## Saved Tensor Structure

	Each windowed `.pt` file is a dictionary with:

	- `samples`: `torch.Tensor` of shape `[num_windows, num_channels, window_length]`
	- `labels`: `torch.Tensor` with class ids (`0: healthy`, `1: inner race fault`, `2: outer race fault`)
	- `sequence_ids`: indices that point back to the original MAT files
	- `sliding_window_sequence_ids`: mapping from windows to their source sequences
	- `size`: split ratio used when generating the split (for traceability)

	The `before_sliding_window` tensors use the same keys but keep the un-windowed signals (length ≈ 2,000,000 samples per channel).

	---

	## Features

	Each MAT file contains several arrays; the preprocessing script keeps only the first two vibration channels (`channel_1`, `channel_2`) that provide the full 2 000 000 samples used in prior work. Their names are captured in `additional_features.pt` under each MAT file:

	```python
	import torch

	meta = torch.load("additional_features.pt")
	print(meta["healthy"]["H-A-1.mat"]["name_features"])
	```

	Use this metadata to understand the physical meaning of each channel or to filter specific features.

	---

	## Usage Example

	```python
	import torch

	train = torch.load("train.pt")
	windows = train["samples"] # [N, num_channels, window_length]
	labels = train["labels"] # bearing condition ids
	original_indices = train["sliding_window_sequence_ids"]

	print(windows.shape)
	print(labels.unique())
	```

	Access the un-windowed signals:

	```python
	raw = torch.load("before_sliding_window/train.pt")
	full_sequences = raw["samples"] # [num_sequences, num_channels, original_length]
	```

	---

	## Processing Pipeline

	1. Raw input – MAT files organised into three lists: healthy (`H-`), inner faults (`I-`), outer faults (`O-*`).
	2. Channel selection – For every file, the script extracts `channel_1` and `channel_2`, each of length 2 000 000 samples. Additional metadata (speed, load, etc.) is preserved in `additional_features.pt`.
	3. Class-wise sequence split – Using `train_size`, sequences are randomly assigned to train vs. (val+test); the remaining sequences are divided into validation and test according to `val_size`/`test_size`.
	4. Save before-window tensors – Full-length tensors are written to `before_sliding_window/{train,val,test}.pt` for troubleshooting.
	5. Sliding-window sampling – Windows of length `window_size` are generated every `step = window_size * stride` samples from each sequence.
	6. Persist final datasets – Windowed tensors and labels are stored in the root `.pt` files together with the mapping fields (`sequence_ids`, `sliding_window_sequence_ids`, `size`).

	---

	## Input / Output Cheat Sheet

	\| Stage \| Shape \| Description \|
	\|-------\|-------\|-------------\|
	\| Raw MAT arrays \| `(2 000 000,)` per channel \| Original vibration signals \|
	\| After loading \| `(1, 2, 2 000 000)` \| Tensor for a single MAT file (two channels) \|
	\| Before sliding window (train) \| `[num_sequences, 2, original_length]` \| Randomly selected sequences saved to `before_sliding_window/train.pt` \|
	\| After sliding window (train) \| `[num_windows, 2, window_size]` \| Final training dataset in `train.pt` \|

	---

	## Notes

	- This README applies to the refreshed implementation; other datasets still rely on the legacy processing approach.
	- The supplied splits rely on random shuffling with the configured ratios. Re-run the pipeline to regenerate different splits if required.