UO Processed Dataset
Documentation for the University of Ottawa (UO) bearing dataset after running the updated preprocessing pipeline. Along with KAIST and PU, this is one of the three datasets that were refreshed most recently.
Folder Layout
UO/
βββ train.pt
βββ val.pt
βββ test.pt
βββ args.json
βββ additional_features.pt
βββ before_sliding_window/
βββ train.pt
βββ val.pt
βββ test.pt
train.pt,val.pt,test.ptβ windowed tensors post sliding-window subsampling.before_sliding_window/*.ptβ the same splits before windowing (full-length sequences).args.jsonβ preprocessing arguments (window size, stride, split ratios, etc.).additional_features.ptβ Torch-serialized metadata collected from each MAT file (including channel names and auxiliary sensor info).
Saved Tensor Structure
Each windowed .pt file is a dictionary with:
samples:torch.Tensorof shape[num_windows, num_channels, window_length]labels:torch.Tensorwith class ids (0: healthy,1: inner race fault,2: outer race fault)sequence_ids: indices that point back to the original MAT filessliding_window_sequence_ids: mapping from windows to their source sequencessize: split ratio used when generating the split (for traceability)
The before_sliding_window tensors use the same keys but keep the un-windowed signals (length β 2,000,000 samples per channel).
Features
Each MAT file contains several arrays; the preprocessing script keeps only the first two vibration channels (channel_1, channel_2) that provide the full 2β―000β―000 samples used in prior work. Their names are captured in additional_features.pt under each MAT file:
import torch
meta = torch.load("additional_features.pt")
print(meta["healthy"]["H-A-1.mat"]["name_features"])
Use this metadata to understand the physical meaning of each channel or to filter specific features.
Usage Example
import torch
train = torch.load("train.pt")
windows = train["samples"] # [N, num_channels, window_length]
labels = train["labels"] # bearing condition ids
original_indices = train["sliding_window_sequence_ids"]
print(windows.shape)
print(labels.unique())
Access the un-windowed signals:
raw = torch.load("before_sliding_window/train.pt")
full_sequences = raw["samples"] # [num_sequences, num_channels, original_length]
Processing Pipeline
- Raw input β MAT files organised into three lists: healthy (
H-*), inner faults (I-*), outer faults (O-*). - Channel selection β For every file, the script extracts
channel_1andchannel_2, each of length 2β―000β―000 samples. Additional metadata (speed, load, etc.) is preserved inadditional_features.pt. - Class-wise sequence split β Using
train_size, sequences are randomly assigned to train vs. (val+test); the remaining sequences are divided into validation and test according toval_size/test_size. - Save before-window tensors β Full-length tensors are written to
before_sliding_window/{train,val,test}.ptfor troubleshooting. - Sliding-window sampling β Windows of length
window_sizeare generated everystep = window_size * stridesamples from each sequence. - Persist final datasets β Windowed tensors and labels are stored in the root
.ptfiles together with the mapping fields (sequence_ids,sliding_window_sequence_ids,size).
Input / Output Cheat Sheet
| Stage | Shape | Description |
|---|---|---|
| Raw MAT arrays | (2β―000β―000,) per channel |
Original vibration signals |
| After loading | (1, 2, 2β―000β―000) |
Tensor for a single MAT file (two channels) |
| Before sliding window (train) | [num_sequences, 2, original_length] |
Randomly selected sequences saved to before_sliding_window/train.pt |
| After sliding window (train) | [num_windows, 2, window_size] |
Final training dataset in train.pt |
Notes
- This README applies to the refreshed implementation; other datasets still rely on the legacy processing approach.
- The supplied splits rely on random shuffling with the configured ratios. Re-run the pipeline to regenerate different splits if required.