Pdm / processed_data /UO /README.md
DavidNguyen's picture
Create README.md
1d1da63 verified

UO Processed Dataset

Documentation for the University of Ottawa (UO) bearing dataset after running the updated preprocessing pipeline. Along with KAIST and PU, this is one of the three datasets that were refreshed most recently.


Folder Layout

UO/
β”œβ”€β”€ train.pt
β”œβ”€β”€ val.pt
β”œβ”€β”€ test.pt
β”œβ”€β”€ args.json
β”œβ”€β”€ additional_features.pt
└── before_sliding_window/
    β”œβ”€β”€ train.pt
    β”œβ”€β”€ val.pt
    └── test.pt
  • train.pt, val.pt, test.pt – windowed tensors post sliding-window subsampling.
  • before_sliding_window/*.pt – the same splits before windowing (full-length sequences).
  • args.json – preprocessing arguments (window size, stride, split ratios, etc.).
  • additional_features.pt – Torch-serialized metadata collected from each MAT file (including channel names and auxiliary sensor info).

Saved Tensor Structure

Each windowed .pt file is a dictionary with:

  • samples: torch.Tensor of shape [num_windows, num_channels, window_length]
  • labels: torch.Tensor with class ids (0: healthy, 1: inner race fault, 2: outer race fault)
  • sequence_ids: indices that point back to the original MAT files
  • sliding_window_sequence_ids: mapping from windows to their source sequences
  • size: split ratio used when generating the split (for traceability)

The before_sliding_window tensors use the same keys but keep the un-windowed signals (length β‰ˆ 2,000,000 samples per channel).


Features

Each MAT file contains several arrays; the preprocessing script keeps only the first two vibration channels (channel_1, channel_2) that provide the full 2β€―000β€―000 samples used in prior work. Their names are captured in additional_features.pt under each MAT file:

import torch

meta = torch.load("additional_features.pt")
print(meta["healthy"]["H-A-1.mat"]["name_features"])

Use this metadata to understand the physical meaning of each channel or to filter specific features.


Usage Example

import torch

train = torch.load("train.pt")
windows = train["samples"]                   # [N, num_channels, window_length]
labels = train["labels"]                     # bearing condition ids
original_indices = train["sliding_window_sequence_ids"]

print(windows.shape)
print(labels.unique())

Access the un-windowed signals:

raw = torch.load("before_sliding_window/train.pt")
full_sequences = raw["samples"]              # [num_sequences, num_channels, original_length]

Processing Pipeline

  1. Raw input – MAT files organised into three lists: healthy (H-*), inner faults (I-*), outer faults (O-*).
  2. Channel selection – For every file, the script extracts channel_1 and channel_2, each of length 2β€―000β€―000 samples. Additional metadata (speed, load, etc.) is preserved in additional_features.pt.
  3. Class-wise sequence split – Using train_size, sequences are randomly assigned to train vs. (val+test); the remaining sequences are divided into validation and test according to val_size/test_size.
  4. Save before-window tensors – Full-length tensors are written to before_sliding_window/{train,val,test}.pt for troubleshooting.
  5. Sliding-window sampling – Windows of length window_size are generated every step = window_size * stride samples from each sequence.
  6. Persist final datasets – Windowed tensors and labels are stored in the root .pt files together with the mapping fields (sequence_ids, sliding_window_sequence_ids, size).

Input / Output Cheat Sheet

Stage Shape Description
Raw MAT arrays (2β€―000β€―000,) per channel Original vibration signals
After loading (1, 2, 2β€―000β€―000) Tensor for a single MAT file (two channels)
Before sliding window (train) [num_sequences, 2, original_length] Randomly selected sequences saved to before_sliding_window/train.pt
After sliding window (train) [num_windows, 2, window_size] Final training dataset in train.pt

Notes

  • This README applies to the refreshed implementation; other datasets still rely on the legacy processing approach.
  • The supplied splits rely on random shuffling with the configured ratios. Re-run the pipeline to regenerate different splits if required.