Spaces:
Running
notebooks/data_preprocessing.ipynb — documentation
This note describes what the notebook does, cell by cell.
Cell 0 — Load DeepLoc and inspect it
| Code | What it does |
|---|---|
import pandas as pd |
Uses pandas for tabular data. |
df = pd.read_csv("../data/raw/deeploc.csv") |
Reads the raw CSV from the project’s data/raw folder (path is relative to the notebook under notebooks/). |
print(df.shape) |
Prints row and column counts (example run: 28,303 rows × 16 columns). |
print(df.columns.tolist()) |
Lists all column names: identifiers, metadata, label columns, and the sequence column. |
df.head() |
Shows the first five rows as a preview. |
Takeaway: The file includes columns such as ACC, Kingdom, Partition, eleven subcellular-location columns (numeric), Sequence, and an index-like Unnamed: 0 column.
Cell 1 — Shared constants for cleaning and modeling
| Code | What it does |
|---|---|
from pathlib import Path |
Used later for path handling when saving files (save_processed_data). |
VALID_AMINO_ACIDS |
Allowed one-letter codes: standard 20 amino acids plus common extensions (X unknown, B Asx, Z Glx, U selenocysteine, O pyrrolysine). Rows are kept only if every character in the sequence is in this set (after uppercasing). |
LABEL_COLUMNS |
The 11 DeepLoc subcellular compartments treated as binary multilabel targets. |
SEQUENCE_COLUMN = "Sequence" |
Name of the protein sequence column. |
ID_COLUMN = "ACC" |
Accession / protein ID used as the row identifier in the final table. |
Purpose: Single place to define valid amino acid alphabets and which columns are labels.
Cell 2 — Cleaning pipeline and saving processed data
This cell defines helper functions and runs a full pipeline inside if __name__ == "__main__":. In Jupyter, __name__ is usually "__main__", so that block does run when you execute the cell.
is_valid_sequence(seq)
- Requires a string; strips whitespace and uppercases.
- Rejects empty sequences.
- Returns True only if every character is in
VALID_AMINO_ACIDS.
load_dataset(file_path)
- Reads the CSV with
pd.read_csv. - Prints confirmation, shape, and column list; returns the dataframe.
clean_sequences(df)
- Works on a copy of the dataframe.
dropna(subset=[SEQUENCE_COLUMN])— removes rows with missing sequences.- Converts
Sequenceto string, strip + upper. - Keeps rows where
is_valid_sequenceis true. drop_duplicates(subset=[SEQUENCE_COLUMN])— drops duplicate sequences (first row kept).- Prints row counts before and after (example: 28,303 unchanged if nothing was dropped).
clean_labels(df)
- For each label column:
to_numeric(..., errors="coerce"),fillna(0),astype(int), thenclip(0, 1)so labels are binary 0/1. - Adds
label_count: sum of the eleven labels per row (number of “positive” compartments). - Prints the distribution of
label_count.
build_final_dataframe(df)
- Keeps
ACC,Sequence, and the 11 label columns only (dropsUnnamed: 0,Kingdom,Partition, etc.). - Example shape: 28,303 × 13 (1 ID + 1 sequence + 11 labels).
- Prints shape and
head()of the result.
save_processed_data(df, output_path)
- Uses
Path(output_path); creates parent directories if missing (mkdir(parents=True, exist_ok=True)). - Writes CSV with
to_csv(..., index=False). - Prints the output path.
if __name__ == "__main__": block
- Sets
input_fileandoutput_file(rawdeeploc.csv→ processeddeeploc_multilabel.csv). Uses raw stringsr"..."so Windows backslashes are not interpreted as escape sequences. - Pipeline order: load → clean sequences → clean labels → build final dataframe → save.
Net effect: A cleaned multilabel CSV with protein ID, normalized sequence, and eleven binary compartment columns, saved under data/processed/ (exact path as configured in the notebook).
One-line summary
The notebook loads deeploc.csv, sanitizes sequences and binarizes subcellular labels, drops non-essential columns, and writes a compact multilabel file deeploc_multilabel.csv, after an exploratory load in cell 0.