protloc-ai / docs /data_preprocessing_notebook.md
Tanoj22
Initial commit: ProtLoc-AI project setup and core app
cb6f1ba

notebooks/data_preprocessing.ipynb — documentation

This note describes what the notebook does, cell by cell.


Cell 0 — Load DeepLoc and inspect it

Code What it does
import pandas as pd Uses pandas for tabular data.
df = pd.read_csv("../data/raw/deeploc.csv") Reads the raw CSV from the project’s data/raw folder (path is relative to the notebook under notebooks/).
print(df.shape) Prints row and column counts (example run: 28,303 rows × 16 columns).
print(df.columns.tolist()) Lists all column names: identifiers, metadata, label columns, and the sequence column.
df.head() Shows the first five rows as a preview.

Takeaway: The file includes columns such as ACC, Kingdom, Partition, eleven subcellular-location columns (numeric), Sequence, and an index-like Unnamed: 0 column.


Cell 1 — Shared constants for cleaning and modeling

Code What it does
from pathlib import Path Used later for path handling when saving files (save_processed_data).
VALID_AMINO_ACIDS Allowed one-letter codes: standard 20 amino acids plus common extensions (X unknown, B Asx, Z Glx, U selenocysteine, O pyrrolysine). Rows are kept only if every character in the sequence is in this set (after uppercasing).
LABEL_COLUMNS The 11 DeepLoc subcellular compartments treated as binary multilabel targets.
SEQUENCE_COLUMN = "Sequence" Name of the protein sequence column.
ID_COLUMN = "ACC" Accession / protein ID used as the row identifier in the final table.

Purpose: Single place to define valid amino acid alphabets and which columns are labels.


Cell 2 — Cleaning pipeline and saving processed data

This cell defines helper functions and runs a full pipeline inside if __name__ == "__main__":. In Jupyter, __name__ is usually "__main__", so that block does run when you execute the cell.

is_valid_sequence(seq)

  • Requires a string; strips whitespace and uppercases.
  • Rejects empty sequences.
  • Returns True only if every character is in VALID_AMINO_ACIDS.

load_dataset(file_path)

  • Reads the CSV with pd.read_csv.
  • Prints confirmation, shape, and column list; returns the dataframe.

clean_sequences(df)

  • Works on a copy of the dataframe.
  • dropna(subset=[SEQUENCE_COLUMN]) — removes rows with missing sequences.
  • Converts Sequence to string, strip + upper.
  • Keeps rows where is_valid_sequence is true.
  • drop_duplicates(subset=[SEQUENCE_COLUMN]) — drops duplicate sequences (first row kept).
  • Prints row counts before and after (example: 28,303 unchanged if nothing was dropped).

clean_labels(df)

  • For each label column: to_numeric(..., errors="coerce"), fillna(0), astype(int), then clip(0, 1) so labels are binary 0/1.
  • Adds label_count: sum of the eleven labels per row (number of “positive” compartments).
  • Prints the distribution of label_count.

build_final_dataframe(df)

  • Keeps ACC, Sequence, and the 11 label columns only (drops Unnamed: 0, Kingdom, Partition, etc.).
  • Example shape: 28,303 × 13 (1 ID + 1 sequence + 11 labels).
  • Prints shape and head() of the result.

save_processed_data(df, output_path)

  • Uses Path(output_path); creates parent directories if missing (mkdir(parents=True, exist_ok=True)).
  • Writes CSV with to_csv(..., index=False).
  • Prints the output path.

if __name__ == "__main__": block

  • Sets input_file and output_file (raw deeploc.csv → processed deeploc_multilabel.csv). Uses raw strings r"..." so Windows backslashes are not interpreted as escape sequences.
  • Pipeline order: load → clean sequences → clean labels → build final dataframe → save.

Net effect: A cleaned multilabel CSV with protein ID, normalized sequence, and eleven binary compartment columns, saved under data/processed/ (exact path as configured in the notebook).


One-line summary

The notebook loads deeploc.csv, sanitizes sequences and binarizes subcellular labels, drops non-essential columns, and writes a compact multilabel file deeploc_multilabel.csv, after an exploratory load in cell 0.