Spaces:

Tanoj22
/

protloc-ai

Running

App Files Files Community

protloc-ai / docs /data_preprocessing_notebook.md

Tanoj22

Initial commit: ProtLoc-AI project setup and core app

cb6f1ba about 2 months ago

preview code

raw

history blame contribute delete

4.39 kB

`notebooks/data_preprocessing.ipynb` — documentation

This note describes what the notebook does, cell by cell.

Cell 0 — Load DeepLoc and inspect it

Code	What it does
`import pandas as pd`	Uses pandas for tabular data.
`df = pd.read_csv("../data/raw/deeploc.csv")`	Reads the raw CSV from the project’s `data/raw` folder (path is relative to the notebook under `notebooks/`).
`print(df.shape)`	Prints row and column counts (example run: 28,303 rows × 16 columns).
`print(df.columns.tolist())`	Lists all column names: identifiers, metadata, label columns, and the sequence column.
`df.head()`	Shows the first five rows as a preview.

Takeaway: The file includes columns such as ACC, Kingdom, Partition, eleven subcellular-location columns (numeric), Sequence, and an index-like Unnamed: 0 column.

Cell 1 — Shared constants for cleaning and modeling

Code	What it does
`from pathlib import Path`	Used later for path handling when saving files (`save_processed_data`).
`VALID_AMINO_ACIDS`	Allowed one-letter codes: standard 20 amino acids plus common extensions (X unknown, B Asx, Z Glx, U selenocysteine, O pyrrolysine). Rows are kept only if every character in the sequence is in this set (after uppercasing).
`LABEL_COLUMNS`	The 11 DeepLoc subcellular compartments treated as binary multilabel targets.
`SEQUENCE_COLUMN = "Sequence"`	Name of the protein sequence column.
`ID_COLUMN = "ACC"`	Accession / protein ID used as the row identifier in the final table.

Purpose: Single place to define valid amino acid alphabets and which columns are labels.

Cell 2 — Cleaning pipeline and saving processed data

This cell defines helper functions and runs a full pipeline inside if __name__ == "__main__":. In Jupyter, __name__ is usually "__main__", so that block does run when you execute the cell.

`is_valid_sequence(seq)`

Requires a string; strips whitespace and uppercases.
Rejects empty sequences.
Returns True only if every character is in VALID_AMINO_ACIDS.

`load_dataset(file_path)`

Reads the CSV with pd.read_csv.
Prints confirmation, shape, and column list; returns the dataframe.

`clean_sequences(df)`

Works on a copy of the dataframe.
dropna(subset=[SEQUENCE_COLUMN]) — removes rows with missing sequences.
Converts Sequence to string, strip + upper.
Keeps rows where is_valid_sequence is true.
drop_duplicates(subset=[SEQUENCE_COLUMN]) — drops duplicate sequences (first row kept).
Prints row counts before and after (example: 28,303 unchanged if nothing was dropped).

`clean_labels(df)`

For each label column: to_numeric(..., errors="coerce"), fillna(0), astype(int), then clip(0, 1) so labels are binary 0/1.
Adds label_count: sum of the eleven labels per row (number of “positive” compartments).
Prints the distribution of label_count.

`build_final_dataframe(df)`

Keeps ACC, Sequence, and the 11 label columns only (drops Unnamed: 0, Kingdom, Partition, etc.).
Example shape: 28,303 × 13 (1 ID + 1 sequence + 11 labels).
Prints shape and head() of the result.

`save_processed_data(df, output_path)`

Uses Path(output_path); creates parent directories if missing (mkdir(parents=True, exist_ok=True)).
Writes CSV with to_csv(..., index=False).
Prints the output path.

`if name == "main":` block

Sets input_file and output_file (raw deeploc.csv → processed deeploc_multilabel.csv). Uses raw strings r"..." so Windows backslashes are not interpreted as escape sequences.
Pipeline order: load → clean sequences → clean labels → build final dataframe → save.

Net effect: A cleaned multilabel CSV with protein ID, normalized sequence, and eleven binary compartment columns, saved under data/processed/ (exact path as configured in the notebook).

One-line summary

The notebook loads deeploc.csv, sanitizes sequences and binarizes subcellular labels, drops non-essential columns, and writes a compact multilabel file deeploc_multilabel.csv, after an exploratory load in cell 0.

notebooks/data_preprocessing.ipynb — documentation

Cell 0 — Load DeepLoc and inspect it

Cell 1 — Shared constants for cleaning and modeling

Cell 2 — Cleaning pipeline and saving processed data

is_valid_sequence(seq)

load_dataset(file_path)

clean_sequences(df)

clean_labels(df)

build_final_dataframe(df)

save_processed_data(df, output_path)

if __name__ == "__main__": block