Spaces:

Tanoj22
/

protloc-ai

Running

App Files Files Community

protloc-ai / docs /data_preprocessing_notebook.md

Tanoj22

Initial commit: ProtLoc-AI project setup and core app

cb6f1ba about 2 months ago

preview code

raw

history blame contribute delete

4.39 kB

	# `notebooks/data_preprocessing.ipynb` — documentation

	This note describes what the notebook does, cell by cell.

	---

	## Cell 0 — Load DeepLoc and inspect it

	\| Code \| What it does \|
	\|------\|----------------\|
	\| `import pandas as pd` \| Uses pandas for tabular data. \|
	\| `df = pd.read_csv("../data/raw/deeploc.csv")` \| Reads the raw CSV from the project’s `data/raw` folder (path is relative to the notebook under `notebooks/`). \|
	\| `print(df.shape)` \| Prints row and column counts (example run: 28,303 rows × 16 columns). \|
	\| `print(df.columns.tolist())` \| Lists all column names: identifiers, metadata, label columns, and the sequence column. \|
	\| `df.head()` \| Shows the first five rows as a preview. \|

	Takeaway: The file includes columns such as `ACC`, `Kingdom`, `Partition`, eleven subcellular-location columns (numeric), `Sequence`, and an index-like `Unnamed: 0` column.

	---

	## Cell 1 — Shared constants for cleaning and modeling

	\| Code \| What it does \|
	\|------\|----------------\|
	\| `from pathlib import Path` \| Used later for path handling when saving files (`save_processed_data`). \|
	\| `VALID_AMINO_ACIDS` \| Allowed one-letter codes: standard 20 amino acids plus common extensions (X unknown, B Asx, Z Glx, U selenocysteine, O pyrrolysine). Rows are kept only if every character in the sequence is in this set (after uppercasing). \|
	\| `LABEL_COLUMNS` \| The 11 DeepLoc subcellular compartments treated as binary multilabel targets. \|
	\| `SEQUENCE_COLUMN = "Sequence"` \| Name of the protein sequence column. \|
	\| `ID_COLUMN = "ACC"` \| Accession / protein ID used as the row identifier in the final table. \|

	Purpose: Single place to define valid amino acid alphabets and which columns are labels.

	---

	## Cell 2 — Cleaning pipeline and saving processed data

	This cell defines helper functions and runs a full pipeline inside `if __name__ == "__main__":`. In Jupyter, `__name__` is usually `"__main__"`, so that block does run when you execute the cell.

	### `is_valid_sequence(seq)`

	- Requires a string; strips whitespace and uppercases.
	- Rejects empty sequences.
	- Returns True only if every character is in `VALID_AMINO_ACIDS`.

	### `load_dataset(file_path)`

	- Reads the CSV with `pd.read_csv`.
	- Prints confirmation, shape, and column list; returns the dataframe.

	### `clean_sequences(df)`

	- Works on a copy of the dataframe.
	- `dropna(subset=[SEQUENCE_COLUMN])` — removes rows with missing sequences.
	- Converts `Sequence` to string, strip + upper.
	- Keeps rows where `is_valid_sequence` is true.
	- `drop_duplicates(subset=[SEQUENCE_COLUMN])` — drops duplicate sequences (first row kept).
	- Prints row counts before and after (example: 28,303 unchanged if nothing was dropped).

	### `clean_labels(df)`

	- For each label column: `to_numeric(..., errors="coerce")`, `fillna(0)`, `astype(int)`, then `clip(0, 1)` so labels are binary 0/1.
	- Adds `label_count`: sum of the eleven labels per row (number of “positive” compartments).
	- Prints the distribution of `label_count`.

	### `build_final_dataframe(df)`

	- Keeps `ACC`, `Sequence`, and the 11 label columns only (drops `Unnamed: 0`, `Kingdom`, `Partition`, etc.).
	- Example shape: 28,303 × 13 (1 ID + 1 sequence + 11 labels).
	- Prints shape and `head()` of the result.

	### `save_processed_data(df, output_path)`

	- Uses `Path(output_path)`; creates parent directories if missing (`mkdir(parents=True, exist_ok=True)`).
	- Writes CSV with `to_csv(..., index=False)`.
	- Prints the output path.

	### `if __name__ == "__main__":` block

	- Sets `input_file` and `output_file` (raw `deeploc.csv` → processed `deeploc_multilabel.csv`). Uses raw strings `r"..."` so Windows backslashes are not interpreted as escape sequences.
	- Pipeline order: load → clean sequences → clean labels → build final dataframe → save.

	Net effect: A cleaned multilabel CSV with protein ID, normalized sequence, and eleven binary compartment columns, saved under `data/processed/` (exact path as configured in the notebook).

	---

	## One-line summary

	The notebook loads `deeploc.csv`, sanitizes sequences and binarizes subcellular labels, drops non-essential columns, and writes a compact multilabel file `deeploc_multilabel.csv`, after an exploratory load in cell 0.

	# `notebooks/data_preprocessing.ipynb` — documentation

	This note describes what the notebook does, cell by cell.

	---

	## Cell 0 — Load DeepLoc and inspect it

	\| Code \| What it does \|
	\|------\|----------------\|
	\| `import pandas as pd` \| Uses pandas for tabular data. \|
	\| `df = pd.read_csv("../data/raw/deeploc.csv")` \| Reads the raw CSV from the project’s `data/raw` folder (path is relative to the notebook under `notebooks/`). \|
	\| `print(df.shape)` \| Prints row and column counts (example run: 28,303 rows × 16 columns). \|
	\| `print(df.columns.tolist())` \| Lists all column names: identifiers, metadata, label columns, and the sequence column. \|
	\| `df.head()` \| Shows the first five rows as a preview. \|

	Takeaway: The file includes columns such as `ACC`, `Kingdom`, `Partition`, eleven subcellular-location columns (numeric), `Sequence`, and an index-like `Unnamed: 0` column.

	---

	## Cell 1 — Shared constants for cleaning and modeling

	\| Code \| What it does \|
	\|------\|----------------\|
	\| `from pathlib import Path` \| Used later for path handling when saving files (`save_processed_data`). \|
	\| `VALID_AMINO_ACIDS` \| Allowed one-letter codes: standard 20 amino acids plus common extensions (X unknown, B Asx, Z Glx, U selenocysteine, O pyrrolysine). Rows are kept only if every character in the sequence is in this set (after uppercasing). \|
	\| `LABEL_COLUMNS` \| The 11 DeepLoc subcellular compartments treated as binary multilabel targets. \|
	\| `SEQUENCE_COLUMN = "Sequence"` \| Name of the protein sequence column. \|
	\| `ID_COLUMN = "ACC"` \| Accession / protein ID used as the row identifier in the final table. \|

	Purpose: Single place to define valid amino acid alphabets and which columns are labels.

	---

	## Cell 2 — Cleaning pipeline and saving processed data

	This cell defines helper functions and runs a full pipeline inside `if __name__ == "__main__":`. In Jupyter, `__name__` is usually `"__main__"`, so that block does run when you execute the cell.

	### `is_valid_sequence(seq)`

	- Requires a string; strips whitespace and uppercases.
	- Rejects empty sequences.
	- Returns True only if every character is in `VALID_AMINO_ACIDS`.

	### `load_dataset(file_path)`

	- Reads the CSV with `pd.read_csv`.
	- Prints confirmation, shape, and column list; returns the dataframe.

	### `clean_sequences(df)`

	- Works on a copy of the dataframe.
	- `dropna(subset=[SEQUENCE_COLUMN])` — removes rows with missing sequences.
	- Converts `Sequence` to string, strip + upper.
	- Keeps rows where `is_valid_sequence` is true.
	- `drop_duplicates(subset=[SEQUENCE_COLUMN])` — drops duplicate sequences (first row kept).
	- Prints row counts before and after (example: 28,303 unchanged if nothing was dropped).

	### `clean_labels(df)`

	- For each label column: `to_numeric(..., errors="coerce")`, `fillna(0)`, `astype(int)`, then `clip(0, 1)` so labels are binary 0/1.
	- Adds `label_count`: sum of the eleven labels per row (number of “positive” compartments).
	- Prints the distribution of `label_count`.

	### `build_final_dataframe(df)`

	- Keeps `ACC`, `Sequence`, and the 11 label columns only (drops `Unnamed: 0`, `Kingdom`, `Partition`, etc.).
	- Example shape: 28,303 × 13 (1 ID + 1 sequence + 11 labels).
	- Prints shape and `head()` of the result.

	### `save_processed_data(df, output_path)`

	- Uses `Path(output_path)`; creates parent directories if missing (`mkdir(parents=True, exist_ok=True)`).
	- Writes CSV with `to_csv(..., index=False)`.
	- Prints the output path.

	### `if __name__ == "__main__":` block

	- Sets `input_file` and `output_file` (raw `deeploc.csv` → processed `deeploc_multilabel.csv`). Uses raw strings `r"..."` so Windows backslashes are not interpreted as escape sequences.
	- Pipeline order: load → clean sequences → clean labels → build final dataframe → save.

	Net effect: A cleaned multilabel CSV with protein ID, normalized sequence, and eleven binary compartment columns, saved under `data/processed/` (exact path as configured in the notebook).

	---

	## One-line summary

	The notebook loads `deeploc.csv`, sanitizes sequences and binarizes subcellular labels, drops non-essential columns, and writes a compact multilabel file `deeploc_multilabel.csv`, after an exploratory load in cell 0.