Spaces:
Running
Running
File size: 4,388 Bytes
cb6f1ba | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 | # `notebooks/data_preprocessing.ipynb` — documentation
This note describes what the notebook does, cell by cell.
---
## Cell 0 — Load DeepLoc and inspect it
| Code | What it does |
|------|----------------|
| `import pandas as pd` | Uses pandas for tabular data. |
| `df = pd.read_csv("../data/raw/deeploc.csv")` | Reads the raw CSV from the project’s `data/raw` folder (path is relative to the notebook under `notebooks/`). |
| `print(df.shape)` | Prints row and column counts (example run: **28,303** rows × **16** columns). |
| `print(df.columns.tolist())` | Lists all column names: identifiers, metadata, label columns, and the sequence column. |
| `df.head()` | Shows the first five rows as a preview. |
**Takeaway:** The file includes columns such as `ACC`, `Kingdom`, `Partition`, eleven subcellular-location columns (numeric), `Sequence`, and an index-like `Unnamed: 0` column.
---
## Cell 1 — Shared constants for cleaning and modeling
| Code | What it does |
|------|----------------|
| `from pathlib import Path` | Used later for path handling when saving files (`save_processed_data`). |
| `VALID_AMINO_ACIDS` | Allowed one-letter codes: standard 20 amino acids plus common extensions (**X** unknown, **B** Asx, **Z** Glx, **U** selenocysteine, **O** pyrrolysine). Rows are kept only if every character in the sequence is in this set (after uppercasing). |
| `LABEL_COLUMNS` | The **11** DeepLoc subcellular compartments treated as **binary multilabel** targets. |
| `SEQUENCE_COLUMN = "Sequence"` | Name of the protein sequence column. |
| `ID_COLUMN = "ACC"` | Accession / protein ID used as the row identifier in the final table. |
**Purpose:** Single place to define valid amino acid alphabets and which columns are labels.
---
## Cell 2 — Cleaning pipeline and saving processed data
This cell defines helper functions and runs a full pipeline inside `if __name__ == "__main__":`. In Jupyter, `__name__` is usually `"__main__"`, so that block **does** run when you execute the cell.
### `is_valid_sequence(seq)`
- Requires a **string**; strips whitespace and uppercases.
- Rejects empty sequences.
- Returns **True** only if **every** character is in `VALID_AMINO_ACIDS`.
### `load_dataset(file_path)`
- Reads the CSV with `pd.read_csv`.
- Prints confirmation, shape, and column list; returns the dataframe.
### `clean_sequences(df)`
- Works on a **copy** of the dataframe.
- `dropna(subset=[SEQUENCE_COLUMN])` — removes rows with missing sequences.
- Converts `Sequence` to string, **strip + upper**.
- Keeps rows where `is_valid_sequence` is true.
- `drop_duplicates(subset=[SEQUENCE_COLUMN])` — drops duplicate sequences (first row kept).
- Prints row counts before and after (example: **28,303** unchanged if nothing was dropped).
### `clean_labels(df)`
- For each label column: `to_numeric(..., errors="coerce")`, `fillna(0)`, `astype(int)`, then **`clip(0, 1)`** so labels are binary **0/1**.
- Adds **`label_count`**: sum of the eleven labels per row (number of “positive” compartments).
- Prints the distribution of `label_count`.
### `build_final_dataframe(df)`
- Keeps **`ACC`**, **`Sequence`**, and the **11** label columns only (drops `Unnamed: 0`, `Kingdom`, `Partition`, etc.).
- Example shape: **28,303 × 13** (1 ID + 1 sequence + 11 labels).
- Prints shape and `head()` of the result.
### `save_processed_data(df, output_path)`
- Uses `Path(output_path)`; creates parent directories if missing (`mkdir(parents=True, exist_ok=True)`).
- Writes CSV with `to_csv(..., index=False)`.
- Prints the output path.
### `if __name__ == "__main__":` block
- Sets **`input_file`** and **`output_file`** (raw `deeploc.csv` → processed `deeploc_multilabel.csv`). Uses **raw strings** `r"..."` so Windows backslashes are not interpreted as escape sequences.
- Pipeline order: **load → clean sequences → clean labels → build final dataframe → save**.
**Net effect:** A cleaned **multilabel** CSV with protein ID, normalized sequence, and eleven binary compartment columns, saved under `data/processed/` (exact path as configured in the notebook).
---
## One-line summary
The notebook **loads** `deeploc.csv`, **sanitizes** sequences and **binarizes** subcellular labels, **drops** non-essential columns, and **writes** a compact multilabel file **`deeploc_multilabel.csv`**, after an exploratory load in cell 0.
|