{ "cells": [ { "cell_type": "markdown", "id": "7343ea9d", "metadata": { "papermill": { "duration": 0.008159, "end_time": "2025-07-19T12:48:10.214040", "exception": false, "start_time": "2025-07-19T12:48:10.205881", "status": "completed" }, "tags": [] }, "source": [ "\n", "\n", "## Introduction\n", "\n", "In this notebook, we tackle a **DNA sequence classification** task: predicting the **type of gene** (e.g., *PSEUDO*, *BIOLOGICAL\\_REGION*) based on its **nucleotide sequence**.\n", "\n", "The dataset contains gene entries with their NCBI ID, symbol, description, gene type (our target label), and raw DNA sequences. Our main objective is to explore whether machine learning models can learn meaningful patterns in the sequences to classify gene types accurately.\n", "\n", "We’ll start by preprocessing the sequences using **k-mer encoding**, then train and evaluate classification models such as Random Forests. This project demonstrates a practical application of machine learning in bioinformatics — bridging the gap between raw DNA data and functional gene annotation.\n" ] }, { "cell_type": "markdown", "id": "46468e3b", "metadata": { "papermill": { "duration": 0.008144, "end_time": "2025-07-19T12:48:10.229310", "exception": false, "start_time": "2025-07-19T12:48:10.221166", "status": "completed" }, "tags": [] }, "source": [ "#### Tools" ] }, { "cell_type": "code", "execution_count": 1, "id": "b7a65916", "metadata": { "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19", "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5", "execution": { "iopub.execute_input": "2025-07-19T12:48:10.244925Z", "iopub.status.busy": "2025-07-19T12:48:10.244605Z", "iopub.status.idle": "2025-07-19T12:48:20.686493Z", "shell.execute_reply": "2025-07-19T12:48:20.685488Z" }, "papermill": { "duration": 10.451606, "end_time": "2025-07-19T12:48:20.688143", "exception": false, "start_time": "2025-07-19T12:48:10.236537", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "from scipy.sparse import hstack\n", "\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import LabelEncoder, OneHotEncoder\n", "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n", "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay\n", "\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.naive_bayes import MultinomialNB\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.svm import SVC\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\n", "\n", "from xgboost import XGBClassifier\n", "from lightgbm import LGBMClassifier\n" ] }, { "cell_type": "markdown", "id": "b8846ce4", "metadata": { "papermill": { "duration": 0.006625, "end_time": "2025-07-19T12:48:20.702836", "exception": false, "start_time": "2025-07-19T12:48:20.696211", "status": "completed" }, "tags": [] }, "source": [ "## EDA" ] }, { "cell_type": "code", "execution_count": 2, "id": "e8973917", "metadata": { "execution": { "iopub.execute_input": "2025-07-19T12:48:20.718190Z", "iopub.status.busy": "2025-07-19T12:48:20.717432Z", "iopub.status.idle": "2025-07-19T12:48:21.106112Z", "shell.execute_reply": "2025-07-19T12:48:21.105181Z" }, "papermill": { "duration": 0.398343, "end_time": "2025-07-19T12:48:21.107915", "exception": false, "start_time": "2025-07-19T12:48:20.709572", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "df = pd.read_csv('/kaggle/input/dna-sequence-prediction/train.csv')" ] }, { "cell_type": "code", "execution_count": 3, "id": "8a466c50", "metadata": { "execution": { "iopub.execute_input": "2025-07-19T12:48:21.123374Z", "iopub.status.busy": "2025-07-19T12:48:21.123049Z", "iopub.status.idle": "2025-07-19T12:48:21.140162Z", "shell.execute_reply": "2025-07-19T12:48:21.139225Z" }, "papermill": { "duration": 0.026712, "end_time": "2025-07-19T12:48:21.141932", "exception": false, "start_time": "2025-07-19T12:48:21.115220", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "df = df.drop(columns=['Unnamed: 0'], errors='ignore')" ] }, { "cell_type": "markdown", "id": "f21b55ce", "metadata": { "papermill": { "duration": 0.006915, "end_time": "2025-07-19T12:48:21.156247", "exception": false, "start_time": "2025-07-19T12:48:21.149332", "status": "completed" }, "tags": [] }, "source": [ "Filter out the unnecessary chars from NucleotideSequence." ] }, { "cell_type": "code", "execution_count": 4, "id": "835c4f22", "metadata": { "execution": { "iopub.execute_input": "2025-07-19T12:48:21.171994Z", "iopub.status.busy": "2025-07-19T12:48:21.171651Z", "iopub.status.idle": "2025-07-19T12:48:21.285344Z", "shell.execute_reply": "2025-07-19T12:48:21.284554Z" }, "papermill": { "duration": 0.123458, "end_time": "2025-07-19T12:48:21.286945", "exception": false, "start_time": "2025-07-19T12:48:21.163487", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "df['NucleotideSequence'] = df['NucleotideSequence'].str.replace(r'[<>]', '', regex=True)" ] }, { "cell_type": "code", "execution_count": 5, "id": "5af9fd01", "metadata": { "execution": { "iopub.execute_input": "2025-07-19T12:48:21.302354Z", "iopub.status.busy": "2025-07-19T12:48:21.301996Z", "iopub.status.idle": "2025-07-19T12:48:21.324748Z", "shell.execute_reply": "2025-07-19T12:48:21.323837Z" }, "papermill": { "duration": 0.032286, "end_time": "2025-07-19T12:48:21.326338", "exception": false, "start_time": "2025-07-19T12:48:21.294052", "status": "completed" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
| \n", " | NCBIGeneID | \n", "Symbol | \n", "Description | \n", "GeneType | \n", "GeneGroupMethod | \n", "NucleotideSequence | \n", "
|---|---|---|---|---|---|---|
| 0 | \n", "106481178 | \n", "RNU4-21P | \n", "RNA, U4 small nuclear 21, pseudogene | \n", "PSEUDO | \n", "NCBI Ortholog | \n", "AGCTTAGCACAGTGGCAGTATCATAGGCAGTGAGGTTTATCCGAGG... | \n", "
| 1 | \n", "123477792 | \n", "LOC123477792 | \n", "Sharpr-MPRA regulatory region 12926 | \n", "BIOLOGICAL_REGION | \n", "NCBI Ortholog | \n", "CTGGAGCGGCCACGATGTGAACTGTCACCGGCCACTGCTGCTCCGA... | \n", "
| 2 | \n", "113174975 | \n", "LOC113174975 | \n", "Sharpr-MPRA regulatory region 7591 | \n", "BIOLOGICAL_REGION | \n", "NCBI Ortholog | \n", "TTCCCAATTTTTCCTCTGCTTTTTAATTTTCTAGTTTCCTTTTTCC... | \n", "
| 3 | \n", "116216107 | \n", "LOC116216107 | \n", "CRISPRi-validated cis-regulatory element chr10... | \n", "BIOLOGICAL_REGION | \n", "NCBI Ortholog | \n", "CGCCCAGGCTGGAGTGCAGTGGCGCCATCTCGGCTCACTGCAGGCT... | \n", "
| 4 | \n", "28502 | \n", "IGHD2-21 | \n", "immunoglobulin heavy diversity 2-21 | \n", "OTHER | \n", "NCBI Ortholog | \n", "AGCATATTGTGGTGGTGACTGCTATTCC | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 22588 | \n", "124907055 | \n", "LOC124907055 | \n", "uncharacterized LOC124907055 | \n", "ncRNA | \n", "NCBI Ortholog | \n", "GGTGGGGTGGGGTGGGGTGGGGTGGGGTGCAGAGAAAACGATTGAT... | \n", "
| 22589 | \n", "106480032 | \n", "RNU6-1060P | \n", "RNA, U6 small nuclear 1060, pseudogene | \n", "PSEUDO | \n", "NCBI Ortholog | \n", "GTGCTCACTTCAGCAGCACATATACTAAAATTGGAATGATACAGAG... | \n", "
| 22590 | \n", "106481029 | \n", "RN7SL387P | \n", "RNA, 7SL, cytoplasmic 387, pseudogene | \n", "PSEUDO | \n", "NCBI Ortholog | \n", "GCTGGGCGTGGTGGTGGGTGCCTGTAATCCCAGCTACTAGGGAGGC... | \n", "
| 22591 | \n", "100286918 | \n", "NDUFS5P2 | \n", "NADH:ubiquinone oxidoreductase subunit S5 pseu... | \n", "PSEUDO | \n", "NCBI Ortholog | \n", "TCGTCCTGAAGCAGCGGCCAGAGAAGAGACAAGGGCACGAGCATCA... | \n", "
| 22592 | \n", "100189280 | \n", "TRA-TGC3-1 | \n", "tRNA-Ala (anticodon TGC) 3-1 | \n", "tRNA | \n", "NCBI Ortholog | \n", "GGGGATGTAGCTCAGTGGTAGAGCGCATGCTTTGCATGTATGAGGC... | \n", "
22593 rows × 6 columns
\n", "| \n", " | Model | \n", "Accuracy | \n", "Precision | \n", "Recall | \n", "F1 Score | \n", "
|---|---|---|---|---|---|
| 0 | \n", "Gradient Boosting | \n", "0.983625 | \n", "0.960137 | \n", "0.974966 | \n", "0.966911 | \n", "
| 1 | \n", "Random Forest | \n", "0.980084 | \n", "0.963295 | \n", "0.950453 | \n", "0.955137 | \n", "
| 2 | \n", "Decision Tree | \n", "0.978535 | \n", "0.950967 | \n", "0.955858 | \n", "0.952580 | \n", "
| 3 | \n", "KNN | \n", "0.840451 | \n", "0.717834 | \n", "0.674208 | \n", "0.688334 | \n", "
| 4 | \n", "Logistic Regression | \n", "0.924541 | \n", "0.740218 | \n", "0.665699 | \n", "0.688134 | \n", "
| 5 | \n", "Naive Bayes | \n", "0.717858 | \n", "0.485762 | \n", "0.683324 | \n", "0.519784 | \n", "
| 6 | \n", "SVM | \n", "0.655233 | \n", "0.184063 | \n", "0.208640 | \n", "0.195433 | \n", "
GradientBoostingClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GradientBoostingClassifier()