{ "cells": [ { "cell_type": "markdown", "id": "4c99fd1a", "metadata": {}, "source": "# TargetRecon — Python API Demo\n\nThis notebook walks through the full TargetRecon Python API:\n- Fetching a target report\n- Exploring UniProt, PDB, AlphaFold, ChEMBL, and STRING-DB data\n- Filtering and ranking ligands\n- Exporting reports (HTML, JSON, SDF)\n- Batch processing multiple targets\n\n**Install:** `pip install targetrecon`" }, { "cell_type": "markdown", "id": "dc69c745", "metadata": {}, "source": [ "## 1. Installation check" ] }, { "cell_type": "code", "execution_count": 1, "id": "4cfbda57", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TargetRecon version: 0.1.0\n" ] } ], "source": [ "import targetrecon\n", "print(\"TargetRecon version:\", targetrecon.__version__)" ] }, { "cell_type": "markdown", "id": "5c8f76e8", "metadata": {}, "source": "## 2. Run a recon — single target\n\n`targetrecon.recon()` accepts a **gene name**, **UniProt accession**, or **ChEMBL target ID**.\n\nIt fetches data from 4 sources in parallel (UniProt, PDB, AlphaFold, ChEMBL, STRING-DB) and returns a `TargetReport` object." }, { "cell_type": "code", "execution_count": 2, "id": "4f7d6b31", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Resolving identifiers for 'EGFR'...\n",
       "
\n" ], "text/plain": [ "\u001b[36mResolving identifiers for \u001b[0m\u001b[36m'EGFR'\u001b[0m\u001b[36m...\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
UniProt: P00533  |  ChEMBL: CHEMBL203\n",
       "
\n" ], "text/plain": [ "\u001b[36mUniProt: P00533 | ChEMBL: CHEMBL203\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Fetching data from 5 sources in parallel...\n",
       "
\n" ], "text/plain": [ "\u001b[36mFetching data from \u001b[0m\u001b[1;36m5\u001b[0m\u001b[36m sources in parallel\u001b[0m\u001b[36m...\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Done!  375 structures · 10000 bioactivities · 750 unique ligands\n",
       "
\n" ], "text/plain": [ "\u001b[32mDone! \u001b[0m\u001b[1;32m375\u001b[0m\u001b[32m structures · \u001b[0m\u001b[1;32m10000\u001b[0m\u001b[32m bioactivities · \u001b[0m\u001b[1;32m750\u001b[0m\u001b[32m unique ligands\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Query resolved to: P00533 / EGFR\n", "Protein name : Epidermal growth factor receptor\n", "PDB structures : 375\n", "Bioactivities : 10000\n", "Unique ligands : 750\n" ] } ], "source": [ "report = targetrecon.recon(\"EGFR\")\n", "print(f\"Query resolved to: {report.uniprot.uniprot_id} / {report.uniprot.gene_name}\")\n", "print(f\"Protein name : {report.uniprot.protein_name}\")\n", "print(f\"PDB structures : {report.num_pdb_structures}\")\n", "print(f\"Bioactivities : {report.num_bioactivities}\")\n", "print(f\"Unique ligands : {report.num_unique_ligands}\")" ] }, { "cell_type": "markdown", "id": "2bbc3b2c", "metadata": {}, "source": [ "### Works with UniProt accessions and ChEMBL IDs too" ] }, { "cell_type": "code", "execution_count": null, "id": "b60806d4", "metadata": {}, "outputs": [], "source": [ "# By UniProt accession\n", "report_up = targetrecon.recon(\"P00533\")\n", "print(\"UniProt input →\", report_up.uniprot.gene_name)\n", "\n", "# By ChEMBL target ID\n", "report_ch = targetrecon.recon(\"CHEMBL203\")\n", "print(\"ChEMBL input →\", report_ch.uniprot.gene_name)" ] }, { "cell_type": "markdown", "id": "546e301b", "metadata": {}, "source": [ "## 3. Explore UniProt data" ] }, { "cell_type": "code", "execution_count": null, "id": "c913134a", "metadata": {}, "outputs": [], "source": "u = report.uniprot\n\nprint(\"Gene name :\", u.gene_name)\nprint(\"Protein name :\", u.protein_name)\nprint(\"Organism :\", u.organism)\nprint(\"UniProt ID :\", u.uniprot_id)\nprint(\"ChEMBL target :\", u.chembl_id)\nprint(\"Sequence length:\", u.sequence_length)\nprint(\"\\nFunction (first 300 chars):\")\nprint(u.function_description[:300] if u.function_description else \"N/A\")" }, { "cell_type": "code", "execution_count": null, "id": "085f5591", "metadata": {}, "outputs": [], "source": "# Subcellular location\nprint(\"Subcellular locations:\", u.subcellular_locations)\n\n# Diseases\nprint(\"\\nAssociated diseases:\")\nfor d in u.disease_associations[:5]:\n print(\" -\", d)" }, { "cell_type": "code", "execution_count": null, "id": "7507812f", "metadata": {}, "outputs": [], "source": [ "# GO terms grouped by category\n", "from collections import defaultdict\n", "\n", "go_by_cat = defaultdict(list)\n", "for go in u.go_terms:\n", " go_by_cat[go.category].append(go.term)\n", "\n", "for cat, terms in go_by_cat.items():\n", " print(f\"\\nGO — {cat} ({len(terms)} terms):\")\n", " for t in terms[:5]:\n", " print(\" \", t)" ] }, { "cell_type": "markdown", "id": "834245d7", "metadata": {}, "source": [ "## 4. Explore PDB structures" ] }, { "cell_type": "code", "execution_count": null, "id": "53cf0b82", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "pdb_rows = []\n", "for s in report.pdb_structures:\n", " pdb_rows.append({\n", " \"PDB ID\": s.pdb_id,\n", " \"Method\": s.method.value if s.method else \"\",\n", " \"Resolution (Å)\": s.resolution,\n", " \"Deposit Date\": s.release_date,\n", " \"Ligands\": \", \".join(l.ligand_id for l in s.ligands) if s.ligands else \"\",\n", " \"Title\": (s.title or \"\")[:60],\n", " })\n", "\n", "pdb_df = pd.DataFrame(pdb_rows)\n", "print(f\"{len(pdb_df)} structures\")\n", "pdb_df.head(10)" ] }, { "cell_type": "code", "execution_count": null, "id": "9a9e7efd", "metadata": {}, "outputs": [], "source": [ "# Method breakdown\n", "print(\"Method breakdown:\")\n", "print(pdb_df[\"Method\"].value_counts().to_string())\n", "\n", "# Resolution statistics\n", "res = pdb_df[\"Resolution (Å)\"].dropna()\n", "print(f\"\\nResolution stats (Å): min={res.min():.2f} median={res.median():.2f} max={res.max():.2f}\")" ] }, { "cell_type": "markdown", "id": "69d4ae45", "metadata": {}, "source": [ "## 5. AlphaFold predicted structure" ] }, { "cell_type": "code", "execution_count": null, "id": "042bf0e1", "metadata": {}, "outputs": [], "source": "af = report.alphafold\nif af:\n print(\"AlphaFold UniProt ID:\", af.uniprot_id)\n print(\"Model version :\", af.version)\n print(\"PDB download URL :\", af.pdb_url)\n print(\"Mean pLDDT :\", af.mean_plddt)\nelse:\n print(\"No AlphaFold entry found.\")" }, { "cell_type": "markdown", "id": "fee0c96c", "metadata": {}, "source": "## 6. Bioactivity data (ChEMBL)" }, { "cell_type": "code", "execution_count": null, "id": "6bd5de73", "metadata": {}, "outputs": [], "source": [ "bio_rows = []\n", "for b in report.bioactivities:\n", " bio_rows.append({\n", " \"Molecule\": b.molecule_chembl_id or b.name or \"\",\n", " \"Activity Type\": b.activity_type,\n", " \"Value (nM)\": b.value,\n", " \"pChEMBL\": b.pchembl_value,\n", " \"Source\": b.source,\n", " \"SMILES\": (b.smiles or \"\")[:40],\n", " })\n", "\n", "bio_df = pd.DataFrame(bio_rows)\n", "print(f\"{len(bio_df)} bioactivity records\")\n", "bio_df.head(10)" ] }, { "cell_type": "code", "execution_count": null, "id": "da7208d4", "metadata": {}, "outputs": [], "source": [ "# Source breakdown\n", "print(\"Source breakdown:\")\n", "print(bio_df[\"Source\"].value_counts().to_string())\n", "\n", "# Activity type breakdown\n", "print(\"\\nActivity type breakdown:\")\n", "print(bio_df[\"Activity Type\"].value_counts().head(8).to_string())\n", "\n", "# pChEMBL distribution\n", "pc = bio_df[\"pChEMBL\"].dropna()\n", "print(f\"\\npChEMBL stats: min={pc.min():.2f} median={pc.median():.2f} max={pc.max():.2f}\")\n", "print(f\"High-potency (pChEMBL ≥ 9): {(pc >= 9).sum()} records\")" ] }, { "cell_type": "markdown", "id": "39ffcff1", "metadata": {}, "source": [ "## 7. Unique ligands ranked by potency" ] }, { "cell_type": "code", "execution_count": null, "id": "ec314056", "metadata": {}, "outputs": [], "source": [ "lig_rows = []\n", "for l in report.ligand_summary:\n", " lig_rows.append({\n", " \"Name\": l.name or \"\",\n", " \"ChEMBL ID\": l.chembl_id or \"\",\n", " \"Best pChEMBL\": l.best_pchembl,\n", " \"Activity Type\": l.best_activity_type,\n", " \"Activity (nM)\": l.best_activity_value_nM,\n", " \"# Assays\": l.num_assays,\n", " \"Sources\": \", \".join(l.sources),\n", " \"SMILES\": (l.smiles or \"\")[:50],\n", " })\n", "\n", "lig_df = pd.DataFrame(lig_rows)\n", "print(f\"{len(lig_df)} unique ligands (sorted by pChEMBL descending)\")\n", "lig_df.head(20)" ] }, { "cell_type": "code", "execution_count": null, "id": "862877ec", "metadata": {}, "outputs": [], "source": [ "# Best ligand shortcut\n", "best = report.best_ligand\n", "if best:\n", " print(\"Best ligand:\")\n", " print(f\" Name : {best.name}\")\n", " print(f\" ChEMBL ID: {best.chembl_id}\")\n", " print(f\" pChEMBL : {best.best_pchembl}\")\n", " print(f\" SMILES : {best.smiles}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "3be433a3", "metadata": {}, "outputs": [], "source": [ "# Filter ligands programmatically\n", "potent = [l for l in report.ligand_summary if l.best_pchembl and l.best_pchembl >= 9.0]\n", "print(f\"Ligands with pChEMBL ≥ 9.0: {len(potent)}\")\n", "\n", "ic50_only = [l for l in report.ligand_summary if l.best_activity_type == \"IC50\"]\n", "print(f\"Ligands measured by IC50: {len(ic50_only)}\")\n", "\n", "multi_source = [l for l in report.ligand_summary if len(l.sources) > 1]\n", "print(f\"Ligands confirmed in multiple databases: {len(multi_source)}\")" ] }, { "cell_type": "markdown", "id": "c55fad07", "metadata": {}, "source": [ "## 8. Protein–protein interactions (STRING DB)" ] }, { "cell_type": "code", "execution_count": null, "id": "ad58e4b6", "metadata": {}, "outputs": [], "source": "if report.interactions:\n int_rows = []\n for i in report.interactions:\n int_rows.append({\n \"Partner\": i.gene_b,\n \"Score\": i.score,\n })\n int_df = pd.DataFrame(int_rows).sort_values(\"Score\", ascending=False)\n print(f\"{len(int_df)} protein interactions\")\n print(int_df.to_string(index=False))\nelse:\n print(\"No interaction data.\")" }, { "cell_type": "markdown", "id": "d54f1233", "metadata": {}, "source": [ "## 9. Export reports" ] }, { "cell_type": "code", "execution_count": null, "id": "df1423ac", "metadata": {}, "outputs": [], "source": [ "from targetrecon.core import save_html, save_json, save_sdf\n", "from pathlib import Path\n", "\n", "out = Path(\"outputs\")\n", "out.mkdir(exist_ok=True)\n", "\n", "# HTML — interactive self-contained report\n", "html_path = save_html(report, out / \"EGFR_report.html\")\n", "print(\"HTML report →\", html_path)\n", "\n", "# JSON — full machine-readable report\n", "json_path = save_json(report, out / \"EGFR_report.json\")\n", "print(\"JSON report →\", json_path)\n", "\n", "# SDF — top 20 ligands with 3D conformers (RDKit MMFF)\n", "sdf_path = save_sdf(report, out / \"EGFR_top20_ligands.sdf\", top_n=20)\n", "print(\"SDF ligands →\", sdf_path)" ] }, { "cell_type": "code", "execution_count": null, "id": "ce15ba96", "metadata": {}, "outputs": [], "source": [ "# SDF with filters — only IC50, pChEMBL ≥ 8, top 50\n", "sdf_filtered = save_sdf(\n", " report,\n", " out / \"EGFR_IC50_filtered.sdf\",\n", " top_n=50,\n", " min_pchembl=8.0,\n", " activity_type=\"IC50\",\n", ")\n", "print(\"Filtered SDF →\", sdf_filtered)" ] }, { "cell_type": "markdown", "id": "cc393be6", "metadata": {}, "source": [ "## 10. Batch processing — compare targets\n", "\n", "Fetch multiple targets concurrently using `asyncio.gather()` via `recon_async()`." ] }, { "cell_type": "code", "execution_count": null, "id": "5bda203a", "metadata": {}, "outputs": [], "source": [ "import asyncio\n", "\n", "panel = [\"EGFR\", \"BRAF\", \"CDK2\", \"ABL1\"]\n", "\n", "panel_reports = await asyncio.gather(*[\n", " targetrecon.recon_async(t, verbose=False) for t in panel\n", "])\n", "\n", "summary = []\n", "for t, r in zip(panel, panel_reports):\n", " if r.uniprot is None:\n", " continue\n", " summary.append({\n", " \"Target\": t,\n", " \"UniProt\": r.uniprot.uniprot_id,\n", " \"PDB Structures\": r.num_pdb_structures,\n", " \"Bioactivities\": r.num_bioactivities,\n", " \"Unique Ligands\": r.num_unique_ligands,\n", " \"Best pChEMBL\": r.best_ligand.best_pchembl if r.best_ligand else None,\n", " \"Best Ligand\": r.best_ligand.name or r.best_ligand.chembl_id if r.best_ligand else None,\n", " })\n", "\n", "pd.DataFrame(summary)" ] }, { "cell_type": "code", "execution_count": null, "id": "a46c7768", "metadata": {}, "outputs": [], "source": [ "# Export all panel reports as HTML\n", "for t, r in zip(panel, panel_reports):\n", " if r.uniprot:\n", " p = save_html(r, out / f\"{t}_report.html\")\n", " print(f\" {t} → {p}\")" ] }, { "cell_type": "markdown", "id": "462ed9ec", "metadata": {}, "source": [ "## 11. Work with raw JSON\n", "\n", "The `TargetReport` is a Pydantic model — serialize/deserialize freely." ] }, { "cell_type": "code", "execution_count": null, "id": "3345767b", "metadata": {}, "outputs": [], "source": [ "# Serialize to dict\n", "data = report.model_dump()\n", "print(\"Top-level keys:\", list(data.keys()))\n", "\n", "# Serialize to JSON string\n", "json_str = report.model_dump_json(indent=2)\n", "print(f\"\\nJSON size: {len(json_str):,} characters\")\n", "\n", "# Load back from JSON file\n", "from targetrecon.models import TargetReport\n", "with open(out / \"EGFR_report.json\") as f:\n", " loaded = TargetReport.model_validate_json(f.read())\n", "print(f\"\\nRound-trip: {loaded.uniprot.gene_name} | {loaded.num_unique_ligands} ligands\")" ] }, { "cell_type": "markdown", "id": "79ec0d78", "metadata": {}, "source": [ "## 12. Quick visualization (optional — requires matplotlib)" ] }, { "cell_type": "code", "execution_count": null, "id": "103e40d7", "metadata": {}, "outputs": [], "source": [ "try:\n", " import matplotlib.pyplot as plt\n", " from collections import Counter\n", "\n", " pchembl_vals = [b.pchembl_value for b in report.bioactivities if b.pchembl_value]\n", "\n", " fig, axes = plt.subplots(1, 2, figsize=(12, 4))\n", "\n", " # pChEMBL histogram\n", " axes[0].hist(pchembl_vals, bins=30, color=\"#58a6ff\", edgecolor=\"white\", linewidth=0.5)\n", " axes[0].axvline(7, color=\"#f85149\", linestyle=\"--\", label=\"pChEMBL = 7 (100 nM)\")\n", " axes[0].axvline(9, color=\"#3fb950\", linestyle=\"--\", label=\"pChEMBL = 9 (1 nM)\")\n", " axes[0].set_xlabel(\"pChEMBL value\")\n", " axes[0].set_ylabel(\"Count\")\n", " axes[0].set_title(f\"EGFR — pChEMBL distribution (n={len(pchembl_vals)})\")\n", " axes[0].legend()\n", "\n", " # Activity type bar chart\n", " atype_counts = Counter(b.activity_type for b in report.bioactivities if b.activity_type)\n", " top_types = dict(atype_counts.most_common(6))\n", " axes[1].bar(top_types.keys(), top_types.values(), color=\"#bc8cff\", edgecolor=\"white\")\n", " axes[1].set_xlabel(\"Activity type\")\n", " axes[1].set_ylabel(\"Count\")\n", " axes[1].set_title(\"EGFR — Bioactivity type breakdown\")\n", "\n", " plt.tight_layout()\n", " plt.savefig(out / \"EGFR_charts.png\", dpi=150, bbox_inches=\"tight\")\n", " plt.show()\n", " print(\"Chart saved →\", out / \"EGFR_charts.png\")\n", "\n", "except ImportError:\n", " print(\"matplotlib not installed — pip install matplotlib to enable visualizations\")" ] } ], "metadata": { "kernelspec": { "display_name": "base", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.4" } }, "nbformat": 4, "nbformat_minor": 5 }