Spaces:

hemantn
/

targetrecon

Running

File size: 20,015 Bytes

1e1cf8d

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "4c99fd1a",
   "metadata": {},
   "source": "# TargetRecon — Python API Demo\n\nThis notebook walks through the full TargetRecon Python API:\n- Fetching a target report\n- Exploring UniProt, PDB, AlphaFold, ChEMBL, and STRING-DB data\n- Filtering and ranking ligands\n- Exporting reports (HTML, JSON, SDF)\n- Batch processing multiple targets\n\n**Install:** `pip install targetrecon`"
  },
  {
   "cell_type": "markdown",
   "id": "dc69c745",
   "metadata": {},
   "source": [
    "## 1. Installation check"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "4cfbda57",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "TargetRecon version: 0.1.0\n"
     ]
    }
   ],
   "source": [
    "import targetrecon\n",
    "print(\"TargetRecon version:\", targetrecon.__version__)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5c8f76e8",
   "metadata": {},
   "source": "## 2. Run a recon — single target\n\n`targetrecon.recon()` accepts a **gene name**, **UniProt accession**, or **ChEMBL target ID**.\n\nIt fetches data from 4 sources in parallel (UniProt, PDB, AlphaFold, ChEMBL, STRING-DB) and returns a `TargetReport` object."
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "4f7d6b31",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #008080; text-decoration-color: #008080\">Resolving identifiers for </span><span style=\"color: #008080; text-decoration-color: #008080\">'EGFR'</span><span style=\"color: #008080; text-decoration-color: #008080\">...</span>\n",
       "</pre>\n"
      ],
      "text/plain": [
       "\u001b[36mResolving identifiers for \u001b[0m\u001b[36m'EGFR'\u001b[0m\u001b[36m...\u001b[0m\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #008080; text-decoration-color: #008080\">UniProt: P00533  |  ChEMBL: CHEMBL203</span>\n",
       "</pre>\n"
      ],
      "text/plain": [
       "\u001b[36mUniProt: P00533  |  ChEMBL: CHEMBL203\u001b[0m\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #008080; text-decoration-color: #008080\">Fetching data from </span><span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">5</span><span style=\"color: #008080; text-decoration-color: #008080\"> sources in parallel...</span>\n",
       "</pre>\n"
      ],
      "text/plain": [
       "\u001b[36mFetching data from \u001b[0m\u001b[1;36m5\u001b[0m\u001b[36m sources in parallel\u001b[0m\u001b[36m...\u001b[0m\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"color: #008000; text-decoration-color: #008000\">Done!  </span><span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">375</span><span style=\"color: #008000; text-decoration-color: #008000\"> structures · </span><span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">10000</span><span style=\"color: #008000; text-decoration-color: #008000\"> bioactivities · </span><span style=\"color: #008000; text-decoration-color: #008000; font-weight: bold\">750</span><span style=\"color: #008000; text-decoration-color: #008000\"> unique ligands</span>\n",
       "</pre>\n"
      ],
      "text/plain": [
       "\u001b[32mDone!  \u001b[0m\u001b[1;32m375\u001b[0m\u001b[32m structures · \u001b[0m\u001b[1;32m10000\u001b[0m\u001b[32m bioactivities · \u001b[0m\u001b[1;32m750\u001b[0m\u001b[32m unique ligands\u001b[0m\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Query resolved to: P00533 / EGFR\n",
      "Protein name    : Epidermal growth factor receptor\n",
      "PDB structures  : 375\n",
      "Bioactivities   : 10000\n",
      "Unique ligands  : 750\n"
     ]
    }
   ],
   "source": [
    "report = targetrecon.recon(\"EGFR\")\n",
    "print(f\"Query resolved to: {report.uniprot.uniprot_id} / {report.uniprot.gene_name}\")\n",
    "print(f\"Protein name    : {report.uniprot.protein_name}\")\n",
    "print(f\"PDB structures  : {report.num_pdb_structures}\")\n",
    "print(f\"Bioactivities   : {report.num_bioactivities}\")\n",
    "print(f\"Unique ligands  : {report.num_unique_ligands}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2bbc3b2c",
   "metadata": {},
   "source": [
    "### Works with UniProt accessions and ChEMBL IDs too"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b60806d4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# By UniProt accession\n",
    "report_up = targetrecon.recon(\"P00533\")\n",
    "print(\"UniProt input →\", report_up.uniprot.gene_name)\n",
    "\n",
    "# By ChEMBL target ID\n",
    "report_ch = targetrecon.recon(\"CHEMBL203\")\n",
    "print(\"ChEMBL input  →\", report_ch.uniprot.gene_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "546e301b",
   "metadata": {},
   "source": [
    "## 3. Explore UniProt data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c913134a",
   "metadata": {},
   "outputs": [],
   "source": "u = report.uniprot\n\nprint(\"Gene name      :\", u.gene_name)\nprint(\"Protein name   :\", u.protein_name)\nprint(\"Organism       :\", u.organism)\nprint(\"UniProt ID     :\", u.uniprot_id)\nprint(\"ChEMBL target  :\", u.chembl_id)\nprint(\"Sequence length:\", u.sequence_length)\nprint(\"\\nFunction (first 300 chars):\")\nprint(u.function_description[:300] if u.function_description else \"N/A\")"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "085f5591",
   "metadata": {},
   "outputs": [],
   "source": "# Subcellular location\nprint(\"Subcellular locations:\", u.subcellular_locations)\n\n# Diseases\nprint(\"\\nAssociated diseases:\")\nfor d in u.disease_associations[:5]:\n    print(\" -\", d)"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7507812f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# GO terms grouped by category\n",
    "from collections import defaultdict\n",
    "\n",
    "go_by_cat = defaultdict(list)\n",
    "for go in u.go_terms:\n",
    "    go_by_cat[go.category].append(go.term)\n",
    "\n",
    "for cat, terms in go_by_cat.items():\n",
    "    print(f\"\\nGO — {cat} ({len(terms)} terms):\")\n",
    "    for t in terms[:5]:\n",
    "        print(\"  \", t)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "834245d7",
   "metadata": {},
   "source": [
    "## 4. Explore PDB structures"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "53cf0b82",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "pdb_rows = []\n",
    "for s in report.pdb_structures:\n",
    "    pdb_rows.append({\n",
    "        \"PDB ID\": s.pdb_id,\n",
    "        \"Method\": s.method.value if s.method else \"\",\n",
    "        \"Resolution (Å)\": s.resolution,\n",
    "        \"Deposit Date\": s.release_date,\n",
    "        \"Ligands\": \", \".join(l.ligand_id for l in s.ligands) if s.ligands else \"\",\n",
    "        \"Title\": (s.title or \"\")[:60],\n",
    "    })\n",
    "\n",
    "pdb_df = pd.DataFrame(pdb_rows)\n",
    "print(f\"{len(pdb_df)} structures\")\n",
    "pdb_df.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9a9e7efd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Method breakdown\n",
    "print(\"Method breakdown:\")\n",
    "print(pdb_df[\"Method\"].value_counts().to_string())\n",
    "\n",
    "# Resolution statistics\n",
    "res = pdb_df[\"Resolution (Å)\"].dropna()\n",
    "print(f\"\\nResolution stats (Å): min={res.min():.2f}  median={res.median():.2f}  max={res.max():.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "69d4ae45",
   "metadata": {},
   "source": [
    "## 5. AlphaFold predicted structure"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "042bf0e1",
   "metadata": {},
   "outputs": [],
   "source": "af = report.alphafold\nif af:\n    print(\"AlphaFold UniProt ID:\", af.uniprot_id)\n    print(\"Model version       :\", af.version)\n    print(\"PDB download URL    :\", af.pdb_url)\n    print(\"Mean pLDDT          :\", af.mean_plddt)\nelse:\n    print(\"No AlphaFold entry found.\")"
  },
  {
   "cell_type": "markdown",
   "id": "fee0c96c",
   "metadata": {},
   "source": "## 6. Bioactivity data (ChEMBL)"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6bd5de73",
   "metadata": {},
   "outputs": [],
   "source": [
    "bio_rows = []\n",
    "for b in report.bioactivities:\n",
    "    bio_rows.append({\n",
    "        \"Molecule\": b.molecule_chembl_id or b.name or \"\",\n",
    "        \"Activity Type\": b.activity_type,\n",
    "        \"Value (nM)\": b.value,\n",
    "        \"pChEMBL\": b.pchembl_value,\n",
    "        \"Source\": b.source,\n",
    "        \"SMILES\": (b.smiles or \"\")[:40],\n",
    "    })\n",
    "\n",
    "bio_df = pd.DataFrame(bio_rows)\n",
    "print(f\"{len(bio_df)} bioactivity records\")\n",
    "bio_df.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "da7208d4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Source breakdown\n",
    "print(\"Source breakdown:\")\n",
    "print(bio_df[\"Source\"].value_counts().to_string())\n",
    "\n",
    "# Activity type breakdown\n",
    "print(\"\\nActivity type breakdown:\")\n",
    "print(bio_df[\"Activity Type\"].value_counts().head(8).to_string())\n",
    "\n",
    "# pChEMBL distribution\n",
    "pc = bio_df[\"pChEMBL\"].dropna()\n",
    "print(f\"\\npChEMBL stats: min={pc.min():.2f}  median={pc.median():.2f}  max={pc.max():.2f}\")\n",
    "print(f\"High-potency (pChEMBL ≥ 9): {(pc >= 9).sum()} records\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39ffcff1",
   "metadata": {},
   "source": [
    "## 7. Unique ligands ranked by potency"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ec314056",
   "metadata": {},
   "outputs": [],
   "source": [
    "lig_rows = []\n",
    "for l in report.ligand_summary:\n",
    "    lig_rows.append({\n",
    "        \"Name\": l.name or \"\",\n",
    "        \"ChEMBL ID\": l.chembl_id or \"\",\n",
    "        \"Best pChEMBL\": l.best_pchembl,\n",
    "        \"Activity Type\": l.best_activity_type,\n",
    "        \"Activity (nM)\": l.best_activity_value_nM,\n",
    "        \"# Assays\": l.num_assays,\n",
    "        \"Sources\": \", \".join(l.sources),\n",
    "        \"SMILES\": (l.smiles or \"\")[:50],\n",
    "    })\n",
    "\n",
    "lig_df = pd.DataFrame(lig_rows)\n",
    "print(f\"{len(lig_df)} unique ligands (sorted by pChEMBL descending)\")\n",
    "lig_df.head(20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "862877ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Best ligand shortcut\n",
    "best = report.best_ligand\n",
    "if best:\n",
    "    print(\"Best ligand:\")\n",
    "    print(f\"  Name     : {best.name}\")\n",
    "    print(f\"  ChEMBL ID: {best.chembl_id}\")\n",
    "    print(f\"  pChEMBL  : {best.best_pchembl}\")\n",
    "    print(f\"  SMILES   : {best.smiles}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3be433a3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Filter ligands programmatically\n",
    "potent = [l for l in report.ligand_summary if l.best_pchembl and l.best_pchembl >= 9.0]\n",
    "print(f\"Ligands with pChEMBL ≥ 9.0: {len(potent)}\")\n",
    "\n",
    "ic50_only = [l for l in report.ligand_summary if l.best_activity_type == \"IC50\"]\n",
    "print(f\"Ligands measured by IC50: {len(ic50_only)}\")\n",
    "\n",
    "multi_source = [l for l in report.ligand_summary if len(l.sources) > 1]\n",
    "print(f\"Ligands confirmed in multiple databases: {len(multi_source)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c55fad07",
   "metadata": {},
   "source": [
    "## 8. Protein–protein interactions (STRING DB)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ad58e4b6",
   "metadata": {},
   "outputs": [],
   "source": "if report.interactions:\n    int_rows = []\n    for i in report.interactions:\n        int_rows.append({\n            \"Partner\": i.gene_b,\n            \"Score\": i.score,\n        })\n    int_df = pd.DataFrame(int_rows).sort_values(\"Score\", ascending=False)\n    print(f\"{len(int_df)} protein interactions\")\n    print(int_df.to_string(index=False))\nelse:\n    print(\"No interaction data.\")"
  },
  {
   "cell_type": "markdown",
   "id": "d54f1233",
   "metadata": {},
   "source": [
    "## 9. Export reports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "df1423ac",
   "metadata": {},
   "outputs": [],
   "source": [
    "from targetrecon.core import save_html, save_json, save_sdf\n",
    "from pathlib import Path\n",
    "\n",
    "out = Path(\"outputs\")\n",
    "out.mkdir(exist_ok=True)\n",
    "\n",
    "# HTML — interactive self-contained report\n",
    "html_path = save_html(report, out / \"EGFR_report.html\")\n",
    "print(\"HTML report  →\", html_path)\n",
    "\n",
    "# JSON — full machine-readable report\n",
    "json_path = save_json(report, out / \"EGFR_report.json\")\n",
    "print(\"JSON report  →\", json_path)\n",
    "\n",
    "# SDF — top 20 ligands with 3D conformers (RDKit MMFF)\n",
    "sdf_path = save_sdf(report, out / \"EGFR_top20_ligands.sdf\", top_n=20)\n",
    "print(\"SDF ligands  →\", sdf_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ce15ba96",
   "metadata": {},
   "outputs": [],
   "source": [
    "# SDF with filters — only IC50, pChEMBL ≥ 8, top 50\n",
    "sdf_filtered = save_sdf(\n",
    "    report,\n",
    "    out / \"EGFR_IC50_filtered.sdf\",\n",
    "    top_n=50,\n",
    "    min_pchembl=8.0,\n",
    "    activity_type=\"IC50\",\n",
    ")\n",
    "print(\"Filtered SDF →\", sdf_filtered)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cc393be6",
   "metadata": {},
   "source": [
    "## 10. Batch processing — compare targets\n",
    "\n",
    "Fetch multiple targets concurrently using `asyncio.gather()` via `recon_async()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5bda203a",
   "metadata": {},
   "outputs": [],
   "source": [
    "import asyncio\n",
    "\n",
    "panel = [\"EGFR\", \"BRAF\", \"CDK2\", \"ABL1\"]\n",
    "\n",
    "panel_reports = await asyncio.gather(*[\n",
    "    targetrecon.recon_async(t, verbose=False) for t in panel\n",
    "])\n",
    "\n",
    "summary = []\n",
    "for t, r in zip(panel, panel_reports):\n",
    "    if r.uniprot is None:\n",
    "        continue\n",
    "    summary.append({\n",
    "        \"Target\": t,\n",
    "        \"UniProt\": r.uniprot.uniprot_id,\n",
    "        \"PDB Structures\": r.num_pdb_structures,\n",
    "        \"Bioactivities\": r.num_bioactivities,\n",
    "        \"Unique Ligands\": r.num_unique_ligands,\n",
    "        \"Best pChEMBL\": r.best_ligand.best_pchembl if r.best_ligand else None,\n",
    "        \"Best Ligand\": r.best_ligand.name or r.best_ligand.chembl_id if r.best_ligand else None,\n",
    "    })\n",
    "\n",
    "pd.DataFrame(summary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a46c7768",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Export all panel reports as HTML\n",
    "for t, r in zip(panel, panel_reports):\n",
    "    if r.uniprot:\n",
    "        p = save_html(r, out / f\"{t}_report.html\")\n",
    "        print(f\"  {t} → {p}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "462ed9ec",
   "metadata": {},
   "source": [
    "## 11. Work with raw JSON\n",
    "\n",
    "The `TargetReport` is a Pydantic model — serialize/deserialize freely."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3345767b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Serialize to dict\n",
    "data = report.model_dump()\n",
    "print(\"Top-level keys:\", list(data.keys()))\n",
    "\n",
    "# Serialize to JSON string\n",
    "json_str = report.model_dump_json(indent=2)\n",
    "print(f\"\\nJSON size: {len(json_str):,} characters\")\n",
    "\n",
    "# Load back from JSON file\n",
    "from targetrecon.models import TargetReport\n",
    "with open(out / \"EGFR_report.json\") as f:\n",
    "    loaded = TargetReport.model_validate_json(f.read())\n",
    "print(f\"\\nRound-trip: {loaded.uniprot.gene_name} | {loaded.num_unique_ligands} ligands\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79ec0d78",
   "metadata": {},
   "source": [
    "## 12. Quick visualization (optional — requires matplotlib)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "103e40d7",
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    import matplotlib.pyplot as plt\n",
    "    from collections import Counter\n",
    "\n",
    "    pchembl_vals = [b.pchembl_value for b in report.bioactivities if b.pchembl_value]\n",
    "\n",
    "    fig, axes = plt.subplots(1, 2, figsize=(12, 4))\n",
    "\n",
    "    # pChEMBL histogram\n",
    "    axes[0].hist(pchembl_vals, bins=30, color=\"#58a6ff\", edgecolor=\"white\", linewidth=0.5)\n",
    "    axes[0].axvline(7, color=\"#f85149\", linestyle=\"--\", label=\"pChEMBL = 7 (100 nM)\")\n",
    "    axes[0].axvline(9, color=\"#3fb950\", linestyle=\"--\", label=\"pChEMBL = 9 (1 nM)\")\n",
    "    axes[0].set_xlabel(\"pChEMBL value\")\n",
    "    axes[0].set_ylabel(\"Count\")\n",
    "    axes[0].set_title(f\"EGFR — pChEMBL distribution (n={len(pchembl_vals)})\")\n",
    "    axes[0].legend()\n",
    "\n",
    "    # Activity type bar chart\n",
    "    atype_counts = Counter(b.activity_type for b in report.bioactivities if b.activity_type)\n",
    "    top_types = dict(atype_counts.most_common(6))\n",
    "    axes[1].bar(top_types.keys(), top_types.values(), color=\"#bc8cff\", edgecolor=\"white\")\n",
    "    axes[1].set_xlabel(\"Activity type\")\n",
    "    axes[1].set_ylabel(\"Count\")\n",
    "    axes[1].set_title(\"EGFR — Bioactivity type breakdown\")\n",
    "\n",
    "    plt.tight_layout()\n",
    "    plt.savefig(out / \"EGFR_charts.png\", dpi=150, bbox_inches=\"tight\")\n",
    "    plt.show()\n",
    "    print(\"Chart saved →\", out / \"EGFR_charts.png\")\n",
    "\n",
    "except ImportError:\n",
    "    print(\"matplotlib not installed — pip install matplotlib to enable visualizations\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}