File size: 8,963 Bytes
35e6a9d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 | # Data Sources & API Endpoints
**K R&D Lab — Cancer Research Suite**
Author: Oksana Kolisnyk | kosatiks-group.pp.ua
Repo: github.com/TEZv/K-RnD-Lab-PHYLO-03_2026
Generated: 2026-03-07
---
## Real Data APIs (Group A Tabs)
### 1. PubMed E-utilities (NCBI)
| Property | Value |
|----------|-------|
| **Base URL** | `https://eutils.ncbi.nlm.nih.gov/entrez/eutils` |
| **Auth** | None required (free, no API key) |
| **Rate limit** | 3 requests/sec without key; enforced via `time.sleep(0.34)` |
| **Endpoints used** | `esearch.fcgi` — search & count; `esummary.fcgi` — fetch metadata |
| **Used in tabs** | A1 (paper counts per process), A4 (papers per year), A2 (gene paper counts) |
| **Docs** | https://www.ncbi.nlm.nih.gov/books/NBK25501/ |
| **Terms of use** | https://www.ncbi.nlm.nih.gov/home/about/policies/ |
**Example call (paper count):**
```
GET https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi
?db=pubmed
&term="ferroptosis" AND "GBM"[tiab]
&rettype=count
&retmode=json
```
---
### 2. ClinVar E-utilities (NCBI)
| Property | Value |
|----------|-------|
| **Base URL** | `https://eutils.ncbi.nlm.nih.gov/entrez/eutils` |
| **Auth** | None required |
| **Rate limit** | Same as PubMed (3 req/sec) |
| **Endpoints used** | `esearch.fcgi?db=clinvar` — variant search; `esummary.fcgi?db=clinvar` — classification |
| **Used in tabs** | A3 (Real Variant Lookup) |
| **Docs** | https://www.ncbi.nlm.nih.gov/clinvar/docs/api_http/ |
| **Data policy** | All ClinVar data is public domain |
**Example call:**
```
GET https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi
?db=clinvar
&term=NM_007294.4:c.5266dupC
&retmode=json
&retmax=5
```
---
### 3. OpenTargets Platform GraphQL API
| Property | Value |
|----------|-------|
| **Base URL** | `https://api.platform.opentargets.org/api/v4/graphql` |
| **Auth** | None required (free, open access) |
| **Rate limit** | No hard limit; reasonable use expected |
| **Endpoints used** | GraphQL POST — disease associations, tractability, known drugs |
| **Used in tabs** | A1 (process associations), A2 (target gap index), A5 (druggable orphans) |
| **Docs** | https://platform-docs.opentargets.org/data-access/graphql-api |
| **Data release** | Updated quarterly; cite as "Open Targets Platform [release date]" |
| **License** | CC0 (public domain) |
**Example query (disease-associated targets):**
```graphql
query AssocTargets($efoId: String!, $size: Int!) {
disease(efoId: $efoId) {
associatedTargets(page: {index: 0, size: $size}) {
rows {
target { approvedSymbol approvedName }
score
}
}
}
}
```
**EFO IDs used:**
| Cancer | EFO ID |
|--------|--------|
| GBM | EFO_0000519 |
| PDAC | EFO_0002618 |
| SCLC | EFO_0000702 |
| UVM | EFO_0004339 |
| DIPG | EFO_0009708 |
| ACC | EFO_0003060 |
| MCC | EFO_0005558 |
| PCNSL | EFO_0005543 |
| Pediatric AML | EFO_0000222 |
---
### 4. gnomAD GraphQL API
| Property | Value |
|----------|-------|
| **Base URL** | `https://gnomad.broadinstitute.org/api` |
| **Auth** | None required |
| **Rate limit** | No hard limit; reasonable use expected |
| **Endpoints used** | GraphQL POST — `variantSearch` query |
| **Dataset** | `gnomad_r4` (v4, 807,162 individuals) |
| **Used in tabs** | A3 (Real Variant Lookup — allele frequency) |
| **Docs** | https://gnomad.broadinstitute.org/api |
| **License** | ODC Open Database License (ODbL) |
**Example query:**
```graphql
query VariantSearch($query: String!, $dataset: DatasetId!) {
variantSearch(query: $query, dataset: $dataset) {
variant_id
rsids
exome { af }
genome { af }
}
}
```
---
### 5. ClinicalTrials.gov API v2
| Property | Value |
|----------|-------|
| **Base URL** | `https://clinicaltrials.gov/api/v2` |
| **Auth** | None required |
| **Rate limit** | No hard limit documented; polite use recommended |
| **Endpoints used** | `GET /studies` — trial search by gene + cancer type |
| **Used in tabs** | A2 (trial counts per gene), A5 (orphan target trial check) |
| **Docs** | https://clinicaltrials.gov/data-api/api |
| **Data policy** | Public domain (US government) |
**Example call:**
```
GET https://clinicaltrials.gov/api/v2/studies
?query.term=KRAS GBM
&pageSize=1
&format=json
```
---
### 6. DepMap Public Data
| Property | Value |
|----------|-------|
| **Source** | Broad Institute DepMap Portal |
| **URL** | https://depmap.org/portal/download/all/ |
| **File** | `CRISPR_gene_effect.csv` (Chronos scores) |
| **Auth** | None required (public download) |
| **Used in tabs** | A2 (essentiality scores for gap index) |
| **Score convention** | **Negative = essential** (−1 = median essential gene effect); inverted in app per know-how guide |
| **License** | CC BY 4.0 |
| **Citation** | Broad Institute DepMap, [release]. DepMap Public [release]. figshare. |
> **Implementation note:** The app uses a curated reference gene set with representative scores as a lightweight proxy. For full analysis, download the complete CRISPR_gene_effect.csv (~500 MB) from depmap.org and replace `_load_depmap_sample()` in `app.py`.
---
## Simulated Data Sources (Group B Tabs)
All Group B tabs use **rule-based computational models** — no external APIs.
| Tab | Model Type | Basis |
|-----|-----------|-------|
| B1 — miRNA Explorer | Curated lookup table | Published miRNA-target databases (miRDB, TargetScan concepts) |
| B2 — siRNA Targets | Curated efficacy estimates | Published siRNA screen literature |
| B3 — LNP Corona | Langmuir adsorption model | Corona proteomics literature (Monopoli et al. 2012; Lundqvist et al. 2017) |
| B4 — Flow Corona | Competitive Langmuir kinetics | Vroman effect literature (Vroman 1962; Hirsh et al. 2013) |
| B5 — Variant Concepts | ACMG/AMP 2015 rule set | Richards et al. 2015 ACMG guidelines |
> ⚠️ All Group B outputs are labeled **SIMULATED** in the UI and must not be used for clinical or research decisions.
---
## RAG Chatbot (Tab A6)
| Property | Value |
|----------|-------|
| **Embedding model** | `all-MiniLM-L6-v2` (sentence-transformers) |
| **Model size** | ~80 MB, CPU-compatible |
| **Vector index** | FAISS `IndexFlatIP` (cosine similarity on L2-normalized vectors) |
| **Corpus** | 20 curated paper abstracts (see `chatbot.py` `PAPER_CORPUS`) |
| **Source** | PubMed abstracts (public domain) |
| **No external API** | Fully offline after model download |
**20 Indexed PMIDs** *(all verified against PubMed esummary + efetch, 2026-03-07):*
| PMID | First Author | Topic | Journal | Year |
|------|-------------|-------|---------|------|
| 34394960 | Hou X | LNP mRNA delivery review | Nat Rev Mater | 2021 |
| 32251383 | Cheng Q | SORT LNPs organ selectivity | Nat Nanotechnol | 2020 |
| 29653760 | Sabnis S | Novel amino lipid series for mRNA | Mol Ther | 2018 |
| 22782619 | Jayaraman M | Ionizable lipid siRNA LNP potency | Angew Chem Int Ed | 2012 |
| 33208369 | Rosenblum D | CRISPR-Cas9 LNP cancer therapy | Sci Adv | 2020 |
| 18809927 | Lundqvist M | Nanoparticle size/surface protein corona | PNAS | 2008 |
| 22086677 | Walkey CD | Nanomaterial-protein interactions | Chem Soc Rev | 2012 |
| 31565943 | Park M | Accessible surface area nanoparticle corona | Nano Lett | 2019 |
| 33754708 | Sebastiani F | ApoE binding drives LNP rearrangement | ACS Nano | 2021 |
| 20461061 | Akinc A | Endogenous ApoE-mediated LNP liver delivery | Mol Ther | 2010 |
| 30096302 | Bailey MH | Cancer driver genes TCGA pan-cancer | Cell | 2018 |
| 30311387 | Landrum MJ | ClinVar at five years | Hum Mutat | 2018 |
| 32461654 | Karczewski KJ | gnomAD mutational constraint 141,456 humans | Nature | 2020 |
| 27328919 | Bouaoun L | TP53 variations IARC database | Hum Mutat | 2016 |
| 31820981 | Lanman BA | KRAS G12C covalent inhibitor AMG 510 | J Med Chem | 2020 |
| 28678784 | Sahin U | Personalized RNA mutanome vaccines | Nature | 2017 |
| 31348638 | Kozma GT | Anti-PEG IgM complement activation LNP | ACS Nano | 2019 |
| 33016924 | Cafri G | mRNA neoantigen T cell immunity GI cancer | J Clin Invest | 2020 |
| 31142840 | Cristiano S | Genome-wide cfDNA fragmentation in cancer | Nature | 2019 |
| 33883548 | Larson MH | Cell-free transcriptome tissue biomarkers | Nat Commun | 2021 |
---
## Caching System
All real API calls are cached locally to reduce latency and respect rate limits.
| Property | Value |
|----------|-------|
| **Cache directory** | `./cache/` |
| **TTL** | 24 hours |
| **Key format** | `{endpoint}_{md5(query)}.json` |
| **Format** | JSON |
| **Invalidation** | Automatic on TTL expiry; manual by deleting `./cache/` |
---
## Lab Journal
| Property | Value |
|----------|-------|
| **File** | `./lab_journal.csv` |
| **Format** | CSV (timestamp, tab, action, result_summary, note) |
| **Auto-logged** | Every tab run automatically logs an entry |
| **Manual notes** | Via sidebar note field |
---
*Data Sources documented by K R&D Lab Cancer Research Suite | 2026-03-07*
|