TEZv's picture
Upload 7 files
35e6a9d verified
# Data Sources & API Endpoints
**K R&D Lab — Cancer Research Suite**
Author: Oksana Kolisnyk | kosatiks-group.pp.ua
Repo: github.com/TEZv/K-RnD-Lab-PHYLO-03_2026
Generated: 2026-03-07
---
## Real Data APIs (Group A Tabs)
### 1. PubMed E-utilities (NCBI)
| Property | Value |
|----------|-------|
| **Base URL** | `https://eutils.ncbi.nlm.nih.gov/entrez/eutils` |
| **Auth** | None required (free, no API key) |
| **Rate limit** | 3 requests/sec without key; enforced via `time.sleep(0.34)` |
| **Endpoints used** | `esearch.fcgi` — search & count; `esummary.fcgi` — fetch metadata |
| **Used in tabs** | A1 (paper counts per process), A4 (papers per year), A2 (gene paper counts) |
| **Docs** | https://www.ncbi.nlm.nih.gov/books/NBK25501/ |
| **Terms of use** | https://www.ncbi.nlm.nih.gov/home/about/policies/ |
**Example call (paper count):**
```
GET https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi
?db=pubmed
&term="ferroptosis" AND "GBM"[tiab]
&rettype=count
&retmode=json
```
---
### 2. ClinVar E-utilities (NCBI)
| Property | Value |
|----------|-------|
| **Base URL** | `https://eutils.ncbi.nlm.nih.gov/entrez/eutils` |
| **Auth** | None required |
| **Rate limit** | Same as PubMed (3 req/sec) |
| **Endpoints used** | `esearch.fcgi?db=clinvar` — variant search; `esummary.fcgi?db=clinvar` — classification |
| **Used in tabs** | A3 (Real Variant Lookup) |
| **Docs** | https://www.ncbi.nlm.nih.gov/clinvar/docs/api_http/ |
| **Data policy** | All ClinVar data is public domain |
**Example call:**
```
GET https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi
?db=clinvar
&term=NM_007294.4:c.5266dupC
&retmode=json
&retmax=5
```
---
### 3. OpenTargets Platform GraphQL API
| Property | Value |
|----------|-------|
| **Base URL** | `https://api.platform.opentargets.org/api/v4/graphql` |
| **Auth** | None required (free, open access) |
| **Rate limit** | No hard limit; reasonable use expected |
| **Endpoints used** | GraphQL POST — disease associations, tractability, known drugs |
| **Used in tabs** | A1 (process associations), A2 (target gap index), A5 (druggable orphans) |
| **Docs** | https://platform-docs.opentargets.org/data-access/graphql-api |
| **Data release** | Updated quarterly; cite as "Open Targets Platform [release date]" |
| **License** | CC0 (public domain) |
**Example query (disease-associated targets):**
```graphql
query AssocTargets($efoId: String!, $size: Int!) {
disease(efoId: $efoId) {
associatedTargets(page: {index: 0, size: $size}) {
rows {
target { approvedSymbol approvedName }
score
}
}
}
}
```
**EFO IDs used:**
| Cancer | EFO ID |
|--------|--------|
| GBM | EFO_0000519 |
| PDAC | EFO_0002618 |
| SCLC | EFO_0000702 |
| UVM | EFO_0004339 |
| DIPG | EFO_0009708 |
| ACC | EFO_0003060 |
| MCC | EFO_0005558 |
| PCNSL | EFO_0005543 |
| Pediatric AML | EFO_0000222 |
---
### 4. gnomAD GraphQL API
| Property | Value |
|----------|-------|
| **Base URL** | `https://gnomad.broadinstitute.org/api` |
| **Auth** | None required |
| **Rate limit** | No hard limit; reasonable use expected |
| **Endpoints used** | GraphQL POST — `variantSearch` query |
| **Dataset** | `gnomad_r4` (v4, 807,162 individuals) |
| **Used in tabs** | A3 (Real Variant Lookup — allele frequency) |
| **Docs** | https://gnomad.broadinstitute.org/api |
| **License** | ODC Open Database License (ODbL) |
**Example query:**
```graphql
query VariantSearch($query: String!, $dataset: DatasetId!) {
variantSearch(query: $query, dataset: $dataset) {
variant_id
rsids
exome { af }
genome { af }
}
}
```
---
### 5. ClinicalTrials.gov API v2
| Property | Value |
|----------|-------|
| **Base URL** | `https://clinicaltrials.gov/api/v2` |
| **Auth** | None required |
| **Rate limit** | No hard limit documented; polite use recommended |
| **Endpoints used** | `GET /studies` — trial search by gene + cancer type |
| **Used in tabs** | A2 (trial counts per gene), A5 (orphan target trial check) |
| **Docs** | https://clinicaltrials.gov/data-api/api |
| **Data policy** | Public domain (US government) |
**Example call:**
```
GET https://clinicaltrials.gov/api/v2/studies
?query.term=KRAS GBM
&pageSize=1
&format=json
```
---
### 6. DepMap Public Data
| Property | Value |
|----------|-------|
| **Source** | Broad Institute DepMap Portal |
| **URL** | https://depmap.org/portal/download/all/ |
| **File** | `CRISPR_gene_effect.csv` (Chronos scores) |
| **Auth** | None required (public download) |
| **Used in tabs** | A2 (essentiality scores for gap index) |
| **Score convention** | **Negative = essential** (−1 = median essential gene effect); inverted in app per know-how guide |
| **License** | CC BY 4.0 |
| **Citation** | Broad Institute DepMap, [release]. DepMap Public [release]. figshare. |
> **Implementation note:** The app uses a curated reference gene set with representative scores as a lightweight proxy. For full analysis, download the complete CRISPR_gene_effect.csv (~500 MB) from depmap.org and replace `_load_depmap_sample()` in `app.py`.
---
## Simulated Data Sources (Group B Tabs)
All Group B tabs use **rule-based computational models** — no external APIs.
| Tab | Model Type | Basis |
|-----|-----------|-------|
| B1 — miRNA Explorer | Curated lookup table | Published miRNA-target databases (miRDB, TargetScan concepts) |
| B2 — siRNA Targets | Curated efficacy estimates | Published siRNA screen literature |
| B3 — LNP Corona | Langmuir adsorption model | Corona proteomics literature (Monopoli et al. 2012; Lundqvist et al. 2017) |
| B4 — Flow Corona | Competitive Langmuir kinetics | Vroman effect literature (Vroman 1962; Hirsh et al. 2013) |
| B5 — Variant Concepts | ACMG/AMP 2015 rule set | Richards et al. 2015 ACMG guidelines |
> ⚠️ All Group B outputs are labeled **SIMULATED** in the UI and must not be used for clinical or research decisions.
---
## RAG Chatbot (Tab A6)
| Property | Value |
|----------|-------|
| **Embedding model** | `all-MiniLM-L6-v2` (sentence-transformers) |
| **Model size** | ~80 MB, CPU-compatible |
| **Vector index** | FAISS `IndexFlatIP` (cosine similarity on L2-normalized vectors) |
| **Corpus** | 20 curated paper abstracts (see `chatbot.py` `PAPER_CORPUS`) |
| **Source** | PubMed abstracts (public domain) |
| **No external API** | Fully offline after model download |
**20 Indexed PMIDs** *(all verified against PubMed esummary + efetch, 2026-03-07):*
| PMID | First Author | Topic | Journal | Year |
|------|-------------|-------|---------|------|
| 34394960 | Hou X | LNP mRNA delivery review | Nat Rev Mater | 2021 |
| 32251383 | Cheng Q | SORT LNPs organ selectivity | Nat Nanotechnol | 2020 |
| 29653760 | Sabnis S | Novel amino lipid series for mRNA | Mol Ther | 2018 |
| 22782619 | Jayaraman M | Ionizable lipid siRNA LNP potency | Angew Chem Int Ed | 2012 |
| 33208369 | Rosenblum D | CRISPR-Cas9 LNP cancer therapy | Sci Adv | 2020 |
| 18809927 | Lundqvist M | Nanoparticle size/surface protein corona | PNAS | 2008 |
| 22086677 | Walkey CD | Nanomaterial-protein interactions | Chem Soc Rev | 2012 |
| 31565943 | Park M | Accessible surface area nanoparticle corona | Nano Lett | 2019 |
| 33754708 | Sebastiani F | ApoE binding drives LNP rearrangement | ACS Nano | 2021 |
| 20461061 | Akinc A | Endogenous ApoE-mediated LNP liver delivery | Mol Ther | 2010 |
| 30096302 | Bailey MH | Cancer driver genes TCGA pan-cancer | Cell | 2018 |
| 30311387 | Landrum MJ | ClinVar at five years | Hum Mutat | 2018 |
| 32461654 | Karczewski KJ | gnomAD mutational constraint 141,456 humans | Nature | 2020 |
| 27328919 | Bouaoun L | TP53 variations IARC database | Hum Mutat | 2016 |
| 31820981 | Lanman BA | KRAS G12C covalent inhibitor AMG 510 | J Med Chem | 2020 |
| 28678784 | Sahin U | Personalized RNA mutanome vaccines | Nature | 2017 |
| 31348638 | Kozma GT | Anti-PEG IgM complement activation LNP | ACS Nano | 2019 |
| 33016924 | Cafri G | mRNA neoantigen T cell immunity GI cancer | J Clin Invest | 2020 |
| 31142840 | Cristiano S | Genome-wide cfDNA fragmentation in cancer | Nature | 2019 |
| 33883548 | Larson MH | Cell-free transcriptome tissue biomarkers | Nat Commun | 2021 |
---
## Caching System
All real API calls are cached locally to reduce latency and respect rate limits.
| Property | Value |
|----------|-------|
| **Cache directory** | `./cache/` |
| **TTL** | 24 hours |
| **Key format** | `{endpoint}_{md5(query)}.json` |
| **Format** | JSON |
| **Invalidation** | Automatic on TTL expiry; manual by deleting `./cache/` |
---
## Lab Journal
| Property | Value |
|----------|-------|
| **File** | `./lab_journal.csv` |
| **Format** | CSV (timestamp, tab, action, result_summary, note) |
| **Auto-logged** | Every tab run automatically logs an entry |
| **Manual notes** | Via sidebar note field |
---
*Data Sources documented by K R&D Lab Cancer Research Suite | 2026-03-07*