Spaces:
Running
Running
File size: 1,696 Bytes
673a52e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 |
# BioFlow Ingestion Guide (Phase 3)
This guide explains how to ingest data from **PubMed**, **UniProt**, and **ChEMBL** into Qdrant.
## 1) FastAPI Endpoints (Recommended)
### PubMed
`POST /api/ingest/pubmed`
```json
{
"query": "EGFR lung cancer",
"limit": 100,
"batch_size": 50,
"rate_limit": 0.4,
"collection": "bioflow_memory",
"email": "you@example.com",
"api_key": "NCBI_API_KEY",
"sync": false
}
```
### UniProt
`POST /api/ingest/uniprot`
```json
{
"query": "EGFR AND organism_id:9606",
"limit": 50,
"batch_size": 50,
"rate_limit": 0.2,
"collection": "bioflow_memory",
"sync": false
}
```
### ChEMBL
`POST /api/ingest/chembl`
```json
{
"query": "EGFR",
"limit": 30,
"batch_size": 50,
"rate_limit": 0.3,
"collection": "bioflow_memory",
"search_mode": "target",
"sync": false
}
```
### All Sources
`POST /api/ingest/all`
```json
{
"query": "EGFR lung cancer",
"pubmed_limit": 100,
"uniprot_limit": 50,
"chembl_limit": 30,
"batch_size": 50,
"rate_limit": 0.3,
"collection": "bioflow_memory",
"sync": false
}
```
### Job Status
`GET /api/ingest/jobs/{job_id}`
## 2) Next.js Proxy Routes (Optional)
If you want to call the backend through Next.js:
```
/api/ingest/pubmed
/api/ingest/uniprot
/api/ingest/chembl
/api/ingest/all
/api/ingest/jobs/{job_id}
```
## 3) CLI Ingestion
```
python -m bioflow.ingestion.ingest_all --query "EGFR lung cancer" --limit 100
```
## 4) Environment Variables
- `INGEST_BATCH_SIZE`
- `PUBMED_RATE_LIMIT`
- `UNIPROT_RATE_LIMIT`
- `CHEMBL_RATE_LIMIT`
- `NCBI_EMAIL`
- `NCBI_API_KEY`
- `CHEMBL_SEARCH_MODE`
## 5) Recommended Minimums
- PubMed: 100 records
- UniProt: 50 records
- ChEMBL: 30 records
|