bioflow / docs /INGESTION_GUIDE.md
ramiiiiiiiiiiiiiiiiiiiiiiiiiiiiii's picture
Fix explorer/ingestion UI and 3D endpoints
673a52e
# BioFlow Ingestion Guide (Phase 3)
This guide explains how to ingest data from **PubMed**, **UniProt**, and **ChEMBL** into Qdrant.
## 1) FastAPI Endpoints (Recommended)
### PubMed
`POST /api/ingest/pubmed`
```json
{
"query": "EGFR lung cancer",
"limit": 100,
"batch_size": 50,
"rate_limit": 0.4,
"collection": "bioflow_memory",
"email": "you@example.com",
"api_key": "NCBI_API_KEY",
"sync": false
}
```
### UniProt
`POST /api/ingest/uniprot`
```json
{
"query": "EGFR AND organism_id:9606",
"limit": 50,
"batch_size": 50,
"rate_limit": 0.2,
"collection": "bioflow_memory",
"sync": false
}
```
### ChEMBL
`POST /api/ingest/chembl`
```json
{
"query": "EGFR",
"limit": 30,
"batch_size": 50,
"rate_limit": 0.3,
"collection": "bioflow_memory",
"search_mode": "target",
"sync": false
}
```
### All Sources
`POST /api/ingest/all`
```json
{
"query": "EGFR lung cancer",
"pubmed_limit": 100,
"uniprot_limit": 50,
"chembl_limit": 30,
"batch_size": 50,
"rate_limit": 0.3,
"collection": "bioflow_memory",
"sync": false
}
```
### Job Status
`GET /api/ingest/jobs/{job_id}`
## 2) Next.js Proxy Routes (Optional)
If you want to call the backend through Next.js:
```
/api/ingest/pubmed
/api/ingest/uniprot
/api/ingest/chembl
/api/ingest/all
/api/ingest/jobs/{job_id}
```
## 3) CLI Ingestion
```
python -m bioflow.ingestion.ingest_all --query "EGFR lung cancer" --limit 100
```
## 4) Environment Variables
- `INGEST_BATCH_SIZE`
- `PUBMED_RATE_LIMIT`
- `UNIPROT_RATE_LIMIT`
- `CHEMBL_RATE_LIMIT`
- `NCBI_EMAIL`
- `NCBI_API_KEY`
- `CHEMBL_SEARCH_MODE`
## 5) Recommended Minimums
- PubMed: 100 records
- UniProt: 50 records
- ChEMBL: 30 records