bioflow / docs /INGESTION_GUIDE.md
ramiiiiiiiiiiiiiiiiiiiiiiiiiiiiii's picture
Fix explorer/ingestion UI and 3D endpoints
673a52e

BioFlow Ingestion Guide (Phase 3)

This guide explains how to ingest data from PubMed, UniProt, and ChEMBL into Qdrant.

1) FastAPI Endpoints (Recommended)

PubMed

POST /api/ingest/pubmed

{
  "query": "EGFR lung cancer",
  "limit": 100,
  "batch_size": 50,
  "rate_limit": 0.4,
  "collection": "bioflow_memory",
  "email": "you@example.com",
  "api_key": "NCBI_API_KEY",
  "sync": false
}

UniProt

POST /api/ingest/uniprot

{
  "query": "EGFR AND organism_id:9606",
  "limit": 50,
  "batch_size": 50,
  "rate_limit": 0.2,
  "collection": "bioflow_memory",
  "sync": false
}

ChEMBL

POST /api/ingest/chembl

{
  "query": "EGFR",
  "limit": 30,
  "batch_size": 50,
  "rate_limit": 0.3,
  "collection": "bioflow_memory",
  "search_mode": "target",
  "sync": false
}

All Sources

POST /api/ingest/all

{
  "query": "EGFR lung cancer",
  "pubmed_limit": 100,
  "uniprot_limit": 50,
  "chembl_limit": 30,
  "batch_size": 50,
  "rate_limit": 0.3,
  "collection": "bioflow_memory",
  "sync": false
}

Job Status

GET /api/ingest/jobs/{job_id}

2) Next.js Proxy Routes (Optional)

If you want to call the backend through Next.js:

/api/ingest/pubmed
/api/ingest/uniprot
/api/ingest/chembl
/api/ingest/all
/api/ingest/jobs/{job_id}

3) CLI Ingestion

python -m bioflow.ingestion.ingest_all --query "EGFR lung cancer" --limit 100

4) Environment Variables

  • INGEST_BATCH_SIZE
  • PUBMED_RATE_LIMIT
  • UNIPROT_RATE_LIMIT
  • CHEMBL_RATE_LIMIT
  • NCBI_EMAIL
  • NCBI_API_KEY
  • CHEMBL_SEARCH_MODE

5) Recommended Minimums

  • PubMed: 100 records
  • UniProt: 50 records
  • ChEMBL: 30 records