File size: 5,040 Bytes
adecc9b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
673a52e
 
 
 
 
 
 
 
 
 
 
adecc9b
 
 
 
673a52e
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
# πŸ—ΊοΈ BioFlow Orchestrator Development Roadmap

This roadmap outlines the collaborative development of a unified R&D platform for biological discovery using **fully open-source** tools and models.

---

## πŸ—οΈ Phase 1: Infrastructure & Core Framework βœ… COMPLETE
**Goal:** Establish the "modality-agnostic" foundation so tools can be plugged in without rewriting core logic.

- [x] **Core Abstractions** (`bioflow/core/base.py`):
  - `BioEncoder`: Interface for vectorization (ESM-2, ChemBERTa, PubMedBERT, CLIP)
  - `BioPredictor`: Interface for predictions (DeepPurpose, ADMET)
  - `BioGenerator`: Interface for candidate generation
  - `BioRetriever`: Interface for vector DB operations
  - Data containers: `EmbeddingResult`, `PredictionResult`, `RetrievalResult`

- [x] **Tool Registry** (`bioflow/core/registry.py`):
  - Central hub to manage multiple tools
  - Register/unregister by name
  - Default tool fallbacks
  - Utility methods for listing and summary

- [x] **Configuration Schema** (`bioflow/core/config.py`):
  - `NodeConfig`: Single pipeline node definition
  - `WorkflowConfig`: Complete workflow definition
  - `BioFlowConfig`: Master system configuration
  - YAML-compatible dataclasses

- [x] **Stateful Pipeline Engine** (`bioflow/core/orchestrator.py`):
  - `BioFlowOrchestrator`: DAG-based workflow execution
  - Topological sort for dependency resolution
  - `ExecutionContext` for state passing
  - Custom handler support
  - Error handling and traceability

- [x] **Sample Workflows** (`bioflow/workflows/`):
  - `drug_discovery.yaml`: Encode β†’ Retrieve β†’ Predict β†’ Filter
  - `literature_mining.yaml`: Cross-modal literature search

---

## πŸ§ͺ Phase 2: Parallel Tool Implementation βœ… COMPLETE
The team works on their respective modules using the core interfaces.

### **1. OBM Integration** βœ…
- [x] `OBMEncoder` - Unified multimodal encoder (`bioflow/plugins/obm_encoder.py`)
- [x] `TextEncoder` - PubMedBERT/SciBERT (`bioflow/plugins/encoders/text_encoder.py`)
- [x] `MoleculeEncoder` - ChemBERTa/RDKit (`bioflow/plugins/encoders/molecule_encoder.py`)
- [x] `ProteinEncoder` - ESM-2/ProtBERT (`bioflow/plugins/encoders/protein_encoder.py`)
- [x] Lazy loading for efficient memory usage
- [x] Dimension projection for cross-modal compatibility

### **2. Qdrant Retriever** βœ…
- [x] `QdrantRetriever` implements `BioRetriever` interface (`bioflow/plugins/qdrant_retriever.py`)
- [x] HNSW indexing with cosine/euclidean/dot distance
- [x] Payload filtering (species, experiment type, modality)
- [x] Batch ingestion support
- [x] In-memory, local, or remote Qdrant connections

### **3. DeepPurpose Predictor** βœ…
- [x] `DeepPurposePredictor` implements `BioPredictor` (`bioflow/plugins/deeppurpose_predictor.py`)
- [x] DTI prediction with Transformer+CNN architecture
- [x] Graceful fallback when DeepPurpose unavailable
- [x] Batch prediction support

---

## πŸ”— Phase 3: The Unified Workflow βœ… COMPLETE
**Goal:** Connect the tools into a coherent discovery loop.

- [x] **Typed Node System** (`bioflow/core/nodes.py`):
  - `EncodeNode`: Vectorize inputs via BioEncoder
  - `RetrieveNode`: Query vector DB for similar items
  - `PredictNode`: Run DTI predictions on candidates
  - `IngestNode`: Add new data to vector DB
  - `FilterNode`: Score-based filtering and ranking
  - `TraceabilityNode`: Link results to evidence sources

- [x] **Discovery Pipelines** (`bioflow/workflows/discovery.py`):
  - `DiscoveryPipeline`: Full drug discovery workflow (encode β†’ retrieve β†’ predict β†’ filter β†’ trace)
  - `LiteratureMiningPipeline`: Cross-modal literature search
  - `ProteinDesignPipeline`: Protein homolog discovery
  - Batch ingestion and simple search APIs

- [x] **Data Ingestion Utilities** (`bioflow/workflows/ingestion.py`):
  - JSON/CSV file loaders
  - SMILES/FASTA file parsers
  - Sample data generators for testing

- [x] **Evidence Traceability**:
  - Automatic PubMed/UniProt/PubChem/DrugBank link generation
  - Metadata preservation through pipeline

**Verification:** `python scripts/verify_phase3.py` - All 5 tests pass βœ…

---

## πŸ“Š Phase 4: UI/UX & Deployment βœ… COMPLETE
**Goal:** Build an intuitive, modern interface for the BioFlow platform.

- [x] **Next.js Frontend** (`ui/`):
  - Next.js 16 app router + Tailwind + shadcn/ui
  - Dashboard pages: Discovery, 3D Visualization, Workflow Builder
  - `/app/api/*` proxy routes to the FastAPI backend
  - Optional mock fallbacks for molecules/proteins list routes

**Launch:**
- Full stack (Windows): `launch_bioflow_full.bat`
- Manual:
  - Backend: `python -m uvicorn bioflow.api.server:app --host 0.0.0.0 --port 8000`
  - UI: `cd ui && pnpm dev`

---

## πŸš€ Phase 5: Open-Source Alignment
- **Strict Open-Source Compliance**: remove proprietary integrations and keep only OSS models/tools.
- **Open Protein/Peptide Options**: integrate open models (e.g., ESM-2 / ProGen2) behind `BioGenerator`.
- **Open Retrieval + Evidence**: improve evidence traceability (PubMed/UniProt/ChEMBL) and evaluation.