Spaces:

vignt97867896
/

bioflow

Sleeping

Rami-Troudi commited on Jan 26

Commit

adecc9b

1 Parent(s): 0f68a94

Phase 1: FastAPI integration with DeepPurpose DTI predictor

- Added bioflow/api/ with FastAPI server (port 8000)
- Integrated DeepPurposePredictor from deeppurpose002.py logic
- Updated Next.js API routes to call FastAPI backend
- Created launch_bioflow_full.bat for dual-server startup
- Merged OpenBioMed core + lacoste001 Next.js UI

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +0 -6
.gitignore +159 -0
.gitmodules +3 -0
LICENSE +21 -0
README-CN.md +265 -0
README.md +270 -15
USE_POLICY.md +19 -0
bioflow/__init__.py +61 -0
bioflow/api/__init__.py +7 -0
bioflow/api/dti_predictor.py +346 -0
bioflow/api/requirements.txt +29 -0
bioflow/api/server.py +359 -0
bioflow/app.py +570 -0
bioflow/core/__init__.py +87 -0
bioflow/core/base.py +247 -0
bioflow/core/config.py +92 -0
bioflow/core/nodes.py +465 -0
bioflow/core/orchestrator.py +303 -0
bioflow/core/registry.py +154 -0
bioflow/demo.py +261 -0
bioflow/obm_wrapper.py +355 -0
bioflow/pipeline.py +370 -0
bioflow/plugins/__init__.py +58 -0
bioflow/plugins/deeppurpose_predictor.py +220 -0
bioflow/plugins/encoders/__init__.py +17 -0
bioflow/plugins/encoders/molecule_encoder.py +226 -0
bioflow/plugins/encoders/protein_encoder.py +188 -0
bioflow/plugins/encoders/text_encoder.py +177 -0
bioflow/plugins/obm_encoder.py +294 -0
bioflow/plugins/obm_plugin.py +40 -0
bioflow/plugins/qdrant_retriever.py +312 -0
bioflow/qdrant_manager.py +365 -0
bioflow/ui/__init__.py +15 -0
bioflow/ui/app.py +61 -0
bioflow/ui/components.py +481 -0
bioflow/ui/config.py +583 -0
bioflow/ui/pages/__init__.py +5 -0
bioflow/ui/pages/data.py +163 -0
bioflow/ui/pages/discovery.py +165 -0
bioflow/ui/pages/explorer.py +127 -0
bioflow/ui/pages/home.py +213 -0
bioflow/ui/pages/settings.py +192 -0
bioflow/ui/requirements.txt +31 -0
bioflow/visualizer.py +386 -0
bioflow/workflows/__init__.py +49 -0
bioflow/workflows/discovery.py +400 -0
bioflow/workflows/drug_discovery.yaml +54 -0
bioflow/workflows/ingestion.py +276 -0
bioflow/workflows/literature_mining.yaml +41 -0
checkpoints/.placeholder +0 -0

.gitattributes CHANGED Viewed

@@ -1,9 +1,3 @@
-*.pth filter=lfs diff=lfs merge=lfs -text
-*.ckpt filter=lfs diff=lfs merge=lfs -text
-*.pkl filter=lfs diff=lfs merge=lfs -text
-*.npy filter=lfs diff=lfs merge=lfs -text
-*.pt filter=lfs diff=lfs merge=lfs -text
-*.tfevents.* filter=lfs diff=lfs merge=lfs -text
 *.tab filter=lfs diff=lfs merge=lfs -text
 *.stl filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text

 *.tab filter=lfs diff=lfs merge=lfs -text
 *.stl filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text

.gitignore CHANGED Viewed

	@@ -0,0 +1,159 @@

+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+.pybuilder/
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/
+# pytype static type analyzer
+.pytype/
+# Cython debug symbols
+cython_debug/
+/logs/*
+/.vscode
+/assets/*
+/checkpoints/**/*
+/datasets/**/*
+/misc/*
+/tmp/*
+/third_party/*
+!/third_party/.placeholder
+!/third_party/p2rank_2.5
+!/checkpoints/.placeholder
+!/datasets/**/.placeholder
+!/third_party/.placeholder
+#files
+*.csv
+*.pt
+*.txt
+!requirements.txt
+*.pth

.gitmodules ADDED Viewed

	@@ -0,0 +1,3 @@

+[submodule "third_party/p2rank_2.5"]
+	path = third_party/p2rank_2.5
+	url = https://github.com/rdk/p2rank.git

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2023 Pharmolix
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README-CN.md ADDED Viewed

	@@ -0,0 +1,265 @@

+<div align="center"><h1>OpenBioMed</h1></div>
+<h4 align="center">
+    <p>
+        <b>中文</b> |
+        <a href="./README.md">English</a>
+    <p>
+</h4>
+[![GitHub Repo stars](https://img.shields.io/github/stars/PharMolix/OpenBioMed?style=social)](https://github.com/PharMolix/OpenBioMed/stargazers)
+[![GitHub last commit](https://img.shields.io/github/last-commit/PharMolix/OpenBioMed)](https://github.com/PharMolix/OpenBioMed/commits/main)
+[![GitHub contributors](https://img.shields.io/github/contributors/PharMolix/OpenBioMed?color=orange)](https://github.com/PharMolix/OpenBioMed/graphs/contributors)
+[![GitHub pull request](https://img.shields.io/badge/PRs-welcome-blue)](https://github.com/PharMolix/OpenBioMed/pulls)
+[![Spaces](https://img.shields.io/badge/🤗-Open%20in%20Spaces-blue)](https://huggingface.co/PharMolix)
+[![Docker Pulls](https://img.shields.io/docker/pulls/youngking0727/openbiomed_server)](https://hub.docker.com/repository/docker/youngking0727/openbiomed_server)
+![platform](images/platform.png)
+欢迎用户在[该网站](http://openbiomed.pharmolix.com)使用我们的生物医药智能体开发平台！
+## 更新信息 🎉
+- [2025/05/26] 🔥 我们的框架进行了功能更新，包括新的工具、数据集和模型。我们实现了**LangCell** (📃[论文](https://arxiv.org/abs/2405.06708), 🤖[模型](https://drive.google.com/drive/folders/1cuhVG9v0YoAnjW-t_WMpQQguajumCBTp?usp=sharing), 📎[引用](#to-cite-langcell)) 和细胞数据处理接口（见[示例](./examples/cell_annotation.ipynb))。我们还推出了ADMET、QED、SA、LogP、Lipinski、相似性等分子性质预测工具。
+- [2025/03/07] 🔥 发布**OpenBioMed生物医药智能体开发平台**，可通过[该链接](http://openbiomed.pharmolix.com)访问。该平台能帮助研发人员零门槛使用AI模型定制化自己的科学研究助手（**AutoPilot**）。平台的[使用文档](https://www.zybuluo.com/icycookies/note/2587490)已经同步发布。
+- [2025/03/07] 🔥 发布**OpenBioMed v2**. 我们在这次更新中适配了更多的生物医药下游任务，开放了更加易用的数据接口，并继承了更前沿的AI模型。同时，我们发布了试用版**PharmolixFM**模型（📃[技术报告](https://arxiv.org/abs/2503.21788), 🤖[模型](https://cloud.tsinghua.edu.cn/f/8f337ed5b58f45138659/), 📎[引用](#to-cite-pharmolixfm)），并完成了BioMedGPT-R1模型的推理支持。我们预计于本月内开放BioMedGPT-R1的微调代码。
+    > PharmolixFM是由水木分子与清华大学智能产业研究院联合研发的全原子基础大模型。该模型使用最先进的非自回归式多模态生成模型，在原子尺度上实现了对分子、抗体和蛋白质的联合建模。PharmolixFM能够适配多种下游任务，如分子对接、基于口袋的分子设计、抗体设计、分子构象生成等。在给定口袋位置的分子对接任务中，PharMolixFM的预测精度可与AlphaFold3媲美 (83.9 vs 90.2, RMSD < 2Å) 。
+- [2025/02/20] 发布**BioMedGPT-R1** (🤗[Huggingface模型](https://huggingface.co/PharMolix/BioMedGPT-R1)).
+    > BioMedGPT-R1-17B是由水木分子与清华大学智能产业研究院（AIR）联合发布的生物医药多模态推理模型。其在上一版本的基础上，用DeepSeek-R1-Distill-Qwen-14B更新了原采用的文本基座模型，并通过跨模态对齐和多模态推理SFT实现模型微调，在生物医药问答任务上效果逼近闭源商用大模型和人类专家水平。
+- [2024/05/16]  发布 **LangCell** (📃[论文](https://arxiv.org/abs/2405.06708), 💻[代码](https://github.com/PharMolix/LangCell), 🤖[模型](https://drive.google.com/drive/folders/1cuhVG9v0YoAnjW-t_WMpQQguajumCBTp?usp=sharing), 📎[引用](#to-cite-langcell)).
+    > LangCell是由水木分子与清华大学智能产业研究院联合研发的首个“自然语言-单细胞”多模态预训练模型。该模型通过学习富含细胞身份信息的知识性文本，有效提升了对单细胞转录组学的理解能力，并解决了数据匮乏场景下的细胞身份理解任务。LangCell是唯一能有效进行零样本细胞身份理解的单细胞模型，并且在少样本和微调场景下也取得SOTA。LangCell将很快被集成到OpenBioMed。
+- [2023/08/14]  发布 **BioMedGPT-LM-7B** (🤗[HuggingFace模型](https://huggingface.co/PharMolix/BioMedGPT-LM-7B)) 、 **BioMedGPT-10B** (📃[论文](https://arxiv.org/abs/2308.09442v2), 🤖[模型](https://pan.baidu.com/s/1iAMBkuoZnNAylhopP5OgEg?pwd=7a6b#list/path=%2F), 📎[引用](#to-cite-biomedgpt)) 和 **DrugFM** (🤖[模型](https://pan.baidu.com/s/1iAMBkuoZnNAylhopP5OgEg?pwd=7a6b#list/path=%2F)).
+    > BioMedGPT-10B是由水木分子联合清华大学智能产业研究院联合发布的首个可商用的多模态生物医药大模型。该模型将以分子结构和蛋白质序列为代表的生命语言与人类的自然语言相结合，在生物医药专业问答能力比肩人类专家水平，在分子和蛋白质跨���态问答中表现出强大的性能。BioMedGPT-LM-7B是首个可商用、生物医药专用的Llama2大模型。
+    > DrugFM是由"清华AIR-智源联合研究中心"联合研发的多模态小分子基础模型。 该模型针对小分子药物的组织规律和表示学习进行了更细粒度的设计，形成了小分子药物预训练模型UniMap，并与多模态小分子基础模型MolFM有机结合。该模型在跨模态抽取任务中取得SOTA。
+- [2023/06/12] 发布 **MolFM** (📃[论文](https://arxiv.org/abs/2307.09484), 🤖[模型](https://pan.baidu.com/s/1iAMBkuoZnNAylhopP5OgEg?pwd=7a6b#list/path=%2F), 📎[引用](#to-cite-molfm)) 和 **CellLM** (📃[论文](https://arxiv.org/abs/2306.04371), 🤖[模型](https://pan.baidu.com/s/1iAMBkuoZnNAylhopP5OgEg), 📎[引用](#to-cite-celllm)).
+    > MolFM是一个支持统一表示分子结构、生物医学文本和知识图谱的多模态小分子基础模型。在零样本和微调场景下，MolFM的跨模态检索能力分别比现有模型提升了12.03%和5.04%。在分子描述生成、基于文本的分子生成和分子性质预测中，MolFM也取得了显著的结果。
+    > CellLM是首个使用分支对比学习策略在正常细胞和癌症细胞数据上同时训练的大规模细胞表示学习模型。CellLM在细胞类型注释（71.8 vs 68.8）、少样本场景下的单细胞药物敏感性预测（88.9 vs 80.6）和单组学细胞系药物敏感性预测上均优于ScBERT（93.4 vs 87.2）。
+- [2023/04/23] 发布 **BioMedGPT-1.6B** (🤖[模型](https://pan.baidu.com/s/1iAMBkuoZnNAylhopP5OgEg)) 和 **OpenBioMed**.
+## 目录
+- [介绍](#介绍)
+- [环境搭建](#环境搭建)
+- [使用指南](#使用指南)
+- [先前版本](#先前版本)
+- [局限性](#局限性)
+- [引用](#引用)
+## 介绍
+OpenBioMed是一个面向生命科学研究和药物研发的Python深度学习工具包。OpenBioMed为小分子结构、蛋白质结构、单细胞转录组学数据、知识图谱和生物医学文本等多模态数据提供了**灵活的数据处理接口**。OpenBioMed构建了**20余个计算工具**，涵盖了大部分AI药物发现任务和最新的针对分子、蛋白质的多模态理解生成任务。此外，OpenBioMed为研究者提供了一套**易用的工作流构建界面**，支持以拖拽形式对接多个模型，并构建基于大语言模型的智能体以解决复杂的科研问题。
+OpenBioMed为研究者提供了：
+- **4种不同数据的处理接口**， 包括分子结构、蛋白结构、口袋结构和自然语言文本。我们将在未来加入DNA、RNA、单细胞转录组学数据和知识图谱的处理接口。
+- **20余个工具**, 包括分子性质预测、蛋白折叠为代表的AI预测工具、分子结构的可视化工具和互联网信息、数据库查询工具。
+- **超过20个深度学习模型**, 包括[PharmolixFM](https://cloud.tsinghua.edu.cn/f/8f337ed5b58f45138659/), [BioMedGPT-R1](https://huggingface.co/PharMolix/BioMedGPT-R1), [BioMedGPT](https://ieeexplore.ieee.org/document/10767279/) and [MutaPLM](https://arxiv.org/abs/2410.22949)等自研模型。
+OpenBioMed的核心特色如下:
+- **统一的数据处理框架**，能轻松加载不同模态的数据，并将其转换为统一的格式。
+- **现成的模型预测模块**。我们整理并公开了各类模型的参数，并提供了使用案例，能够简便的迁移到其他数据或任务中。
+- **易用的工作流与智能体构建方案**，以帮助研究者针对复杂的科研问题构建多工具协同工作流，通过反复执行工作流以模拟科学试验中的试错过程，并通过大语言模型归纳得到潜在的科学发现。
+下表显示了OpenBioMed中支持的工具，它们在未来会被进一步扩展。
+|      工具名称       |                           适配模型                           |                             简介                             |
+| :-----------------: | :----------------------------------------------------------: | :----------------------------------------------------------: |
+|    分子性质预测     |         [GraphMVP](https://arxiv.org/abs/2110.07728)         |     针对给定分子预测其性质，如血脑屏障穿透性和药物副作用     |
+|      分子问答       |          [BioT5](https://arxiv.org/abs/2310.07276)           | 针对给定分子和某个提问进行解答，如介绍分子结构、询问分子官能团、氢键供体的数量等 |
+|   分子结构可视化    |                              无                              |                        分子结构可视化                        |
+|   分子名称/ID检索   |                              无                              |         基于分子名称或ID，从PubChem数据库中检索分子          |
+|  分子相似结构检索   |                              无                              |             从PubChem数据库中检索结构相似的分子              |
+|     蛋白质问答      |          [BioT5](https://arxiv.org/abs/2310.07276)           | 针对给定蛋白和某个提问进行解答，如询问motif、蛋白功能、在细胞中的分布和相关疾病等 |
+|     蛋白质折叠      | [ESMFold](https://www.science.org/doi/10.1126/science.ade2574) |              基于氨基酸序列预测蛋白质的三维结构              |
+|  蛋白结合位点预测   | [P2Rank](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0285-8) |           预测蛋白质中潜在的（与小分子的）结合位点           |
+|    突变效应阐释     |         [MutaPLM](https://arxiv.org/abs/2410.22949)          | 给定氨基酸序列上的一个单点突变，使用自然语言描述可能得突变效应 |
+|      突变设计       |         [MutaPLM](https://arxiv.org/abs/2410.22949)          | 基于初始蛋白质序列和自然语言描述的优化目标，生成符合优化目标的突变后蛋白质 |
+|    蛋白质ID检索     |                              无                              |          基于ID，从UniProtKB数据库中检索蛋白质序列           |
+|   蛋白质结构检索    |                              无                              |       基于ID，从PDB和AlphaFoldDB数据库中检索蛋白质结构       |
+|  蛋白质结构可视化   |                             N/A                              |                       蛋白质结构可视化                       |
+| 蛋白质-分子刚性对接 | [PharmolixFM](https://cloud.tsinghua.edu.cn/f/8f337ed5b58f45138659/) |         给定蛋白口袋结构和分子，生成对接后的分子构象         |
+| 基于口袋的分子设计  | [PharmolixFM](https://cloud.tsinghua.edu.cn/f/8f337ed5b58f45138659/), [MolCRAFT](https://github.com/AlgoMole/MolCRAFT) |      给定蛋白口袋结构，生成能与该口袋对接的分子及其构象      |
+|    复合物可视化     |                             N/A                              |            可视化蛋白质-小分子结合后的复合物结构             |
+|     口袋可视化      |                             N/A                              |                    可视化蛋白质的口袋结构                    |
+|     互联网搜索      |                             N/A                              |                      在互联网中检索信息                      |
+## 环境搭建
+为支持OpenBioMed的基本功能，请执行如下操作：
+```bash
+conda create -n OpenBioMed python=3.9
+conda activate OpenBioMed
+pip install torch==1.13.1+{your_cuda_version} torchvision==0.14.1+{your_cuda_version} torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/{your_cuda_version}
+pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-1.13.1+{your_cuda_version}.html
+pip install pytorch_lightning==2.0.8 peft==0.9.0 accelerate==1.3.0 --no-deps -i https://pypi.tuna.tsinghua.edu.cn/simple
+pip install -r requirements.txt
+```
+推荐使用11.7版本的cuda驱动来构建环境。开发者尚未测试使用其他版本的cuda驱动是否会产生问题。
+为支持可视化工具与vina分数计算工具，请按如下操作下载依赖包：
+```
+# 可视化依赖
+conda install -c conda-forge pymol-open-source
+pip install imageio
+# AutoDockVina依赖
+pip install meeko==0.1.dev3 pdb2pqr vina==1.2.2
+python -m pip install git+https://github.com/Valdes-Tresanco-MS/AutoDockTools_py3
+# PoseBusters依赖
+pip install posebusters==0.3.1
+# 部分评估指标依赖
+pip install spacy rouge_score nltk
+python
+>>> import nltk
+>>> nltk.download('wordnet')
+>>> nltk.download('omw-1.4')
+# LangCell依赖
+pip install geneformer
+```
+下载依赖后，您可以运行以下命令安装OpenBioMed包，从而更方便地使用我们的接口：
+```bash
+pip install -e .
+# 使用OpenBioMed的接口
+python
+>>> from open_biomed.data import Molecule
+>>> molecule = Molecule(smiles="CC(=O)OC1=CC=CC=C1C(=O)O")
+>>> print(molecule.calc_logp())
+```
+### 构建docker
+直接运行 `./scripts/docker_run.sh`，就可以构建docker镜像并运行容器，并在端口8082和8083运行后端服务。
+```
+sh ./scripts/docker_run.sh
+```
+与此同时，我们也提供了build好的[docker镜像](https://hub.docker.com/repository/docker/youngking0727/openbiomed_server)，可以直接拉取使用。
+## 使用指南
+请移步我们的 [使用案例与教程](./examples) 。
+| 教程名称                                                     | 简介                                                         |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+| [BioMedGPT推理](./examples/biomedgpt_r1.ipynb)               | 使用BioMedGPT-10B回答分子与蛋白质相关问题和使用BioMedGPT-R1进行推理的示例。 |
+| [分子与蛋白质数据处理](./examples/manipulate_molecules.ipynb) | 使用OpenBioMed中的接口加载、处理、导出分子与蛋白质数据的示例。 |
+| [深度学习工具的使用](./examples/explore_ai4s_tools.ipynb)    | 使用深度学习模型进行预测的示例。                             |
+| [可视化](./examples/visualization.ipynb)                     | 使用OpenBioMed中的接口对小分子、蛋白质、口袋和复合物进行可视化的示例。 |
+| [工作流](./examples/workflow.ipynb)                          | 构建多工具协同工作流和大模型智能体的示例。                   |
+| [模型开发](./examples/model_customization.ipynb)             | 在OpenBioMed框架中使用个人数据或模型结构开发新模型的教程。   |
+## 先前版本
+如果你想使用OpenBioMed先前版本的部分功能，请切换至该仓库的v1.0分支：
+```bash
+git checkout v1.0
+```
+## 局限性
+本项目包含BioMedGPT-LM-7B，BioMedGPT-10B和BioMedGPT-R1，这些模型应当被负责任地使用。BioMedGPT不应用于向公众提供服务。我们严禁使用BioMedGPT生成任何违反适用法律法规的内容，如煽动颠覆国家政权、危害国家安全和利益、传播恐怖主义、极端主义、种族仇恨和歧视、暴力、色情或虚假有害信息等。BioMedGPT不对用户提供或发布的任何内容、数据或信息产生的任何后果负责。
+## 协议
+本项目代码依照[MIT](./LICENSE)协议开源。使用BioMedGPT-LM-7B、BioMedGPT-10B和BioMedGPT-R1模型，需要遵循[使用协议](./USE_POLICY.md)。
+## 联系方式
+我们期待您的反馈以帮助我们改进这一框架。若您在使用过程中有任何技术问题或建议，请随时在GitHub issue中提出。若您有商业合作的意向，请联系[opensource@pharmolix.com](mailto:opensource@pharmolix.com)。
+## 引用
+如果您认为我们的开源代码和模型对您的研究有帮助，请考虑给我们的项目点上星标🌟并引用📎以下文章。感谢您的支持！
+##### 引用OpenBioMed:
+```
+@misc{OpenBioMed_code,
+      author={Luo, Yizhen and Yang, Kai and Fan, Siqi and Hong, Massimo and Zhao, Suyuan and Chen, Xinrui and Nie, Zikun and Luo, Wen and Xie, Ailin and Liu, Xing Yi and Zhang, Jiahuan and Wu, Yushuai and Nie, Zaiqing},
+      title={Code of OpenBioMed},
+      year={2023},
+      howpublished={\url{https://github.com/Pharmolix/OpenBioMed.git}}
+}
+```
+##### 引用BioMedGPT:
+```
+@article{luo2024biomedgpt,
+  title={Biomedgpt: An open multimodal large language model for biomedicine},
+  author={Luo, Yizhen and Zhang, Jiahuan and Fan, Siqi and Yang, Kai and Hong, Massimo and Wu, Yushuai and Qiao, Mu and Nie, Zaiqing},
+  journal={IEEE Journal of Biomedical and Health Informatics},
+  year={2024},
+  publisher={IEEE}
+}
+```
+##### 引用PharMolixFM:
+@article{luo2025pharmolixfm,
+  title={PharMolixFM: All-Atom Foundation Models for Molecular Modeling and Generation},
+  author={Luo, Yizhen and Wang, Jiashuo and Fan, Siqi and Nie, Zaiqing},
+  journal={arXiv preprint arXiv:2503.21788},
+  year={2025}
+}
+##### 引用MolFM:
+```
+@misc{luo2023molfm,
+      title={MolFM: A Multimodal Molecular Foundation Model},
+      author={Yizhen Luo and Kai Yang and Massimo Hong and Xing Yi Liu and Zaiqing Nie},
+      year={2023},
+      eprint={2307.09484},
+      archivePrefix={arXiv},
+      primaryClass={q-bio.BM}
+}
+```
+##### 引用LangCell:
+```
+@misc{zhao2024langcell,
+      title={LangCell: Language-Cell Pre-training for Cell Identity Understanding},
+      author={Suyuan Zhao and Jiahuan Zhang and Yizhen Luo and Yushuai Wu and Zaiqing Nie},
+      year={2024},
+      eprint={2405.06708},
+      archivePrefix={arXiv},
+      primaryClass={q-bio.GN}
+}
+```
+##### 引用MutaPLM
+```
+@article{luo2025mutaplm,
+  title={MutaPLM: Protein Language Modeling for Mutation Explanation and Engineering},
+  author={Luo, Yizhen and Nie, Zikun and Hong, Massimo and Zhao, Suyuan and Zhou, Hao and Nie, Zaiqing},
+  journal={Advances in Neural Information Processing Systems},
+  volume={37},
+  pages={79783--79818},
+  year={2025}
+}
+```

README.md CHANGED Viewed

@@ -1,27 +1,282 @@
-DeepPurpose002 — Training & Prediction (DTI)
-Ce repo contient un pipeline DeepPurpose pour :
-entraîner un modèle Drug–Target Interaction (DTI) à partir de paires (SMILES, séquence protéique, label),
-évaluer le modèle (métriques + logs),
-prédire des interactions/affinités sur de nouvelles paires et exporter les résultats.
-Contenu
-deeppurpose002.py : chargement données → preprocessing/encodage → entraînement → évaluation → sauvegarde modèle + outputs
-prediction_test.py (ou équivalent) : chargement du modèle sauvegardé → prédictions → export CSV
-Utilisation
-python deeppurpose002.py
-python prediction_test.py
-Format attendu
-Train (supervisé) : drug_smiles, target_sequence, label
-Predict : drug_smiles, target_sequence
-Outputs : modèles dans models/, résultats/logs dans outputs/.

+<div align="center"><h1>OpenBioMed</h1></div>
+<h4 align="center">
+    <p>
+        <b>English</b> |
+        <a href="./README-CN.md">中文</a>
+    <p>
+</h4>
+[![GitHub Repo stars](https://img.shields.io/github/stars/PharMolix/OpenBioMed?style=social)](https://github.com/PharMolix/OpenBioMed/stargazers)
+[![GitHub last commit](https://img.shields.io/github/last-commit/PharMolix/OpenBioMed)](https://github.com/PharMolix/OpenBioMed/commits/main)
+[![GitHub contributors](https://img.shields.io/github/contributors/PharMolix/OpenBioMed?color=orange)](https://github.com/PharMolix/OpenBioMed/graphs/contributors)
+[![GitHub pull request](https://img.shields.io/badge/PRs-welcome-blue)](https://github.com/PharMolix/OpenBioMed/pulls)
+[![Spaces](https://img.shields.io/badge/🤗-Open%20in%20Spaces-blue)](https://huggingface.co/PharMolix)
+[![Docker Pulls](https://img.shields.io/docker/pulls/youngking0727/openbiomed_server)](https://hub.docker.com/repository/docker/youngking0727/openbiomed_server)
+![platform](images/platform.png)
+Feel free to use our **Agent Platform for Biomedicine and Life Science** at this [website](http://openbiomed.pharmolix.com)!
+## News 🎉
+- [2025/05/26] 🔥 Our framework has been updated with several new features including new tools, datasets, and models. We implement **LangCell** (📃[Paper](https://arxiv.org/abs/2405.06708), 🤖[Model](https://drive.google.com/drive/folders/1cuhVG9v0YoAnjW-t_WMpQQguajumCBTp?usp=sharing), 📎[Citation](#to-cite-langcell)) and APIs to manipulate cells (See [the Example](./examples/cell_annotation.ipynb)). We also introduce a wider range of tools to calculate molecular properties (ADMET, QED, SA, LogP, Lipinski, Similarity, etc.).
+- [2025/03/07] 🔥 We present **OpenBioMed Agent Platform** at this [website](http://openbiomed.pharmolix.com) to customize workflows and LLM agents (**AutoPilots**) in solving complicated scientific research tasks. **Tutorials** for using this platform are also [available](https://www.zybuluo.com/icycookies/note/2587490).
+- [2025/03/07] 🔥 Released **OpenBioMed v2**. We present new features including additional downstream biomedical tasks, more flexible data APIs, and advanced models. We also release a preview version of **PharmolixFM** (📃[Paper](https://arxiv.org/abs/2503.21788), 🤖[Model](https://cloud.tsinghua.edu.cn/f/8f337ed5b58f45138659/), 📎[Citation](#to-cite-pharmolixfm)). BioMedGPT-R1 inference is currently supported, and fine-tuning will be available in this month!
+> PharmolixFM is an all-atom molecular foundation model jointly released by PharMolix Inc. and Institute of AI Industry Research (AIR), Tsinghua University. It unifies molecules, antibodies, and proteins by jointly modeling them at atom-level with cutting-edge non-autoregressive multi-modal generative models. PharmolixFM is capable of solving mutiple downstream tasks such as docking, structure-based drug design, peptide design, and molecular conformation generation. PharmolixFM achieves competitive performance with AlphaFold3 (83.9 vs 90.2, RMSD < 2Å) on protein-molecule docking (given pocket).
+- [2025/02/20] BioMedGPT-R1 (🤗[Huggingface Model](https://huggingface.co/PharMolix/BioMedGPT-R1)) has been released.
+> BioMedGPT-R1-17B is a multimodal biomedical reasoning model jointly released by PharMolix and Institute of AI Industry Research (AIR) . It updates the language model in last version with DeepSeek-R1-Distill-Qwen-14B and adopts two-stage training for cross-modal alignment and multimodal reasoning SFT, performing on par with commercial model on biomedical QA benchmark.
+- [2024/05/16] Released implementation of **LangCell** (📃[Paper](https://arxiv.org/abs/2405.06708), 💻[Code](https://github.com/PharMolix/LangCell), 🤖[Model](https://drive.google.com/drive/folders/1cuhVG9v0YoAnjW-t_WMpQQguajumCBTp?usp=sharing), 📎[Citation](#to-cite-langcell)).
+> LangCell is the first "language-cell" multimodal pre-trained model jointly developed by PharMolix and Institute for AI Industry Research (AIR). It effectively enhances the understanding of single-cell transcriptomics by learning knowledge-rich texts containing cell identity information, and addresses the task of cell identity understanding in data-scarce scenarios. LangCell is the only single-cell model capable of effective zero-shot cell identity understanding and has also achieved SOTA in few-shot and fine-tuning scenarios. LangCell will soon be integrated into OpenBioMed.
+- [2023/08/14] Released implementation of **BioMedGPT-10B** (📃[Paper](https://arxiv.org/abs/2308.09442v2), 🤖[Model](https://pan.baidu.com/s/1iAMBkuoZnNAylhopP5OgEg?pwd=7a6b#list/path=%2F), 📎[Citation](#to-cite-biomedgpt)), **BioMedGPT-LM-7B** (🤗[HuggingFace Model](https://huggingface.co/PharMolix/BioMedGPT-LM-7B)) and **DrugFM** (🤖[Model](https://pan.baidu.com/s/1iAMBkuoZnNAylhopP5OgEg?pwd=7a6b#list/path=%2F)).
+> BioMedGPT-10B is the first commercial-friendly multimodal biomedical foundation model jointly released by PharMolix and Institute of AI Industry Research (AIR). It aligns the language of life (molecular structures and protein sequences) with human natural language, performing on par with human experts on biomedical QA benchmarks, and demonstrating powerful performance in cross-modal molecule and protein question answering tasks. BioMedGPT-LM-7B is the first commercial-friendly generative foundation model tailored for biomedicine based on Llama-2.
+> DrugFM is a multi-modal molecular foundation model jointly developed by Institute of AI Industry Research (AIR) and Beijing Academy of Artificial Intelligence, BAAI. It leverages UniMAP, a pre-trained molecular model that captures fine-grained properties and representations of molecules, and incorporates MolFM, our multimodal molecular foundation model. DrugFM achieves SOTA on cross-modal retrieval.
+- [2023/06/12] Released implementation of **MolFM** (📃[Paper](https://arxiv.org/abs/2307.09484), 🤖[Model](https://pan.baidu.com/s/1iAMBkuoZnNAylhopP5OgEg?pwd=7a6b#list/path=%2F), 📎[Citation](#to-cite-molfm)) and **CellLM** (📃[Paper](https://arxiv.org/abs/2306.04371), 🤖[Model](https://pan.baidu.com/s/1iAMBkuoZnNAylhopP5OgEg), 📎[Citation](#to-cite-celllm)).
+> MolFM is a multi-modal molecular foundation model that enables joint comprehension of molecular structures, biomedical documents and knowledge graphs. On cross-modal retrieval, MolFM outperforms existing models by 12.03% and 5.04% under zero-shot and fine-tuning settings. MolFM also excels in molecule captioning, text-to-molecule generation and molecule property prediction.
+> CellLM is the first large-scale cell representation learning model trained on both normal cells and cancer cells with divide-and-conquer contrastive learning. CellLM beats ScBERT on cell type annotation (71.8 vs 68.8), few-shot single-cell drug sensitivity prediction (88.9 vs 80.6) and single-omics cell line drug sensitivity prediction (93.4 vs 87.2).
+- [2023/04/23] Released implementation of **BioMedGPT-1.6B** (🤖[Model](https://pan.baidu.com/s/1iAMBkuoZnNAylhopP5OgEg)) and **OpenBioMed**.
+## Table of contents
+- [Introduction](#introduction)
+- [Installation](#installation)
+- [Tutorials](#tutorials)
+- [Previous version](#previous-version)
+- [Limitations](#limitations)
+- [Cite us](#cite-us)
+## Introduction
+This repository holds OpenBioMed, a Python deep learning toolkit for AI-empowered biomedicine. OpenBioMed provides **flexible APIs to handle multi-modal biomedical data**, including molecules, proteins, single cells, natural language, and knowledge graphs. OpenBioMed builds **20+ tools that covers a wide range of downstream applications**, ranging from traditional AI drug discovery tasks to newly-emerged multi-modal challenges. Moreover, OpenBioMed provides **an easy-to-use interface for building workflows** that connect multiple tools and developing LLM-driven agents for solving complicated biomedical research tasks.
+OpenBioMed provide researchers with access to:
+- **4 types of data modalities**:  OpenBioMed provide easy-to-use APIs for researchers to access and process different types of data including molecules, proteins, pockets, and texts. New data structures for DNAs, RNAs, single cells, and knowledge graphs will be available in future versions.
+- **20+ tools**, ranging from ML-based prediction models for AIDD tasks including molecule property prediction and protein folding, to visualization tools and web-search APIs.
+- **20+ deep learning models**, comprising exclusive models such as [PharmolixFM](https://cloud.tsinghua.edu.cn/f/8f337ed5b58f45138659/), [BioMedGPT-R1](https://huggingface.co/PharMolix/BioMedGPT-R1), [BioMedGPT](https://ieeexplore.ieee.org/document/10767279/) and [MutaPLM](https://arxiv.org/abs/2410.22949).
+Key features of OpenBioMed include:
+- **Unified Data Processing Pipeline**: easily load and transform the heterogeneous data from different biomedical entities and modalities into a unified format.
+- **Off-the-shelf Inference**: publicly available pre-trained models and inference demos, readily to be transferred to your own data or task.
+- **Easy-to-use Interface for Building Workflows and LLM Agents**: flexibly build solutions for complicated research tasks with multi-tool collaborative workflows, and harvest LLMs for simulating trial-and-errors and gaining scientific insights.
+Here is a list of currently available tools. This is a continuing effort and we are working on further growing the list.
+|              Tool              |                       Supported Model                        |                         Description                          |
+| :----------------------------: | :----------------------------------------------------------: | :----------------------------------------------------------: |
+| Molecular Property Prediction  |         [GraphMVP](https://arxiv.org/abs/2110.07728)         | Predicting the properties of a given molecule (e.g. blood-brain barrier penetration and side effects) |
+|  Molecule Question Answering   |          [BioT5](https://arxiv.org/abs/2310.07276)           | Answering textual queries of a given molecule (e.g. structural descriptions, functional groups, number of hydrogen bond donors) |
+|     Molecule Visualization     |                             N/A                              |                     Visualize a molecule                     |
+|    Molecule Name/ID Request    |                             N/A                              | Obtaining a molecule from PubChem using its name or PubChemID |
+|   Molecule Structure Request   |                             N/A                              | Obtaining a molecule from PubChem based on similar structures |
+|   Protein Question Answering   |          [BioT5](https://arxiv.org/abs/2310.07276)           | Answering textual queries of a given protein (e.g. motifs, functions, subcellular location, related diseases) |
+|        Protein Folding         | [ESMFold](https://www.science.org/doi/10.1126/science.ade2574) | Predicting the 3D structure of a protein based on its amino acid sequence |
+|   Protein Pocket Prediction    | [P2Rank](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0285-8) |     Predicting potential binding sites within a protein      |
+|      Mutation Explanation      |         [MutaPLM](https://arxiv.org/abs/2410.22949)          | Providing textual explanations of a single-site substitution mutation on a protein sequence |
+|      Mutation Engineering      |         [MutaPLM](https://arxiv.org/abs/2410.22949)          | Generating a mutated protein to fit the textual instructions on the wild-type protein sequence. |
+|   Protein UniProtID Request    |                             N/A                              | Obtaining a protein sequence from UniProtKB based on UniProt accession ID |
+|      Protein PDB Request       |                             N/A                              | Obtaining a protein structure from PDB/AlphaFoldDB based on PDB/AlphaFoldDB accession ID |
+|     Protein Visualization      |                             N/A                              |                     Visualize a protein                      |
+| Protein-molecule Rigid Docking | [PharmolixFM](https://cloud.tsinghua.edu.cn/f/8f337ed5b58f45138659/) | Generate the binding pose of the molecule with a given pocket in a protein |
+|  Structure-based Drug Design   | [PharmolixFM](https://cloud.tsinghua.edu.cn/f/8f337ed5b58f45138659/), [MolCRAFT](https://github.com/AlgoMole/MolCRAFT) | Generate a molecule that binds with a given pocket in a protein |
+|     Complex Visualization      |                             N/A                              |             Visualize a protein-molecule complex             |
+|      Pocket Visualization      |                             N/A                              |             Visualize a pocket within a protein              |
+|          Web Request           |                             N/A                              |             Obtaining information by web search              |
+## Installation
+To enable basic features of OpenBioMed, please execute the following:
+```bash
+conda create -n OpenBioMed python=3.9
+conda activate OpenBioMed
+pip install torch==1.13.1+{your_cuda_version} torchvision==0.14.1+{your_cuda_version} torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/{your_cuda_version}
+pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-1.13.1+{your_cuda_version}.html
+pip install pytorch_lightning==2.0.8 peft==0.9.0 accelerate==1.3.0 --no-deps -i https://pypi.tuna.tsinghua.edu.cn/simple
+pip install -r requirements.txt
+```
+We recommend using cuda=11.7 to set up the environment. Other versions of cudatoolkits may lead to unexpected problems.
+To enable visualization tools and vina score computation tools, you should install the following packages:
+```
+# For visualization
+conda install -c conda-forge pymol-open-source
+pip install imageio
+# For AutoDockVina
+pip install meeko==0.1.dev3 pdb2pqr vina==1.2.2
+python -m pip install git+https://github.com/Valdes-Tresanco-MS/AutoDockTools_py3
+# For PoseBusters
+pip install posebusters==0.3.1
+# For overlap-based evaluation
+pip install spacy rouge_score nltk
+python
+>>> import nltk
+>>> nltk.download('wordnet')
+>>> nltk.download('omw-1.4')
+# For LangCell
+pip install geneformer
+```
+After downloading the dependencies, you can run the following command to install the package and use our APIs more conveniently:
+```bash
+pip install -e .
+# Try using OpenBioMed APIs
+python
+>>> from open_biomed.data import Molecule
+>>> molecule = Molecule(smiles="CC(=O)OC1=CC=CC=C1C(=O)O")
+>>> print(molecule.calc_logp())
+```
+### Build Docker
+Executing ./scripts/docker_run.sh directly will build the Docker image and run the container, launching the backend services on ports 8082 and 8083.
+```
+sh ./scripts/docker_run.sh
+```
+At the same time, we also provide a pre-built [docker image](https://hub.docker.com/repository/docker/youngking0727/openbiomed_server), which can be pulled and used directly.
+## Tutorials
+Checkout our [Jupytor notebooks](./examples/) for a quick start!
+| Name                                                         | Description                                                  |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+| [BioMedGPT Inference](./examples/biomedgpt_r1.ipynb)         | Examples of using BioMedGPT-10B to answer questions about molecules and proteins and BioMedGPT-R1 to perform reasoning. |
+| [Molecule Processing](./examples/manipulate_molecules.ipynb) | Examples of using OpenBioMed APIs to load, process, and export molecules and proteins. |
+| [ML Tool Usage](./examples/explore_ai4s_tools.ipynb)         | Examples of using machine learning tools to perform inference. |
+| [Visualization](./examples/visualization.ipynb)              | Examples of using OpenBioMed APIs to visualize molecules, proteins, complexes, and pockets. |
+| [Workflow Construction](./examples/workflow.ipynb)           | Examples of building and executing workflows and developing LLM agents for complicated scientific tasks. |
+| [Model Customization](./examples/model_customization.ipynb)  | Tutorials on how to customize your own model and data using OpenBioMed training pipelines. |
+## Previous Version
+If you hope to use the features of the previous version, please switch to the `v1.0` branch of this repository by running the following command:
+```bash
+git checkout v1.0
+```
+## Limitations
+This repository holds BioMedGPT-LM-7B, BioMedGPT-10B, and BioMedGPT-R1, and we emphasize the responsible and ethical use of these models. BioMedGPT should NOT be used to provide services to the general public. Generating any content that violates applicable laws and regulations, such as inciting subversion of state power, endangering national security and interests, propagating terrorism, extremism, ethnic hatred and discrimination, violence, pornography, or false and harmful information, etc. is strictly prohibited. BioMedGPT is not liable for any consequences arising from any content, data, or information provided or published by users.
+## License
+This repository is licensed under the [MIT License](./LICENSE). The use of BioMedGPT-LM-7B and BioMedGPT-10B models is accompanied with [Acceptable Use Policy](./USE_POLICY.md).
+## Contact Us
+We are looking forward to user feedback to help us improve our framework. If you have any technical questions or suggestions, please feel free to open an issue. For commercial support or collaboration, please contact [opensource@pharmolix.com](mailto:opensource@pharmolix.com).
+## Cite Us
+If you find our open-sourced code and models helpful to your research, please consider giving this repository a 🌟star and 📎citing our research papers. Thank you for your support!
+##### To cite OpenBioMed:
+```
+@misc{OpenBioMed_code,
+      author={Luo, Yizhen and Yang, Kai and Fan, Siqi and Hong, Massimo and Zhao, Suyuan and Chen, Xinrui and Nie, Zikun and Luo, Wen and Xie, Ailin and Liu, Xing Yi and Zhang, Jiahuan and Wu, Yushuai and Nie, Zaiqing},
+      title={Code of OpenBioMed},
+      year={2023},
+      howpublished={\url{https://github.com/Pharmolix/OpenBioMed.git}}
+}
+```
+##### To cite BioMedGPT:
+```
+@article{luo2024biomedgpt,
+  title={Biomedgpt: An open multimodal large language model for biomedicine},
+  author={Luo, Yizhen and Zhang, Jiahuan and Fan, Siqi and Yang, Kai and Hong, Massimo and Wu, Yushuai and Qiao, Mu and Nie, Zaiqing},
+  journal={IEEE Journal of Biomedical and Health Informatics},
+  year={2024},
+  publisher={IEEE}
+}
+```
+##### To cite PharmolixFM:
+@article{luo2025pharmolixfm,
+  title={PharMolixFM: All-Atom Foundation Models for Molecular Modeling and Generation},
+  author={Luo, Yizhen and Wang, Jiashuo and Fan, Siqi and Nie, Zaiqing},
+  journal={arXiv preprint arXiv:2503.21788},
+  year={2025}
+}
+##### To cite MolFM:
+```
+@misc{luo2023molfm,
+      title={MolFM: A Multimodal Molecular Foundation Model},
+      author={Yizhen Luo and Kai Yang and Massimo Hong and Xing Yi Liu and Zaiqing Nie},
+      year={2023},
+      eprint={2307.09484},
+      archivePrefix={arXiv},
+      primaryClass={q-bio.BM}
+}
+```
+##### To cite LangCell:
+```
+@misc{zhao2024langcell,
+      title={LangCell: Language-Cell Pre-training for Cell Identity Understanding},
+      author={Suyuan Zhao and Jiahuan Zhang and Yizhen Luo and Yushuai Wu and Zaiqing Nie},
+      year={2024},
+      eprint={2405.06708},
+      archivePrefix={arXiv},
+      primaryClass={q-bio.GN}
+}
+```
+##### To cite MutaPLM:
+```
+@article{luo2025mutaplm,
+  title={MutaPLM: Protein Language Modeling for Mutation Explanation and Engineering},
+  author={Luo, Yizhen and Nie, Zikun and Hong, Massimo and Zhao, Suyuan and Zhou, Hao and Nie, Zaiqing},
+  journal={Advances in Neural Information Processing Systems},
+  volume={37},
+  pages={79783--79818},
+  year={2025}
+}
+```

USE_POLICY.md ADDED Viewed

	@@ -0,0 +1,19 @@

+## BioMedGPT Acceptable Use Policy
+BioMedGPT is only for internal use by registered users. You agree and acknowledge that you will use BioMedGPT solely for internal use purposes and undertake not to use it, directly or indirectly, to provide services to the general public with the territory of the PRC. Otherwise, you will be subject to all the damages caused to BioMedGPT.
+You have the right to use BioMedGPT pursuant to relevant agreements, but you cannot engage in any unlawful activities or disturb the orderly operation of BioMedGPT. You are not allowed to generate any content through BioMedGPT or induce it to output any speech containing the following contents, or we will block or delete the information in accordance with the applicable laws and regulations and report the matter to the relevant authorities:
+1. inciting to resist or undermine the implementation of the Constitution, laws and administrative regulations;
+2. inciting to subvert the state power and the overthrow of the political system;
+3. inciting to sperate the state or undermine unity of the country;
+4. inciting national enmity or discrimination, undermine the unity of nations;
+5. content involving discrimination on the basis of race, sex, religion, geographical content, etc.;
+6. fabricating or distorting facts, spreading disinformation, or disturbing the public order;
+7. propagating heretical teachings or feudal superstitions, disseminating obscenity, pornography, gambling, violence, homicide, terror or instigating others to commit crimes;
+8. publicly humiliating others, inventing stories to defame others, or committing other malicious attacks;
+9. harming the credibility of state organs;
+10. violating the public interest or public morality or not suitable for publication on BioMedGPT in accordance with the provisions of the relevant BioMedGPT agreements and rules;
+11. violating the Constitution, laws and administrative regulations.
+You fully understand and acknowledge that you are responsible for all your activities and consequences that occur in using the BioMedGPT services, including any content, data or information you provide or publish. BioMedGPT will not be responsible for any losses thereof.

bioflow/__init__.py ADDED Viewed

	@@ -0,0 +1,61 @@

+"""
+BioFlow - Multimodal Biological Intelligence Framework
+========================================================
+A modular, open-source platform for biological discovery integrating:
+- Multimodal encoders (text, molecules, proteins, images)
+- Vector database memory (Qdrant)
+- Prediction tools (DTI, ADMET)
+- Workflow orchestration
+Core Modules:
+    - core: Abstract interfaces, registry, and orchestrator
+    - plugins: Tool implementations (OBM, DeepPurpose, etc.)
+    - workflows: YAML-based pipeline definitions
+Open-Source Models Supported:
+    - Text: PubMedBERT, SciBERT, Specter
+    - Molecules: ChemBERTa, RDKit FP
+    - Proteins: ESM-2, ProtBERT
+    - Images: CLIP, BioMedCLIP
+"""
+__version__ = "0.2.0"
+__author__ = "BioFlow Team"
+# Core abstractions
+from bioflow.core import (
+    Modality,
+    BioEncoder,
+    BioPredictor,
+    BioGenerator,
+    BioRetriever,
+    ToolRegistry,
+    BioFlowOrchestrator,
+    WorkflowConfig,
+    NodeConfig,
+)
+# Legacy imports (for backward compatibility)
+try:
+    from bioflow.obm_wrapper import OBMWrapper
+    from bioflow.qdrant_manager import QdrantManager
+except ImportError:
+    OBMWrapper = None
+    QdrantManager = None
+__all__ = [
+    # Core
+    "Modality",
+    "BioEncoder",
+    "BioPredictor",
+    "BioGenerator",
+    "BioRetriever",
+    "ToolRegistry",
+    "BioFlowOrchestrator",
+    "WorkflowConfig",
+    "NodeConfig",
+    # Wrappers
+    "OBMWrapper",
+    "QdrantManager",
+]

bioflow/api/__init__.py ADDED Viewed

	@@ -0,0 +1,7 @@

+"""
+BioFlow API
+============
+FastAPI backend bridging the Next.js UI with OpenBioMed core.
+"""
+__version__ = "2.0.0"

bioflow/api/dti_predictor.py ADDED Viewed

	@@ -0,0 +1,346 @@

+"""
+BioFlow DTI Predictor
+======================
+Drug-Target Interaction prediction using DeepPurpose.
+Integrated from lacoste001/deeppurpose002.py for the hackathon.
+"""
+import os
+import sys
+import logging
+import numpy as np
+import pandas as pd
+from typing import Any, Dict, List, Optional, Tuple
+from dataclasses import dataclass, field
+from datetime import datetime
+logger = logging.getLogger(__name__)
+# ============================================================================
+# Data Classes
+# ============================================================================
+@dataclass
+class DTIPrediction:
+    """Result of a DTI prediction."""
+    drug_smiles: str
+    target_sequence: str
+    binding_affinity: float  # pKd or similar
+    confidence: float
+    model_name: str
+    metadata: Dict[str, Any] = field(default_factory=dict)
+    def to_dict(self) -> Dict[str, Any]:
+        return {
+            "drug_smiles": self.drug_smiles,
+            "target_sequence": self.target_sequence[:50] + "..." if len(self.target_sequence) > 50 else self.target_sequence,
+            "binding_affinity": self.binding_affinity,
+            "confidence": self.confidence,
+            "model_name": self.model_name,
+            "metadata": self.metadata,
+        }
+@dataclass
+class DTIMetrics:
+    """Metrics from model evaluation."""
+    mse: float
+    rmse: float
+    mae: float
+    pearson: float
+    spearman: float
+    concordance_index: float
+    def to_dict(self) -> Dict[str, float]:
+        return {
+            "mse": self.mse,
+            "rmse": self.rmse,
+            "mae": self.mae,
+            "pearson": self.pearson,
+            "spearman": self.spearman,
+            "concordance_index": self.concordance_index,
+        }
+# ============================================================================
+# Metric Functions (from deeppurpose002.py)
+# ============================================================================
+def mse(y_true: np.ndarray, y_pred: np.ndarray) -> float:
+    """Mean Squared Error."""
+    y_true = np.asarray(y_true, dtype=float).reshape(-1)
+    y_pred = np.asarray(y_pred, dtype=float).reshape(-1)
+    return float(np.mean((y_true - y_pred) ** 2))
+def mae(y_true: np.ndarray, y_pred: np.ndarray) -> float:
+    """Mean Absolute Error."""
+    y_true = np.asarray(y_true, dtype=float).reshape(-1)
+    y_pred = np.asarray(y_pred, dtype=float).reshape(-1)
+    return float(np.mean(np.abs(y_true - y_pred)))
+def pearson(y_true: np.ndarray, y_pred: np.ndarray) -> float:
+    """Pearson correlation coefficient."""
+    y_true = np.asarray(y_true, dtype=float).reshape(-1)
+    y_pred = np.asarray(y_pred, dtype=float).reshape(-1)
+    if y_true.size < 2 or np.std(y_true) == 0 or np.std(y_pred) == 0:
+        return float("nan")
+    return float(np.corrcoef(y_true, y_pred)[0, 1])
+def spearman(y_true: np.ndarray, y_pred: np.ndarray) -> float:
+    """Spearman rank correlation."""
+    a = pd.Series(np.asarray(y_true, dtype=float).reshape(-1)).rank(method="average").to_numpy()
+    b = pd.Series(np.asarray(y_pred, dtype=float).reshape(-1)).rank(method="average").to_numpy()
+    return pearson(a, b)
+def concordance_index(y_true: np.ndarray, y_pred: np.ndarray, max_n: int = 2000, seed: int = 0) -> float:
+    """
+    Concordance Index (CI) - approximated for large datasets.
+    Measures pairwise ranking accuracy.
+    """
+    y_true = np.asarray(y_true, dtype=float).reshape(-1)
+    y_pred = np.asarray(y_pred, dtype=float).reshape(-1)
+    n = len(y_true)
+    if n < 2:
+        return float("nan")
+    # Sample if too large
+    if n > max_n:
+        rng = np.random.default_rng(seed)
+        idx = rng.choice(n, size=max_n, replace=False)
+        y_true = y_true[idx]
+        y_pred = y_pred[idx]
+        n = max_n
+    conc = 0.0
+    total = 0.0
+    for i in range(n):
+        for j in range(i + 1, n):
+            if y_true[i] == y_true[j]:
+                continue
+            total += 1.0
+            dt = y_true[i] - y_true[j]
+            dp = y_pred[i] - y_pred[j]
+            prod = dt * dp
+            if prod > 0:
+                conc += 1.0
+            elif prod == 0:
+                conc += 0.5
+    if total == 0:
+        return float("nan")
+    return float(conc / total)
+# ============================================================================
+# DeepPurpose Predictor Class
+# ============================================================================
+class DeepPurposePredictor:
+    """
+    Drug-Target Interaction predictor using DeepPurpose.
+    Supports multiple encoding strategies:
+    - Drug: Morgan, CNN, Transformer, MPNN
+    - Target: CNN, Transformer, AAC
+    Example:
+        >>> predictor = DeepPurposePredictor()
+        >>> result = predictor.predict("CCO", "MKTVRQERLKSIVRILERSKEPVSG")
+        >>> print(result.binding_affinity)
+    """
+    def __init__(
+        self,
+        drug_encoding: str = "Morgan",
+        target_encoding: str = "CNN",
+        model_path: Optional[str] = None,
+        device: str = "cpu",
+    ):
+        self.drug_encoding = drug_encoding
+        self.target_encoding = target_encoding
+        self.model_path = model_path
+        self.device = device
+        self.model = None
+        self._loaded = False
+    def load_model(self) -> bool:
+        """Load the DeepPurpose model."""
+        try:
+            from DeepPurpose import DTI as dp_models
+            from DeepPurpose import utils
+            if self.model_path and os.path.exists(self.model_path):
+                # Load pre-trained model
+                self.model = dp_models.model_pretrained(self.model_path)
+                logger.info(f"Loaded model from {self.model_path}")
+            else:
+                # Initialize new model (for inference with pre-trained weights)
+                config = utils.generate_config(
+                    drug_encoding=self.drug_encoding,
+                    target_encoding=self.target_encoding,
+                    cls_hidden_dims=[1024, 1024, 512],
+                )
+                self.model = dp_models.model_initialize(**config)
+                logger.info(f"Initialized new model: {self.drug_encoding}-{self.target_encoding}")
+            self._loaded = True
+            return True
+        except ImportError:
+            logger.warning("DeepPurpose not installed. Using mock predictions.")
+            self._loaded = False
+            return False
+        except Exception as e:
+            logger.error(f"Failed to load model: {e}")
+            self._loaded = False
+            return False
+    def predict(self, drug_smiles: str, target_sequence: str) -> DTIPrediction:
+        """
+        Predict binding affinity between drug and target.
+        Args:
+            drug_smiles: SMILES string of the drug molecule
+            target_sequence: Amino acid sequence of the target protein
+        Returns:
+            DTIPrediction with binding affinity and confidence
+        """
+        if not self._loaded:
+            self.load_model()
+        if self.model is not None:
+            try:
+                from DeepPurpose import utils
+                # Prepare data for prediction
+                X_drug = [drug_smiles]
+                X_target = [target_sequence]
+                y_dummy = [0]  # Not used for prediction
+                data = utils.data_process(
+                    X_drug, X_target, y_dummy,
+                    drug_encoding=self.drug_encoding,
+                    target_encoding=self.target_encoding,
+                    split_method="no_split",
+                )
+                # Get prediction
+                pred = self.model.predict(data)
+                affinity = float(pred[0]) if len(pred) > 0 else 0.0
+                return DTIPrediction(
+                    drug_smiles=drug_smiles,
+                    target_sequence=target_sequence,
+                    binding_affinity=affinity,
+                    confidence=0.85,  # TODO: Implement uncertainty estimation
+                    model_name=f"DeepPurpose-{self.drug_encoding}-{self.target_encoding}",
+                    metadata={
+                        "timestamp": datetime.utcnow().isoformat(),
+                        "device": self.device,
+                    }
+                )
+            except Exception as e:
+                logger.error(f"Prediction failed: {e}")
+        # Fallback: Mock prediction
+        return self._mock_predict(drug_smiles, target_sequence)
+    def _mock_predict(self, drug_smiles: str, target_sequence: str) -> DTIPrediction:
+        """Generate a mock prediction when model is unavailable."""
+        import hashlib
+        # Deterministic "prediction" based on input hash
+        hash_input = f"{drug_smiles}:{target_sequence}"
+        hash_val = int(hashlib.md5(hash_input.encode()).hexdigest()[:8], 16)
+        # Generate realistic-looking pKd value (typically 4-10)
+        affinity = 4.0 + (hash_val % 6000) / 1000.0
+        confidence = 0.7 + (hash_val % 300) / 1000.0
+        return DTIPrediction(
+            drug_smiles=drug_smiles,
+            target_sequence=target_sequence,
+            binding_affinity=round(affinity, 3),
+            confidence=round(confidence, 3),
+            model_name="Mock-Predictor",
+            metadata={
+                "timestamp": datetime.utcnow().isoformat(),
+                "note": "Mock prediction - DeepPurpose not loaded",
+            }
+        )
+    def batch_predict(
+        self,
+        drug_target_pairs: List[Tuple[str, str]],
+    ) -> List[DTIPrediction]:
+        """Predict for multiple drug-target pairs."""
+        return [self.predict(d, t) for d, t in drug_target_pairs]
+    def evaluate(
+        self,
+        y_true: np.ndarray,
+        y_pred: np.ndarray,
+    ) -> DTIMetrics:
+        """Evaluate predictions against ground truth."""
+        import math
+        m_mse = mse(y_true, y_pred)
+        return DTIMetrics(
+            mse=m_mse,
+            rmse=math.sqrt(m_mse),
+            mae=mae(y_true, y_pred),
+            pearson=pearson(y_true, y_pred),
+            spearman=spearman(y_true, y_pred),
+            concordance_index=concordance_index(y_true, y_pred),
+        )
+# ============================================================================
+# Factory function
+# ============================================================================
+def get_dti_predictor(
+    drug_encoding: str = "Morgan",
+    target_encoding: str = "CNN",
+    model_path: Optional[str] = None,
+) -> DeepPurposePredictor:
+    """Factory function to get a DTI predictor instance."""
+    predictor = DeepPurposePredictor(
+        drug_encoding=drug_encoding,
+        target_encoding=target_encoding,
+        model_path=model_path,
+    )
+    return predictor
+# ============================================================================
+# CLI for standalone usage
+# ============================================================================
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser(description="DTI Prediction")
+    parser.add_argument("--drug", required=True, help="Drug SMILES")
+    parser.add_argument("--target", required=True, help="Target protein sequence")
+    parser.add_argument("--drug-enc", default="Morgan", help="Drug encoding method")
+    parser.add_argument("--target-enc", default="CNN", help="Target encoding method")
+    args = parser.parse_args()
+    predictor = get_dti_predictor(args.drug_enc, args.target_enc)
+    result = predictor.predict(args.drug, args.target)
+    print(f"\n{'='*60}")
+    print("DTI Prediction Result")
+    print(f"{'='*60}")
+    print(f"Drug: {result.drug_smiles}")
+    print(f"Target: {result.target_sequence}")
+    print(f"Binding Affinity (pKd): {result.binding_affinity:.3f}")
+    print(f"Confidence: {result.confidence:.3f}")
+    print(f"Model: {result.model_name}")
+    print(f"{'='*60}\n")

bioflow/api/requirements.txt ADDED Viewed

	@@ -0,0 +1,29 @@

+# BioFlow API Requirements
+# ========================
+# Core
+fastapi>=0.109.0
+uvicorn[standard]>=0.27.0
+pydantic>=2.5.0
+# Async
+httpx>=0.26.0
+aiofiles>=23.2.0
+# Data
+numpy>=1.24.0
+pandas>=2.0.0
+# DeepPurpose (DTI prediction)
+DeepPurpose>=0.1.4
+torch>=2.0.0
+# TDC - Therapeutics Data Commons
+PyTDC>=0.4.0
+# Vector DB
+qdrant-client>=1.7.0
+# Utilities
+python-multipart>=0.0.6
+python-dotenv>=1.0.0

bioflow/api/server.py ADDED Viewed

	@@ -0,0 +1,359 @@

+"""
+BioFlow API - Main Server
+==========================
+FastAPI application serving the Next.js frontend.
+Endpoints for discovery, prediction, and data management.
+"""
+import os
+import sys
+import uuid
+import logging
+from datetime import datetime
+from typing import Any, Dict, List, Optional
+from contextlib import asynccontextmanager
+from fastapi import FastAPI, HTTPException, BackgroundTasks
+from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel, Field
+# Add project root to path
+ROOT_DIR = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+sys.path.insert(0, ROOT_DIR)
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+# ============================================================================
+# In-Memory Job Store (replace with Redis/DB in production)
+# ============================================================================
+JOBS: Dict[str, Dict[str, Any]] = {}
+# ============================================================================
+# Pydantic Models
+# ============================================================================
+class DiscoveryRequest(BaseModel):
+    """Request for drug discovery pipeline."""
+    query: str = Field(..., description="SMILES, FASTA, or natural language query")
+    search_type: str = Field(default="similarity", description="similarity | binding | properties")
+    database: str = Field(default="all", description="Target database")
+    limit: int = Field(default=10, ge=1, le=100)
+class PredictRequest(BaseModel):
+    """Request for DTI prediction."""
+    drug_smiles: str = Field(..., description="SMILES string of drug")
+    target_sequence: str = Field(..., description="Protein sequence (FASTA)")
+class IngestRequest(BaseModel):
+    """Request to ingest data into vector DB."""
+    content: str
+    modality: str = Field(default="smiles", description="smiles | protein | text")
+    metadata: Optional[Dict[str, Any]] = None
+class JobStatus(BaseModel):
+    """Status of an async job."""
+    job_id: str
+    status: str  # pending | running | completed | failed
+    progress: int = 0
+    result: Optional[Dict[str, Any]] = None
+    error: Optional[str] = None
+    created_at: str
+    updated_at: str
+class HealthResponse(BaseModel):
+    """Health check response."""
+    status: str
+    version: str
+    timestamp: str
+# ============================================================================
+# Lifespan (startup/shutdown)
+# ============================================================================
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    """Initialize resources on startup, cleanup on shutdown."""
+    logger.info("🚀 BioFlow API starting up...")
+    # TODO: Initialize Qdrant connection, load models
+    yield
+    logger.info("🛑 BioFlow API shutting down...")
+# ============================================================================
+# FastAPI App
+# ============================================================================
+app = FastAPI(
+    title="BioFlow API",
+    description="AI-Powered Drug Discovery Platform API",
+    version="2.0.0",
+    lifespan=lifespan,
+)
+# CORS for Next.js frontend
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=[
+        "http://localhost:3000",
+        "http://127.0.0.1:3000",
+        "http://localhost:3001",
+    ],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# ============================================================================
+# Health & Info
+# ============================================================================
+@app.get("/", response_model=HealthResponse)
+async def root():
+    """Health check endpoint."""
+    return HealthResponse(
+        status="healthy",
+        version="2.0.0",
+        timestamp=datetime.utcnow().isoformat(),
+    )
+@app.get("/health", response_model=HealthResponse)
+async def health():
+    """Health check endpoint."""
+    return HealthResponse(
+        status="healthy",
+        version="2.0.0",
+        timestamp=datetime.utcnow().isoformat(),
+    )
+# ============================================================================
+# Discovery Pipeline
+# ============================================================================
+def run_discovery_pipeline(job_id: str, request: DiscoveryRequest):
+    """Background task for discovery pipeline."""
+    import time
+    try:
+        JOBS[job_id]["status"] = "running"
+        JOBS[job_id]["updated_at"] = datetime.utcnow().isoformat()
+        # Step 1: Encode
+        JOBS[job_id]["progress"] = 25
+        JOBS[job_id]["current_step"] = "encode"
+        time.sleep(1)  # TODO: Replace with actual encoding
+        # Step 2: Search
+        JOBS[job_id]["progress"] = 50
+        JOBS[job_id]["current_step"] = "search"
+        time.sleep(1)  # TODO: Replace with vector search
+        # Step 3: Predict
+        JOBS[job_id]["progress"] = 75
+        JOBS[job_id]["current_step"] = "predict"
+        time.sleep(1)  # TODO: Replace with DTI prediction
+        # Step 4: Results
+        JOBS[job_id]["progress"] = 100
+        JOBS[job_id]["current_step"] = "complete"
+        JOBS[job_id]["status"] = "completed"
+        JOBS[job_id]["result"] = {
+            "candidates": [
+                {"name": "Candidate A", "smiles": "CCO", "score": 0.95, "mw": 342.4, "logp": 2.1},
+                {"name": "Candidate B", "smiles": "CC(=O)O", "score": 0.89, "mw": 298.3, "logp": 1.8},
+                {"name": "Candidate C", "smiles": "c1ccccc1", "score": 0.82, "mw": 415.5, "logp": 3.2},
+            ],
+            "query": request.query,
+            "search_type": request.search_type,
+        }
+        JOBS[job_id]["updated_at"] = datetime.utcnow().isoformat()
+    except Exception as e:
+        JOBS[job_id]["status"] = "failed"
+        JOBS[job_id]["error"] = str(e)
+        JOBS[job_id]["updated_at"] = datetime.utcnow().isoformat()
+        logger.error(f"Discovery pipeline failed: {e}")
+@app.post("/api/discovery")
+async def start_discovery(request: DiscoveryRequest, background_tasks: BackgroundTasks):
+    """Start a discovery pipeline (async)."""
+    job_id = f"disc_{uuid.uuid4().hex[:12]}"
+    now = datetime.utcnow().isoformat()
+    JOBS[job_id] = {
+        "job_id": job_id,
+        "status": "pending",
+        "progress": 0,
+        "current_step": "queued",
+        "result": None,
+        "error": None,
+        "created_at": now,
+        "updated_at": now,
+        "request": request.model_dump(),
+    }
+    background_tasks.add_task(run_discovery_pipeline, job_id, request)
+    return {
+        "success": True,
+        "job_id": job_id,
+        "status": "pending",
+        "message": "Discovery pipeline started",
+    }
+@app.get("/api/discovery/{job_id}")
+async def get_discovery_status(job_id: str):
+    """Get status of a discovery job."""
+    if job_id not in JOBS:
+        raise HTTPException(status_code=404, detail="Job not found")
+    return JOBS[job_id]
+# ============================================================================
+# DTI Prediction
+# ============================================================================
+# Import the predictor
+from bioflow.api.dti_predictor import get_dti_predictor, DeepPurposePredictor
+# Global predictor instance (lazy loaded)
+_dti_predictor: Optional[DeepPurposePredictor] = None
+def get_predictor() -> DeepPurposePredictor:
+    """Get or create the DTI predictor instance."""
+    global _dti_predictor
+    if _dti_predictor is None:
+        _dti_predictor = get_dti_predictor()
+    return _dti_predictor
+@app.post("/api/predict")
+async def predict_dti(request: PredictRequest):
+    """
+    Predict drug-target interaction.
+    Uses DeepPurpose under the hood.
+    """
+    try:
+        predictor = get_predictor()
+        result = predictor.predict(request.drug_smiles, request.target_sequence)
+        return {
+            "success": True,
+            "prediction": {
+                "drug_smiles": result.drug_smiles,
+                "target_sequence": result.target_sequence,
+                "binding_affinity": result.binding_affinity,
+                "confidence": result.confidence,
+                "interaction_probability": min(result.confidence + 0.05, 1.0),
+            },
+            "metadata": {
+                "model": result.model_name,
+                "timestamp": datetime.utcnow().isoformat(),
+                **result.metadata,
+            }
+        }
+    except Exception as e:
+        logger.error(f"Prediction failed: {e}")
+        raise HTTPException(status_code=500, detail=str(e))
+# ============================================================================
+# Data Management
+# ============================================================================
+@app.post("/api/ingest")
+async def ingest_data(request: IngestRequest):
+    """Ingest data into vector database."""
+    try:
+        # TODO: Integrate with Qdrant via bioflow.qdrant_manager
+        doc_id = f"doc_{uuid.uuid4().hex[:12]}"
+        return {
+            "success": True,
+            "id": doc_id,
+            "modality": request.modality,
+            "message": "Data ingested successfully",
+        }
+    except Exception as e:
+        logger.error(f"Ingest failed: {e}")
+        raise HTTPException(status_code=500, detail=str(e))
+@app.get("/api/molecules")
+async def list_molecules(limit: int = 20, offset: int = 0):
+    """List molecules in the database."""
+    # TODO: Query from Qdrant
+    mock_molecules = [
+        {"id": "mol_001", "smiles": "CCO", "name": "Ethanol", "mw": 46.07},
+        {"id": "mol_002", "smiles": "CC(=O)O", "name": "Acetic Acid", "mw": 60.05},
+        {"id": "mol_003", "smiles": "c1ccccc1", "name": "Benzene", "mw": 78.11},
+    ]
+    return {
+        "molecules": mock_molecules,
+        "total": len(mock_molecules),
+        "limit": limit,
+        "offset": offset,
+    }
+@app.get("/api/proteins")
+async def list_proteins(limit: int = 20, offset: int = 0):
+    """List proteins in the database."""
+    # TODO: Query from Qdrant
+    mock_proteins = [
+        {"id": "prot_001", "uniprot_id": "P00533", "name": "EGFR", "length": 1210},
+        {"id": "prot_002", "uniprot_id": "P04637", "name": "p53", "length": 393},
+        {"id": "prot_003", "uniprot_id": "P38398", "name": "BRCA1", "length": 1863},
+    ]
+    return {
+        "proteins": mock_proteins,
+        "total": len(mock_proteins),
+        "limit": limit,
+        "offset": offset,
+    }
+# ============================================================================
+# Explorer (Embeddings)
+# ============================================================================
+@app.get("/api/explorer/embeddings")
+async def get_embeddings(dataset: str = "default", method: str = "umap"):
+    """Get 2D projections of embeddings for visualization."""
+    import random
+    # TODO: Get actual embeddings from Qdrant and project
+    # Generate mock UMAP-like data
+    random.seed(42)
+    points = []
+    for i in range(100):
+        cluster = i % 4
+        cx, cy = [(2, 3), (-2, -1), (4, -2), (-1, 4)][cluster]
+        points.append({
+            "id": f"mol_{i:03d}",
+            "x": cx + random.gauss(0, 0.8),
+            "y": cy + random.gauss(0, 0.8),
+            "cluster": cluster,
+            "label": f"Molecule {i}",
+        })
+    return {
+        "points": points,
+        "method": method,
+        "dataset": dataset,
+        "n_clusters": 4,
+    }
+# ============================================================================
+# Run with: uvicorn bioflow.api.server:app --reload --port 8000
+# ============================================================================
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=8000)

bioflow/app.py ADDED Viewed

	@@ -0,0 +1,570 @@

+"""
+BioFlow Explorer - Streamlit Interface
+=======================================
+Interactive web interface for testing and exploring the BioFlow
+multimodal biological intelligence system.
+Run with: streamlit run bioflow/app.py
+"""
+import streamlit as st
+import numpy as np
+import pandas as pd
+from typing import List, Dict, Any
+import json
+import os
+import sys
+# Add project root to path
+ROOT_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
+sys.path.insert(0, ROOT_DIR)
+# Page config
+st.set_page_config(
+    page_title="BioFlow Explorer",
+    page_icon="🧬",
+    layout="wide",
+    initial_sidebar_state="expanded"
+)
+# Custom CSS
+st.markdown("""
+<style>
+    .main-header {
+        font-size: 2.5rem;
+        font-weight: bold;
+        background: linear-gradient(90deg, #667eea 0%, #764ba2 100%);
+        -webkit-background-clip: text;
+        -webkit-text-fill-color: transparent;
+        margin-bottom: 1rem;
+    }
+    .metric-card {
+        background: linear-gradient(135deg, #f5f7fa 0%, #c3cfe2 100%);
+        padding: 1rem;
+        border-radius: 0.5rem;
+        margin: 0.5rem 0;
+    }
+    .result-card {
+        border: 1px solid #ddd;
+        border-radius: 0.5rem;
+        padding: 1rem;
+        margin: 0.5rem 0;
+        background: white;
+    }
+    .modality-text { color: #3b82f6; }
+    .modality-molecule { color: #10b981; }
+    .modality-protein { color: #f59e0b; }
+</style>
+""", unsafe_allow_html=True)
+@st.cache_resource
+def init_bioflow(use_mock: bool = True):
+    """Initialize BioFlow components (cached)."""
+    try:
+        from bioflow.obm_wrapper import OBMWrapper
+        from bioflow.qdrant_manager import QdrantManager
+        from bioflow.pipeline import BioFlowPipeline, MinerAgent, ValidatorAgent
+        obm = OBMWrapper(use_mock=use_mock)
+        qdrant = QdrantManager(obm, qdrant_path=None)  # In-memory
+        qdrant.create_collection("bioflow_demo", recreate=True)
+        pipeline = BioFlowPipeline(obm, qdrant)
+        pipeline.register_agent(MinerAgent(obm, qdrant, "bioflow_demo"))
+        pipeline.register_agent(ValidatorAgent(obm, qdrant, "bioflow_demo"))
+        return {
+            "obm": obm,
+            "qdrant": qdrant,
+            "pipeline": pipeline,
+            "ready": True
+        }
+    except Exception as e:
+        st.error(f"Failed to initialize: {e}")
+        return {"ready": False, "error": str(e)}
+def render_sidebar():
+    """Render the sidebar with controls."""
+    st.sidebar.markdown("## 🧬 BioFlow Explorer")
+    st.sidebar.markdown("---")
+    # Mode selection
+    mode = st.sidebar.selectbox(
+        "Mode",
+        ["🔍 Search & Explore", "📥 Data Ingestion", "🧪 Cross-Modal Analysis",
+         "📊 Visualization", "🔬 Pipeline Demo", "📚 Documentation"]
+    )
+    st.sidebar.markdown("---")
+    # Settings
+    with st.sidebar.expander("⚙️ Settings"):
+        use_mock = st.checkbox("Use Mock Mode (no GPU needed)", value=True)
+        vector_dim = st.number_input("Vector Dimension", value=768, disabled=True)
+    st.sidebar.markdown("---")
+    st.sidebar.markdown("### Quick Stats")
+    return mode, use_mock
+def render_search_page(components):
+    """Render the search and explore page."""
+    st.markdown('<p class="main-header">🔍 Search & Explore</p>', unsafe_allow_html=True)
+    col1, col2 = st.columns([2, 1])
+    with col1:
+        query = st.text_area(
+            "Enter your query",
+            placeholder="e.g., 'KRAS inhibitor for cancer treatment' or a SMILES string like 'CCO'",
+            height=100
+        )
+        query_modality = st.selectbox(
+            "Query Modality",
+            ["text", "smiles", "protein"],
+            help="Select the type of your input"
+        )
+    with col2:
+        target_modality = st.selectbox(
+            "Search for",
+            ["All", "text", "smiles", "protein"],
+            help="Filter results by modality"
+        )
+        top_k = st.slider("Number of results", 1, 20, 5)
+    if st.button("🔍 Search", type="primary"):
+        if not query:
+            st.warning("Please enter a query")
+            return
+        with st.spinner("Encoding and searching..."):
+            obm = components["obm"]
+            qdrant = components["qdrant"]
+            # Encode query
+            embedding = obm.encode(query, query_modality)
+            # Display query embedding info
+            with st.expander("📊 Query Embedding Details"):
+                st.json({
+                    "modality": embedding.modality.value,
+                    "dimension": embedding.dimension,
+                    "content_hash": embedding.content_hash,
+                    "vector_sample": embedding.vector[:5].tolist()
+                })
+            # Search
+            filter_mod = None if target_modality == "All" else target_modality
+            results = qdrant.search(
+                query=query,
+                query_modality=query_modality,
+                limit=top_k,
+                filter_modality=filter_mod
+            )
+            if results:
+                st.markdown("### 📋 Search Results")
+                for i, r in enumerate(results):
+                    with st.container():
+                        col1, col2, col3 = st.columns([1, 4, 1])
+                        with col1:
+                            st.metric("Rank", i + 1)
+                        with col2:
+                            modality_class = f"modality-{r.modality}"
+                            st.markdown(f"**<span class='{modality_class}'>[{r.modality.upper()}]</span>** {r.content[:100]}...", unsafe_allow_html=True)
+                        with col3:
+                            st.metric("Score", f"{r.score:.3f}")
+                        st.divider()
+            else:
+                st.info("No results found. Try ingesting some data first!")
+def render_ingestion_page(components):
+    """Render the data ingestion page."""
+    st.markdown('<p class="main-header">📥 Data Ingestion</p>', unsafe_allow_html=True)
+    tab1, tab2, tab3 = st.tabs(["📝 Single Entry", "📄 Batch Upload", "🧪 Sample Data"])
+    with tab1:
+        st.markdown("### Add Single Entry")
+        col1, col2 = st.columns(2)
+        with col1:
+            content = st.text_area("Content", placeholder="Enter text, SMILES, or protein sequence")
+            modality = st.selectbox("Type", ["text", "smiles", "protein"])
+        with col2:
+            source = st.text_input("Source", placeholder="e.g., PubMed:12345")
+            tags = st.text_input("Tags (comma-separated)", placeholder="e.g., cancer, kinase")
+        if st.button("➕ Add Entry"):
+            if content:
+                qdrant = components["qdrant"]
+                item = {
+                    "content": content,
+                    "modality": modality,
+                    "source": source,
+                    "tags": [t.strip() for t in tags.split(",") if t.strip()]
+                }
+                stats = qdrant.ingest([item])
+                st.success(f"Added successfully! Stats: {stats}")
+            else:
+                st.warning("Please enter content")
+    with tab2:
+        st.markdown("### Batch Upload")
+        uploaded_file = st.file_uploader("Upload JSON or CSV", type=["json", "csv"])
+        if uploaded_file:
+            try:
+                if uploaded_file.name.endswith('.json'):
+                    data = json.load(uploaded_file)
+                else:
+                    df = pd.read_csv(uploaded_file)
+                    data = df.to_dict('records')
+                st.write(f"Found {len(data)} entries")
+                st.dataframe(pd.DataFrame(data).head())
+                if st.button("📤 Upload All"):
+                    qdrant = components["qdrant"]
+                    stats = qdrant.ingest(data)
+                    st.success(f"Ingestion complete! {stats}")
+            except Exception as e:
+                st.error(f"Error parsing file: {e}")
+    with tab3:
+        st.markdown("### Load Sample Data")
+        st.markdown("Load pre-defined sample data to test the system.")
+        sample_data = [
+            {"content": "Aspirin is used to reduce fever and relieve mild to moderate pain", "modality": "text", "source": "sample", "tags": ["pain", "fever"]},
+            {"content": "CC(=O)OC1=CC=CC=C1C(=O)O", "modality": "smiles", "source": "ChEMBL", "tags": ["aspirin", "nsaid"]},
+            {"content": "Ibuprofen is a nonsteroidal anti-inflammatory drug used for treating pain", "modality": "text", "source": "sample", "tags": ["pain", "nsaid"]},
+            {"content": "CC(C)CC1=CC=C(C=C1)C(C)C(=O)O", "modality": "smiles", "source": "ChEMBL", "tags": ["ibuprofen", "nsaid"]},
+            {"content": "KRAS mutations are found in many cancers and are difficult to target", "modality": "text", "source": "PubMed", "tags": ["cancer", "KRAS"]},
+            {"content": "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHHYREQIKRVKDSEDVPMVLVGNKCDLPSRTVDTKQAQDLARSYGIPFIETSAKTRQGVDDAFYTLVREIRKHKEKMSKDGKKKKKKSKTKCVIM", "modality": "protein", "source": "UniProt:P01116", "tags": ["KRAS", "GTPase"]},
+            {"content": "Sotorasib is a first-in-class KRAS G12C inhibitor", "modality": "text", "source": "PubMed", "tags": ["KRAS", "inhibitor", "cancer"]},
+            {"content": "C[C@@H]1CC(=O)N(C2=C1C=CC(=C2)NC(=O)C3=CC=C(C=C3)N4CCN(CC4)C)C5=NC=CC(=N5)C6CCCCC6", "modality": "smiles", "source": "ChEMBL", "tags": ["sotorasib", "KRAS", "inhibitor"]},
+        ]
+        if st.button("🧪 Load Sample Data"):
+            qdrant = components["qdrant"]
+            stats = qdrant.ingest(sample_data)
+            st.success(f"Loaded {len(sample_data)} sample entries! {stats}")
+            st.balloons()
+def render_crossmodal_page(components):
+    """Render cross-modal analysis page."""
+    st.markdown('<p class="main-header">🧪 Cross-Modal Analysis</p>', unsafe_allow_html=True)
+    st.markdown("""
+    Explore how different modalities relate to each other in the shared embedding space.
+    This is the core capability of BioFlow - connecting text, molecules, and proteins.
+    """)
+    col1, col2 = st.columns(2)
+    with col1:
+        st.markdown("### Query")
+        query = st.text_area("Enter query", height=100)
+        query_mod = st.selectbox("Query type", ["text", "smiles", "protein"], key="q_mod")
+    with col2:
+        st.markdown("### Targets")
+        targets = st.text_area("Enter targets (one per line)", height=100)
+        target_mod = st.selectbox("Target type", ["text", "smiles", "protein"], key="t_mod")
+    if st.button("🔄 Compute Cross-Modal Similarity"):
+        if query and targets:
+            obm = components["obm"]
+            target_list = [t.strip() for t in targets.strip().split("\n") if t.strip()]
+            results = obm.cross_modal_similarity(
+                query=query,
+                query_modality=query_mod,
+                targets=target_list,
+                target_modality=target_mod
+            )
+            st.markdown("### Results (sorted by similarity)")
+            df = pd.DataFrame(results, columns=["Content", "Similarity"])
+            df["Rank"] = range(1, len(df) + 1)
+            df = df[["Rank", "Content", "Similarity"]]
+            st.dataframe(df, use_container_width=True)
+            # Visualize
+            import plotly.express as px
+            fig = px.bar(df, x="Content", y="Similarity", title="Cross-Modal Similarities")
+            st.plotly_chart(fig, use_container_width=True)
+def render_visualization_page(components):
+    """Render visualization page."""
+    st.markdown('<p class="main-header">📊 Visualization</p>', unsafe_allow_html=True)
+    tab1, tab2, tab3 = st.tabs(["🌐 Embedding Space", "📈 Similarity Matrix", "🧬 Molecules"])
+    with tab1:
+        st.markdown("### Embedding Space Visualization")
+        # Get all points from collection
+        qdrant = components["qdrant"]
+        info = qdrant.get_collection_info()
+        if info.get("points_count", 0) == 0:
+            st.warning("No data in collection. Go to Data Ingestion to add some!")
+            return
+        st.metric("Points in collection", info.get("points_count", 0))
+        if st.button("🎨 Generate Embedding Plot"):
+            # This would require fetching all vectors - simplified for demo
+            st.info("Embedding visualization requires fetching all vectors. In production, use sampling.")
+            # Demo with random data
+            n_points = min(info.get("points_count", 20), 50)
+            fake_embeddings = np.random.randn(n_points, 2)
+            import plotly.express as px
+            fig = px.scatter(
+                x=fake_embeddings[:, 0],
+                y=fake_embeddings[:, 1],
+                title="Embedding Space (Demo - PCA projection)"
+            )
+            st.plotly_chart(fig, use_container_width=True)
+    with tab2:
+        st.markdown("### Compute Similarity Matrix")
+        items = st.text_area("Enter items (one per line)", height=150)
+        modality = st.selectbox("Modality", ["text", "smiles", "protein"], key="sim_mod")
+        if st.button("🔢 Compute Matrix"):
+            if items:
+                obm = components["obm"]
+                item_list = [i.strip() for i in items.strip().split("\n") if i.strip()]
+                if modality == "text":
+                    embeddings = obm.encode_text(item_list)
+                elif modality == "smiles":
+                    embeddings = obm.encode_smiles(item_list)
+                else:
+                    embeddings = obm.encode_protein(item_list)
+                vectors = np.array([e.vector for e in embeddings])
+                # Compute similarity
+                norms = np.linalg.norm(vectors, axis=1, keepdims=True)
+                normalized = vectors / np.clip(norms, 1e-9, None)
+                similarity = np.dot(normalized, normalized.T)
+                import plotly.figure_factory as ff
+                labels = [i[:20] for i in item_list]
+                fig = ff.create_annotated_heatmap(
+                    similarity,
+                    x=labels,
+                    y=labels,
+                    colorscale='RdBu'
+                )
+                st.plotly_chart(fig, use_container_width=True)
+    with tab3:
+        st.markdown("### Molecule Visualization")
+        smiles = st.text_input("Enter SMILES", placeholder="CC(=O)OC1=CC=CC=C1C(=O)O")
+        if smiles:
+            try:
+                from rdkit import Chem
+                from rdkit.Chem import Draw
+                mol = Chem.MolFromSmiles(smiles)
+                if mol:
+                    img = Draw.MolToImage(mol, size=(400, 300))
+                    st.image(img, caption=f"Molecule: {smiles}")
+                else:
+                    st.error("Invalid SMILES")
+            except ImportError:
+                st.warning("RDKit not installed. Install with: pip install rdkit")
+def render_pipeline_page(components):
+    """Render pipeline demo page."""
+    st.markdown('<p class="main-header">🔬 Pipeline Demo</p>', unsafe_allow_html=True)
+    st.markdown("""
+    Run a complete discovery workflow that:
+    1. Searches for related literature
+    2. Finds similar molecules
+    3. Validates candidates
+    4. Analyzes result diversity
+    """)
+    query = st.text_input("Enter discovery query", placeholder="e.g., KRAS inhibitor for lung cancer")
+    col1, col2 = st.columns(2)
+    with col1:
+        query_mod = st.selectbox("Query modality", ["text", "smiles", "protein"])
+    with col2:
+        target_mod = st.selectbox("Target modality", ["smiles", "text", "protein"])
+    if st.button("🚀 Run Discovery Pipeline", type="primary"):
+        if query:
+            pipeline = components["pipeline"]
+            with st.spinner("Running pipeline..."):
+                results = pipeline.run_discovery_workflow(
+                    query=query,
+                    query_modality=query_mod,
+                    target_modality=target_mod
+                )
+            st.markdown("## 📊 Pipeline Results")
+            # Literature
+            with st.expander("📚 Related Literature", expanded=True):
+                lit = results.get("stages", {}).get("literature", [])
+                if lit:
+                    for item in lit:
+                        st.markdown(f"- **Score: {item['score']:.3f}** - {item['content'][:100]}...")
+                else:
+                    st.info("No literature found")
+            # Molecules
+            with st.expander("🧪 Similar Molecules", expanded=True):
+                mols = results.get("stages", {}).get("molecules", [])
+                if mols:
+                    df = pd.DataFrame(mols)
+                    st.dataframe(df)
+                else:
+                    st.info("No molecules found")
+            # Validation
+            with st.expander("✅ Validation Results"):
+                val = results.get("stages", {}).get("validation", [])
+                if val:
+                    st.json(val)
+                else:
+                    st.info("No validation performed")
+            # Diversity
+            with st.expander("📈 Diversity Analysis"):
+                div = results.get("stages", {}).get("diversity", {})
+                if div:
+                    col1, col2, col3 = st.columns(3)
+                    col1.metric("Mean Similarity", f"{div.get('mean_similarity', 0):.3f}")
+                    col2.metric("Diversity Score", f"{div.get('diversity_score', 0):.3f}")
+                    col3.metric("Modalities", len(div.get('modality_distribution', {})))
+                    st.json(div)
+def render_docs_page():
+    """Render documentation page."""
+    st.markdown('<p class="main-header">📚 Documentation</p>', unsafe_allow_html=True)
+    st.markdown("""
+    ## BioFlow + OpenBioMed Integration
+    ### 🎯 Overview
+    BioFlow is a multimodal biological intelligence framework that leverages OpenBioMed (OBM)
+    for encoding biological data and Qdrant for vector storage and retrieval.
+    ### 🧩 Components
+    | Component | Description |
+    |-----------|-------------|
+    | **OBMWrapper** | Encodes text, molecules (SMILES), and proteins into a shared vector space |
+    | **QdrantManager** | Manages vector storage, indexing, and similarity search |
+    | **BioFlowPipeline** | Orchestrates agents in discovery workflows |
+    | **Visualizer** | Creates plots for embeddings, similarities, and molecules |
+    ### 🔌 API Examples
+    ```python
+    from bioflow import OBMWrapper, QdrantManager, BioFlowPipeline
+    # Initialize
+    obm = OBMWrapper(device="cuda")
+    qdrant = QdrantManager(obm, qdrant_path="./data/qdrant")
+    # Encode different modalities
+    text_vec = obm.encode_text("KRAS inhibitor for cancer")
+    mol_vec = obm.encode_smiles("CCO")
+    prot_vec = obm.encode_protein("MTEYKLVVV...")
+    # Cross-modal search
+    results = qdrant.cross_modal_search(
+        query="anti-inflammatory drug",
+        query_modality="text",
+        target_modality="smiles",
+        limit=10
+    )
+    ```
+    ### 🌟 Key Features
+    1. **Unified Embedding Space**: All modalities map to the same vector dimension
+    2. **Cross-Modal Search**: Find molecules from text queries and vice versa
+    3. **Pipeline Orchestration**: Chain agents for complex discovery workflows
+    4. **Mock Mode**: Test without GPU using deterministic random embeddings
+    ### 📁 File Structure
+    ```
+    bioflow/
+    ├── __init__.py          # Package exports
+    ├── obm_wrapper.py       # OBM encoding interface
+    ├── qdrant_manager.py    # Qdrant operations
+    ├── pipeline.py          # Workflow orchestration
+    ├── visualizer.py        # Visualization utilities
+    └── app.py               # Streamlit interface
+    ```
+    """)
+def main():
+    """Main application entry point."""
+    mode, use_mock = render_sidebar()
+    # Initialize components
+    components = init_bioflow(use_mock=use_mock)
+    if not components.get("ready"):
+        st.error("System not ready. Check configuration.")
+        return
+    # Display collection stats in sidebar
+    info = components["qdrant"].get_collection_info()
+    st.sidebar.metric("📊 Vectors", info.get("points_count", 0))
+    st.sidebar.metric("📐 Dimension", info.get("vector_size", 768))
+    # Route to appropriate page
+    if "Search" in mode:
+        render_search_page(components)
+    elif "Ingestion" in mode:
+        render_ingestion_page(components)
+    elif "Cross-Modal" in mode:
+        render_crossmodal_page(components)
+    elif "Visualization" in mode:
+        render_visualization_page(components)
+    elif "Pipeline" in mode:
+        render_pipeline_page(components)
+    elif "Documentation" in mode:
+        render_docs_page()
+if __name__ == "__main__":
+    main()

bioflow/core/__init__.py ADDED Viewed

	@@ -0,0 +1,87 @@

+"""
+BioFlow Core
+=============
+Core abstractions and orchestration for the BioFlow platform.
+Public API:
+    - Modality: Enum of supported data types
+    - BioEncoder, BioPredictor, BioGenerator, BioRetriever: Abstract interfaces
+    - EmbeddingResult, PredictionResult, RetrievalResult: Data containers
+    - ToolRegistry: Central tool management
+    - BioFlowOrchestrator: Pipeline execution engine
+    - WorkflowConfig, NodeConfig: Configuration classes
+"""
+from bioflow.core.base import (
+    Modality,
+    BioEncoder,
+    BioPredictor,
+    BioGenerator,
+    BioRetriever,
+    BioTool,
+    EmbeddingResult,
+    PredictionResult,
+    RetrievalResult,
+)
+from bioflow.core.registry import ToolRegistry
+from bioflow.core.orchestrator import (
+    BioFlowOrchestrator,
+    ExecutionContext,
+    PipelineResult,
+)
+from bioflow.core.config import (
+    NodeType,
+    NodeConfig,
+    WorkflowConfig,
+    EncoderConfig,
+    VectorDBConfig,
+    BioFlowConfig,
+)
+from bioflow.core.nodes import (
+    EncodeNode,
+    RetrieveNode,
+    PredictNode,
+    IngestNode,
+    FilterNode,
+    TraceabilityNode,
+)
+__all__ = [
+    # Enums
+    "Modality",
+    "NodeType",
+    # Abstract interfaces
+    "BioEncoder",
+    "BioPredictor",
+    "BioGenerator",
+    "BioRetriever",
+    "BioTool",
+    # Data containers
+    "EmbeddingResult",
+    "PredictionResult",
+    "RetrievalResult",
+    # Registry
+    "ToolRegistry",
+    # Orchestrator
+    "BioFlowOrchestrator",
+    "ExecutionContext",
+    "PipelineResult",
+    # Config
+    "NodeConfig",
+    "WorkflowConfig",
+    "EncoderConfig",
+    "VectorDBConfig",
+    "BioFlowConfig",
+    # Nodes
+    "EncodeNode",
+    "RetrieveNode",
+    "PredictNode",
+    "IngestNode",
+    "FilterNode",
+    "TraceabilityNode",
+]

bioflow/core/base.py ADDED Viewed

	@@ -0,0 +1,247 @@

+"""
+BioFlow Core Abstractions
+==========================
+Defines the fundamental interfaces for all tools in the BioFlow platform.
+All encoders, predictors, generators, and retrievers must implement these.
+Open-Source Models Supported:
+- Text: PubMedBERT, SciBERT, Specter
+- Molecules: ChemBERTa, RDKit FP
+- Proteins: ESM-2, ProtBERT
+- Images: CLIP, BioMedCLIP
+"""
+from abc import ABC, abstractmethod
+from typing import Any, Dict, List, Optional, Union
+from dataclasses import dataclass, field
+from enum import Enum
+class Modality(Enum):
+    """Supported data modalities in BioFlow."""
+    TEXT = "text"
+    SMILES = "smiles"
+    PROTEIN = "protein"
+    IMAGE = "image"
+    GENOMIC = "genomic"
+    STRUCTURE = "structure"
+@dataclass
+class EmbeddingResult:
+    """Result of an encoding operation."""
+    vector: List[float]
+    modality: Modality
+    dimension: int
+    metadata: Dict[str, Any] = field(default_factory=dict)
+    def __len__(self):
+        return len(self.vector)
+@dataclass
+class PredictionResult:
+    """Result of a prediction operation."""
+    score: float
+    label: Optional[str] = None
+    confidence: Optional[float] = None
+    metadata: Dict[str, Any] = field(default_factory=dict)
+@dataclass
+class RetrievalResult:
+    """Result of a retrieval/search operation."""
+    id: str
+    score: float
+    content: Any
+    modality: Modality
+    payload: Dict[str, Any] = field(default_factory=dict)
+class BioEncoder(ABC):
+    """
+    Interface for any tool that converts biological data into vectors.
+    Implementations:
+    - OBMEncoder: Multimodal (text, SMILES, protein)
+    - ESM2Encoder: Protein sequences
+    - ChemBERTaEncoder: SMILES molecules
+    - PubMedBERTEncoder: Biomedical text
+    - CLIPEncoder: Images
+    Example:
+        >>> encoder = ESM2Encoder(device="cuda")
+        >>> result = encoder.encode("MKTVRQERLKSIVRILERSKEPVSG", Modality.PROTEIN)
+        >>> print(len(result.vector))  # 1280
+    """
+    @abstractmethod
+    def encode(self, content: Any, modality: Modality) -> EmbeddingResult:
+        """
+        Encode content into a vector representation.
+        Args:
+            content: Raw input (text, SMILES string, protein sequence, etc.)
+            modality: Type of the input data
+        Returns:
+            EmbeddingResult with vector and metadata
+        """
+        pass
+    @property
+    @abstractmethod
+    def dimension(self) -> int:
+        """Return the dimension of output vectors."""
+        pass
+    @property
+    def supported_modalities(self) -> List[Modality]:
+        """Return list of modalities this encoder supports."""
+        return [Modality.TEXT]  # Override in subclasses
+    def batch_encode(self, contents: List[Any], modality: Modality) -> List[EmbeddingResult]:
+        """Encode multiple items. Override for optimized batch processing."""
+        return [self.encode(c, modality) for c in contents]
+class BioPredictor(ABC):
+    """
+    Interface for tools that predict properties, affinities, or interactions.
+    Implementations:
+    - DeepPurposePredictor: DTI prediction
+    - ToxicityPredictor: ADMET properties
+    - BindingAffinityPredictor: Kd/Ki estimation
+    Example:
+        >>> predictor = DeepPurposePredictor()
+        >>> result = predictor.predict(drug="CCO", target="MKTVRQ...")
+        >>> print(result.score)  # 0.85
+    """
+    @abstractmethod
+    def predict(self, drug: str, target: str) -> PredictionResult:
+        """
+        Predict interaction/property between drug and target.
+        Args:
+            drug: SMILES string of drug molecule
+            target: Protein sequence or identifier
+        Returns:
+            PredictionResult with score and metadata
+        """
+        pass
+    def batch_predict(self, pairs: List[tuple]) -> List[PredictionResult]:
+        """Predict for multiple drug-target pairs."""
+        return [self.predict(d, t) for d, t in pairs]
+class BioGenerator(ABC):
+    """
+    Interface for tools that generate new biological candidates.
+    Implementations:
+    - MoleculeGenerator: SMILES generation
+    - ProteinGenerator: Sequence design
+    - VariantGenerator: Mutation suggestions
+    Example:
+        >>> generator = MoleculeGenerator()
+        >>> candidates = generator.generate(
+        ...     seed="CCO",
+        ...     constraints={"mw_max": 500, "logp_max": 5}
+        ... )
+    """
+    @abstractmethod
+    def generate(self, seed: Any, constraints: Dict[str, Any]) -> List[Any]:
+        """
+        Generate new candidates based on seed and constraints.
+        Args:
+            seed: Starting point (molecule, sequence, etc.)
+            constraints: Dictionary of constraints (e.g., MW, toxicity)
+        Returns:
+            List of generated candidates
+        """
+        pass
+class BioRetriever(ABC):
+    """
+    Interface for vector database retrieval operations.
+    Implementations:
+    - QdrantRetriever: Qdrant vector search
+    - FAISSRetriever: FAISS similarity search
+    Example:
+        >>> retriever = QdrantRetriever(collection="molecules")
+        >>> results = retriever.search(query_vector, limit=10)
+    """
+    @abstractmethod
+    def search(
+        self,
+        query: Union[List[float], str],
+        limit: int = 10,
+        filters: Optional[Dict[str, Any]] = None
+    ) -> List[RetrievalResult]:
+        """
+        Search for similar items in vector database.
+        Args:
+            query: Query vector or raw content to encode first
+            limit: Maximum number of results
+            filters: Metadata filters to apply
+        Returns:
+            List of RetrievalResult sorted by similarity
+        """
+        pass
+    @abstractmethod
+    def ingest(
+        self,
+        content: Any,
+        modality: Modality,
+        payload: Optional[Dict[str, Any]] = None
+    ) -> str:
+        """
+        Ingest content into the vector database.
+        Args:
+            content: Raw content to encode and store
+            modality: Type of content
+            payload: Additional metadata to store
+        Returns:
+            ID of the inserted item
+        """
+        pass
+class BioTool(ABC):
+    """
+    General wrapper for miscellaneous tools.
+    Implementations:
+    - RDKitTool: Molecular operations
+    - VisualizationTool: Plotting and visualization
+    - FilterTool: Candidate filtering
+    """
+    @abstractmethod
+    def execute(self, *args, **kwargs) -> Any:
+        """Execute the tool with given arguments."""
+        pass
+    @property
+    def name(self) -> str:
+        """Return tool name."""
+        return self.__class__.__name__

bioflow/core/config.py ADDED Viewed

	@@ -0,0 +1,92 @@

+"""
+BioFlow Configuration Schema
+=============================
+Dataclasses and schemas for workflow configuration.
+"""
+from dataclasses import dataclass, field
+from typing import Dict, List, Any, Optional
+from enum import Enum
+class NodeType(Enum):
+    """Types of nodes in a BioFlow pipeline."""
+    ENCODE = "encode"         # Vectorize input using encoder
+    RETRIEVE = "retrieve"     # Search vector DB for neighbors
+    PREDICT = "predict"       # Run prediction model
+    GENERATE = "generate"     # Generate new candidates
+    FILTER = "filter"         # Filter/rank candidates
+    CUSTOM = "custom"         # User-defined function
+@dataclass
+class NodeConfig:
+    """Configuration for a single pipeline node."""
+    id: str
+    type: NodeType
+    tool: str                            # Name of registered tool
+    inputs: List[str] = field(default_factory=list)  # Node IDs or "input"
+    params: Dict[str, Any] = field(default_factory=dict)
+    def __post_init__(self):
+        if isinstance(self.type, str):
+            self.type = NodeType(self.type)
+@dataclass
+class WorkflowConfig:
+    """Configuration for an entire workflow."""
+    name: str
+    description: str = ""
+    nodes: List[NodeConfig] = field(default_factory=list)
+    output_node: str = ""                # ID of final node
+    metadata: Dict[str, Any] = field(default_factory=dict)
+    @classmethod
+    def from_dict(cls, data: Dict[str, Any]) -> "WorkflowConfig":
+        """Create WorkflowConfig from dictionary (e.g., loaded YAML)."""
+        nodes = [
+            NodeConfig(**node) if isinstance(node, dict) else node
+            for node in data.get("nodes", [])
+        ]
+        return cls(
+            name=data.get("name", "unnamed"),
+            description=data.get("description", ""),
+            nodes=nodes,
+            output_node=data.get("output_node", ""),
+            metadata=data.get("metadata", {})
+        )
+@dataclass
+class EncoderConfig:
+    """Configuration for an encoder."""
+    name: str
+    model_type: str                      # e.g., "esm2", "pubmedbert", "chemberta"
+    model_path: Optional[str] = None
+    device: str = "cpu"
+    dimension: int = 768
+    modalities: List[str] = field(default_factory=list)
+    extra: Dict[str, Any] = field(default_factory=dict)
+@dataclass
+class VectorDBConfig:
+    """Configuration for vector database."""
+    provider: str = "qdrant"             # qdrant, faiss, etc.
+    url: Optional[str] = None
+    path: Optional[str] = None
+    default_collection: str = "bioflow_memory"
+    distance_metric: str = "cosine"
+@dataclass
+class BioFlowConfig:
+    """Master configuration for entire BioFlow system."""
+    project_name: str = "BioFlow"
+    encoders: Dict[str, EncoderConfig] = field(default_factory=dict)
+    vector_db: VectorDBConfig = field(default_factory=VectorDBConfig)
+    workflows: Dict[str, WorkflowConfig] = field(default_factory=dict)
+    default_encoder: str = "default"
+    log_level: str = "INFO"

bioflow/core/nodes.py ADDED Viewed

	@@ -0,0 +1,465 @@

+"""
+BioFlow Workflow Nodes
+=======================
+Typed node implementations for the BioFlow orchestrator.
+Each node wraps a specific operation in the discovery pipeline.
+"""
+import logging
+from typing import List, Dict, Any, Optional, Union
+from dataclasses import dataclass, field
+from datetime import datetime
+from abc import ABC, abstractmethod
+from bioflow.core import (
+    Modality,
+    BioEncoder,
+    BioPredictor,
+    BioRetriever,
+    EmbeddingResult,
+    PredictionResult,
+    RetrievalResult,
+)
+logger = logging.getLogger(__name__)
+@dataclass
+class NodeResult:
+    """Result from any node execution."""
+    node_id: str
+    node_type: str
+    data: Any
+    metadata: Dict[str, Any] = field(default_factory=dict)
+    timestamp: str = field(default_factory=lambda: datetime.now().isoformat())
+    def __repr__(self):
+        return f"NodeResult({self.node_type}: {len(self.data) if hasattr(self.data, '__len__') else 1} items)"
+class BaseNode(ABC):
+    """Base class for all workflow nodes."""
+    def __init__(self, node_id: str):
+        self.node_id = node_id
+    @property
+    @abstractmethod
+    def node_type(self) -> str:
+        pass
+    @abstractmethod
+    def execute(self, input_data: Any, context: Dict[str, Any] = None) -> NodeResult:
+        pass
+class EncodeNode(BaseNode):
+    """
+    Encodes input data into embeddings.
+    Input: Raw content (text, SMILES, protein sequence)
+    Output: EmbeddingResult or list of EmbeddingResults
+    """
+    def __init__(
+        self,
+        node_id: str,
+        encoder: BioEncoder,
+        modality: Modality = Modality.TEXT,
+        auto_detect: bool = False
+    ):
+        super().__init__(node_id)
+        self.encoder = encoder
+        self.modality = modality
+        self.auto_detect = auto_detect
+    @property
+    def node_type(self) -> str:
+        return "encode"
+    def execute(self, input_data: Any, context: Dict[str, Any] = None) -> NodeResult:
+        """Encode input data."""
+        context = context or {}
+        # Handle batch input
+        if isinstance(input_data, list):
+            if self.auto_detect and hasattr(self.encoder, 'encode_auto'):
+                results = [self.encoder.encode_auto(item) for item in input_data]
+            else:
+                results = self.encoder.batch_encode(input_data, self.modality)
+            data = results
+        else:
+            if self.auto_detect and hasattr(self.encoder, 'encode_auto'):
+                result = self.encoder.encode_auto(input_data)
+            else:
+                result = self.encoder.encode(input_data, self.modality)
+            data = result
+        return NodeResult(
+            node_id=self.node_id,
+            node_type=self.node_type,
+            data=data,
+            metadata={"modality": self.modality.value, "auto_detect": self.auto_detect}
+        )
+class RetrieveNode(BaseNode):
+    """
+    Retrieves similar items from vector database.
+    Input: Query (string or embedding)
+    Output: List of RetrievalResults
+    """
+    def __init__(
+        self,
+        node_id: str,
+        retriever: BioRetriever,
+        collection: str = None,
+        limit: int = 10,
+        modality: Modality = Modality.TEXT,
+        filters: Dict[str, Any] = None
+    ):
+        super().__init__(node_id)
+        self.retriever = retriever
+        self.collection = collection
+        self.limit = limit
+        self.modality = modality
+        self.filters = filters or {}
+    @property
+    def node_type(self) -> str:
+        return "retrieve"
+    def execute(self, input_data: Any, context: Dict[str, Any] = None) -> NodeResult:
+        """Retrieve similar items."""
+        context = context or {}
+        # Override from context if provided
+        limit = context.get("limit", self.limit)
+        filters = {**self.filters, **context.get("filters", {})}
+        # Handle EmbeddingResult input
+        if isinstance(input_data, EmbeddingResult):
+            query = input_data.vector
+        else:
+            query = input_data
+        results = self.retriever.search(
+            query=query,
+            limit=limit,
+            filters=filters if filters else None,
+            collection=self.collection,
+            modality=self.modality
+        )
+        return NodeResult(
+            node_id=self.node_id,
+            node_type=self.node_type,
+            data=results,
+            metadata={
+                "count": len(results),
+                "collection": self.collection,
+                "filters": filters
+            }
+        )
+class PredictNode(BaseNode):
+    """
+    Runs predictions on drug-target pairs.
+    Input: List of candidates (from retrieval) or direct (drug, target) pairs
+    Output: List of PredictionResults with scores
+    """
+    def __init__(
+        self,
+        node_id: str,
+        predictor: BioPredictor,
+        target_sequence: str = None,
+        drug_field: str = "content",
+        threshold: float = 0.0
+    ):
+        super().__init__(node_id)
+        self.predictor = predictor
+        self.target_sequence = target_sequence
+        self.drug_field = drug_field
+        self.threshold = threshold
+    @property
+    def node_type(self) -> str:
+        return "predict"
+    def execute(self, input_data: Any, context: Dict[str, Any] = None) -> NodeResult:
+        """Run predictions."""
+        context = context or {}
+        target = context.get("target", self.target_sequence)
+        if not target:
+            raise ValueError("Target sequence is required for prediction")
+        predictions = []
+        # Handle different input types
+        if isinstance(input_data, list):
+            for item in input_data:
+                # Extract drug from RetrievalResult or dict
+                if isinstance(item, RetrievalResult):
+                    drug = item.content
+                    source_id = item.id
+                elif isinstance(item, dict):
+                    drug = item.get(self.drug_field, item.get("smiles", ""))
+                    source_id = item.get("id", "unknown")
+                else:
+                    drug = str(item)
+                    source_id = "unknown"
+                try:
+                    result = self.predictor.predict(drug, target)
+                    if result.score >= self.threshold:
+                        predictions.append({
+                            "drug": drug,
+                            "source_id": source_id,
+                            "prediction": result,
+                            "score": result.score
+                        })
+                except Exception as e:
+                    logger.warning(f"Prediction failed for {drug[:20]}...: {e}")
+        else:
+            result = self.predictor.predict(str(input_data), target)
+            predictions.append({
+                "drug": str(input_data),
+                "prediction": result,
+                "score": result.score
+            })
+        # Sort by score
+        predictions.sort(key=lambda x: x["score"], reverse=True)
+        return NodeResult(
+            node_id=self.node_id,
+            node_type=self.node_type,
+            data=predictions,
+            metadata={
+                "count": len(predictions),
+                "threshold": self.threshold,
+                "target_length": len(target) if target else 0
+            }
+        )
+class IngestNode(BaseNode):
+    """
+    Ingests data into the vector database.
+    Input: List of items to ingest
+    Output: List of ingested IDs
+    """
+    def __init__(
+        self,
+        node_id: str,
+        retriever: BioRetriever,
+        collection: str = None,
+        modality: Modality = Modality.TEXT,
+        content_field: str = "content"
+    ):
+        super().__init__(node_id)
+        self.retriever = retriever
+        self.collection = collection
+        self.modality = modality
+        self.content_field = content_field
+    @property
+    def node_type(self) -> str:
+        return "ingest"
+    def execute(self, input_data: Any, context: Dict[str, Any] = None) -> NodeResult:
+        """Ingest data into vector DB."""
+        context = context or {}
+        ids = []
+        if isinstance(input_data, list):
+            for item in input_data:
+                if isinstance(item, dict):
+                    content = item.get(self.content_field, item.get("smiles", item.get("sequence", "")))
+                    payload = {k: v for k, v in item.items() if k != self.content_field}
+                else:
+                    content = str(item)
+                    payload = {}
+                item_id = self.retriever.ingest(
+                    content=content,
+                    modality=self.modality,
+                    payload=payload,
+                    collection=self.collection
+                )
+                ids.append(item_id)
+        else:
+            item_id = self.retriever.ingest(
+                content=str(input_data),
+                modality=self.modality,
+                payload=context.get("payload", {}),
+                collection=self.collection
+            )
+            ids.append(item_id)
+        return NodeResult(
+            node_id=self.node_id,
+            node_type=self.node_type,
+            data=ids,
+            metadata={"count": len(ids), "collection": self.collection}
+        )
+class FilterNode(BaseNode):
+    """
+    Filters and ranks results.
+    Input: List of items
+    Output: Filtered/ranked list
+    """
+    def __init__(
+        self,
+        node_id: str,
+        score_field: str = "score",
+        threshold: float = 0.5,
+        top_k: int = None,
+        diversity: float = 0.0  # For MMR-style diversification
+    ):
+        super().__init__(node_id)
+        self.score_field = score_field
+        self.threshold = threshold
+        self.top_k = top_k
+        self.diversity = diversity
+    @property
+    def node_type(self) -> str:
+        return "filter"
+    def _get_score(self, item: Any) -> float:
+        """Extract score from item."""
+        if isinstance(item, dict):
+            return item.get(self.score_field, 0)
+        elif hasattr(item, self.score_field):
+            return getattr(item, self.score_field)
+        elif hasattr(item, 'score'):
+            return item.score
+        return 0
+    def execute(self, input_data: Any, context: Dict[str, Any] = None) -> NodeResult:
+        """Filter and rank results."""
+        context = context or {}
+        if not isinstance(input_data, list):
+            input_data = [input_data]
+        # Filter by threshold
+        filtered = [item for item in input_data if self._get_score(item) >= self.threshold]
+        # Sort by score
+        filtered.sort(key=lambda x: self._get_score(x), reverse=True)
+        # Apply top_k
+        if self.top_k:
+            filtered = filtered[:self.top_k]
+        return NodeResult(
+            node_id=self.node_id,
+            node_type=self.node_type,
+            data=filtered,
+            metadata={
+                "input_count": len(input_data),
+                "output_count": len(filtered),
+                "threshold": self.threshold
+            }
+        )
+class TraceabilityNode(BaseNode):
+    """
+    Adds evidence linking and provenance to results.
+    Input: Results with source IDs
+    Output: Results enriched with evidence links
+    """
+    def __init__(
+        self,
+        node_id: str,
+        source_mapping: Dict[str, str] = None  # Maps ID prefixes to URLs
+    ):
+        super().__init__(node_id)
+        self.source_mapping = source_mapping or {
+            "PMID": "https://pubmed.ncbi.nlm.nih.gov/{id}",
+            "UniProt": "https://www.uniprot.org/uniprot/{id}",
+            "ChEMBL": "https://www.ebi.ac.uk/chembl/compound_report_card/{id}",
+            "PubChem": "https://pubchem.ncbi.nlm.nih.gov/compound/{id}",
+        }
+    @property
+    def node_type(self) -> str:
+        return "trace"
+    def _generate_evidence_link(self, source_id: str, payload: Dict[str, Any]) -> Dict[str, str]:
+        """Generate evidence links from source ID and payload."""
+        links = {}
+        # Check for known ID types in payload
+        for key, url_template in self.source_mapping.items():
+            if key.lower() in payload:
+                links[key] = url_template.format(id=payload[key.lower()])
+            elif f"{key.lower()}_id" in payload:
+                links[key] = url_template.format(id=payload[f"{key.lower()}_id"])
+        # Check source_id prefix
+        for prefix, url_template in self.source_mapping.items():
+            if source_id.startswith(prefix):
+                id_part = source_id.replace(prefix, "").lstrip("_:-")
+                links[prefix] = url_template.format(id=id_part)
+        return links
+    def execute(self, input_data: Any, context: Dict[str, Any] = None) -> NodeResult:
+        """Add evidence links to results."""
+        context = context or {}
+        if not isinstance(input_data, list):
+            input_data = [input_data]
+        enriched = []
+        for item in input_data:
+            if isinstance(item, dict):
+                source_id = item.get("source_id", item.get("id", ""))
+                payload = item.get("payload", item)
+                evidence = self._generate_evidence_link(source_id, payload)
+                enriched_item = {
+                    **item,
+                    "evidence_links": evidence,
+                    "has_evidence": len(evidence) > 0
+                }
+                enriched.append(enriched_item)
+            elif isinstance(item, RetrievalResult):
+                evidence = self._generate_evidence_link(item.id, item.payload)
+                enriched.append({
+                    "id": item.id,
+                    "content": item.content,
+                    "score": item.score,
+                    "modality": item.modality.value,
+                    "payload": item.payload,
+                    "evidence_links": evidence,
+                    "has_evidence": len(evidence) > 0
+                })
+            else:
+                enriched.append(item)
+        return NodeResult(
+            node_id=self.node_id,
+            node_type=self.node_type,
+            data=enriched,
+            metadata={"with_evidence": sum(1 for e in enriched if isinstance(e, dict) and e.get("has_evidence", False))}
+        )

bioflow/core/orchestrator.py ADDED Viewed

	@@ -0,0 +1,303 @@

+"""
+BioFlow Orchestrator
+=====================
+Stateful pipeline engine that manages the flow of data through
+registered tools, forming a Directed Acyclic Graph (DAG) of operations.
+"""
+import logging
+from typing import Dict, List, Any, Optional, Callable
+from dataclasses import dataclass, field
+from datetime import datetime
+from collections import defaultdict
+from typing import Optional as OptionalType
+from bioflow.core.base import BioEncoder, BioPredictor, BioGenerator, Modality
+from bioflow.core.config import NodeConfig, WorkflowConfig, NodeType
+from bioflow.core.registry import ToolRegistry
+# Re-import Optional with a different name to avoid conflicts
+from typing import Optional
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+@dataclass
+class ExecutionContext:
+    """Context passed through pipeline execution."""
+    workflow_id: str
+    start_time: datetime = field(default_factory=datetime.now)
+    node_outputs: Dict[str, Any] = field(default_factory=dict)
+    metadata: Dict[str, Any] = field(default_factory=dict)
+    errors: List[str] = field(default_factory=list)
+    def get_input(self, node_id: str) -> Any:
+        """Get output from a previous node."""
+        return self.node_outputs.get(node_id)
+    def set_output(self, node_id: str, value: Any):
+        """Store output from a node."""
+        self.node_outputs[node_id] = value
+@dataclass
+class PipelineResult:
+    """Final result of workflow execution."""
+    success: bool
+    output: Any
+    context: ExecutionContext
+    duration_ms: float
+    def to_dict(self) -> Dict[str, Any]:
+        return {
+            "success": self.success,
+            "output": self.output,
+            "duration_ms": self.duration_ms,
+            "errors": self.context.errors,
+            "node_outputs": {k: str(v)[:100] for k, v in self.context.node_outputs.items()}
+        }
+class BioFlowOrchestrator:
+    """
+    Main orchestration engine for BioFlow pipelines.
+    Responsibilities:
+    - Parse workflow configurations
+    - Build execution DAG from node dependencies
+    - Execute nodes in topological order
+    - Manage state between nodes
+    - Handle errors and retries
+    Example:
+        >>> orchestrator = BioFlowOrchestrator()
+        >>> orchestrator.register_workflow(workflow_config)
+        >>> result = orchestrator.run("my_workflow", input_data)
+    """
+    def __init__(self, registry: Optional[ToolRegistry] = None):
+        """
+        Initialize orchestrator.
+        Args:
+            registry: Tool registry instance. Uses global if None.
+        """
+        self.registry = registry if registry is not None else ToolRegistry
+        self.workflows: Dict[str, WorkflowConfig] = {}
+        self.custom_handlers: Dict[str, Callable] = {}
+        self._retriever = None  # Qdrant manager reference
+    def set_retriever(self, retriever):
+        """Set the vector DB retriever (QdrantManager)."""
+        self._retriever = retriever
+    def register_workflow(self, config: WorkflowConfig) -> None:
+        """Register a workflow configuration."""
+        self.workflows[config.name] = config
+        logger.info(f"Registered workflow: {config.name} ({len(config.nodes)} nodes)")
+    def register_custom_handler(self, name: str, handler: Callable) -> None:
+        """Register a custom node handler function."""
+        self.custom_handlers[name] = handler
+        logger.info(f"Registered custom handler: {name}")
+    def _build_execution_order(self, config: WorkflowConfig) -> List[NodeConfig]:
+        """
+        Build topological execution order from node dependencies.
+        Returns nodes sorted so dependencies are executed first.
+        """
+        # Build adjacency list
+        in_degree = defaultdict(int)
+        dependents = defaultdict(list)
+        node_map = {node.id: node for node in config.nodes}
+        for node in config.nodes:
+            for dep in node.inputs:
+                if dep != "input" and dep in node_map:
+                    dependents[dep].append(node.id)
+                    in_degree[node.id] += 1
+        # Kahn's algorithm for topological sort
+        queue = [n.id for n in config.nodes if in_degree[n.id] == 0]
+        order = []
+        while queue:
+            node_id = queue.pop(0)
+            order.append(node_map[node_id])
+            for dependent in dependents[node_id]:
+                in_degree[dependent] -= 1
+                if in_degree[dependent] == 0:
+                    queue.append(dependent)
+        if len(order) != len(config.nodes):
+            raise ValueError("Cycle detected in workflow DAG")
+        return order
+    def _execute_node(
+        self,
+        node: NodeConfig,
+        context: ExecutionContext,
+        initial_input: Any
+    ) -> Any:
+        """Execute a single node and return its output."""
+        # Gather inputs
+        inputs = []
+        for inp in node.inputs:
+            if inp == "input":
+                inputs.append(initial_input)
+            else:
+                inputs.append(context.get_input(inp))
+        # Single input case
+        node_input = inputs[0] if len(inputs) == 1 else inputs
+        logger.debug(f"Executing node: {node.id} (type={node.type.value})")
+        try:
+            if node.type == NodeType.ENCODE:
+                encoder = self.registry.get_encoder(node.tool)
+                modality = Modality(node.params.get("modality", "text"))
+                return encoder.encode(node_input, modality)
+            elif node.type == NodeType.PREDICT:
+                predictor = self.registry.get_predictor(node.tool)
+                drug: str = str(node.params.get("drug") or node_input)
+                target: str = str(node.params.get("target") or node.params.get("target_input") or "")
+                return predictor.predict(drug, target)
+            elif node.type == NodeType.RETRIEVE:
+                if self._retriever is None:
+                    raise ValueError("No retriever configured. Call set_retriever() first.")
+                limit = node.params.get("limit", 5)
+                modality = node.params.get("modality", "text")
+                return self._retriever.search(
+                    query=node_input,
+                    query_modality=modality,
+                    limit=limit
+                )
+            elif node.type == NodeType.GENERATE:
+                generator = self.registry.get_generator(node.tool)
+                constraints = node.params.get("constraints", {})
+                return generator.generate(node_input, constraints)
+            elif node.type == NodeType.FILTER:
+                # Built-in filter: expects list, applies threshold
+                threshold = node.params.get("threshold", 0.5)
+                key = node.params.get("key", "score")
+                if isinstance(node_input, list):
+                    return [x for x in node_input if getattr(x, key, x.get(key, 0)) >= threshold]
+                return node_input
+            elif node.type == NodeType.CUSTOM:
+                if node.tool not in self.custom_handlers:
+                    raise ValueError(f"Custom handler '{node.tool}' not registered")
+                handler = self.custom_handlers[node.tool]
+                return handler(node_input, **node.params)
+            else:
+                raise ValueError(f"Unknown node type: {node.type}")
+        except Exception as e:
+            context.errors.append(f"Node {node.id}: {str(e)}")
+            logger.error(f"Error in node {node.id}: {e}")
+            raise
+    def run(
+        self,
+        workflow_name: str,
+        input_data: Any,
+        metadata: Optional[Dict[str, Any]] = None
+    ) -> PipelineResult:
+        """
+        Execute a registered workflow.
+        Args:
+            workflow_name: Name of registered workflow
+            input_data: Initial input to the pipeline
+            metadata: Optional metadata to include in context
+        Returns:
+            PipelineResult with output and execution details
+        """
+        if workflow_name not in self.workflows:
+            raise ValueError(f"Workflow '{workflow_name}' not found")
+        config = self.workflows[workflow_name]
+        context = ExecutionContext(
+            workflow_id=workflow_name,
+            metadata=metadata or {}
+        )
+        start = datetime.now()
+        try:
+            # Get execution order
+            execution_order = self._build_execution_order(config)
+            # Execute each node
+            for node in execution_order:
+                output = self._execute_node(node, context, input_data)
+                context.set_output(node.id, output)
+            # Get final output
+            final_output = context.get_input(config.output_node) if config.output_node else output
+            duration = (datetime.now() - start).total_seconds() * 1000
+            return PipelineResult(
+                success=True,
+                output=final_output,
+                context=context,
+                duration_ms=duration
+            )
+        except Exception as e:
+            duration = (datetime.now() - start).total_seconds() * 1000
+            logger.error(f"Workflow {workflow_name} failed: {e}")
+            return PipelineResult(
+                success=False,
+                output=None,
+                context=context,
+                duration_ms=duration
+            )
+    def run_from_dict(
+        self,
+        workflow_dict: Dict[str, Any],
+        input_data: Any
+    ) -> PipelineResult:
+        """
+        Execute a workflow from a dictionary (e.g., loaded YAML).
+        Useful for ad-hoc workflows without pre-registration.
+        """
+        config = WorkflowConfig.from_dict(workflow_dict)
+        self.register_workflow(config)
+        return self.run(config.name, input_data)
+    def list_workflows(self) -> List[str]:
+        """List all registered workflows."""
+        return list(self.workflows.keys())
+    def describe_workflow(self, name: str) -> Dict[str, Any]:
+        """Get details about a workflow."""
+        if name not in self.workflows:
+            raise ValueError(f"Workflow '{name}' not found")
+        config = self.workflows[name]
+        return {
+            "name": config.name,
+            "description": config.description,
+            "nodes": [
+                {"id": n.id, "type": n.type.value, "tool": n.tool}
+                for n in config.nodes
+            ],
+            "output_node": config.output_node
+        }

bioflow/core/registry.py ADDED Viewed

	@@ -0,0 +1,154 @@

+"""
+BioFlow Tool Registry
+======================
+Central registry for all biological tools in the BioFlow platform.
+Supports encoders, predictors, generators, and misc tools.
+"""
+from typing import Dict, Type, Any, Optional, List
+import logging
+from bioflow.core.base import BioEncoder, BioPredictor, BioGenerator, BioTool
+logger = logging.getLogger(__name__)
+class ToolRegistry:
+    """
+    Central registry for all biological tools in the BioFlow platform.
+    Features:
+    - Register/unregister tools by name
+    - Get tools with fallback to default
+    - List all registered tools
+    - Auto-discovery of tools from plugins directory
+    Usage:
+        >>> ToolRegistry.register_encoder("esm2", ESM2Encoder())
+        >>> encoder = ToolRegistry.get_encoder("esm2")
+    """
+    _encoders: Dict[str, BioEncoder] = {}
+    _predictors: Dict[str, BioPredictor] = {}
+    _generators: Dict[str, BioGenerator] = {}
+    _misc_tools: Dict[str, BioTool] = {}
+    _default_encoder: Optional[str] = None
+    _default_predictor: Optional[str] = None
+    # ==================== ENCODERS ====================
+    @classmethod
+    def register_encoder(cls, name: str, encoder: BioEncoder, set_default: bool = False):
+        """Register an encoder with optional default flag."""
+        cls._encoders[name] = encoder
+        if set_default or cls._default_encoder is None:
+            cls._default_encoder = name
+        logger.info(f"Registered encoder: {name} (dim={encoder.dimension})")
+    @classmethod
+    def unregister_encoder(cls, name: str):
+        """Remove an encoder from registry."""
+        if name in cls._encoders:
+            del cls._encoders[name]
+            if cls._default_encoder == name:
+                cls._default_encoder = next(iter(cls._encoders), None)
+    @classmethod
+    def get_encoder(cls, name: str = None) -> BioEncoder:
+        """Get encoder by name, or return default."""
+        name = name or cls._default_encoder
+        if name not in cls._encoders:
+            available = list(cls._encoders.keys())
+            raise ValueError(f"Encoder '{name}' not found. Available: {available}")
+        return cls._encoders[name]
+    # ==================== PREDICTORS ====================
+    @classmethod
+    def register_predictor(cls, name: str, predictor: BioPredictor, set_default: bool = False):
+        """Register a predictor with optional default flag."""
+        cls._predictors[name] = predictor
+        if set_default or cls._default_predictor is None:
+            cls._default_predictor = name
+        logger.info(f"Registered predictor: {name}")
+    @classmethod
+    def unregister_predictor(cls, name: str):
+        """Remove a predictor from registry."""
+        if name in cls._predictors:
+            del cls._predictors[name]
+            if cls._default_predictor == name:
+                cls._default_predictor = next(iter(cls._predictors), None)
+    @classmethod
+    def get_predictor(cls, name: str = None) -> BioPredictor:
+        """Get predictor by name, or return default."""
+        name = name or cls._default_predictor
+        if name not in cls._predictors:
+            available = list(cls._predictors.keys())
+            raise ValueError(f"Predictor '{name}' not found. Available: {available}")
+        return cls._predictors[name]
+    # ==================== GENERATORS ====================
+    @classmethod
+    def register_generator(cls, name: str, generator: BioGenerator):
+        """Register a generator."""
+        cls._generators[name] = generator
+        logger.info(f"Registered generator: {name}")
+    @classmethod
+    def get_generator(cls, name: str) -> BioGenerator:
+        """Get generator by name."""
+        if name not in cls._generators:
+            available = list(cls._generators.keys())
+            raise ValueError(f"Generator '{name}' not found. Available: {available}")
+        return cls._generators[name]
+    # ==================== MISC TOOLS ====================
+    @classmethod
+    def register_tool(cls, name: str, tool: BioTool):
+        """Register a miscellaneous tool."""
+        cls._misc_tools[name] = tool
+        logger.info(f"Registered tool: {name}")
+    @classmethod
+    def get_tool(cls, name: str) -> BioTool:
+        """Get misc tool by name."""
+        if name not in cls._misc_tools:
+            available = list(cls._misc_tools.keys())
+            raise ValueError(f"Tool '{name}' not found. Available: {available}")
+        return cls._misc_tools[name]
+    # ==================== UTILITIES ====================
+    @classmethod
+    def list_tools(cls) -> Dict[str, List[str]]:
+        """List all registered tools by category."""
+        return {
+            "encoders": list(cls._encoders.keys()),
+            "predictors": list(cls._predictors.keys()),
+            "generators": list(cls._generators.keys()),
+            "tools": list(cls._misc_tools.keys())
+        }
+    @classmethod
+    def clear(cls):
+        """Clear all registered tools (useful for testing)."""
+        cls._encoders.clear()
+        cls._predictors.clear()
+        cls._generators.clear()
+        cls._misc_tools.clear()
+        cls._default_encoder = None
+        cls._default_predictor = None
+    @classmethod
+    def summary(cls) -> str:
+        """Get human-readable summary of registered tools."""
+        tools = cls.list_tools()
+        lines = ["BioFlow Tool Registry:"]
+        for category, names in tools.items():
+            lines.append(f"  {category}: {', '.join(names) if names else '(none)'}")
+        return "\n".join(lines)

bioflow/demo.py ADDED Viewed

	@@ -0,0 +1,261 @@

+"""
+BioFlow Demo Script - Test all capabilities
+=============================================
+This script demonstrates all major features of the BioFlow system.
+Run this to verify your installation and see the system in action.
+Usage:
+    python bioflow/demo.py
+"""
+import os
+import sys
+import numpy as np
+from pprint import pprint
+# Add project root to path
+ROOT_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
+sys.path.insert(0, ROOT_DIR)
+def print_header(title: str):
+    """Print a formatted section header."""
+    print("\n" + "=" * 60)
+    print(f"  {title}")
+    print("=" * 60 + "\n")
+def demo_obm_encoding():
+    """Demonstrate OBM encoding capabilities."""
+    print_header("🧬 OBM Multimodal Encoding")
+    from bioflow.obm_wrapper import OBMWrapper, ModalityType
+    # Initialize with mock mode (no GPU needed)
+    obm = OBMWrapper(use_mock=True)
+    print(f"✅ OBM initialized in Mock mode")
+    print(f"   Vector dimension: {obm.vector_dim}")
+    print(f"   Device: {obm.device}")
+    # Encode text
+    print("\n📝 Encoding Text:")
+    texts = [
+        "KRAS is a protein involved in cell signaling",
+        "Aspirin is used to reduce inflammation"
+    ]
+    text_embeddings = obm.encode_text(texts)
+    for emb in text_embeddings:
+        print(f"   [{emb.modality.value}] dim={emb.dimension}, hash={emb.content_hash}")
+        print(f"   Content: {emb.content[:50]}...")
+    # Encode SMILES
+    print("\n🧪 Encoding SMILES:")
+    smiles_list = [
+        "CC(=O)OC1=CC=CC=C1C(=O)O",  # Aspirin
+        "CCO",                         # Ethanol
+        "c1ccccc1"                     # Benzene
+    ]
+    mol_embeddings = obm.encode_smiles(smiles_list)
+    for emb in mol_embeddings:
+        print(f"   [{emb.modality.value}] {emb.content} → dim={emb.dimension}")
+    # Encode proteins
+    print("\n🔬 Encoding Proteins:")
+    proteins = [
+        "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGET",  # KRAS fragment
+    ]
+    prot_embeddings = obm.encode_protein(proteins)
+    for emb in prot_embeddings:
+        print(f"   [{emb.modality.value}] {emb.content[:30]}... → dim={emb.dimension}")
+    # Cross-modal similarity
+    print("\n🔄 Cross-Modal Similarity (Text → Molecules):")
+    similarities = obm.cross_modal_similarity(
+        query="anti-inflammatory drug",
+        query_modality="text",
+        targets=smiles_list,
+        target_modality="smiles"
+    )
+    for content, score in similarities:
+        print(f"   {score:.4f} | {content}")
+    return obm
+def demo_qdrant_manager(obm):
+    """Demonstrate Qdrant vector storage."""
+    print_header("📦 Qdrant Vector Storage")
+    from bioflow.qdrant_manager import QdrantManager
+    # Initialize with in-memory storage
+    qdrant = QdrantManager(obm, default_collection="demo_collection")
+    print(f"✅ Qdrant Manager initialized (in-memory)")
+    # Create collection
+    qdrant.create_collection(recreate=True)
+    print(f"   Collection created: demo_collection")
+    # Ingest sample data
+    print("\n📥 Ingesting Sample Data:")
+    sample_data = [
+        {"content": "Aspirin is used to reduce fever and relieve mild to moderate pain", "modality": "text", "source": "PubMed:001", "tags": ["pain", "fever"]},
+        {"content": "CC(=O)OC1=CC=CC=C1C(=O)O", "modality": "smiles", "source": "ChEMBL", "tags": ["aspirin", "nsaid"]},
+        {"content": "Ibuprofen is a nonsteroidal anti-inflammatory drug", "modality": "text", "source": "PubMed:002", "tags": ["nsaid"]},
+        {"content": "CC(C)CC1=CC=C(C=C1)C(C)C(=O)O", "modality": "smiles", "source": "ChEMBL", "tags": ["ibuprofen"]},
+        {"content": "KRAS mutations are found in many cancers", "modality": "text", "source": "PubMed:003", "tags": ["cancer", "KRAS"]},
+        {"content": "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGET", "modality": "protein", "source": "UniProt:P01116", "tags": ["KRAS"]},
+    ]
+    stats = qdrant.ingest(sample_data)
+    print(f"   Ingestion stats: {stats}")
+    # Collection info
+    info = qdrant.get_collection_info()
+    print(f"\n📊 Collection Info:")
+    for k, v in info.items():
+        print(f"   {k}: {v}")
+    # Search
+    print("\n🔍 Searching for 'anti-inflammatory':")
+    results = qdrant.search(
+        query="anti-inflammatory medicine",
+        query_modality="text",
+        limit=3
+    )
+    for r in results:
+        print(f"   {r.score:.4f} | [{r.modality}] {r.content[:50]}...")
+    # Cross-modal search
+    print("\n🔄 Cross-Modal Search (Text → Molecules):")
+    results = qdrant.cross_modal_search(
+        query="pain relief medication",
+        query_modality="text",
+        target_modality="smiles",
+        limit=3
+    )
+    for r in results:
+        print(f"   {r.score:.4f} | {r.content}")
+    # Diversity analysis
+    print("\n📈 Neighbors Diversity Analysis:")
+    diversity = qdrant.get_neighbors_diversity(
+        query="cancer treatment",
+        query_modality="text",
+        k=5
+    )
+    for k, v in diversity.items():
+        if isinstance(v, float):
+            print(f"   {k}: {v:.4f}")
+        else:
+            print(f"   {k}: {v}")
+    return qdrant
+def demo_pipeline(obm, qdrant):
+    """Demonstrate the pipeline system."""
+    print_header("🔬 BioFlow Pipeline")
+    from bioflow.pipeline import BioFlowPipeline, MinerAgent, ValidatorAgent
+    # Initialize pipeline
+    pipeline = BioFlowPipeline(obm, qdrant)
+    print("✅ Pipeline initialized")
+    # Register agents
+    miner = MinerAgent(obm, qdrant, "demo_collection")
+    validator = ValidatorAgent(obm, qdrant, "demo_collection")
+    pipeline.register_agent(miner)
+    pipeline.register_agent(validator)
+    print(f"   Registered agents: {list(pipeline.agents.keys())}")
+    # Run discovery workflow
+    print("\n🚀 Running Discovery Workflow:")
+    print("   Query: 'anti-inflammatory drug for pain'")
+    results = pipeline.run_discovery_workflow(
+        query="anti-inflammatory drug for pain",
+        query_modality="text",
+        target_modality="smiles"
+    )
+    print("\n   📚 Literature Results:")
+    for item in results["stages"].get("literature", [])[:3]:
+        print(f"      {item['score']:.4f} | {item['content'][:40]}...")
+    print("\n   🧪 Molecule Results:")
+    for item in results["stages"].get("molecules", [])[:3]:
+        print(f"      {item['score']:.4f} | {item['content']}")
+    print("\n   📊 Diversity:")
+    div = results["stages"].get("diversity", {})
+    print(f"      Mean similarity: {div.get('mean_similarity', 0):.4f}")
+    print(f"      Diversity score: {div.get('diversity_score', 0):.4f}")
+def demo_visualization():
+    """Demonstrate visualization capabilities."""
+    print_header("📊 Visualization Capabilities")
+    try:
+        from bioflow.visualizer import EmbeddingVisualizer, ResultsVisualizer
+        print("✅ Visualization module loaded")
+        # Generate sample embeddings
+        n_samples = 20
+        embeddings = np.random.randn(n_samples, 768)
+        labels = [f"Sample {i}" for i in range(n_samples)]
+        colors = ["text"] * 7 + ["smiles"] * 7 + ["protein"] * 6
+        # Dimensionality reduction
+        print("\n🔻 Dimensionality Reduction:")
+        reduced = EmbeddingVisualizer.reduce_dimensions(embeddings, method="pca", n_components=2)
+        print(f"   Original shape: {embeddings.shape}")
+        print(f"   Reduced shape: {reduced.shape}")
+        # Note about plots
+        print("\n📈 Plotting Functions Available:")
+        print("   - plot_embeddings_2d(embeddings, labels, colors)")
+        print("   - plot_embeddings_3d(embeddings, labels)")
+        print("   - plot_similarity_matrix(embeddings, labels)")
+        print("   - create_dashboard(results, embeddings)")
+        print("\n   Run the Streamlit app to see interactive visualizations!")
+    except ImportError as e:
+        print(f"⚠️ Some visualization dependencies missing: {e}")
+        print("   Install with: pip install plotly scikit-learn")
+def main():
+    """Run all demos."""
+    print("\n" + "🧬" * 20)
+    print("   BIOFLOW + OBM DEMO")
+    print("🧬" * 20)
+    print("\nThis demo runs in MOCK mode (no GPU/model required).")
+    print("Embeddings are deterministic random vectors for testing.\n")
+    # Run demos
+    obm = demo_obm_encoding()
+    qdrant = demo_qdrant_manager(obm)
+    demo_pipeline(obm, qdrant)
+    demo_visualization()
+    print_header("✅ Demo Complete!")
+    print("Next steps:")
+    print("  1. Run the Streamlit interface:")
+    print("     streamlit run bioflow/app.py")
+    print("")
+    print("  2. For real embeddings, set use_mock=False and ensure:")
+    print("     - BioMedGPT checkpoints are downloaded")
+    print("     - GPU is available")
+    print("")
+    print("  3. Read the documentation:")
+    print("     docs/BIOFLOW_OBM_REPORT.md")
+    print("")
+if __name__ == "__main__":
+    main()

bioflow/obm_wrapper.py ADDED Viewed

	@@ -0,0 +1,355 @@

+"""
+OBM Wrapper - Unified Multimodal Encoding Interface
+=====================================================
+This module provides a clean, high-level API for encoding biological data
+(text, molecules, proteins) into a unified vector space using open-source models.
+"""
+import os
+import sys
+import torch
+import numpy as np
+import logging
+from typing import List, Union, Dict, Any, Optional, Tuple
+from dataclasses import dataclass
+from enum import Enum
+import hashlib
+# Add project root to path
+ROOT_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
+sys.path.insert(0, ROOT_DIR)
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+class ModalityType(Enum):
+    """Supported data modalities."""
+    TEXT = "text"
+    MOLECULE = "molecule"
+    SMILES = "smiles"
+    PROTEIN = "protein"
+    CELL = "cell"
+@dataclass
+class EmbeddingResult:
+    """Container for embedding results with metadata."""
+    vector: np.ndarray
+    modality: ModalityType
+    content: str
+    content_hash: str
+    dimension: int
+    def to_dict(self) -> Dict[str, Any]:
+        return {
+            "vector": self.vector.tolist(),
+            "modality": self.modality.value,
+            "content": self.content,
+            "content_hash": self.content_hash,
+            "dimension": self.dimension
+        }
+class OBMWrapper:
+    """
+    Unified wrapper for OpenBioMed multimodal encoding.
+    This class provides a clean API for encoding biological data into
+    a shared embedding space, enabling cross-modal similarity search.
+    Attributes:
+        device: Computing device ('cuda' or 'cpu')
+        model: Underlying open-source model
+        vector_dim: Dimension of output embeddings
+    """
+    def __init__(
+        self,
+        device: str = None,
+        config_path: str = None,
+        checkpoint_path: str = None,
+        use_mock: bool = False
+    ):
+        """
+        Initialize the OBM wrapper.
+        Args:
+            device: 'cuda' or 'cpu'. Auto-detects if None.
+            config_path: Path to open-source model config YAML.
+            checkpoint_path: Path to model weights.
+            use_mock: If True, uses mock embeddings (for testing without GPU).
+        """
+        self.device = device or ("cuda" if torch.cuda.is_available() else "cpu")
+        self.use_mock = use_mock
+        self._model = None
+        self._vector_dim = 768  # Default, updated after model load
+        if config_path is None:
+            config_path = os.path.join(ROOT_DIR, "configs/model/opensource_model.yaml")
+        self.config_path = config_path
+        self.checkpoint_path = checkpoint_path
+        if not use_mock:
+            self._init_model()
+        else:
+            logger.info("Using MOCK mode - embeddings are random vectors for testing")
+            self._vector_dim = 768
+    def _init_model(self):
+        """Initialize the open-source model."""
+        try:
+            # Placeholder for initializing open-source model
+            pass
+            self._model = None
+            self._vector_dim = 768
+            logger.info(f"OBM initialized. Device: {self.device}, Vector dim: {self._vector_dim}")
+        except Exception as e:
+            logger.error(f"Failed to load model: {e}")
+            logger.warning("Falling back to MOCK mode")
+            self.use_mock = True
+            self._vector_dim = 768
+    @property
+    def vector_dim(self) -> int:
+        """Return the embedding dimension."""
+        return self._vector_dim
+    @property
+    def is_ready(self) -> bool:
+        """Check if model is loaded and ready."""
+        return self._model is not None or self.use_mock
+    def _compute_hash(self, content: str) -> str:
+        """Compute content hash for deduplication."""
+        return hashlib.md5(content.encode()).hexdigest()[:16]
+    def _mock_embed(self, content: str, modality: ModalityType) -> np.ndarray:
+        """Generate deterministic mock embedding based on content hash."""
+        seed = int(self._compute_hash(content), 16) % (2**32)
+        rng = np.random.RandomState(seed)
+        vec = rng.randn(self._vector_dim).astype(np.float32)
+        # Normalize
+        vec = vec / np.linalg.norm(vec)
+        return vec
+    @torch.no_grad()
+    def encode_text(self, text: Union[str, List[str]]) -> List[EmbeddingResult]:
+        """
+        Encode text (abstracts, descriptions, notes) into embeddings.
+        Args:
+            text: Single string or list of strings.
+        Returns:
+            List of EmbeddingResult objects.
+        """
+        if isinstance(text, str):
+            text = [text]
+        results = []
+        if self.use_mock:
+            for t in text:
+                vec = self._mock_embed(t, ModalityType.TEXT)
+                results.append(EmbeddingResult(
+                    vector=vec,
+                    modality=ModalityType.TEXT,
+                    content=t[:200],  # Truncate for storage
+                    content_hash=self._compute_hash(t),
+                    dimension=self._vector_dim
+                ))
+        else:
+            tokenizer = self._model.llm_tokenizer
+            inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
+            inputs = {k: v.to(self.device) for k, v in inputs.items()}
+            outputs = self._model.llm(**inputs, output_hidden_states=True)
+            hidden = outputs.hidden_states[-1]
+            mask = inputs['attention_mask'].unsqueeze(-1).float()
+            pooled = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
+            vectors = pooled.cpu().numpy()
+            for i, t in enumerate(text):
+                results.append(EmbeddingResult(
+                    vector=vectors[i],
+                    modality=ModalityType.TEXT,
+                    content=t[:200],
+                    content_hash=self._compute_hash(t),
+                    dimension=self._vector_dim
+                ))
+        return results
+    @torch.no_grad()
+    def encode_smiles(self, smiles: Union[str, List[str]]) -> List[EmbeddingResult]:
+        """
+        Encode SMILES molecular representations into embeddings.
+        Args:
+            smiles: Single SMILES string or list of SMILES.
+        Returns:
+            List of EmbeddingResult objects.
+        """
+        if isinstance(smiles, str):
+            smiles = [smiles]
+        results = []
+        if self.use_mock:
+            for s in smiles:
+                vec = self._mock_embed(s, ModalityType.MOLECULE)
+                results.append(EmbeddingResult(
+                    vector=vec,
+                    modality=ModalityType.MOLECULE,
+                    content=s,
+                    content_hash=self._compute_hash(s),
+                    dimension=self._vector_dim
+                ))
+        else:
+            from open_biomed.data import Molecule
+            from torch_scatter import scatter_mean
+            molecules = [Molecule.from_smiles(s) for s in smiles]
+            mol_feats = [self._model.featurizer.molecule_featurizer(m) for m in molecules]
+            collated = self._model.collator.molecule_collator(mol_feats).to(self.device)
+            node_feats = self._model.mol_structure_encoder(collated)
+            proj_feats = self._model.proj_mol(node_feats)
+            vectors = scatter_mean(proj_feats, collated.batch, dim=0).cpu().numpy()
+            for i, s in enumerate(smiles):
+                results.append(EmbeddingResult(
+                    vector=vectors[i],
+                    modality=ModalityType.MOLECULE,
+                    content=s,
+                    content_hash=self._compute_hash(s),
+                    dimension=self._vector_dim
+                ))
+        return results
+    @torch.no_grad()
+    def encode_protein(self, sequences: Union[str, List[str]]) -> List[EmbeddingResult]:
+        """
+        Encode protein sequences (FASTA format) into embeddings.
+        Args:
+            sequences: Single sequence or list of sequences.
+        Returns:
+            List of EmbeddingResult objects.
+        """
+        if isinstance(sequences, str):
+            sequences = [sequences]
+        results = []
+        if self.use_mock:
+            for seq in sequences:
+                vec = self._mock_embed(seq, ModalityType.PROTEIN)
+                results.append(EmbeddingResult(
+                    vector=vec,
+                    modality=ModalityType.PROTEIN,
+                    content=seq[:100] + "..." if len(seq) > 100 else seq,
+                    content_hash=self._compute_hash(seq),
+                    dimension=self._vector_dim
+                ))
+        else:
+            tokenizer = self._model.prot_tokenizer
+            inputs = tokenizer(sequences, return_tensors="pt", padding=True, truncation=True, max_length=1024)
+            inputs = {k: v.to(self.device) for k, v in inputs.items()}
+            outputs = self._model.prot_structure_encoder(**inputs)
+            hidden = outputs.last_hidden_state
+            proj = self._model.proj_prot(hidden)
+            mask = inputs['attention_mask'].unsqueeze(-1).float()
+            pooled = (proj * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
+            vectors = pooled.cpu().numpy()
+            for i, seq in enumerate(sequences):
+                results.append(EmbeddingResult(
+                    vector=vectors[i],
+                    modality=ModalityType.PROTEIN,
+                    content=seq[:100] + "..." if len(seq) > 100 else seq,
+                    content_hash=self._compute_hash(seq),
+                    dimension=self._vector_dim
+                ))
+        return results
+    def encode(self, content: str, modality: Union[str, ModalityType]) -> EmbeddingResult:
+        """
+        Universal encoding function.
+        Args:
+            content: The content to encode.
+            modality: Type of content ('text', 'smiles', 'molecule', 'protein').
+        Returns:
+            Single EmbeddingResult.
+        """
+        if isinstance(modality, str):
+            modality = ModalityType(modality.lower())
+        if modality in [ModalityType.TEXT]:
+            return self.encode_text(content)[0]
+        elif modality in [ModalityType.MOLECULE, ModalityType.SMILES]:
+            return self.encode_smiles(content)[0]
+        elif modality == ModalityType.PROTEIN:
+            return self.encode_protein(content)[0]
+        else:
+            raise ValueError(f"Unsupported modality: {modality}")
+    def compute_similarity(self, embedding1: np.ndarray, embedding2: np.ndarray) -> float:
+        """Compute cosine similarity between two embeddings."""
+        norm1 = np.linalg.norm(embedding1)
+        norm2 = np.linalg.norm(embedding2)
+        if norm1 == 0 or norm2 == 0:
+            return 0.0
+        return float(np.dot(embedding1, embedding2) / (norm1 * norm2))
+    def cross_modal_similarity(
+        self,
+        query: str,
+        query_modality: str,
+        targets: List[str],
+        target_modality: str
+    ) -> List[Tuple[str, float]]:
+        """
+        Compute cross-modal similarities.
+        Args:
+            query: Query content.
+            query_modality: Modality of query.
+            targets: List of target contents.
+            target_modality: Modality of targets.
+        Returns:
+            List of (target, similarity_score) tuples, sorted by similarity.
+        """
+        query_emb = self.encode(query, query_modality)
+        target_embs = []
+        if target_modality.lower() in ['text']:
+            target_embs = self.encode_text(targets)
+        elif target_modality.lower() in ['smiles', 'molecule']:
+            target_embs = self.encode_smiles(targets)
+        elif target_modality.lower() == 'protein':
+            target_embs = self.encode_protein(targets)
+        results = []
+        for emb in target_embs:
+            sim = self.compute_similarity(query_emb.vector, emb.vector)
+            results.append((emb.content, sim))
+        return sorted(results, key=lambda x: x[1], reverse=True)

bioflow/pipeline.py ADDED Viewed

	@@ -0,0 +1,370 @@

+"""
+BioFlow Pipeline - Workflow Orchestration
+==========================================
+This module provides the pipeline orchestration for BioFlow,
+connecting agents, memory (Qdrant), and OBM encoders.
+"""
+import logging
+from typing import List, Dict, Any, Optional, Callable
+from dataclasses import dataclass, field
+from enum import Enum
+from datetime import datetime
+import json
+from bioflow.obm_wrapper import OBMWrapper
+from bioflow.qdrant_manager import QdrantManager, SearchResult
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+class AgentType(Enum):
+    """Types of agents in the BioFlow system."""
+    GENERATOR = "generator"      # Generates new molecules/variants
+    VALIDATOR = "validator"      # Validates properties (toxicity, etc.)
+    MINER = "miner"              # Mines literature for evidence
+    RANKER = "ranker"            # Ranks candidates
+    CUSTOM = "custom"
+@dataclass
+class AgentMessage:
+    """Message passed between agents."""
+    sender: str
+    content: Any
+    metadata: Dict[str, Any] = field(default_factory=dict)
+    timestamp: str = field(default_factory=lambda: datetime.now().isoformat())
+@dataclass
+class PipelineResult:
+    """Result of a pipeline execution."""
+    success: bool
+    outputs: List[Any]
+    messages: List[AgentMessage]
+    stats: Dict[str, Any]
+class BaseAgent:
+    """Base class for all BioFlow agents."""
+    def __init__(
+        self,
+        name: str,
+        agent_type: AgentType,
+        obm: OBMWrapper,
+        qdrant: QdrantManager
+    ):
+        self.name = name
+        self.agent_type = agent_type
+        self.obm = obm
+        self.qdrant = qdrant
+    def process(self, input_data: Any, context: Dict[str, Any] = None) -> AgentMessage:
+        """Process input and return output message."""
+        raise NotImplementedError
+class MinerAgent(BaseAgent):
+    """
+    Literature mining agent.
+    Retrieves relevant scientific articles/abstracts based on query.
+    """
+    def __init__(self, obm: OBMWrapper, qdrant: QdrantManager, collection: str = None):
+        super().__init__("LiteratureMiner", AgentType.MINER, obm, qdrant)
+        self.collection = collection
+    def process(
+        self,
+        input_data: str,
+        context: Dict[str, Any] = None
+    ) -> AgentMessage:
+        """
+        Search for relevant literature.
+        Args:
+            input_data: Query text, SMILES, or protein sequence.
+            context: Optional context with 'modality', 'limit'.
+        """
+        context = context or {}
+        modality = context.get("modality", "text")
+        limit = context.get("limit", 5)
+        results = self.qdrant.search(
+            query=input_data,
+            query_modality=modality,
+            collection=self.collection,
+            limit=limit,
+            filter_modality="text"
+        )
+        return AgentMessage(
+            sender=self.name,
+            content=[r.payload for r in results],
+            metadata={
+                "query": input_data,
+                "modality": modality,
+                "result_count": len(results),
+                "top_score": results[0].score if results else 0
+            }
+        )
+class ValidatorAgent(BaseAgent):
+    """
+    Validation agent.
+    Checks molecules against known toxicity, drug-likeness, etc.
+    """
+    def __init__(self, obm: OBMWrapper, qdrant: QdrantManager, collection: str = None):
+        super().__init__("Validator", AgentType.VALIDATOR, obm, qdrant)
+        self.collection = collection
+    def process(
+        self,
+        input_data: str,
+        context: Dict[str, Any] = None
+    ) -> AgentMessage:
+        """
+        Validate a molecule.
+        Args:
+            input_data: SMILES string to validate.
+            context: Optional context.
+        """
+        context = context or {}
+        # Search for similar known molecules
+        similar = self.qdrant.search(
+            query=input_data,
+            query_modality="smiles",
+            collection=self.collection,
+            limit=10,
+            filter_modality="smiles"
+        )
+        # Basic validation flags
+        validation = {
+            "has_similar_known": len(similar) > 0,
+            "max_similarity": similar[0].score if similar else 0,
+            "similar_molecules": [
+                {
+                    "smiles": r.content,
+                    "score": r.score,
+                    "tags": r.payload.get("tags", [])
+                }
+                for r in similar[:3]
+            ]
+        }
+        # Flag potential issues based on tags of similar molecules
+        risk_tags = ["toxic", "mutagenic", "carcinogenic"]
+        flagged_risks = []
+        for r in similar:
+            tags = r.payload.get("tags", [])
+            for tag in tags:
+                if any(risk in tag.lower() for risk in risk_tags):
+                    flagged_risks.append({"molecule": r.content, "tag": tag})
+        validation["flagged_risks"] = flagged_risks
+        validation["passed"] = len(flagged_risks) == 0
+        return AgentMessage(
+            sender=self.name,
+            content=validation,
+            metadata={"input_smiles": input_data}
+        )
+class RankerAgent(BaseAgent):
+    """
+    Ranking agent.
+    Ranks candidates based on multiple criteria.
+    """
+    def __init__(self, obm: OBMWrapper, qdrant: QdrantManager):
+        super().__init__("Ranker", AgentType.RANKER, obm, qdrant)
+    def process(
+        self,
+        input_data: List[Dict[str, Any]],
+        context: Dict[str, Any] = None
+    ) -> AgentMessage:
+        """
+        Rank a list of candidates.
+        Args:
+            input_data: List of candidate dicts with 'content', 'scores'.
+        """
+        # Simple weighted ranking
+        ranked = sorted(
+            input_data,
+            key=lambda x: sum(x.get("scores", {}).values()),
+            reverse=True
+        )
+        return AgentMessage(
+            sender=self.name,
+            content=ranked,
+            metadata={"original_count": len(input_data)}
+        )
+class BioFlowPipeline:
+    """
+    Main pipeline orchestrator for BioFlow.
+    Connects multiple agents in a workflow.
+    """
+    def __init__(
+        self,
+        obm: OBMWrapper,
+        qdrant: QdrantManager
+    ):
+        self.obm = obm
+        self.qdrant = qdrant
+        self.agents: Dict[str, BaseAgent] = {}
+        self.workflow: List[str] = []
+        self.messages: List[AgentMessage] = []
+    def register_agent(self, agent: BaseAgent) -> None:
+        """Register an agent with the pipeline."""
+        self.agents[agent.name] = agent
+        logger.info(f"Registered agent: {agent.name} ({agent.agent_type.value})")
+    def set_workflow(self, agent_names: List[str]) -> None:
+        """
+        Set the workflow order.
+        Args:
+            agent_names: List of agent names in execution order.
+        """
+        for name in agent_names:
+            if name not in self.agents:
+                raise ValueError(f"Unknown agent: {name}")
+        self.workflow = agent_names
+    def run(
+        self,
+        initial_input: Any,
+        initial_context: Dict[str, Any] = None
+    ) -> PipelineResult:
+        """
+        Execute the pipeline.
+        Args:
+            initial_input: Starting input data.
+            initial_context: Initial context for first agent.
+        Returns:
+            PipelineResult with all outputs and messages.
+        """
+        self.messages = []
+        current_input = initial_input
+        current_context = initial_context or {}
+        outputs = []
+        for agent_name in self.workflow:
+            agent = self.agents[agent_name]
+            logger.info(f"Executing agent: {agent_name}")
+            try:
+                message = agent.process(current_input, current_context)
+                self.messages.append(message)
+                outputs.append(message.content)
+                # Pass output to next agent
+                current_input = message.content
+                current_context.update(message.metadata)
+            except Exception as e:
+                logger.error(f"Agent {agent_name} failed: {e}")
+                return PipelineResult(
+                    success=False,
+                    outputs=outputs,
+                    messages=self.messages,
+                    stats={"failed_at": agent_name, "error": str(e)}
+                )
+        return PipelineResult(
+            success=True,
+            outputs=outputs,
+            messages=self.messages,
+            stats={
+                "agents_executed": len(self.workflow),
+                "total_messages": len(self.messages)
+            }
+        )
+    def run_discovery_workflow(
+        self,
+        query: str,
+        query_modality: str = "text",
+        target_modality: str = "smiles"
+    ) -> Dict[str, Any]:
+        """
+        Run a complete discovery workflow.
+        1. Search for related literature
+        2. Find similar molecules
+        3. Validate candidates
+        4. Return ranked results
+        """
+        results = {
+            "query": query,
+            "query_modality": query_modality,
+            "target_modality": target_modality,
+            "stages": {}
+        }
+        # Stage 1: Literature search
+        literature = self.qdrant.search(
+            query=query,
+            query_modality=query_modality,
+            limit=5,
+            filter_modality="text"
+        )
+        results["stages"]["literature"] = [
+            {"content": r.content, "score": r.score}
+            for r in literature
+        ]
+        # Stage 2: Cross-modal molecule search
+        molecules = self.qdrant.cross_modal_search(
+            query=query,
+            query_modality=query_modality,
+            target_modality=target_modality,
+            limit=10
+        )
+        results["stages"]["molecules"] = [
+            {"content": r.content, "score": r.score, "payload": r.payload}
+            for r in molecules
+        ]
+        # Stage 3: Validate top candidates
+        if "Validator" in self.agents and molecules:
+            validated = []
+            for mol in molecules[:3]:
+                val_msg = self.agents["Validator"].process(mol.content)
+                validated.append({
+                    "smiles": mol.content,
+                    "validation": val_msg.content
+                })
+            results["stages"]["validation"] = validated
+        # Stage 4: Diversity analysis
+        diversity = self.qdrant.get_neighbors_diversity(
+            query=query,
+            query_modality=query_modality,
+            k=10
+        )
+        results["stages"]["diversity"] = diversity
+        return results

bioflow/plugins/__init__.py ADDED Viewed

	@@ -0,0 +1,58 @@

+"""
+BioFlow Plugins
+================
+Tool implementations for the BioFlow platform.
+Encoders:
+- OBMEncoder: Unified multimodal encoder (text, molecules, proteins)
+- TextEncoder: PubMedBERT / SciBERT for biomedical text
+- MoleculeEncoder: ChemBERTa for SMILES
+- ProteinEncoder: ESM-2 for protein sequences
+Retrievers:
+- QdrantRetriever: Vector database search with Qdrant
+Predictors:
+- DeepPurposePredictor: Drug-Target Interaction prediction
+"""
+# Encoders
+from bioflow.plugins.obm_encoder import OBMEncoder
+from bioflow.plugins.encoders import TextEncoder, MoleculeEncoder, ProteinEncoder
+# Retriever
+from bioflow.plugins.qdrant_retriever import QdrantRetriever
+# Predictor
+from bioflow.plugins.deeppurpose_predictor import DeepPurposePredictor
+__all__ = [
+    # Encoders
+    "OBMEncoder",
+    "TextEncoder",
+    "MoleculeEncoder",
+    "ProteinEncoder",
+    # Retriever
+    "QdrantRetriever",
+    # Predictor
+    "DeepPurposePredictor",
+]
+def register_all(registry=None):
+    """
+    Register all plugins with the tool registry.
+    Args:
+        registry: ToolRegistry instance (uses global if None)
+    """
+    from bioflow.core import ToolRegistry
+    registry = registry or ToolRegistry
+    # Note: Encoders are lazy-loaded, so we don't instantiate here
+    # They will be registered when first used
+    print("Plugins available for registration:")
+    print("  Encoders: OBMEncoder, TextEncoder, MoleculeEncoder, ProteinEncoder")
+    print("  Retrievers: QdrantRetriever")
+    print("  Predictors: DeepPurposePredictor")

bioflow/plugins/deeppurpose_predictor.py ADDED Viewed

	@@ -0,0 +1,220 @@

+"""
+DeepPurpose Predictor - DTI Prediction
+========================================
+Implements BioPredictor interface for drug-target interaction prediction.
+Note: DeepPurpose is an open-source toolkit for DTI/DDI prediction.
+If DeepPurpose is not available, falls back to a simple baseline.
+"""
+import logging
+from typing import List, Dict, Any, Optional, Tuple
+import warnings
+from bioflow.core import BioPredictor, PredictionResult
+logger = logging.getLogger(__name__)
+# Lazy import
+_deeppurpose = None
+_deeppurpose_available = None
+def _check_deeppurpose():
+    global _deeppurpose, _deeppurpose_available
+    if _deeppurpose_available is None:
+        try:
+            from DeepPurpose import DTI as DeepPurposeDTI
+            from DeepPurpose import utils as DeepPurposeUtils
+            _deeppurpose = {
+                "DTI": DeepPurposeDTI,
+                "utils": DeepPurposeUtils
+            }
+            _deeppurpose_available = True
+            logger.info("DeepPurpose is available")
+        except ImportError:
+            _deeppurpose_available = False
+            logger.warning(
+                "DeepPurpose not available. Using fallback predictor. "
+                "Install with: pip install DeepPurpose"
+            )
+    return _deeppurpose_available, _deeppurpose
+class DeepPurposePredictor(BioPredictor):
+    """
+    Drug-Target Interaction predictor using DeepPurpose.
+    Predicts binding affinity between a drug (SMILES) and target (protein sequence).
+    Example:
+        >>> predictor = DeepPurposePredictor()
+        >>> result = predictor.predict(
+        ...     drug="CC(=O)Oc1ccccc1C(=O)O",  # Aspirin
+        ...     target="MKTVRQERLKSIVRILERSKEPVSG..."  # Target protein
+        ... )
+        >>> print(result.score)  # Predicted binding affinity
+    Models (when DeepPurpose is available):
+    - Transformer + CNN (default)
+    - MPNN + CNN
+    - Morgan + AAC (baseline)
+    """
+    AVAILABLE_MODELS = [
+        "Transformer_CNN",
+        "MPNN_CNN",
+        "Morgan_CNN",
+        "Morgan_AAC",
+    ]
+    def __init__(
+        self,
+        model_type: str = "Transformer_CNN",
+        pretrained: str = None,
+        device: str = "cpu"
+    ):
+        """
+        Initialize DeepPurposePredictor.
+        Args:
+            model_type: Model architecture (e.g., "Transformer_CNN")
+            pretrained: Path to pretrained model
+            device: torch device
+        """
+        self.model_type = model_type
+        self.pretrained = pretrained
+        self.device = device
+        available, dp = _check_deeppurpose()
+        self._use_deeppurpose = available
+        self._model = None
+        if available and pretrained:
+            self._load_pretrained(pretrained)
+    def _load_pretrained(self, path: str):
+        """Load pretrained DeepPurpose model."""
+        available, dp = _check_deeppurpose()
+        if not available:
+            return
+        try:
+            self._model = dp["DTI"].load_pretrained_model(path)
+            logger.info(f"Loaded pretrained model from {path}")
+        except Exception as e:
+            logger.error(f"Failed to load pretrained model: {e}")
+    def _fallback_predict(self, drug: str, target: str) -> Tuple[float, float]:
+        """
+        Fallback prediction when DeepPurpose is not available.
+        Uses simple heuristics based on molecular properties.
+        This is NOT accurate - just a placeholder.
+        """
+        # Simple heuristics based on sequence/molecule properties
+        drug_score = min(len(drug) / 50.0, 1.0)  # Longer SMILES = higher complexity
+        target_score = min(len(target) / 500.0, 1.0)  # Longer protein = more binding sites
+        # Combine with some randomness
+        import random
+        random.seed(hash(drug + target) % 2**32)
+        base_score = (drug_score + target_score) / 2
+        noise = random.uniform(-0.1, 0.1)
+        score = max(0, min(1, base_score + noise))
+        confidence = 0.3  # Low confidence for fallback
+        return score, confidence
+    def predict(self, drug: str, target: str) -> PredictionResult:
+        """
+        Predict drug-target interaction.
+        Args:
+            drug: SMILES string of drug molecule
+            target: Protein sequence
+        Returns:
+            PredictionResult with binding affinity score
+        """
+        if self._use_deeppurpose:
+            return self._predict_deeppurpose(drug, target)
+        else:
+            score, confidence = self._fallback_predict(drug, target)
+            return PredictionResult(
+                score=score,
+                confidence=confidence,
+                label="binding" if score > 0.5 else "non-binding",
+                metadata={
+                    "method": "fallback_heuristic",
+                    "warning": "DeepPurpose not available, using simple heuristics"
+                }
+            )
+    def _predict_deeppurpose(self, drug: str, target: str) -> PredictionResult:
+        """Predict using DeepPurpose."""
+        available, dp = _check_deeppurpose()
+        try:
+            # Encode drug and target
+            drug_encoding = dp["utils"].drug_encoding(drug, self.model_type.split("_")[0])
+            target_encoding = dp["utils"].target_encoding(target, self.model_type.split("_")[1])
+            # Predict
+            if self._model:
+                y_pred = self._model.predict(drug_encoding, target_encoding)
+            else:
+                # Train a quick model or use default
+                warnings.warn("No pretrained model loaded, predictions may be unreliable")
+                y_pred = [0.5]  # Default
+            score = float(y_pred[0]) if hasattr(y_pred, '__iter__') else float(y_pred)
+            return PredictionResult(
+                score=score,
+                confidence=0.8,
+                label="binding" if score > 0.5 else "non-binding",
+                metadata={
+                    "method": "deeppurpose",
+                    "model_type": self.model_type,
+                    "drug_smiles": drug[:50],
+                    "target_length": len(target)
+                }
+            )
+        except Exception as e:
+            logger.error(f"DeepPurpose prediction failed: {e}")
+            # Fallback
+            score, confidence = self._fallback_predict(drug, target)
+            return PredictionResult(
+                score=score,
+                confidence=confidence,
+                label="binding" if score > 0.5 else "non-binding",
+                metadata={"method": "fallback", "error": str(e)}
+            )
+    def batch_predict(self, pairs: List[Tuple[str, str]]) -> List[PredictionResult]:
+        """
+        Batch predict drug-target interactions.
+        Args:
+            pairs: List of (drug_smiles, target_sequence) tuples
+        Returns:
+            List of PredictionResults
+        """
+        return [self.predict(drug, target) for drug, target in pairs]
+    def get_model_info(self) -> Dict[str, Any]:
+        """Get model information."""
+        return {
+            "model_type": self.model_type,
+            "use_deeppurpose": self._use_deeppurpose,
+            "pretrained": self.pretrained,
+            "device": self.device,
+            "available_models": self.AVAILABLE_MODELS
+        }

bioflow/plugins/encoders/__init__.py ADDED Viewed

	@@ -0,0 +1,17 @@

+"""
+BioFlow Encoders
+=================
+Open-source encoder implementations for different modalities.
+Available Encoders:
+- TextEncoder: PubMedBERT / SciBERT for biomedical text
+- MoleculeEncoder: ChemBERTa for SMILES molecules
+- ProteinEncoder: ESM-2 for protein sequences
+"""
+from bioflow.plugins.encoders.text_encoder import TextEncoder
+from bioflow.plugins.encoders.molecule_encoder import MoleculeEncoder
+from bioflow.plugins.encoders.protein_encoder import ProteinEncoder
+__all__ = ["TextEncoder", "MoleculeEncoder", "ProteinEncoder"]

bioflow/plugins/encoders/molecule_encoder.py ADDED Viewed

	@@ -0,0 +1,226 @@

+"""
+Molecule Encoder - ChemBERTa / RDKit
+=====================================
+Encodes SMILES molecules into vectors.
+Models:
+- seyonec/ChemBERTa-zinc-base-v1 (default)
+- DeepChem/ChemBERTa-77M-MTR
+- RDKit fingerprints (fallback, no GPU needed)
+"""
+import logging
+from typing import List, Optional
+from enum import Enum
+from bioflow.core import BioEncoder, Modality, EmbeddingResult
+logger = logging.getLogger(__name__)
+# Lazy imports
+_transformers = None
+_torch = None
+_rdkit = None
+def _load_transformers():
+    global _transformers, _torch
+    if _transformers is None:
+        try:
+            import transformers
+            import torch
+            _transformers = transformers
+            _torch = torch
+        except ImportError:
+            raise ImportError(
+                "transformers and torch are required. "
+                "Install with: pip install transformers torch"
+            )
+    return _transformers, _torch
+def _load_rdkit():
+    global _rdkit
+    if _rdkit is None:
+        try:
+            from rdkit import Chem
+            from rdkit.Chem import AllChem
+            _rdkit = (Chem, AllChem)
+        except ImportError:
+            raise ImportError(
+                "RDKit is required for fingerprint encoding. "
+                "Install with: pip install rdkit"
+            )
+    return _rdkit
+class MoleculeEncoderBackend(Enum):
+    CHEMBERTA = "chemberta"
+    RDKIT_MORGAN = "rdkit_morgan"
+    RDKIT_MACCS = "rdkit_maccs"
+class MoleculeEncoder(BioEncoder):
+    """
+    Encoder for SMILES molecules using ChemBERTa or RDKit fingerprints.
+    Example:
+        >>> encoder = MoleculeEncoder(backend="chemberta")
+        >>> result = encoder.encode("CCO", Modality.SMILES)  # Ethanol
+        >>> print(len(result.vector))  # 768
+        >>> encoder = MoleculeEncoder(backend="rdkit_morgan")
+        >>> result = encoder.encode("CCO", Modality.SMILES)
+        >>> print(len(result.vector))  # 2048
+    """
+    SUPPORTED_MODELS = {
+        "chemberta": "seyonec/ChemBERTa-zinc-base-v1",
+        "chemberta-77m": "DeepChem/ChemBERTa-77M-MTR",
+    }
+    def __init__(
+        self,
+        backend: str = "chemberta",
+        model_name: str = None,
+        device: str = None,
+        fp_size: int = 2048,  # For RDKit fingerprints
+        fp_radius: int = 2,   # For Morgan fingerprints
+    ):
+        """
+        Initialize MoleculeEncoder.
+        Args:
+            backend: "chemberta", "rdkit_morgan", or "rdkit_maccs"
+            model_name: HuggingFace model path (for chemberta)
+            device: torch device
+            fp_size: Fingerprint size (for RDKit)
+            fp_radius: Morgan fingerprint radius
+        """
+        self.backend = MoleculeEncoderBackend(backend.lower())
+        self.fp_size = fp_size
+        self.fp_radius = fp_radius
+        if self.backend == MoleculeEncoderBackend.CHEMBERTA:
+            transformers, torch = _load_transformers()
+            self.model_path = model_name or self.SUPPORTED_MODELS["chemberta"]
+            if device is None:
+                device = "cuda" if torch.cuda.is_available() else "cpu"
+            self.device = device
+            logger.info(f"Loading MoleculeEncoder: {self.model_path} on {self.device}")
+            self.tokenizer = transformers.AutoTokenizer.from_pretrained(self.model_path)
+            self.model = transformers.AutoModel.from_pretrained(self.model_path)
+            self.model.to(self.device)
+            self.model.eval()
+            self._dimension = self.model.config.hidden_size
+        else:
+            # RDKit fingerprints
+            _load_rdkit()
+            self.device = "cpu"
+            self.model = None
+            self.tokenizer = None
+            if self.backend == MoleculeEncoderBackend.RDKIT_MORGAN:
+                self._dimension = fp_size
+            else:  # MACCS
+                self._dimension = 167
+        logger.info(f"MoleculeEncoder ready (backend={backend}, dim={self._dimension})")
+    @property
+    def dimension(self) -> int:
+        return self._dimension
+    @property
+    def supported_modalities(self) -> List[Modality]:
+        return [Modality.SMILES]
+    def _encode_chemberta(self, smiles: str) -> List[float]:
+        """Encode using ChemBERTa."""
+        transformers, torch = _load_transformers()
+        inputs = self.tokenizer(
+            smiles,
+            return_tensors="pt",
+            max_length=512,
+            truncation=True,
+            padding=True
+        ).to(self.device)
+        with torch.no_grad():
+            outputs = self.model(**inputs)
+            # Mean pooling
+            attention_mask = inputs["attention_mask"].unsqueeze(-1)
+            hidden_states = outputs.last_hidden_state
+            embedding = (hidden_states * attention_mask).sum(1) / attention_mask.sum(1)
+        return embedding.squeeze().cpu().numpy().tolist()
+    def _encode_rdkit(self, smiles: str) -> List[float]:
+        """Encode using RDKit fingerprints."""
+        Chem, AllChem = _load_rdkit()
+        mol = Chem.MolFromSmiles(smiles)
+        if mol is None:
+            raise ValueError(f"Invalid SMILES: {smiles}")
+        if self.backend == MoleculeEncoderBackend.RDKIT_MORGAN:
+            fp = AllChem.GetMorganFingerprintAsBitVect(mol, self.fp_radius, nBits=self.fp_size)
+        else:  # MACCS
+            from rdkit.Chem import MACCSkeys
+            fp = MACCSkeys.GenMACCSKeys(mol)
+        return list(fp)
+    def encode(self, content: str, modality: Modality = Modality.SMILES) -> EmbeddingResult:
+        """Encode SMILES into a vector."""
+        if modality != Modality.SMILES:
+            raise ValueError(f"MoleculeEncoder only supports SMILES modality, got {modality}")
+        if self.backend == MoleculeEncoderBackend.CHEMBERTA:
+            vector = self._encode_chemberta(content)
+        else:
+            vector = self._encode_rdkit(content)
+        return EmbeddingResult(
+            vector=vector,
+            modality=modality,
+            dimension=self._dimension,
+            metadata={"backend": self.backend.value, "smiles": content}
+        )
+    def batch_encode(self, contents: List[str], modality: Modality = Modality.SMILES) -> List[EmbeddingResult]:
+        """Batch encode SMILES."""
+        if self.backend == MoleculeEncoderBackend.CHEMBERTA:
+            transformers, torch = _load_transformers()
+            inputs = self.tokenizer(
+                contents,
+                return_tensors="pt",
+                max_length=512,
+                truncation=True,
+                padding=True
+            ).to(self.device)
+            with torch.no_grad():
+                outputs = self.model(**inputs)
+                attention_mask = inputs["attention_mask"].unsqueeze(-1)
+                hidden_states = outputs.last_hidden_state
+                embeddings = (hidden_states * attention_mask).sum(1) / attention_mask.sum(1)
+            results = []
+            for i, emb in enumerate(embeddings):
+                results.append(EmbeddingResult(
+                    vector=emb.cpu().numpy().tolist(),
+                    modality=modality,
+                    dimension=self._dimension,
+                    metadata={"backend": self.backend.value, "smiles": contents[i]}
+                ))
+            return results
+        else:
+            return [self.encode(s, modality) for s in contents]

bioflow/plugins/encoders/protein_encoder.py ADDED Viewed

	@@ -0,0 +1,188 @@

+"""
+Protein Encoder - ESM-2 / ProtBERT
+===================================
+Encodes protein sequences into vectors.
+Models:
+- facebook/esm2_t33_650M_UR50D (default, 1280-dim)
+- facebook/esm2_t12_35M_UR50D (smaller, 480-dim)
+- Rostlab/prot_bert (1024-dim)
+"""
+import logging
+from typing import List, Optional
+from bioflow.core import BioEncoder, Modality, EmbeddingResult
+logger = logging.getLogger(__name__)
+# Lazy imports
+_transformers = None
+_torch = None
+def _load_transformers():
+    global _transformers, _torch
+    if _transformers is None:
+        try:
+            import transformers
+            import torch
+            _transformers = transformers
+            _torch = torch
+        except ImportError:
+            raise ImportError(
+                "transformers and torch are required. "
+                "Install with: pip install transformers torch"
+            )
+    return _transformers, _torch
+class ProteinEncoder(BioEncoder):
+    """
+    Encoder for protein sequences using ESM-2 or ProtBERT.
+    Example:
+        >>> encoder = ProteinEncoder(model_name="esm2_t12")
+        >>> result = encoder.encode("MKTVRQERLKSIVRILERSKEPVSG", Modality.PROTEIN)
+        >>> print(len(result.vector))  # 480
+    """
+    SUPPORTED_MODELS = {
+        "esm2_t33": "facebook/esm2_t33_650M_UR50D",      # 1280-dim, 650M params
+        "esm2_t30": "facebook/esm2_t30_150M_UR50D",      # 640-dim, 150M params
+        "esm2_t12": "facebook/esm2_t12_35M_UR50D",       # 480-dim, 35M params (fast)
+        "esm2_t6": "facebook/esm2_t6_8M_UR50D",          # 320-dim, 8M params (fastest)
+        "protbert": "Rostlab/prot_bert",                 # 1024-dim
+        "protbert_bfd": "Rostlab/prot_bert_bfd",         # 1024-dim, larger
+    }
+    def __init__(
+        self,
+        model_name: str = "esm2_t12",
+        device: str = None,
+        max_length: int = 1024,
+        pooling: str = "mean"
+    ):
+        """
+        Initialize ProteinEncoder.
+        Args:
+            model_name: Model key or HuggingFace path
+            device: torch device
+            max_length: Max sequence length
+            pooling: Pooling strategy (mean, cls)
+        """
+        transformers, torch = _load_transformers()
+        # Resolve model name
+        self.model_path = self.SUPPORTED_MODELS.get(model_name.lower(), model_name)
+        self.max_length = max_length
+        self.pooling = pooling
+        # Set device
+        if device is None:
+            device = "cuda" if torch.cuda.is_available() else "cpu"
+        self.device = device
+        # Load model
+        logger.info(f"Loading ProteinEncoder: {self.model_path} on {self.device}")
+        self.tokenizer = transformers.AutoTokenizer.from_pretrained(self.model_path)
+        self.model = transformers.AutoModel.from_pretrained(self.model_path)
+        self.model.to(self.device)
+        self.model.eval()
+        self._dimension = self.model.config.hidden_size
+        logger.info(f"ProteinEncoder ready (dim={self._dimension})")
+    @property
+    def dimension(self) -> int:
+        return self._dimension
+    @property
+    def supported_modalities(self) -> List[Modality]:
+        return [Modality.PROTEIN]
+    def _preprocess_sequence(self, sequence: str) -> str:
+        """Preprocess protein sequence for encoding."""
+        # Remove whitespace
+        sequence = sequence.strip().upper()
+        # For ProtBERT, add spaces between amino acids
+        if "prot_bert" in self.model_path.lower():
+            sequence = " ".join(list(sequence))
+        return sequence
+    def encode(self, content: str, modality: Modality = Modality.PROTEIN) -> EmbeddingResult:
+        """Encode protein sequence into a vector."""
+        if modality != Modality.PROTEIN:
+            raise ValueError(f"ProteinEncoder only supports PROTEIN modality, got {modality}")
+        transformers, torch = _load_transformers()
+        # Preprocess
+        sequence = self._preprocess_sequence(content)
+        # Tokenize
+        inputs = self.tokenizer(
+            sequence,
+            return_tensors="pt",
+            max_length=self.max_length,
+            truncation=True,
+            padding=True
+        ).to(self.device)
+        # Encode
+        with torch.no_grad():
+            outputs = self.model(**inputs)
+            hidden_states = outputs.last_hidden_state
+            if self.pooling == "cls":
+                embedding = hidden_states[:, 0, :]
+            else:  # mean
+                attention_mask = inputs["attention_mask"].unsqueeze(-1)
+                embedding = (hidden_states * attention_mask).sum(1) / attention_mask.sum(1)
+        vector = embedding.squeeze().cpu().numpy().tolist()
+        return EmbeddingResult(
+            vector=vector,
+            modality=modality,
+            dimension=self._dimension,
+            metadata={
+                "model": self.model_path,
+                "sequence_length": len(content)
+            }
+        )
+    def batch_encode(self, contents: List[str], modality: Modality = Modality.PROTEIN) -> List[EmbeddingResult]:
+        """Batch encode protein sequences."""
+        transformers, torch = _load_transformers()
+        sequences = [self._preprocess_sequence(s) for s in contents]
+        inputs = self.tokenizer(
+            sequences,
+            return_tensors="pt",
+            max_length=self.max_length,
+            truncation=True,
+            padding=True
+        ).to(self.device)
+        with torch.no_grad():
+            outputs = self.model(**inputs)
+            hidden_states = outputs.last_hidden_state
+            attention_mask = inputs["attention_mask"].unsqueeze(-1)
+            embeddings = (hidden_states * attention_mask).sum(1) / attention_mask.sum(1)
+        results = []
+        for i, emb in enumerate(embeddings):
+            results.append(EmbeddingResult(
+                vector=emb.cpu().numpy().tolist(),
+                modality=modality,
+                dimension=self._dimension,
+                metadata={"model": self.model_path, "sequence_length": len(contents[i])}
+            ))
+        return results

bioflow/plugins/encoders/text_encoder.py ADDED Viewed

	@@ -0,0 +1,177 @@

+"""
+Text Encoder - PubMedBERT / SciBERT
+====================================
+Encodes biomedical text (abstracts, clinical notes) into vectors.
+Models:
+- microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext (default)
+- allenai/scibert_scivocab_uncased
+- allenai/specter
+"""
+import logging
+from typing import List, Optional
+import numpy as np
+from bioflow.core import BioEncoder, Modality, EmbeddingResult
+logger = logging.getLogger(__name__)
+# Lazy imports for optional dependencies
+_transformers = None
+_torch = None
+def _load_transformers():
+    global _transformers, _torch
+    if _transformers is None:
+        try:
+            import transformers
+            import torch
+            _transformers = transformers
+            _torch = torch
+        except ImportError:
+            raise ImportError(
+                "transformers and torch are required for TextEncoder. "
+                "Install with: pip install transformers torch"
+            )
+    return _transformers, _torch
+class TextEncoder(BioEncoder):
+    """
+    Encoder for biomedical text using PubMedBERT or similar models.
+    Example:
+        >>> encoder = TextEncoder(model_name="microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext")
+        >>> result = encoder.encode("EGFR mutations in lung cancer", Modality.TEXT)
+        >>> print(len(result.vector))  # 768
+    """
+    SUPPORTED_MODELS = {
+        "pubmedbert": "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext",
+        "scibert": "allenai/scibert_scivocab_uncased",
+        "specter": "allenai/specter",
+        "biobert": "dmis-lab/biobert-base-cased-v1.2",
+    }
+    def __init__(
+        self,
+        model_name: str = "pubmedbert",
+        device: str = None,
+        max_length: int = 512,
+        pooling: str = "mean"  # mean, cls, max
+    ):
+        """
+        Initialize TextEncoder.
+        Args:
+            model_name: Model key or HuggingFace model path
+            device: torch device (auto-detected if None)
+            max_length: Maximum token length
+            pooling: Pooling strategy for embeddings
+        """
+        transformers, torch = _load_transformers()
+        # Resolve model name
+        self.model_path = self.SUPPORTED_MODELS.get(model_name.lower(), model_name)
+        self.max_length = max_length
+        self.pooling = pooling
+        # Set device
+        if device is None:
+            device = "cuda" if torch.cuda.is_available() else "cpu"
+        self.device = device
+        # Load model and tokenizer
+        logger.info(f"Loading TextEncoder: {self.model_path} on {self.device}")
+        self.tokenizer = transformers.AutoTokenizer.from_pretrained(self.model_path)
+        self.model = transformers.AutoModel.from_pretrained(self.model_path)
+        self.model.to(self.device)
+        self.model.eval()
+        self._dimension = self.model.config.hidden_size
+        logger.info(f"TextEncoder ready (dim={self._dimension})")
+    @property
+    def dimension(self) -> int:
+        return self._dimension
+    @property
+    def supported_modalities(self) -> List[Modality]:
+        return [Modality.TEXT]
+    def encode(self, content: str, modality: Modality = Modality.TEXT) -> EmbeddingResult:
+        """Encode text into a vector."""
+        if modality != Modality.TEXT:
+            raise ValueError(f"TextEncoder only supports TEXT modality, got {modality}")
+        transformers, torch = _load_transformers()
+        # Tokenize
+        inputs = self.tokenizer(
+            content,
+            return_tensors="pt",
+            max_length=self.max_length,
+            truncation=True,
+            padding=True
+        ).to(self.device)
+        # Encode
+        with torch.no_grad():
+            outputs = self.model(**inputs)
+            hidden_states = outputs.last_hidden_state
+            # Apply pooling
+            if self.pooling == "cls":
+                embedding = hidden_states[:, 0, :]
+            elif self.pooling == "mean":
+                attention_mask = inputs["attention_mask"].unsqueeze(-1)
+                embedding = (hidden_states * attention_mask).sum(1) / attention_mask.sum(1)
+            elif self.pooling == "max":
+                embedding = hidden_states.max(dim=1).values
+            else:
+                raise ValueError(f"Unknown pooling: {self.pooling}")
+        vector = embedding.squeeze().cpu().numpy().tolist()
+        return EmbeddingResult(
+            vector=vector,
+            modality=modality,
+            dimension=self._dimension,
+            metadata={"model": self.model_path, "pooling": self.pooling}
+        )
+    def batch_encode(self, contents: List[str], modality: Modality = Modality.TEXT) -> List[EmbeddingResult]:
+        """Batch encode multiple texts."""
+        transformers, torch = _load_transformers()
+        inputs = self.tokenizer(
+            contents,
+            return_tensors="pt",
+            max_length=self.max_length,
+            truncation=True,
+            padding=True
+        ).to(self.device)
+        with torch.no_grad():
+            outputs = self.model(**inputs)
+            hidden_states = outputs.last_hidden_state
+            if self.pooling == "mean":
+                attention_mask = inputs["attention_mask"].unsqueeze(-1)
+                embeddings = (hidden_states * attention_mask).sum(1) / attention_mask.sum(1)
+            else:
+                embeddings = hidden_states[:, 0, :]
+        results = []
+        for i, emb in enumerate(embeddings):
+            results.append(EmbeddingResult(
+                vector=emb.cpu().numpy().tolist(),
+                modality=modality,
+                dimension=self._dimension,
+                metadata={"model": self.model_path, "index": i}
+            ))
+        return results

bioflow/plugins/obm_encoder.py ADDED Viewed

	@@ -0,0 +1,294 @@

+"""
+OBM Encoder - Unified Multimodal Encoder
+==========================================
+The OBM (Open BioMed) Encoder is the central multimodal embedding engine
+that unifies text, molecules, and proteins into a common vector space.
+This is the "heart" of the BioFlow platform - it enables cross-modal
+similarity search (e.g., find proteins similar to a text description).
+Architecture:
+    ┌─────────────────────────────────────────────┐
+    │              OBMEncoder                      │
+    │  ┌─────────┐ ┌──────────┐ ┌─────────────┐  │
+    │  │  Text   │ │ Molecule │ │   Protein   │  │
+    │  │ Encoder │ │  Encoder │ │   Encoder   │  │
+    │  │(PubMed) │ │(ChemBERTa│ │   (ESM-2)   │  │
+    │  └────┬────┘ └────┬─────┘ └──────┬──────┘  │
+    │       │           │              │          │
+    │       └───────────┼──────────────┘          │
+    │                   ▼                         │
+    │           Unified Embedding                 │
+    │              (768-dim)                      │
+    └─────────────────────────────────────────────┘
+"""
+import logging
+from typing import List, Dict, Any, Optional, Union
+import numpy as np
+from bioflow.core import BioEncoder, Modality, EmbeddingResult
+logger = logging.getLogger(__name__)
+class OBMEncoder(BioEncoder):
+    """
+    Unified Multimodal Encoder for BioFlow.
+    Combines specialized encoders for each modality and optionally
+    projects them into a shared embedding space.
+    Example:
+        >>> obm = OBMEncoder()
+        >>>
+        >>> # Encode different modalities
+        >>> text_emb = obm.encode("EGFR inhibitor for lung cancer", Modality.TEXT)
+        >>> mol_emb = obm.encode("CC(=O)Oc1ccccc1C(=O)O", Modality.SMILES)  # Aspirin
+        >>> prot_emb = obm.encode("MKTVRQERLKSIVRILERSKEPVSG", Modality.PROTEIN)
+        >>>
+        >>> # All embeddings have the same dimension
+        >>> assert len(text_emb.vector) == len(mol_emb.vector) == len(prot_emb.vector)
+    Attributes:
+        text_encoder: Encoder for biomedical text
+        molecule_encoder: Encoder for SMILES molecules
+        protein_encoder: Encoder for protein sequences
+        output_dim: Dimension of output embeddings (after projection)
+    """
+    def __init__(
+        self,
+        text_model: str = "pubmedbert",
+        molecule_model: str = "chemberta",
+        protein_model: str = "esm2_t12",
+        device: str = None,
+        output_dim: int = 768,
+        lazy_load: bool = True
+    ):
+        """
+        Initialize OBMEncoder.
+        Args:
+            text_model: Model for text encoding
+            molecule_model: Model for molecule encoding
+            protein_model: Model for protein encoding
+            device: torch device (auto-detected if None)
+            output_dim: Target dimension for all embeddings
+            lazy_load: If True, load encoders on first use
+        """
+        self.text_model = text_model
+        self.molecule_model = molecule_model
+        self.protein_model = protein_model
+        self.device = device
+        self._output_dim = output_dim
+        self.lazy_load = lazy_load
+        # Encoders (lazy loaded)
+        self._text_encoder = None
+        self._molecule_encoder = None
+        self._protein_encoder = None
+        # Projection matrices (for dimension alignment)
+        self._projections: Dict[Modality, Any] = {}
+        if not lazy_load:
+            self._load_all_encoders()
+        logger.info(f"OBMEncoder initialized (lazy_load={lazy_load}, output_dim={output_dim})")
+    def _load_all_encoders(self):
+        """Load all encoders."""
+        self._get_text_encoder()
+        self._get_molecule_encoder()
+        self._get_protein_encoder()
+    def _get_text_encoder(self):
+        """Get or create text encoder."""
+        if self._text_encoder is None:
+            from bioflow.plugins.encoders.text_encoder import TextEncoder
+            self._text_encoder = TextEncoder(
+                model_name=self.text_model,
+                device=self.device
+            )
+        return self._text_encoder
+    def _get_molecule_encoder(self):
+        """Get or create molecule encoder."""
+        if self._molecule_encoder is None:
+            from bioflow.plugins.encoders.molecule_encoder import MoleculeEncoder
+            self._molecule_encoder = MoleculeEncoder(
+                backend=self.molecule_model if self.molecule_model.startswith("rdkit") else "chemberta",
+                model_name=None if self.molecule_model.startswith("rdkit") else self.molecule_model,
+                device=self.device
+            )
+        return self._molecule_encoder
+    def _get_protein_encoder(self):
+        """Get or create protein encoder."""
+        if self._protein_encoder is None:
+            from bioflow.plugins.encoders.protein_encoder import ProteinEncoder
+            self._protein_encoder = ProteinEncoder(
+                model_name=self.protein_model,
+                device=self.device
+            )
+        return self._protein_encoder
+    def _get_encoder_for_modality(self, modality: Modality) -> BioEncoder:
+        """Get the appropriate encoder for a modality."""
+        if modality == Modality.TEXT:
+            return self._get_text_encoder()
+        elif modality == Modality.SMILES:
+            return self._get_molecule_encoder()
+        elif modality == Modality.PROTEIN:
+            return self._get_protein_encoder()
+        else:
+            raise ValueError(f"Unsupported modality: {modality}")
+    def _project_embedding(self, vector: List[float], source_dim: int) -> List[float]:
+        """
+        Project embedding to output dimension.
+        For simplicity, uses truncation/padding. In production,
+        you would train a projection layer.
+        """
+        if source_dim == self._output_dim:
+            return vector
+        elif source_dim > self._output_dim:
+            # Truncate (or use PCA in production)
+            return vector[:self._output_dim]
+        else:
+            # Pad with zeros (or use learned projection)
+            return vector + [0.0] * (self._output_dim - source_dim)
+    @property
+    def dimension(self) -> int:
+        return self._output_dim
+    @property
+    def supported_modalities(self) -> List[Modality]:
+        return [Modality.TEXT, Modality.SMILES, Modality.PROTEIN]
+    def encode(self, content: Any, modality: Modality) -> EmbeddingResult:
+        """
+        Encode content from any supported modality.
+        Args:
+            content: Raw input (text, SMILES, or protein sequence)
+            modality: Type of the input
+        Returns:
+            EmbeddingResult with unified dimension
+        """
+        # Get appropriate encoder
+        encoder = self._get_encoder_for_modality(modality)
+        # Encode
+        result = encoder.encode(content, modality)
+        # Project to unified dimension
+        projected_vector = self._project_embedding(result.vector, encoder.dimension)
+        return EmbeddingResult(
+            vector=projected_vector,
+            modality=modality,
+            dimension=self._output_dim,
+            metadata={
+                **result.metadata,
+                "source_encoder": encoder.__class__.__name__,
+                "source_dim": encoder.dimension,
+                "projected": encoder.dimension != self._output_dim
+            }
+        )
+    def encode_auto(self, content: str) -> EmbeddingResult:
+        """
+        Auto-detect modality and encode.
+        Uses heuristics to determine if input is:
+        - Protein: Contains only amino acid letters (ACDEFGHIKLMNPQRSTVWY)
+        - SMILES: Contains typical SMILES characters (=, #, @, etc.)
+        - Text: Everything else
+        """
+        content = content.strip()
+        # Check for protein (only amino acid letters)
+        amino_acids = set("ACDEFGHIKLMNPQRSTVWY")
+        if content.isupper() and set(content).issubset(amino_acids) and len(content) > 10:
+            return self.encode(content, Modality.PROTEIN)
+        # Check for SMILES (contains typical characters)
+        smiles_chars = set("=#@[]()+-.")
+        if any(c in content for c in smiles_chars) or (
+            len(content) < 100 and not " " in content and content[0].isupper()
+        ):
+            try:
+                # Validate as SMILES
+                return self.encode(content, Modality.SMILES)
+            except:
+                pass
+        # Default to text
+        return self.encode(content, Modality.TEXT)
+    def batch_encode(
+        self,
+        contents: List[Any],
+        modality: Modality
+    ) -> List[EmbeddingResult]:
+        """Batch encode multiple items of the same modality."""
+        encoder = self._get_encoder_for_modality(modality)
+        results = encoder.batch_encode(contents, modality)
+        # Project all to unified dimension
+        projected_results = []
+        for result in results:
+            projected_vector = self._project_embedding(result.vector, encoder.dimension)
+            projected_results.append(EmbeddingResult(
+                vector=projected_vector,
+                modality=modality,
+                dimension=self._output_dim,
+                metadata={**result.metadata, "source_dim": encoder.dimension}
+            ))
+        return projected_results
+    def similarity(self, emb1: EmbeddingResult, emb2: EmbeddingResult) -> float:
+        """
+        Compute cosine similarity between two embeddings.
+        Useful for cross-modal similarity (e.g., text-molecule).
+        """
+        v1 = np.array(emb1.vector)
+        v2 = np.array(emb2.vector)
+        return float(np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2)))
+    def get_encoder_info(self) -> Dict[str, Any]:
+        """Get information about loaded encoders."""
+        info = {
+            "output_dim": self._output_dim,
+            "device": self.device,
+            "encoders": {}
+        }
+        if self._text_encoder:
+            info["encoders"]["text"] = {
+                "model": self._text_encoder.model_path,
+                "dim": self._text_encoder.dimension
+            }
+        if self._molecule_encoder:
+            info["encoders"]["molecule"] = {
+                "backend": self._molecule_encoder.backend.value,
+                "dim": self._molecule_encoder.dimension
+            }
+        if self._protein_encoder:
+            info["encoders"]["protein"] = {
+                "model": self._protein_encoder.model_path,
+                "dim": self._protein_encoder.dimension
+            }
+        return info

bioflow/plugins/obm_plugin.py ADDED Viewed

	@@ -0,0 +1,40 @@

+"""
+OBM Plugin - Deprecated
+========================
+This module is deprecated. Use OBMEncoder from bioflow.plugins.obm_encoder instead.
+"""
+# Redirect to new implementation
+from bioflow.plugins.obm_encoder import OBMEncoder
+# Alias for backward compatibility
+OBMPlugin = OBMEncoder
+__all__ = ["OBMEncoder", "OBMPlugin"]
+        if modality == Modality.TEXT:
+            return self._encode_text(content)
+        elif modality == Modality.SMILES:
+            return self._encode_smiles(content)
+        elif modality == Modality.PROTEIN:
+            return self._encode_protein(content)
+        return []
+    @property
+    def dimension(self) -> int:
+        return 4096 # Placeholder for model dimension
+    def _encode_text(self, text: str):
+        # Placeholder for text encoding using open-source model
+        pass
+    def _encode_smiles(self, smiles: str):
+        # Placeholder for SMILES encoding using open-source model
+        pass
+    def _encode_protein(self, protein: str):
+        # Placeholder for protein encoding using open-source model
+        pass
+# Auto-register the tool so the orchestrator can find it
+ToolRegistry.register_encoder("obm", OBMPlugin())

bioflow/plugins/qdrant_retriever.py ADDED Viewed

	@@ -0,0 +1,312 @@

+"""
+Qdrant Retriever - Vector Database Integration
+================================================
+Implements BioRetriever interface for Qdrant vector database.
+Provides semantic search and ingestion for the BioFlow platform.
+"""
+import logging
+from typing import List, Dict, Any, Optional, Union
+import uuid
+from bioflow.core import BioRetriever, BioEncoder, Modality, RetrievalResult
+logger = logging.getLogger(__name__)
+# Lazy import
+_qdrant_client = None
+def _load_qdrant():
+    global _qdrant_client
+    if _qdrant_client is None:
+        try:
+            from qdrant_client import QdrantClient
+            from qdrant_client.models import (
+                PointStruct,
+                VectorParams,
+                Distance,
+                Filter,
+                FieldCondition,
+                MatchValue,
+            )
+            _qdrant_client = {
+                "QdrantClient": QdrantClient,
+                "PointStruct": PointStruct,
+                "VectorParams": VectorParams,
+                "Distance": Distance,
+                "Filter": Filter,
+                "FieldCondition": FieldCondition,
+                "MatchValue": MatchValue,
+            }
+        except ImportError:
+            raise ImportError(
+                "qdrant-client is required. Install with: pip install qdrant-client"
+            )
+    return _qdrant_client
+class QdrantRetriever(BioRetriever):
+    """
+    Vector database retriever using Qdrant.
+    Supports:
+    - Semantic search with embedding vectors
+    - Payload filtering (by modality, species, etc.)
+    - Batch ingestion of data
+    Example:
+        >>> from bioflow.plugins.obm_encoder import OBMEncoder
+        >>>
+        >>> encoder = OBMEncoder()
+        >>> retriever = QdrantRetriever(encoder=encoder, collection="molecules")
+        >>>
+        >>> # Ingest data
+        >>> retriever.ingest("CCO", Modality.SMILES, {"name": "Ethanol"})
+        >>>
+        >>> # Search
+        >>> results = retriever.search("alcohol compounds", limit=5)
+    """
+    def __init__(
+        self,
+        encoder: BioEncoder,
+        collection: str = "bioflow_memory",
+        url: str = None,
+        path: str = None,
+        distance: str = "cosine"
+    ):
+        """
+        Initialize QdrantRetriever.
+        Args:
+            encoder: BioEncoder instance for vectorization
+            collection: Default collection name
+            url: Qdrant server URL (for remote)
+            path: Local path for persistent storage
+            distance: Distance metric (cosine, euclid, dot)
+        """
+        qdrant = _load_qdrant()
+        self.encoder = encoder
+        self.collection = collection
+        self.distance = distance
+        # Initialize client
+        if url:
+            self.client = qdrant["QdrantClient"](url=url)
+            logger.info(f"Connected to Qdrant server at {url}")
+        elif path:
+            self.client = qdrant["QdrantClient"](path=path)
+            logger.info(f"Using local Qdrant at {path}")
+        else:
+            self.client = qdrant["QdrantClient"](":memory:")
+            logger.info("Using in-memory Qdrant (data will be lost on exit)")
+        # Create collection if not exists
+        self._ensure_collection()
+    def _ensure_collection(self, name: str = None):
+        """Ensure collection exists."""
+        qdrant = _load_qdrant()
+        name = name or self.collection
+        collections = [c.name for c in self.client.get_collections().collections]
+        if name not in collections:
+            distance_map = {
+                "cosine": qdrant["Distance"].COSINE,
+                "euclid": qdrant["Distance"].EUCLID,
+                "dot": qdrant["Distance"].DOT,
+            }
+            self.client.create_collection(
+                collection_name=name,
+                vectors_config=qdrant["VectorParams"](
+                    size=self.encoder.dimension,
+                    distance=distance_map.get(self.distance, qdrant["Distance"].COSINE)
+                )
+            )
+            logger.info(f"Created collection: {name} (dim={self.encoder.dimension})")
+    def search(
+        self,
+        query: Union[List[float], str],
+        limit: int = 10,
+        filters: Optional[Dict[str, Any]] = None,
+        collection: str = None,
+        modality: Modality = Modality.TEXT,
+        **kwargs
+    ) -> List[RetrievalResult]:
+        """
+        Search for similar items.
+        Args:
+            query: Query vector or raw content to encode
+            limit: Maximum results
+            filters: Payload filters (e.g., {"species": "human"})
+            collection: Collection to search (uses default if None)
+            modality: Modality of query (if string)
+        Returns:
+            List of RetrievalResult sorted by similarity
+        """
+        qdrant = _load_qdrant()
+        collection = collection or self.collection
+        # Encode query if string
+        if isinstance(query, str):
+            result = self.encoder.encode(query, modality)
+            query_vector = result.vector
+        else:
+            query_vector = query
+        # Build filter
+        qdrant_filter = None
+        if filters:
+            conditions = []
+            for key, value in filters.items():
+                conditions.append(
+                    qdrant["FieldCondition"](
+                        key=key,
+                        match=qdrant["MatchValue"](value=value)
+                    )
+                )
+            qdrant_filter = qdrant["Filter"](must=conditions)
+        # Search (use query method for newer qdrant-client versions)
+        try:
+            # New API (qdrant-client >= 1.6)
+            results = self.client.query_points(
+                collection_name=collection,
+                query=query_vector,
+                limit=limit,
+                query_filter=qdrant_filter
+            ).points
+        except AttributeError:
+            # Fallback to old API
+            results = self.client.search(
+                collection_name=collection,
+                query_vector=query_vector,
+                limit=limit,
+                query_filter=qdrant_filter
+            )
+        # Convert to RetrievalResult
+        return [
+            RetrievalResult(
+                id=str(r.id),
+                score=r.score,
+                content=r.payload.get("content", ""),
+                modality=Modality(r.payload.get("modality", "text")),
+                payload=r.payload
+            )
+            for r in results
+        ]
+    def ingest(
+        self,
+        content: Any,
+        modality: Modality,
+        payload: Optional[Dict[str, Any]] = None,
+        collection: str = None,
+        id: str = None
+    ) -> str:
+        """
+        Ingest content into the vector database.
+        Args:
+            content: Raw content to encode
+            modality: Type of content
+            payload: Additional metadata
+            collection: Target collection
+            id: Custom ID (auto-generated if None)
+        Returns:
+            ID of inserted item
+        """
+        qdrant = _load_qdrant()
+        collection = collection or self.collection
+        # Encode content
+        result = self.encoder.encode(content, modality)
+        # Generate ID
+        point_id = id or str(uuid.uuid4())
+        # Build payload
+        full_payload = {
+            "content": content,
+            "modality": modality.value,
+            **(payload or {})
+        }
+        # Insert
+        self.client.upsert(
+            collection_name=collection,
+            points=[
+                qdrant["PointStruct"](
+                    id=point_id,
+                    vector=result.vector,
+                    payload=full_payload
+                )
+            ]
+        )
+        logger.debug(f"Ingested {modality.value}: {point_id}")
+        return point_id
+    def batch_ingest(
+        self,
+        items: List[Dict[str, Any]],
+        collection: str = None
+    ) -> List[str]:
+        """
+        Batch ingest multiple items.
+        Args:
+            items: List of {"content": ..., "modality": ..., "payload": ...}
+            collection: Target collection
+        Returns:
+            List of inserted IDs
+        """
+        qdrant = _load_qdrant()
+        collection = collection or self.collection
+        points = []
+        ids = []
+        for item in items:
+            content = item["content"]
+            modality = Modality(item.get("modality", "text"))
+            payload = item.get("payload", {})
+            result = self.encoder.encode(content, modality)
+            point_id = str(uuid.uuid4())
+            points.append(
+                qdrant["PointStruct"](
+                    id=point_id,
+                    vector=result.vector,
+                    payload={"content": content, "modality": modality.value, **payload}
+                )
+            )
+            ids.append(point_id)
+        self.client.upsert(collection_name=collection, points=points)
+        logger.info(f"Batch ingested {len(ids)} items to {collection}")
+        return ids
+    def count(self, collection: str = None) -> int:
+        """Get count of items in collection."""
+        collection = collection or self.collection
+        return self.client.count(collection_name=collection).count
+    def delete_collection(self, collection: str = None):
+        """Delete a collection."""
+        collection = collection or self.collection
+        self.client.delete_collection(collection_name=collection)
+        logger.info(f"Deleted collection: {collection}")

bioflow/qdrant_manager.py ADDED Viewed

	@@ -0,0 +1,365 @@

+"""
+Qdrant Manager - Vector Database Integration
+==============================================
+This module provides high-level management for Qdrant collections,
+including ingestion, search, and retrieval operations for BioFlow.
+"""
+import logging
+from typing import List, Dict, Any, Optional, Union
+from dataclasses import dataclass, field
+from enum import Enum
+import uuid
+try:
+    from qdrant_client import QdrantClient
+    from qdrant_client.models import (
+        PointStruct,
+        VectorParams,
+        Distance,
+        Filter,
+        FieldCondition,
+        MatchValue,
+        SearchRequest,
+        UpdateStatus
+    )
+    QDRANT_AVAILABLE = True
+except ImportError:
+    QDRANT_AVAILABLE = False
+from bioflow.obm_wrapper import OBMWrapper, EmbeddingResult, ModalityType
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+@dataclass
+class SearchResult:
+    """Container for search results."""
+    id: str
+    score: float
+    content: str
+    modality: str
+    payload: Dict[str, Any] = field(default_factory=dict)
+class QdrantManager:
+    """
+    High-level manager for Qdrant vector database operations.
+    Provides methods for:
+    - Collection management (create, delete, info)
+    - Data ingestion with automatic embedding
+    - Cross-modal similarity search
+    - Filtered retrieval
+    """
+    def __init__(
+        self,
+        obm: OBMWrapper,
+        qdrant_url: str = None,
+        qdrant_path: str = None,
+        default_collection: str = "bioflow_memory"
+    ):
+        """
+        Initialize QdrantManager.
+        Args:
+            obm: Initialized OBMWrapper instance.
+            qdrant_url: URL for remote Qdrant server.
+            qdrant_path: Path for local persistent storage.
+            default_collection: Default collection name.
+        """
+        if not QDRANT_AVAILABLE:
+            raise ImportError("qdrant-client is required. Install with: pip install qdrant-client")
+        self.obm = obm
+        self.default_collection = default_collection
+        if qdrant_url:
+            self.client = QdrantClient(url=qdrant_url)
+            logger.info(f"Connected to Qdrant server at {qdrant_url}")
+        elif qdrant_path:
+            self.client = QdrantClient(path=qdrant_path)
+            logger.info(f"Using local Qdrant at {qdrant_path}")
+        else:
+            self.client = QdrantClient(":memory:")
+            logger.info("Using in-memory Qdrant (data will be lost on exit)")
+    def create_collection(
+        self,
+        name: str = None,
+        recreate: bool = False
+    ) -> bool:
+        """
+        Create a new collection.
+        Args:
+            name: Collection name (uses default if None).
+            recreate: If True, deletes existing collection first.
+        Returns:
+            True if created successfully.
+        """
+        name = name or self.default_collection
+        if recreate:
+            try:
+                self.client.delete_collection(name)
+            except Exception:
+                pass
+        try:
+            self.client.create_collection(
+                collection_name=name,
+                vectors_config=VectorParams(
+                    size=self.obm.vector_dim,
+                    distance=Distance.COSINE
+                )
+            )
+            logger.info(f"Created collection '{name}' with dim={self.obm.vector_dim}")
+            return True
+        except Exception as e:
+            logger.warning(f"Collection might exist: {e}")
+            return False
+    def collection_exists(self, name: str = None) -> bool:
+        """Check if collection exists."""
+        name = name or self.default_collection
+        try:
+            collections = self.client.get_collections().collections
+            return any(c.name == name for c in collections)
+        except Exception:
+            return False
+    def get_collection_info(self, name: str = None) -> Dict[str, Any]:
+        """Get collection statistics."""
+        name = name or self.default_collection
+        try:
+            info = self.client.get_collection(name)
+            # Handle different qdrant-client versions
+            points_count = getattr(info, 'points_count', None) or getattr(info, 'vectors_count', 0)
+            status = getattr(info.status, 'value', 'unknown') if hasattr(info, 'status') and info.status else 'unknown'
+            # Try to get vector size from config
+            vector_size = self.obm.vector_dim
+            if hasattr(info, 'config') and info.config:
+                if hasattr(info.config, 'params') and hasattr(info.config.params, 'vectors'):
+                    vectors_config = info.config.params.vectors
+                    if hasattr(vectors_config, 'size'):
+                        vector_size = vectors_config.size
+                    elif isinstance(vectors_config, dict) and '' in vectors_config:
+                        vector_size = vectors_config[''].size
+            return {
+                "name": name,
+                "points_count": points_count,
+                "status": status,
+                "vector_size": vector_size
+            }
+        except Exception as e:
+            return {"error": str(e)}
+    def ingest(
+        self,
+        items: List[Dict[str, Any]],
+        collection: str = None,
+        batch_size: int = 100
+    ) -> Dict[str, int]:
+        """
+        Ingest multiple items with automatic embedding.
+        Args:
+            items: List of dicts with 'content', 'modality', and optional metadata.
+            collection: Target collection name.
+            batch_size: Number of items per batch.
+        Returns:
+            Statistics dict with success/failure counts.
+        """
+        collection = collection or self.default_collection
+        if not self.collection_exists(collection):
+            self.create_collection(collection)
+        stats = {"success": 0, "failed": 0, "skipped": 0}
+        points = []
+        for item in items:
+            content = item.get("content")
+            modality = item.get("modality", item.get("type", "text"))
+            if not content:
+                stats["skipped"] += 1
+                continue
+            try:
+                embedding = self.obm.encode(content, modality)
+                payload = {k: v for k, v in item.items() if k != "content"}
+                payload["content"] = content
+                payload["modality"] = modality
+                payload["content_hash"] = embedding.content_hash
+                point_id = item.get("id", str(uuid.uuid4()))
+                points.append(PointStruct(
+                    id=point_id if isinstance(point_id, int) else hash(point_id) % (10**8),
+                    vector=embedding.vector.tolist(),
+                    payload=payload
+                ))
+                stats["success"] += 1
+                # Batch upload
+                if len(points) >= batch_size:
+                    self.client.upsert(collection_name=collection, points=points)
+                    points = []
+            except Exception as e:
+                logger.error(f"Failed to embed: {e}")
+                stats["failed"] += 1
+        # Upload remaining
+        if points:
+            self.client.upsert(collection_name=collection, points=points)
+        logger.info(f"Ingestion complete: {stats}")
+        return stats
+    def search(
+        self,
+        query: str,
+        query_modality: str = "text",
+        collection: str = None,
+        limit: int = 10,
+        filter_modality: str = None,
+        filters: Dict[str, Any] = None
+    ) -> List[SearchResult]:
+        """
+        Search for similar items.
+        Args:
+            query: Query content (text, SMILES, or protein sequence).
+            query_modality: Modality of the query.
+            collection: Collection to search.
+            limit: Maximum results to return.
+            filter_modality: Only return results of this modality.
+            filters: Additional payload filters.
+        Returns:
+            List of SearchResult objects.
+        """
+        collection = collection or self.default_collection
+        # Encode query
+        embedding = self.obm.encode(query, query_modality)
+        # Build filter
+        qdrant_filter = None
+        conditions = []
+        if filter_modality:
+            conditions.append(
+                FieldCondition(key="modality", match=MatchValue(value=filter_modality))
+            )
+        if filters:
+            for key, value in filters.items():
+                conditions.append(
+                    FieldCondition(key=key, match=MatchValue(value=value))
+                )
+        if conditions:
+            qdrant_filter = Filter(must=conditions)
+        # Search using query_points (new API)
+        results = self.client.query_points(
+            collection_name=collection,
+            query=embedding.vector.tolist(),
+            limit=limit,
+            query_filter=qdrant_filter
+        )
+        return [
+            SearchResult(
+                id=str(r.id),
+                score=r.score,
+                content=r.payload.get("content", "") if r.payload else "",
+                modality=r.payload.get("modality", "unknown") if r.payload else "unknown",
+                payload=r.payload or {}
+            )
+            for r in results.points
+        ]
+    def cross_modal_search(
+        self,
+        query: str,
+        query_modality: str,
+        target_modality: str,
+        collection: str = None,
+        limit: int = 10
+    ) -> List[SearchResult]:
+        """
+        Search across modalities (e.g., text query → molecule results).
+        Args:
+            query: Query content.
+            query_modality: Modality of query ('text', 'smiles', 'protein').
+            target_modality: Modality of desired results.
+            collection: Collection to search.
+            limit: Maximum results.
+        Returns:
+            List of SearchResult objects from target modality.
+        """
+        return self.search(
+            query=query,
+            query_modality=query_modality,
+            collection=collection,
+            limit=limit,
+            filter_modality=target_modality
+        )
+    def get_neighbors_diversity(
+        self,
+        query: str,
+        query_modality: str,
+        collection: str = None,
+        k: int = 10
+    ) -> Dict[str, Any]:
+        """
+        Analyze diversity of top-k neighbors.
+        Returns statistics about the embedding neighborhood:
+        - Mean/std of similarity scores
+        - Modality distribution
+        - Diversity score
+        """
+        results = self.search(query, query_modality, collection, limit=k)
+        if not results:
+            return {"error": "No results found"}
+        scores = [r.score for r in results]
+        modalities = [r.modality for r in results]
+        # Modality distribution
+        modality_counts = {}
+        for m in modalities:
+            modality_counts[m] = modality_counts.get(m, 0) + 1
+        # Diversity score (1 - variance of normalized scores)
+        import numpy as np
+        scores_arr = np.array(scores)
+        diversity = 1.0 - np.std(scores_arr) if len(scores_arr) > 1 else 0.0
+        return {
+            "k": k,
+            "mean_similarity": float(np.mean(scores_arr)),
+            "std_similarity": float(np.std(scores_arr)),
+            "min_similarity": float(np.min(scores_arr)),
+            "max_similarity": float(np.max(scores_arr)),
+            "modality_distribution": modality_counts,
+            "diversity_score": float(diversity)
+        }

bioflow/ui/__init__.py ADDED Viewed

	@@ -0,0 +1,15 @@

+"""
+BioFlow UI Package
+===================
+Modern Streamlit-based interface for the BioFlow platform.
+Pages:
+- Home: Dashboard with key metrics and quick actions
+- Discovery: Drug discovery pipeline interface
+- Explorer: Vector space visualization
+- Data: Data ingestion and management
+- Settings: Configuration and preferences
+"""
+__version__ = "2.0.0"

bioflow/ui/app.py ADDED Viewed

	@@ -0,0 +1,61 @@

+"""
+BioFlow - AI-Powered Drug Discovery Platform
+==============================================
+Main application entry point.
+"""
+import streamlit as st
+import sys
+import os
+# Setup path for imports
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))))
+from bioflow.ui.config import get_css
+from bioflow.ui.components import side_nav
+from bioflow.ui.pages import home, discovery, explorer, data, settings
+def main():
+    """Main application."""
+    # Page config
+    st.set_page_config(
+        page_title="BioFlow",
+        page_icon="🧬",
+        layout="wide",
+        initial_sidebar_state="collapsed",
+    )
+    # Inject custom CSS
+    st.markdown(get_css(), unsafe_allow_html=True)
+    # Initialize session state
+    if "current_page" not in st.session_state:
+        st.session_state.current_page = "home"
+    # Layout with left navigation
+    nav_col, content_col = st.columns([1, 3.6], gap="large")
+    with nav_col:
+        selected = side_nav(active_page=st.session_state.current_page)
+    if selected != st.session_state.current_page:
+        st.session_state.current_page = selected
+        st.rerun()
+    with content_col:
+        page_map = {
+            "home": home.render,
+            "discovery": discovery.render,
+            "explorer": explorer.render,
+            "data": data.render,
+            "settings": settings.render,
+        }
+        render_fn = page_map.get(st.session_state.current_page, home.render)
+        render_fn()
+if __name__ == "__main__":
+    main()

bioflow/ui/components.py ADDED Viewed

	@@ -0,0 +1,481 @@

+"""
+BioFlow UI - Components Library
+================================
+Reusable, modern UI components for Streamlit.
+"""
+import streamlit as st
+from typing import List, Dict, Any, Optional, Callable
+import plotly.express as px
+import plotly.graph_objects as go
+# Import colors
+import sys
+import os
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))))
+from bioflow.ui.config import COLORS
+# === Navigation ===
+def side_nav(active_page: str = "home") -> str:
+    """Left vertical navigation list. Returns the selected page key."""
+    nav_items = [
+        ("home", "🏠", "Home"),
+        ("discovery", "🔬", "Discovery"),
+        ("explorer", "🧬", "Explorer"),
+        ("data", "📊", "Data"),
+        ("settings", "⚙️", "Settings"),
+    ]
+    st.markdown(
+        f"""
+        <div class="nav-rail">
+            <div class="nav-brand">
+                <div class="nav-logo">🧬</div>
+                <div class="nav-title">Bio<span>Flow</span></div>
+            </div>
+            <div class="nav-section">Navigation</div>
+        </div>
+        """,
+        unsafe_allow_html=True,
+    )
+    label_map = {key: f"{icon} {label}" for key, icon, label in nav_items}
+    options = [item[0] for item in nav_items]
+    selected = st.radio(
+        "Navigation",
+        options=options,
+        index=options.index(active_page),
+        format_func=lambda x: label_map.get(x, x),
+        key="nav_radio",
+        label_visibility="collapsed",
+    )
+    return selected
+# === Page Structure ===
+def page_header(title: str, subtitle: str = "", icon: str = ""):
+    """Page header with title and optional subtitle."""
+    header_html = f"""
+    <div style="margin-bottom: 2rem;">
+        <h1 style="display: flex; align-items: center; gap: 0.75rem; margin: 0;">
+            {f'<span style="font-size: 2rem;">{icon}</span>' if icon else ''}
+            {title}
+        </h1>
+        {f'<p style="margin-top: 0.5rem; font-size: 1rem; color: {COLORS.text_muted};">{subtitle}</p>' if subtitle else ''}
+    </div>
+    """
+    st.markdown(header_html, unsafe_allow_html=True)
+def section_header(title: str, icon: str = "", link_text: str = "", link_action: Optional[Callable] = None):
+    """Section header with optional action link."""
+    col1, col2 = st.columns([4, 1])
+    with col1:
+        st.markdown(f"""
+        <div class="section-title">
+            {f'<span>{icon}</span>' if icon else ''}
+            {title}
+        </div>
+        """, unsafe_allow_html=True)
+    with col2:
+        if link_text:
+            if st.button(link_text, key=f"section_{title}", use_container_width=True):
+                if link_action:
+                    link_action()
+def divider():
+    """Visual divider."""
+    st.markdown('<div class="divider"></div>', unsafe_allow_html=True)
+def spacer(height: str = "1rem"):
+    """Vertical spacer."""
+    st.markdown(f'<div style="height: {height};"></div>', unsafe_allow_html=True)
+# === Metrics ===
+def metric_card(
+    value: str,
+    label: str,
+    icon: str = "📊",
+    change: Optional[str] = None,
+    change_type: str = "up",
+    color: str = COLORS.primary
+):
+    """Single metric card with icon and optional trend."""
+    bg_color = color.replace(")", ", 0.15)").replace("rgb", "rgba") if "rgb" in color else f"{color}22"
+    change_html = ""
+    if change:
+        arrow = "↑" if change_type == "up" else "↓"
+        change_html = f'<div class="metric-change {change_type}">{arrow} {change}</div>'
+    st.markdown(f"""
+    <div class="metric">
+        <div class="metric-icon" style="background: {bg_color}; color: {color};">
+            {icon}
+        </div>
+        <div class="metric-value">{value}</div>
+        <div class="metric-label">{label}</div>
+        {change_html}
+    </div>
+    """, unsafe_allow_html=True)
+def metric_row(metrics: List[Dict[str, Any]]):
+    """Row of metric cards."""
+    cols = st.columns(len(metrics))
+    for col, metric in zip(cols, metrics):
+        with col:
+            metric_card(**metric)
+# === Quick Actions ===
+def quick_action(icon: str, title: str, description: str, key: str) -> bool:
+    """Single quick action card. Returns True if clicked."""
+    clicked = st.button(
+        f"{icon}  {title}",
+        key=key,
+        use_container_width=True,
+        help=description
+    )
+    return clicked
+def quick_actions_grid(actions: List[Dict[str, Any]], columns: int = 4) -> Optional[str]:
+    """Grid of quick action cards. Returns clicked action key or None."""
+    cols = st.columns(columns)
+    clicked_key = None
+    for i, action in enumerate(actions):
+        with cols[i % columns]:
+            st.markdown(f"""
+            <div class="quick-action">
+                <span class="quick-action-icon">{action['icon']}</span>
+                <div class="quick-action-title">{action['title']}</div>
+                <div class="quick-action-desc">{action.get('description', '')}</div>
+            </div>
+            """, unsafe_allow_html=True)
+            if st.button("Select", key=action['key'], use_container_width=True):
+                clicked_key = action['key']
+    return clicked_key
+# === Pipeline Progress ===
+def pipeline_progress(steps: List[Dict[str, Any]]):
+    """Visual pipeline with steps showing progress."""
+    html = '<div class="pipeline">'
+    for i, step in enumerate(steps):
+        status = step.get('status', 'pending')
+        icon = step.get('icon', str(i + 1))
+        name = step.get('name', f'Step {i + 1}')
+        # Display icon for completed steps
+        if status == 'done':
+            display = '✓'
+        elif status == 'active':
+            display = icon
+        else:
+            display = str(i + 1)
+        html += f'''
+        <div class="step">
+            <div class="step-dot {status}">{display}</div>
+            <span class="step-name">{name}</span>
+        </div>
+        '''
+        # Add connecting line (except after last step)
+        if i < len(steps) - 1:
+            line_status = 'done' if status == 'done' else ''
+            html += f'<div class="step-line {line_status}"></div>'
+    html += '</div>'
+    st.markdown(html, unsafe_allow_html=True)
+# === Results ===
+def result_card(
+    title: str,
+    score: float,
+    properties: Dict[str, str] = None,
+    badges: List[str] = None,
+    key: str = ""
+) -> bool:
+    """Result card with score and properties. Returns True if clicked."""
+    # Score color
+    if score >= 0.8:
+        score_class = "score-high"
+    elif score >= 0.5:
+        score_class = "score-med"
+    else:
+        score_class = "score-low"
+    # Properties HTML
+    props_html = ""
+    if properties:
+        props_html = '<div style="display: flex; gap: 1rem; margin-top: 0.75rem; flex-wrap: wrap;">'
+        for k, v in properties.items():
+            props_html += f'''
+            <div style="font-size: 0.8125rem;">
+                <span style="color: {COLORS.text_muted};">{k}:</span>
+                <span style="color: {COLORS.text_secondary}; margin-left: 0.25rem;">{v}</span>
+            </div>
+            '''
+        props_html += '</div>'
+    # Badges HTML
+    badges_html = ""
+    if badges:
+        badges_html = '<div style="display: flex; gap: 0.5rem; margin-top: 0.75rem;">'
+        for b in badges:
+            badges_html += f'<span class="badge badge-primary">{b}</span>'
+        badges_html += '</div>'
+    st.markdown(f"""
+    <div class="result">
+        <div style="display: flex; justify-content: space-between; align-items: flex-start;">
+            <div style="font-weight: 600; color: {COLORS.text_primary};">{title}</div>
+            <div class="{score_class}" style="font-size: 1.25rem; font-weight: 700;">{score:.1%}</div>
+        </div>
+        {props_html}
+        {badges_html}
+    </div>
+    """, unsafe_allow_html=True)
+    return st.button("View Details", key=key, use_container_width=True) if key else False
+def results_list(results: List[Dict[str, Any]], empty_message: str = "No results found"):
+    """List of result cards."""
+    if not results:
+        empty_state(icon="🔍", title="No Results", description=empty_message)
+        return
+    for i, result in enumerate(results):
+        result_card(
+            title=result.get('title', f'Result {i + 1}'),
+            score=result.get('score', 0),
+            properties=result.get('properties'),
+            badges=result.get('badges'),
+            key=f"result_{i}"
+        )
+        spacer("0.75rem")
+# === Charts ===
+def bar_chart(data: Dict[str, float], title: str = "", height: int = 300):
+    """Styled bar chart."""
+    fig = go.Figure(data=[
+        go.Bar(
+            x=list(data.keys()),
+            y=list(data.values()),
+            marker_color=COLORS.primary,
+            marker_line_width=0,
+        )
+    ])
+    fig.update_layout(
+        title=title,
+        paper_bgcolor='rgba(0,0,0,0)',
+        plot_bgcolor='rgba(0,0,0,0)',
+        font=dict(family="Inter", color=COLORS.text_secondary),
+        height=height,
+        margin=dict(l=40, r=20, t=40, b=40),
+        xaxis=dict(
+            showgrid=False,
+            showline=True,
+            linecolor=COLORS.border,
+        ),
+        yaxis=dict(
+            showgrid=True,
+            gridcolor=COLORS.border,
+            showline=False,
+        ),
+    )
+    st.plotly_chart(fig, use_container_width=True)
+def scatter_chart(x: List, y: List, labels: List = None, title: str = "", height: int = 400):
+    """Styled scatter plot."""
+    fig = go.Figure(data=[
+        go.Scatter(
+            x=x,
+            y=y,
+            mode='markers',
+            marker=dict(
+                size=10,
+                color=COLORS.primary,
+                opacity=0.7,
+            ),
+            text=labels,
+            hovertemplate='<b>%{text}</b><br>X: %{x}<br>Y: %{y}<extra></extra>' if labels else None,
+        )
+    ])
+    fig.update_layout(
+        title=title,
+        paper_bgcolor='rgba(0,0,0,0)',
+        plot_bgcolor='rgba(0,0,0,0)',
+        font=dict(family="Inter", color=COLORS.text_secondary),
+        height=height,
+        margin=dict(l=40, r=20, t=40, b=40),
+        xaxis=dict(
+            showgrid=True,
+            gridcolor=COLORS.border,
+            showline=True,
+            linecolor=COLORS.border,
+        ),
+        yaxis=dict(
+            showgrid=True,
+            gridcolor=COLORS.border,
+            showline=True,
+            linecolor=COLORS.border,
+        ),
+    )
+    st.plotly_chart(fig, use_container_width=True)
+def heatmap(data: List[List[float]], x_labels: List[str], y_labels: List[str], title: str = "", height: int = 400):
+    """Styled heatmap."""
+    fig = go.Figure(data=[
+        go.Heatmap(
+            z=data,
+            x=x_labels,
+            y=y_labels,
+            colorscale=[
+                [0, COLORS.bg_hover],
+                [0.5, COLORS.primary],
+                [1, COLORS.cyan],
+            ],
+        )
+    ])
+    fig.update_layout(
+        title=title,
+        paper_bgcolor='rgba(0,0,0,0)',
+        plot_bgcolor='rgba(0,0,0,0)',
+        font=dict(family="Inter", color=COLORS.text_secondary),
+        height=height,
+        margin=dict(l=80, r=20, t=40, b=60),
+    )
+    st.plotly_chart(fig, use_container_width=True)
+# === Data Display ===
+def data_table(data: List[Dict], columns: List[str] = None):
+    """Styled data table."""
+    import pandas as pd
+    df = pd.DataFrame(data)
+    if columns:
+        df = df[columns]
+    st.dataframe(df, use_container_width=True, hide_index=True)
+# === States ===
+def empty_state(icon: str = "📭", title: str = "No Data", description: str = ""):
+    """Empty state placeholder."""
+    st.markdown(f"""
+    <div class="empty">
+        <div class="empty-icon">{icon}</div>
+        <div class="empty-title">{title}</div>
+        <div class="empty-desc">{description}</div>
+    </div>
+    """, unsafe_allow_html=True)
+def loading_state(message: str = "Loading..."):
+    """Loading state with spinner."""
+    st.markdown(f"""
+    <div class="loading">
+        <div class="spinner"></div>
+        <div class="loading-text">{message}</div>
+    </div>
+    """, unsafe_allow_html=True)
+# === Molecule Display ===
+def molecule_2d(smiles: str, size: int = 200):
+    """Display 2D molecule structure from SMILES."""
+    try:
+        from rdkit import Chem
+        from rdkit.Chem import Draw
+        import base64
+        from io import BytesIO
+        mol = Chem.MolFromSmiles(smiles)
+        if mol:
+            img = Draw.MolToImage(mol, size=(size, size))
+            buffered = BytesIO()
+            img.save(buffered, format="PNG")
+            img_str = base64.b64encode(buffered.getvalue()).decode()
+            st.markdown(f"""
+            <div class="mol-container">
+                <img src="data:image/png;base64,{img_str}" alt="Molecule" style="max-width: 100%; height: auto;">
+            </div>
+            """, unsafe_allow_html=True)
+        else:
+            st.warning("Invalid SMILES")
+    except ImportError:
+        st.info(f"SMILES: `{smiles}`")
+# === Evidence & Links ===
+def evidence_row(items: List[Dict[str, str]]):
+    """Row of evidence/source links."""
+    html = '<div style="display: flex; gap: 0.5rem; flex-wrap: wrap; margin-top: 0.75rem;">'
+    for item in items:
+        icon = item.get('icon', '📄')
+        label = item.get('label', 'Source')
+        url = item.get('url', '#')
+        html += f'''
+        <a href="{url}" target="_blank" class="evidence">
+            <span>{icon}</span>
+            <span>{label}</span>
+        </a>
+        '''
+    html += '</div>'
+    st.markdown(html, unsafe_allow_html=True)
+# === Badges ===
+def badge(text: str, variant: str = "primary"):
+    """Inline badge component."""
+    st.markdown(f'<span class="badge badge-{variant}">{text}</span>', unsafe_allow_html=True)
+def badge_row(badges: List[Dict[str, str]]):
+    """Row of badges."""
+    html = '<div style="display: flex; gap: 0.5rem; flex-wrap: wrap;">'
+    for b in badges:
+        text = b.get('text', '')
+        variant = b.get('variant', 'primary')
+        html += f'<span class="badge badge-{variant}">{text}</span>'
+    html += '</div>'
+    st.markdown(html, unsafe_allow_html=True)

bioflow/ui/config.py ADDED Viewed

	@@ -0,0 +1,583 @@

+"""
+BioFlow UI - Modern Design System
+==================================
+Clean, minimal, and highly usable interface.
+"""
+from dataclasses import dataclass
+@dataclass
+class Colors:
+    """Color palette - Modern dark theme."""
+    # Primary
+    primary: str = "#8B5CF6"
+    primary_hover: str = "#A78BFA"
+    primary_muted: str = "rgba(139, 92, 246, 0.15)"
+    # Accents
+    cyan: str = "#22D3EE"
+    emerald: str = "#34D399"
+    amber: str = "#FBBF24"
+    rose: str = "#FB7185"
+    # Backgrounds
+    bg_app: str = "#0C0E14"
+    bg_surface: str = "#14161E"
+    bg_elevated: str = "#1C1F2B"
+    bg_hover: str = "#252836"
+    # Text
+    text_primary: str = "#F8FAFC"
+    text_secondary: str = "#A1A7BB"
+    text_muted: str = "#6B7280"
+    # Borders
+    border: str = "#2A2D3A"
+    border_hover: str = "#3F4354"
+    # Status
+    success: str = "#10B981"
+    warning: str = "#F59E0B"
+    error: str = "#EF4444"
+    info: str = "#3B82F6"
+COLORS = Colors()
+def get_css() -> str:
+    """Minimalist, professional CSS using string concatenation to avoid f-string issues."""
+    css = """
+    <style>
+    @import url('https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&family=JetBrains+Mono:wght@400;500&display=swap');
+    :root {
+        --primary: """ + COLORS.primary + """;
+        --bg-app: """ + COLORS.bg_app + """;
+        --bg-surface: """ + COLORS.bg_surface + """;
+        --text: """ + COLORS.text_primary + """;
+        --text-muted: """ + COLORS.text_muted + """;
+        --border: """ + COLORS.border + """;
+        --radius: 12px;
+        --transition: 150ms ease;
+    }
+    .stApp {
+        background: """ + COLORS.bg_app + """;
+        font-family: 'Inter', sans-serif;
+    }
+    #MainMenu, footer, header { visibility: hidden; }
+    .stDeployButton { display: none; }
+    ::-webkit-scrollbar { width: 6px; height: 6px; }
+    ::-webkit-scrollbar-track { background: transparent; }
+    ::-webkit-scrollbar-thumb { background: """ + COLORS.border + """; border-radius: 3px; }
+    ::-webkit-scrollbar-thumb:hover { background: """ + COLORS.border_hover + """; }
+    section[data-testid="stSidebar"] { display: none !important; }
+    h1, h2, h3 {
+        font-weight: 600;
+        color: """ + COLORS.text_primary + """;
+        letter-spacing: -0.025em;
+    }
+    h1 { font-size: 1.875rem; margin-bottom: 0.5rem; }
+    h2 { font-size: 1.5rem; }
+    h3 { font-size: 1.125rem; }
+    p { color: """ + COLORS.text_secondary + """; line-height: 1.6; }
+    .card {
+        background: """ + COLORS.bg_surface + """;
+        border: 1px solid """ + COLORS.border + """;
+        border-radius: var(--radius);
+        padding: 1.25rem;
+    }
+    .metric {
+        background: """ + COLORS.bg_surface + """;
+        border: 1px solid """ + COLORS.border + """;
+        border-radius: var(--radius);
+        padding: 1.25rem;
+        transition: border-color var(--transition);
+    }
+    .metric:hover { border-color: """ + COLORS.primary + """; }
+    .metric-icon {
+        width: 44px;
+        height: 44px;
+        border-radius: 10px;
+        display: flex;
+        align-items: center;
+        justify-content: center;
+        font-size: 1.375rem;
+        margin-bottom: 1rem;
+    }
+    .metric-value {
+        font-size: 2rem;
+        font-weight: 700;
+        color: """ + COLORS.text_primary + """;
+        line-height: 1;
+    }
+    .metric-label {
+        font-size: 0.875rem;
+        color: """ + COLORS.text_muted + """;
+        margin-top: 0.375rem;
+    }
+    .metric-change {
+        display: inline-flex;
+        align-items: center;
+        font-size: 0.75rem;
+        font-weight: 500;
+        padding: 0.25rem 0.5rem;
+        border-radius: 6px;
+        margin-top: 0.5rem;
+    }
+    .metric-change.up { background: rgba(16, 185, 129, 0.15); color: """ + COLORS.success + """; }
+    .metric-change.down { background: rgba(239, 68, 68, 0.15); color: """ + COLORS.error + """; }
+    .stButton > button {
+        font-family: 'Inter', sans-serif;
+        font-weight: 500;
+        font-size: 0.875rem;
+        border-radius: 8px;
+        padding: 0.625rem 1.25rem;
+        transition: all var(--transition);
+        border: none;
+    }
+    .stTextInput input,
+    .stTextArea textarea,
+    .stSelectbox > div > div {
+        background: """ + COLORS.bg_app + """ !important;
+        border: 1px solid """ + COLORS.border + """ !important;
+        border-radius: 10px !important;
+        color: """ + COLORS.text_primary + """ !important;
+        font-family: 'Inter', sans-serif !important;
+    }
+    .stTextInput input:focus,
+    .stTextArea textarea:focus {
+        border-color: """ + COLORS.primary + """ !important;
+        box-shadow: 0 0 0 3px """ + COLORS.primary_muted + """ !important;
+    }
+    .stTabs [data-baseweb="tab-list"] {
+        gap: 0;
+        background: """ + COLORS.bg_surface + """;
+        border-radius: 10px;
+        padding: 4px;
+        border: 1px solid """ + COLORS.border + """;
+    }
+    .stTabs [data-baseweb="tab"] {
+        height: auto;
+        padding: 0.625rem 1.25rem;
+        border-radius: 8px;
+        font-weight: 500;
+        font-size: 0.875rem;
+        color: """ + COLORS.text_muted + """;
+        background: transparent;
+    }
+    .stTabs [aria-selected="true"] {
+        background: """ + COLORS.primary + """ !important;
+        color: white !important;
+    }
+    .stTabs [data-baseweb="tab-highlight"],
+    .stTabs [data-baseweb="tab-border"] { display: none; }
+    .pipeline {
+        display: flex;
+        align-items: center;
+        background: """ + COLORS.bg_surface + """;
+        border: 1px solid """ + COLORS.border + """;
+        border-radius: var(--radius);
+        padding: 1.5rem;
+        gap: 0;
+    }
+    .step {
+        display: flex;
+        flex-direction: column;
+        align-items: center;
+        gap: 0.5rem;
+        flex: 1;
+    }
+    .step-dot {
+        width: 44px;
+        height: 44px;
+        border-radius: 50%;
+        display: flex;
+        align-items: center;
+        justify-content: center;
+        font-size: 1.125rem;
+        font-weight: 600;
+        transition: all var(--transition);
+    }
+    .step-dot.pending {
+        background: """ + COLORS.bg_hover + """;
+        color: """ + COLORS.text_muted + """;
+        border: 2px dashed """ + COLORS.border_hover + """;
+    }
+    .step-dot.active {
+        background: """ + COLORS.primary + """;
+        color: white;
+        box-shadow: 0 0 24px rgba(139, 92, 246, 0.5);
+    }
+    .step-dot.done {
+        background: """ + COLORS.emerald + """;
+        color: white;
+    }
+    .step-name {
+        font-size: 0.75rem;
+        font-weight: 500;
+        color: """ + COLORS.text_muted + """;
+    }
+    .step-line {
+        flex: 0.6;
+        height: 2px;
+        background: """ + COLORS.border + """;
+    }
+    .step-line.done { background: """ + COLORS.emerald + """; }
+    .result {
+        background: """ + COLORS.bg_surface + """;
+        border: 1px solid """ + COLORS.border + """;
+        border-radius: var(--radius);
+        padding: 1.25rem;
+        transition: all var(--transition);
+        cursor: pointer;
+    }
+    .result:hover {
+        border-color: """ + COLORS.primary + """;
+        transform: translateY(-2px);
+        box-shadow: 0 8px 24px rgba(0, 0, 0, 0.2);
+    }
+    .score-high { color: """ + COLORS.emerald + """; }
+    .score-med { color: """ + COLORS.amber + """; }
+    .score-low { color: """ + COLORS.rose + """; }
+    .badge {
+        display: inline-flex;
+        align-items: center;
+        padding: 0.25rem 0.625rem;
+        border-radius: 6px;
+        font-size: 0.6875rem;
+        font-weight: 600;
+        text-transform: uppercase;
+    }
+    .badge-primary { background: """ + COLORS.primary_muted + """; color: """ + COLORS.primary + """; }
+    .badge-success { background: rgba(16, 185, 129, 0.15); color: """ + COLORS.success + """; }
+    .badge-warning { background: rgba(245, 158, 11, 0.15); color: """ + COLORS.warning + """; }
+    .badge-error { background: rgba(239, 68, 68, 0.15); color: """ + COLORS.error + """; }
+    .quick-action {
+        background: """ + COLORS.bg_surface + """;
+        border: 1px solid """ + COLORS.border + """;
+        border-radius: var(--radius);
+        padding: 1.5rem;
+        text-align: center;
+        cursor: pointer;
+        transition: all var(--transition);
+    }
+    .quick-action:hover {
+        border-color: """ + COLORS.primary + """;
+        transform: translateY(-4px);
+        box-shadow: 0 12px 32px rgba(0, 0, 0, 0.25);
+    }
+    .quick-action-icon {
+        font-size: 2.5rem;
+        margin-bottom: 0.75rem;
+        display: block;
+    }
+    .quick-action-title {
+        font-size: 0.9375rem;
+        font-weight: 600;
+        color: """ + COLORS.text_primary + """;
+    }
+    .quick-action-desc {
+        font-size: 0.8125rem;
+        color: """ + COLORS.text_muted + """;
+        margin-top: 0.25rem;
+    }
+    .section-header {
+        display: flex;
+        align-items: center;
+        justify-content: space-between;
+        margin-bottom: 1rem;
+    }
+    .section-title {
+        font-size: 1rem;
+        font-weight: 600;
+        color: """ + COLORS.text_primary + """;
+        display: flex;
+        align-items: center;
+        gap: 0.5rem;
+    }
+    .section-link {
+        font-size: 0.8125rem;
+        color: """ + COLORS.primary + """;
+        cursor: pointer;
+    }
+    .section-link:hover { text-decoration: underline; }
+    .empty {
+        display: flex;
+        flex-direction: column;
+        align-items: center;
+        justify-content: center;
+        padding: 4rem 2rem;
+        text-align: center;
+    }
+    .empty-icon { font-size: 3.5rem; margin-bottom: 1rem; opacity: 0.4; }
+    .empty-title { font-size: 1.125rem; font-weight: 600; color: """ + COLORS.text_primary + """; }
+    .empty-desc { font-size: 0.9375rem; color: """ + COLORS.text_muted + """; max-width: 320px; margin-top: 0.5rem; }
+    .loading {
+        display: flex;
+        flex-direction: column;
+        align-items: center;
+        padding: 3rem;
+    }
+    .spinner {
+        width: 40px;
+        height: 40px;
+        border: 3px solid """ + COLORS.border + """;
+        border-top-color: """ + COLORS.primary + """;
+        border-radius: 50%;
+        animation: spin 0.8s linear infinite;
+    }
+    @keyframes spin { to { transform: rotate(360deg); } }
+    .loading-text {
+        margin-top: 1rem;
+        color: """ + COLORS.text_muted + """;
+        font-size: 0.875rem;
+    }
+    .stProgress > div > div > div {
+        background: linear-gradient(90deg, """ + COLORS.primary + """ 0%, """ + COLORS.cyan + """ 100%);
+        border-radius: 4px;
+    }
+    .stProgress > div > div {
+        background: """ + COLORS.bg_hover + """;
+        border-radius: 4px;
+    }
+    .divider {
+        height: 1px;
+        background: """ + COLORS.border + """;
+        margin: 1.5rem 0;
+    }
+    .mol-container {
+        background: white;
+        border-radius: 10px;
+        padding: 0.75rem;
+        display: flex;
+        align-items: center;
+        justify-content: center;
+    }
+    .evidence {
+        display: inline-flex;
+        align-items: center;
+        gap: 0.375rem;
+        padding: 0.5rem 0.75rem;
+        background: """ + COLORS.bg_app + """;
+        border: 1px solid """ + COLORS.border + """;
+        border-radius: 8px;
+        font-size: 0.8125rem;
+        color: """ + COLORS.text_secondary + """;
+        transition: all var(--transition);
+        text-decoration: none;
+    }
+    .evidence:hover {
+        border-color: """ + COLORS.primary + """;
+        color: """ + COLORS.primary + """;
+    }
+    .stAlert { border-radius: 10px; border: none; }
+    .stDataFrame {
+        border-radius: var(--radius);
+        overflow: hidden;
+        border: 1px solid """ + COLORS.border + """;
+    }
+    .block-container {
+        padding-top: 1.25rem;
+    }
+    .nav-rail {
+        position: sticky;
+        top: 1rem;
+        display: flex;
+        flex-direction: column;
+        gap: 0.75rem;
+        padding: 1rem;
+        background: """ + COLORS.bg_surface + """;
+        border: 1px solid """ + COLORS.border + """;
+        border-radius: 16px;
+        margin-bottom: 1rem;
+    }
+    .nav-brand {
+        display: flex;
+        align-items: center;
+        gap: 0.75rem;
+        padding-bottom: 0.5rem;
+        border-bottom: 1px solid """ + COLORS.border + """;
+    }
+    .nav-logo { font-size: 1.5rem; }
+    .nav-title {
+        font-size: 1.1rem;
+        font-weight: 700;
+        color: """ + COLORS.text_primary + """;
+    }
+    .nav-title span {
+        background: linear-gradient(135deg, """ + COLORS.primary + """ 0%, """ + COLORS.cyan + """ 100%);
+        -webkit-background-clip: text;
+        -webkit-text-fill-color: transparent;
+    }
+    .nav-section {
+        font-size: 0.75rem;
+        text-transform: uppercase;
+        letter-spacing: 0.08em;
+        color: """ + COLORS.text_muted + """;
+    }
+    div[data-testid="stRadio"] {
+        background: """ + COLORS.bg_surface + """;
+        border: 1px solid """ + COLORS.border + """;
+        border-radius: 16px;
+        padding: 0.75rem;
+    }
+    div[data-testid="stRadio"] div[role="radiogroup"] {
+        display: flex;
+        flex-direction: column;
+        gap: 0.5rem;
+        margin-top: 0.25rem;
+    }
+    div[data-testid="stRadio"] input {
+        display: none !important;
+    }
+    div[data-testid="stRadio"] label {
+        background: """ + COLORS.bg_app + """;
+        border: 1px solid """ + COLORS.border + """;
+        border-radius: 12px;
+        padding: 0.65rem 0.9rem;
+        font-weight: 500;
+        color: """ + COLORS.text_secondary + """;
+        transition: all var(--transition);
+        margin: 0 !important;
+    }
+    div[data-testid="stRadio"] label:hover {
+        border-color: """ + COLORS.primary + """;
+        color: """ + COLORS.text_primary + """;
+    }
+    div[data-testid="stRadio"] label:has(input:checked) {
+        background: """ + COLORS.primary + """;
+        border-color: """ + COLORS.primary + """;
+        color: white;
+        box-shadow: 0 8px 20px rgba(139, 92, 246, 0.25);
+    }
+    .hero {
+        position: relative;
+        background: linear-gradient(135deg, rgba(139, 92, 246, 0.12) 0%, rgba(34, 211, 238, 0.08) 100%);
+        border: 1px solid """ + COLORS.border + """;
+        border-radius: 20px;
+        padding: 2.75rem;
+        overflow: hidden;
+    }
+    .hero-badge {
+        display: inline-flex;
+        align-items: center;
+        gap: 0.5rem;
+        padding: 0.35rem 0.75rem;
+        border-radius: 999px;
+        background: """ + COLORS.primary_muted + """;
+        color: """ + COLORS.primary + """;
+        font-size: 0.75rem;
+        font-weight: 600;
+        text-transform: uppercase;
+        letter-spacing: 0.08em;
+    }
+    .hero-title {
+        font-size: 2.25rem;
+        font-weight: 700;
+        color: """ + COLORS.text_primary + """;
+        margin-top: 1rem;
+        line-height: 1.1;
+    }
+    .hero-subtitle {
+        font-size: 1rem;
+        color: """ + COLORS.text_muted + """;
+        margin-top: 0.75rem;
+        max-width: 560px;
+    }
+    .hero-actions {
+        display: flex;
+        gap: 0.75rem;
+        margin-top: 1.5rem;
+        flex-wrap: wrap;
+    }
+    .hero-card {
+        background: """ + COLORS.bg_surface + """;
+        border: 1px solid """ + COLORS.border + """;
+        border-radius: 16px;
+        padding: 1.5rem;
+    }
+    </style>
+    """
+    return css

bioflow/ui/pages/__init__.py ADDED Viewed

	@@ -0,0 +1,5 @@

+"""Page exports."""
+from bioflow.ui.pages import home, discovery, explorer, data, settings
+__all__ = ["home", "discovery", "explorer", "data", "settings"]

bioflow/ui/pages/data.py ADDED Viewed

	@@ -0,0 +1,163 @@

+"""
+BioFlow - Data Page
+===================
+Data management and upload.
+"""
+import streamlit as st
+import sys
+import os
+import pandas as pd
+import numpy as np
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))))
+from bioflow.ui.components import (
+    page_header, section_header, divider, spacer,
+    metric_card, data_table, empty_state
+)
+from bioflow.ui.config import COLORS
+def render():
+    """Render data page."""
+    page_header("Data Management", "Upload, manage, and organize your datasets", "📊")
+    # Stats Row
+    cols = st.columns(4)
+    with cols[0]:
+        metric_card("5", "Datasets", "📁", color=COLORS.primary)
+    with cols[1]:
+        metric_card("24.5K", "Molecules", "🧪", color=COLORS.cyan)
+    with cols[2]:
+        metric_card("1.2K", "Proteins", "🧬", color=COLORS.emerald)
+    with cols[3]:
+        metric_card("156 MB", "Storage Used", "💾", color=COLORS.amber)
+    spacer("2rem")
+    # Tabs
+    tabs = st.tabs(["📁 Datasets", "📤 Upload", "🔧 Processing"])
+    with tabs[0]:
+        section_header("Your Datasets", "📁")
+        # Dataset list
+        datasets = [
+            {"name": "DrugBank Compounds", "type": "Molecules", "count": "12,450", "size": "45.2 MB", "updated": "2024-01-15"},
+            {"name": "ChEMBL Kinase Inhibitors", "type": "Molecules", "count": "8,234", "size": "32.1 MB", "updated": "2024-01-10"},
+            {"name": "Custom Protein Targets", "type": "Proteins", "count": "1,245", "size": "78.5 MB", "updated": "2024-01-08"},
+        ]
+        for ds in datasets:
+            st.markdown(f"""
+            <div class="card" style="margin-bottom: 0.75rem;">
+                <div style="display: flex; justify-content: space-between; align-items: center;">
+                    <div>
+                        <div style="font-weight: 600; color: {COLORS.text_primary};">{ds["name"]}</div>
+                        <div style="display: flex; gap: 1.5rem; margin-top: 0.5rem;">
+                            <span style="font-size: 0.8125rem; color: {COLORS.text_muted};">
+                                <span style="color: {COLORS.primary};">●</span> {ds["type"]}
+                            </span>
+                            <span style="font-size: 0.8125rem; color: {COLORS.text_muted};">{ds["count"]} items</span>
+                            <span style="font-size: 0.8125rem; color: {COLORS.text_muted};">{ds["size"]}</span>
+                            <span style="font-size: 0.8125rem; color: {COLORS.text_muted};">Updated: {ds["updated"]}</span>
+                        </div>
+                    </div>
+                    <div style="display: flex; gap: 0.5rem;">
+                        <span class="badge badge-primary">{ds["type"]}</span>
+                    </div>
+                </div>
+            </div>
+            """, unsafe_allow_html=True)
+            # Action buttons
+            btn_cols = st.columns([1, 1, 1, 4])
+            with btn_cols[0]:
+                st.button("View", key=f"view_{ds['name']}", use_container_width=True)
+            with btn_cols[1]:
+                st.button("Export", key=f"export_{ds['name']}", use_container_width=True)
+            with btn_cols[2]:
+                st.button("Delete", key=f"delete_{ds['name']}", use_container_width=True)
+            spacer("0.5rem")
+    with tabs[1]:
+        section_header("Upload New Data", "📤")
+        # Upload area
+        st.markdown(f"""
+        <div style="
+            border: 2px dashed {COLORS.border};
+            border-radius: 16px;
+            padding: 3rem;
+            text-align: center;
+            background: {COLORS.bg_surface};
+        ">
+            <div style="font-size: 3rem; margin-bottom: 1rem;">📁</div>
+            <div style="font-size: 1.125rem; font-weight: 600; color: {COLORS.text_primary};">
+                Drag & drop files here
+            </div>
+            <div style="font-size: 0.875rem; color: {COLORS.text_muted}; margin-top: 0.5rem;">
+                or click to browse
+            </div>
+            <div style="font-size: 0.75rem; color: {COLORS.text_muted}; margin-top: 1rem;">
+                Supports: CSV, SDF, FASTA, PDB, JSON
+            </div>
+        </div>
+        """, unsafe_allow_html=True)
+        uploaded_file = st.file_uploader(
+            "Upload file",
+            type=["csv", "sdf", "fasta", "pdb", "json"],
+            label_visibility="collapsed"
+        )
+        if uploaded_file:
+            st.success(f"✓ File uploaded: {uploaded_file.name}")
+            col1, col2 = st.columns(2)
+            with col1:
+                dataset_name = st.text_input("Dataset Name", value=uploaded_file.name.split('.')[0])
+            with col2:
+                data_type = st.selectbox("Data Type", ["Molecules", "Proteins", "Text"])
+            if st.button("Process & Import", type="primary", use_container_width=True):
+                with st.spinner("Processing..."):
+                    import time
+                    time.sleep(2)
+                st.success("✓ Dataset imported successfully!")
+    with tabs[2]:
+        section_header("Data Processing", "🔧")
+        st.markdown(f"""
+        <div class="card">
+            <div style="font-weight: 600; color: {COLORS.text_primary}; margin-bottom: 0.75rem;">
+                Available Operations
+            </div>
+        </div>
+        """, unsafe_allow_html=True)
+        operations = [
+            {"icon": "🧹", "name": "Clean & Validate", "desc": "Remove duplicates, fix invalid structures"},
+            {"icon": "🔢", "name": "Compute Descriptors", "desc": "Calculate molecular properties and fingerprints"},
+            {"icon": "🧠", "name": "Generate Embeddings", "desc": "Create vector representations using AI models"},
+            {"icon": "🔗", "name": "Merge Datasets", "desc": "Combine multiple datasets with deduplication"},
+        ]
+        for op in operations:
+            st.markdown(f"""
+            <div class="quick-action" style="margin-bottom: 0.75rem; text-align: left;">
+                <div style="display: flex; align-items: center; gap: 1rem;">
+                    <span style="font-size: 1.5rem;">{op["icon"]}</span>
+                    <div>
+                        <div style="font-weight: 600; color: {COLORS.text_primary};">{op["name"]}</div>
+                        <div style="font-size: 0.8125rem; color: {COLORS.text_muted};">{op["desc"]}</div>
+                    </div>
+                </div>
+            </div>
+            """, unsafe_allow_html=True)
+            st.button(f"Run {op['name']}", key=f"op_{op['name']}", use_container_width=True)

bioflow/ui/pages/discovery.py ADDED Viewed

	@@ -0,0 +1,165 @@

+"""
+BioFlow - Discovery Page
+========================
+Drug discovery pipeline interface.
+"""
+import streamlit as st
+import sys
+import os
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))))
+from bioflow.ui.components import (
+    page_header, section_header, divider, spacer,
+    pipeline_progress, bar_chart, empty_state, loading_state
+)
+from bioflow.ui.config import COLORS
+def render():
+    """Render discovery page."""
+    page_header("Drug Discovery", "Search for drug candidates with AI-powered analysis", "🔬")
+    # Query Input Section
+    st.markdown(f"""
+    <div class="card" style="margin-bottom: 1.5rem;">
+        <div style="font-size: 0.875rem; font-weight: 600; color: {COLORS.text_primary}; margin-bottom: 0.75rem;">
+            Search Query
+        </div>
+    </div>
+    """, unsafe_allow_html=True)
+    col1, col2 = st.columns([3, 1])
+    with col1:
+        query = st.text_area(
+            "Query",
+            placeholder="Enter a natural language query, SMILES string, or FASTA sequence...",
+            height=100,
+            label_visibility="collapsed"
+        )
+    with col2:
+        st.selectbox("Search Type", ["Similarity", "Binding Affinity", "Properties"], label_visibility="collapsed")
+        st.selectbox("Database", ["All", "DrugBank", "ChEMBL", "ZINC"], label_visibility="collapsed")
+        search_clicked = st.button("🔍 Search", type="primary", use_container_width=True)
+    spacer("1.5rem")
+    # Pipeline Progress
+    section_header("Pipeline Status", "🔄")
+    if "discovery_step" not in st.session_state:
+        st.session_state.discovery_step = 0
+    steps = [
+        {"name": "Input", "status": "done" if st.session_state.discovery_step > 0 else "active"},
+        {"name": "Encode", "status": "done" if st.session_state.discovery_step > 1 else ("active" if st.session_state.discovery_step == 1 else "pending")},
+        {"name": "Search", "status": "done" if st.session_state.discovery_step > 2 else ("active" if st.session_state.discovery_step == 2 else "pending")},
+        {"name": "Predict", "status": "done" if st.session_state.discovery_step > 3 else ("active" if st.session_state.discovery_step == 3 else "pending")},
+        {"name": "Results", "status": "active" if st.session_state.discovery_step == 4 else "pending"},
+    ]
+    pipeline_progress(steps)
+    spacer("2rem")
+    divider()
+    spacer("2rem")
+    # Results Section
+    section_header("Results", "🎯")
+    if search_clicked and query:
+        st.session_state.discovery_step = 4
+        st.session_state.discovery_query = query
+    if st.session_state.discovery_step >= 4:
+        # Show results
+        tabs = st.tabs(["Top Candidates", "Property Analysis", "Evidence"])
+        with tabs[0]:
+            # Results list
+            results = [
+                {"name": "Candidate A", "score": 0.95, "mw": "342.4", "logp": "2.1", "hbd": "2"},
+                {"name": "Candidate B", "score": 0.89, "mw": "298.3", "logp": "1.8", "hbd": "3"},
+                {"name": "Candidate C", "score": 0.82, "mw": "415.5", "logp": "3.2", "hbd": "1"},
+                {"name": "Candidate D", "score": 0.76, "mw": "267.3", "logp": "1.5", "hbd": "2"},
+                {"name": "Candidate E", "score": 0.71, "mw": "389.4", "logp": "2.8", "hbd": "2"},
+            ]
+            for r in results:
+                score_color = COLORS.emerald if r["score"] >= 0.8 else (COLORS.amber if r["score"] >= 0.5 else COLORS.rose)
+                st.markdown(f"""
+                <div class="result">
+                    <div style="display: flex; justify-content: space-between; align-items: flex-start;">
+                        <div>
+                            <div style="font-weight: 600; color: {COLORS.text_primary};">{r["name"]}</div>
+                            <div style="display: flex; gap: 1rem; margin-top: 0.5rem;">
+                                <span style="font-size: 0.8125rem; color: {COLORS.text_muted};">MW: {r["mw"]}</span>
+                                <span style="font-size: 0.8125rem; color: {COLORS.text_muted};">LogP: {r["logp"]}</span>
+                                <span style="font-size: 0.8125rem; color: {COLORS.text_muted};">HBD: {r["hbd"]}</span>
+                            </div>
+                        </div>
+                        <div style="font-size: 1.5rem; font-weight: 700; color: {score_color};">{r["score"]:.0%}</div>
+                    </div>
+                </div>
+                """, unsafe_allow_html=True)
+                spacer("0.75rem")
+        with tabs[1]:
+            # Property distribution
+            col1, col2 = st.columns(2)
+            with col1:
+                bar_chart(
+                    {"<200": 5, "200-300": 12, "300-400": 8, "400-500": 3, ">500": 2},
+                    title="Molecular Weight Distribution",
+                    height=250
+                )
+            with col2:
+                bar_chart(
+                    {"<1": 4, "1-2": 10, "2-3": 8, "3-4": 5, ">4": 3},
+                    title="LogP Distribution",
+                    height=250
+                )
+        with tabs[2]:
+            # Evidence
+            st.markdown(f"""
+            <div class="card">
+                <div style="font-weight: 600; color: {COLORS.text_primary}; margin-bottom: 1rem;">
+                    Related Literature
+                </div>
+            </div>
+            """, unsafe_allow_html=True)
+            papers = [
+                {"title": "Novel therapeutic targets for cancer treatment", "year": "2024", "journal": "Nature Medicine"},
+                {"title": "Molecular docking studies of kinase inhibitors", "year": "2023", "journal": "J. Med. Chem."},
+                {"title": "AI-driven drug discovery approaches", "year": "2024", "journal": "Drug Discovery Today"},
+            ]
+            for p in papers:
+                st.markdown(f"""
+                <div style="
+                    padding: 1rem;
+                    border: 1px solid {COLORS.border};
+                    border-radius: 8px;
+                    margin-bottom: 0.75rem;
+                ">
+                    <div style="font-weight: 500; color: {COLORS.text_primary};">{p["title"]}</div>
+                    <div style="font-size: 0.8125rem; color: {COLORS.text_muted}; margin-top: 0.25rem;">
+                        {p["journal"]} • {p["year"]}
+                    </div>
+                </div>
+                """, unsafe_allow_html=True)
+    else:
+        empty_state(
+            "🔍",
+            "No Results Yet",
+            "Enter a query and click Search to find drug candidates"
+        )

bioflow/ui/pages/explorer.py ADDED Viewed

	@@ -0,0 +1,127 @@

+"""
+BioFlow - Explorer Page
+=======================
+Data exploration and visualization.
+"""
+import streamlit as st
+import sys
+import os
+import numpy as np
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))))
+from bioflow.ui.components import (
+    page_header, section_header, divider, spacer,
+    scatter_chart, heatmap, metric_card, empty_state
+)
+from bioflow.ui.config import COLORS
+def render():
+    """Render explorer page."""
+    page_header("Data Explorer", "Visualize molecular embeddings and relationships", "🧬")
+    # Controls
+    col1, col2, col3, col4 = st.columns(4)
+    with col1:
+        dataset = st.selectbox("Dataset", ["DrugBank", "ChEMBL", "ZINC", "Custom"])
+    with col2:
+        viz_type = st.selectbox("Visualization", ["UMAP", "t-SNE", "PCA"])
+    with col3:
+        color_by = st.selectbox("Color by", ["Activity", "MW", "LogP", "Cluster"])
+    with col4:
+        st.write("")  # Spacing
+        st.write("")
+        if st.button("🔄 Refresh", use_container_width=True):
+            st.rerun()
+    spacer("1.5rem")
+    # Main visualization area
+    col_viz, col_details = st.columns([2, 1])
+    with col_viz:
+        section_header("Embedding Space", "🗺️")
+        # Generate sample data
+        np.random.seed(42)
+        n_points = 200
+        # Create clusters
+        cluster1_x = np.random.normal(2, 0.8, n_points // 4)
+        cluster1_y = np.random.normal(3, 0.8, n_points // 4)
+        cluster2_x = np.random.normal(-2, 1, n_points // 4)
+        cluster2_y = np.random.normal(-1, 1, n_points // 4)
+        cluster3_x = np.random.normal(4, 0.6, n_points // 4)
+        cluster3_y = np.random.normal(-2, 0.6, n_points // 4)
+        cluster4_x = np.random.normal(-1, 0.9, n_points // 4)
+        cluster4_y = np.random.normal(4, 0.9, n_points // 4)
+        x = list(cluster1_x) + list(cluster2_x) + list(cluster3_x) + list(cluster4_x)
+        y = list(cluster1_y) + list(cluster2_y) + list(cluster3_y) + list(cluster4_y)
+        labels = [f"Mol_{i}" for i in range(n_points)]
+        scatter_chart(x, y, labels, title=f"{viz_type} Projection - {dataset}", height=450)
+    with col_details:
+        section_header("Statistics", "📊")
+        metric_card("12,450", "Total Molecules", "🧪", color=COLORS.primary)
+        spacer("0.75rem")
+        metric_card("4", "Clusters Found", "🎯", color=COLORS.cyan)
+        spacer("0.75rem")
+        metric_card("0.89", "Silhouette Score", "📈", color=COLORS.emerald)
+        spacer("0.75rem")
+        metric_card("85%", "Coverage", "✓", color=COLORS.amber)
+    spacer("2rem")
+    divider()
+    spacer("2rem")
+    # Similarity Heatmap
+    section_header("Similarity Matrix", "🔥")
+    # Sample similarity matrix
+    np.random.seed(123)
+    labels_short = ["Cluster A", "Cluster B", "Cluster C", "Cluster D", "Cluster E"]
+    similarity = np.random.uniform(0.3, 1.0, (5, 5))
+    similarity = (similarity + similarity.T) / 2  # Make symmetric
+    np.fill_diagonal(similarity, 1.0)
+    heatmap(
+        similarity.tolist(),
+        labels_short,
+        labels_short,
+        title="Inter-cluster Similarity",
+        height=350
+    )
+    spacer("2rem")
+    # Export options
+    st.markdown(f"""
+    <div class="card">
+        <div style="display: flex; justify-content: space-between; align-items: center;">
+            <div>
+                <div style="font-weight: 600; color: {COLORS.text_primary};">Export Data</div>
+                <div style="font-size: 0.8125rem; color: {COLORS.text_muted}; margin-top: 0.25rem;">
+                    Download embeddings, clusters, or full dataset
+                </div>
+            </div>
+        </div>
+    </div>
+    """, unsafe_allow_html=True)
+    exp_cols = st.columns(3)
+    with exp_cols[0]:
+        st.button("📥 Embeddings (CSV)", use_container_width=True)
+    with exp_cols[1]:
+        st.button("📥 Clusters (JSON)", use_container_width=True)
+    with exp_cols[2]:
+        st.button("📥 Full Dataset", use_container_width=True)

bioflow/ui/pages/home.py ADDED Viewed

	@@ -0,0 +1,213 @@

+"""
+BioFlow - Home Page
+====================
+Clean dashboard with key metrics and quick actions.
+"""
+import streamlit as st
+import sys
+import os
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))))
+from bioflow.ui.components import (
+    section_header, divider, spacer,
+    metric_card, pipeline_progress, bar_chart
+)
+from bioflow.ui.config import COLORS
+def render():
+    """Render home page."""
+    # Hero Section (Tailark-inspired)
+    hero_col, hero_side = st.columns([3, 1.4])
+    with hero_col:
+        st.markdown(f"""
+        <div class="hero">
+            <div class="hero-badge">New • BioFlow 2.0</div>
+            <div class="hero-title">AI-Powered Drug Discovery</div>
+            <div class="hero-subtitle">
+                Run discovery pipelines, predict binding, and surface evidence in one streamlined workspace.
+            </div>
+            <div class="hero-actions">
+                <span class="badge badge-primary">Model-aware search</span>
+                <span class="badge badge-success">Evidence-linked</span>
+                <span class="badge badge-warning">Fast iteration</span>
+            </div>
+        </div>
+        """, unsafe_allow_html=True)
+        spacer("0.75rem")
+        btn1, btn2 = st.columns(2)
+        with btn1:
+            if st.button("Start Discovery", type="primary", use_container_width=True):
+                st.session_state.current_page = "discovery"
+                st.rerun()
+        with btn2:
+            if st.button("Explore Data", use_container_width=True):
+                st.session_state.current_page = "explorer"
+                st.rerun()
+    with hero_side:
+        st.markdown(f"""
+        <div class="hero-card">
+            <div style="font-size: 0.75rem; text-transform: uppercase; letter-spacing: 0.08em; color: {COLORS.text_muted};">
+                Today
+            </div>
+            <div style="font-size: 1.75rem; font-weight: 700; color: {COLORS.text_primary}; margin-top: 0.5rem;">
+                156 Discoveries
+            </div>
+            <div style="font-size: 0.875rem; color: {COLORS.text_muted}; margin-top: 0.5rem;">
+                +12% vs last week
+            </div>
+            <div class="divider" style="margin: 1rem 0;"></div>
+            <div style="display: flex; flex-direction: column; gap: 0.5rem;">
+                <span class="badge badge-primary">Discovery</span>
+                <span class="badge badge-success">Prediction</span>
+                <span class="badge badge-warning">Evidence</span>
+            </div>
+        </div>
+        """, unsafe_allow_html=True)
+    # Metrics Row
+    cols = st.columns(4)
+    with cols[0]:
+        metric_card("12.5M", "Molecules", "🧪", "+2.3%", "up", COLORS.primary)
+    with cols[1]:
+        metric_card("847K", "Proteins", "🧬", "+1.8%", "up", COLORS.cyan)
+    with cols[2]:
+        metric_card("1.2M", "Papers", "📚", "+5.2%", "up", COLORS.emerald)
+    with cols[3]:
+        metric_card("156", "Discoveries", "✨", "+12%", "up", COLORS.amber)
+    spacer("2rem")
+    # Quick Actions
+    section_header("Quick Actions", "⚡")
+    action_cols = st.columns(4)
+    with action_cols[0]:
+        st.markdown(f"""
+        <div class="quick-action">
+            <span class="quick-action-icon">🔍</span>
+            <div class="quick-action-title">New Discovery</div>
+            <div class="quick-action-desc">Start a pipeline</div>
+        </div>
+        """, unsafe_allow_html=True)
+        if st.button("Start", key="qa_discovery", use_container_width=True):
+            st.session_state.current_page = "discovery"
+            st.rerun()
+    with action_cols[1]:
+        st.markdown(f"""
+        <div class="quick-action">
+            <span class="quick-action-icon">📊</span>
+            <div class="quick-action-title">Explore Data</div>
+            <div class="quick-action-desc">Visualize embeddings</div>
+        </div>
+        """, unsafe_allow_html=True)
+        if st.button("Explore", key="qa_explorer", use_container_width=True):
+            st.session_state.current_page = "explorer"
+            st.rerun()
+    with action_cols[2]:
+        st.markdown(f"""
+        <div class="quick-action">
+            <span class="quick-action-icon">📁</span>
+            <div class="quick-action-title">Upload Data</div>
+            <div class="quick-action-desc">Add molecules</div>
+        </div>
+        """, unsafe_allow_html=True)
+        if st.button("Upload", key="qa_data", use_container_width=True):
+            st.session_state.current_page = "data"
+            st.rerun()
+    with action_cols[3]:
+        st.markdown(f"""
+        <div class="quick-action">
+            <span class="quick-action-icon">⚙️</span>
+            <div class="quick-action-title">Settings</div>
+            <div class="quick-action-desc">Configure models</div>
+        </div>
+        """, unsafe_allow_html=True)
+        if st.button("Configure", key="qa_settings", use_container_width=True):
+            st.session_state.current_page = "settings"
+            st.rerun()
+    spacer("2rem")
+    divider()
+    spacer("2rem")
+    # Two Column Layout
+    col1, col2 = st.columns([3, 2])
+    with col1:
+        section_header("Recent Discoveries", "🎯")
+        # Sample results
+        results = [
+            {"name": "Aspirin analog", "score": 0.94, "mw": "180.16"},
+            {"name": "Novel kinase inhibitor", "score": 0.87, "mw": "331.39"},
+            {"name": "EGFR binder candidate", "score": 0.72, "mw": "311.38"},
+        ]
+        for r in results:
+            score_color = COLORS.emerald if r["score"] >= 0.8 else (COLORS.amber if r["score"] >= 0.5 else COLORS.rose)
+            st.markdown(f"""
+            <div class="result">
+                <div style="display: flex; justify-content: space-between; align-items: center;">
+                    <div style="font-weight: 600; color: {COLORS.text_primary};">{r["name"]}</div>
+                    <div style="font-size: 1.25rem; font-weight: 700; color: {score_color};">{r["score"]:.0%}</div>
+                </div>
+                <div style="font-size: 0.8125rem; color: {COLORS.text_muted}; margin-top: 0.5rem;">
+                    MW: {r["mw"]}
+                </div>
+            </div>
+            """, unsafe_allow_html=True)
+            spacer("0.75rem")
+    with col2:
+        section_header("Pipeline Activity", "📈")
+        bar_chart(
+            {"Mon": 23, "Tue": 31, "Wed": 28, "Thu": 45, "Fri": 38, "Sat": 12, "Sun": 8},
+            title="",
+            height=250
+        )
+        spacer("1rem")
+        section_header("Active Pipeline", "🔄")
+        pipeline_progress([
+            {"name": "Encode", "status": "done"},
+            {"name": "Search", "status": "active"},
+            {"name": "Predict", "status": "pending"},
+            {"name": "Verify", "status": "pending"},
+        ])
+    spacer("2rem")
+    # Tip
+    st.markdown(f"""
+    <div style="
+        background: {COLORS.bg_surface};
+        border: 1px solid {COLORS.border};
+        border-radius: 12px;
+        padding: 1.25rem;
+        display: flex;
+        align-items: center;
+        gap: 1rem;
+    ">
+        <span style="font-size: 1.5rem;">💡</span>
+        <div>
+            <div style="font-size: 0.9375rem; color: {COLORS.text_primary}; font-weight: 500;">Pro Tip</div>
+            <div style="font-size: 0.8125rem; color: {COLORS.text_muted};">
+                Use natural language like "Find molecules similar to aspirin that can cross the blood-brain barrier"
+            </div>
+        </div>
+    </div>
+    """, unsafe_allow_html=True)

bioflow/ui/pages/settings.py ADDED Viewed

	@@ -0,0 +1,192 @@

+"""
+BioFlow - Settings Page
+========================
+Configuration and preferences.
+"""
+import streamlit as st
+import sys
+import os
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))))
+from bioflow.ui.components import page_header, section_header, divider, spacer
+from bioflow.ui.config import COLORS
+def render():
+    """Render settings page."""
+    page_header("Settings", "Configure models, databases, and preferences", "⚙️")
+    # Tabs for different settings sections
+    tabs = st.tabs(["🧠 Models", "🗄️ Database", "🔌 API Keys", "🎨 Appearance"])
+    with tabs[0]:
+        section_header("Model Configuration", "🧠")
+        st.markdown(f"""
+        <div class="card" style="margin-bottom: 1rem;">
+            <div style="font-weight: 600; color: {COLORS.text_primary}; margin-bottom: 0.5rem;">
+                Embedding Models
+            </div>
+            <div style="font-size: 0.8125rem; color: {COLORS.text_muted};">
+                Configure models used for molecular and protein embeddings
+            </div>
+        </div>
+        """, unsafe_allow_html=True)
+        col1, col2 = st.columns(2)
+        with col1:
+            st.selectbox(
+                "Molecule Encoder",
+                ["MolCLR (Recommended)", "ChemBERTa", "GraphMVP", "MolBERT"],
+                help="Model for generating molecular embeddings"
+            )
+            st.selectbox(
+                "Protein Encoder",
+                ["ESM-2 (Recommended)", "ProtTrans", "UniRep", "SeqVec"],
+                help="Model for generating protein embeddings"
+            )
+        with col2:
+            st.selectbox(
+                "Binding Predictor",
+                ["DrugBAN (Recommended)", "DeepDTA", "GraphDTA", "Custom"],
+                help="Model for predicting drug-target binding"
+            )
+            st.selectbox(
+                "Property Predictor",
+                ["ADMET-AI (Recommended)", "ChemProp", "Custom"],
+                help="Model for ADMET property prediction"
+            )
+        spacer("1rem")
+        st.markdown(f"""
+        <div class="card" style="margin-bottom: 1rem;">
+            <div style="font-weight: 600; color: {COLORS.text_primary}; margin-bottom: 0.5rem;">
+                LLM Settings
+            </div>
+            <div style="font-size: 0.8125rem; color: {COLORS.text_muted};">
+                Configure language models for evidence retrieval and reasoning
+            </div>
+        </div>
+        """, unsafe_allow_html=True)
+        col1, col2 = st.columns(2)
+        with col1:
+            st.selectbox(
+                "LLM Provider",
+                ["OpenAI", "Anthropic", "Local (Ollama)", "Azure OpenAI"]
+            )
+        with col2:
+            st.selectbox(
+                "Model",
+                ["GPT-4o", "GPT-4-turbo", "Claude 3.5 Sonnet", "Llama 3.1 70B"]
+            )
+        st.slider("Temperature", 0.0, 1.0, 0.7, 0.1)
+        st.number_input("Max Tokens", 100, 4096, 2048, 100)
+    with tabs[1]:
+        section_header("Database Configuration", "🗄️")
+        st.markdown(f"""
+        <div class="card" style="margin-bottom: 1rem;">
+            <div style="font-weight: 600; color: {COLORS.text_primary}; margin-bottom: 0.5rem;">
+                Vector Database
+            </div>
+            <div style="font-size: 0.8125rem; color: {COLORS.text_muted};">
+                Configure the vector store for similarity search
+            </div>
+        </div>
+        """, unsafe_allow_html=True)
+        col1, col2 = st.columns(2)
+        with col1:
+            st.selectbox("Vector Store", ["Qdrant (Recommended)", "Milvus", "Pinecone", "Weaviate", "ChromaDB"])
+            st.text_input("Host", value="localhost")
+        with col2:
+            st.number_input("Port", 1, 65535, 6333)
+            st.text_input("Collection Name", value="bioflow_embeddings")
+        spacer("1rem")
+        st.markdown(f"""
+        <div class="card" style="margin-bottom: 1rem;">
+            <div style="font-weight: 600; color: {COLORS.text_primary}; margin-bottom: 0.5rem;">
+                Knowledge Sources
+            </div>
+            <div style="font-size: 0.8125rem; color: {COLORS.text_muted};">
+                External databases for evidence retrieval
+            </div>
+        </div>
+        """, unsafe_allow_html=True)
+        col1, col2 = st.columns(2)
+        with col1:
+            st.checkbox("PubMed", value=True)
+            st.checkbox("DrugBank", value=True)
+            st.checkbox("ChEMBL", value=True)
+        with col2:
+            st.checkbox("UniProt", value=True)
+            st.checkbox("KEGG", value=False)
+            st.checkbox("Reactome", value=False)
+    with tabs[2]:
+        section_header("API Keys", "🔌")
+        st.warning("⚠️ API keys are stored locally and never sent to external servers.")
+        st.text_input("OpenAI API Key", type="password", placeholder="sk-...")
+        st.text_input("Anthropic API Key", type="password", placeholder="sk-ant-...")
+        st.text_input("PubMed API Key", type="password", placeholder="Optional - for higher rate limits")
+        st.text_input("ChEMBL API Key", type="password", placeholder="Optional")
+        spacer("1rem")
+        if st.button("💾 Save API Keys", type="primary"):
+            st.success("✓ API keys saved securely")
+    with tabs[3]:
+        section_header("Appearance", "🎨")
+        st.selectbox("Theme", ["Dark (Default)", "Light", "System"])
+        st.selectbox("Accent Color", ["Purple", "Blue", "Green", "Cyan", "Pink"])
+        st.checkbox("Enable animations", value=True)
+        st.checkbox("Compact mode", value=False)
+        st.slider("Font size", 12, 18, 14)
+    spacer("2rem")
+    divider()
+    spacer("1rem")
+    # Save buttons
+    col1, col2, col3 = st.columns([1, 1, 2])
+    with col1:
+        if st.button("💾 Save Settings", type="primary", use_container_width=True):
+            st.success("✓ Settings saved successfully!")
+    with col2:
+        if st.button("🔄 Reset to Defaults", use_container_width=True):
+            st.info("Settings reset to defaults")
+    spacer("2rem")
+    # Version info
+    st.markdown(f"""
+    <div style="text-align: center; padding: 1rem; color: {COLORS.text_muted}; font-size: 0.75rem;">
+        BioFlow v0.1.0 • Built with OpenBioMed
+    </div>
+    """, unsafe_allow_html=True)

bioflow/ui/requirements.txt ADDED Viewed

	@@ -0,0 +1,31 @@

+# BioFlow UI Dependencies
+# =======================
+# Core Streamlit
+streamlit>=1.29.0
+# Visualization
+plotly>=5.18.0
+altair>=5.2.0
+# Data handling
+pandas>=2.0.0
+numpy>=1.24.0
+# Molecular visualization (optional)
+rdkit>=2023.9.1
+py3Dmol>=2.0.0
+# Machine Learning (optional, for real encoders)
+# torch>=2.0.0
+# transformers>=4.35.0
+# Vector database
+qdrant-client>=1.7.0
+# Image processing
+Pillow>=10.0.0
+# Utilities
+python-dotenv>=1.0.0
+pyyaml>=6.0.1

bioflow/visualizer.py ADDED Viewed

	@@ -0,0 +1,386 @@

+"""
+BioFlow Visualizer - Embedding and Structure Visualization
+===========================================================
+This module provides visualization utilities for embeddings,
+molecular structures, and search results.
+"""
+import numpy as np
+from typing import List, Dict, Any, Optional, Tuple
+import logging
+logger = logging.getLogger(__name__)
+# Optional imports for visualization
+try:
+    import plotly.express as px
+    import plotly.graph_objects as go
+    from plotly.subplots import make_subplots
+    PLOTLY_AVAILABLE = True
+except ImportError:
+    PLOTLY_AVAILABLE = False
+try:
+    from sklearn.decomposition import PCA
+    from sklearn.manifold import TSNE
+    SKLEARN_AVAILABLE = True
+except ImportError:
+    SKLEARN_AVAILABLE = False
+class EmbeddingVisualizer:
+    """Visualize high-dimensional embeddings in 2D/3D."""
+    @staticmethod
+    def reduce_dimensions(
+        embeddings: np.ndarray,
+        method: str = "pca",
+        n_components: int = 2,
+        **kwargs
+    ) -> np.ndarray:
+        """
+        Reduce embedding dimensions for visualization.
+        Args:
+            embeddings: Array of shape (n_samples, n_features).
+            method: 'pca' or 'tsne'.
+            n_components: Target dimensions (2 or 3).
+        Returns:
+            Reduced embeddings of shape (n_samples, n_components).
+        """
+        if not SKLEARN_AVAILABLE:
+            raise ImportError("sklearn required for dimensionality reduction")
+        if method == "pca":
+            reducer = PCA(n_components=n_components, **kwargs)
+        elif method == "tsne":
+            reducer = TSNE(n_components=n_components, **kwargs)
+        else:
+            raise ValueError(f"Unknown method: {method}")
+        return reducer.fit_transform(embeddings)
+    @staticmethod
+    def plot_embeddings_2d(
+        embeddings: np.ndarray,
+        labels: List[str] = None,
+        colors: List[str] = None,
+        title: str = "Embedding Space",
+        hover_data: List[Dict] = None
+    ):
+        """
+        Create 2D scatter plot of embeddings.
+        Returns:
+            Plotly figure object.
+        """
+        if not PLOTLY_AVAILABLE:
+            raise ImportError("plotly required for visualization")
+        if embeddings.shape[1] > 2:
+            coords = EmbeddingVisualizer.reduce_dimensions(embeddings, "pca", 2)
+        else:
+            coords = embeddings
+        fig = go.Figure()
+        # Group by color if provided
+        if colors:
+            unique_colors = list(set(colors))
+            for color in unique_colors:
+                mask = [c == color for c in colors]
+                x = [coords[i, 0] for i in range(len(coords)) if mask[i]]
+                y = [coords[i, 1] for i in range(len(coords)) if mask[i]]
+                text = [labels[i] if labels else f"Point {i}" for i in range(len(coords)) if mask[i]]
+                fig.add_trace(go.Scatter(
+                    x=x, y=y,
+                    mode='markers',
+                    name=color,
+                    text=text,
+                    hoverinfo='text'
+                ))
+        else:
+            fig.add_trace(go.Scatter(
+                x=coords[:, 0],
+                y=coords[:, 1],
+                mode='markers',
+                text=labels or [f"Point {i}" for i in range(len(coords))],
+                hoverinfo='text'
+            ))
+        fig.update_layout(
+            title=title,
+            xaxis_title="Dimension 1",
+            yaxis_title="Dimension 2",
+            hovermode='closest'
+        )
+        return fig
+    @staticmethod
+    def plot_embeddings_3d(
+        embeddings: np.ndarray,
+        labels: List[str] = None,
+        colors: List[str] = None,
+        title: str = "3D Embedding Space"
+    ):
+        """Create 3D scatter plot of embeddings."""
+        if not PLOTLY_AVAILABLE:
+            raise ImportError("plotly required for visualization")
+        if embeddings.shape[1] > 3:
+            coords = EmbeddingVisualizer.reduce_dimensions(embeddings, "pca", 3)
+        else:
+            coords = embeddings
+        fig = go.Figure(data=[go.Scatter3d(
+            x=coords[:, 0],
+            y=coords[:, 1],
+            z=coords[:, 2],
+            mode='markers',
+            text=labels or [f"Point {i}" for i in range(len(coords))],
+            marker=dict(
+                size=5,
+                color=list(range(len(coords))) if not colors else None,
+                colorscale='Viridis',
+                opacity=0.8
+            )
+        )])
+        fig.update_layout(
+            title=title,
+            scene=dict(
+                xaxis_title='Dim 1',
+                yaxis_title='Dim 2',
+                zaxis_title='Dim 3'
+            )
+        )
+        return fig
+    @staticmethod
+    def plot_similarity_matrix(
+        embeddings: np.ndarray,
+        labels: List[str] = None,
+        title: str = "Similarity Matrix"
+    ):
+        """Plot pairwise similarity matrix."""
+        if not PLOTLY_AVAILABLE:
+            raise ImportError("plotly required for visualization")
+        # Compute cosine similarity
+        norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
+        normalized = embeddings / np.clip(norms, 1e-9, None)
+        similarity = np.dot(normalized, normalized.T)
+        labels = labels or [f"Item {i}" for i in range(len(embeddings))]
+        fig = go.Figure(data=go.Heatmap(
+            z=similarity,
+            x=labels,
+            y=labels,
+            colorscale='RdBu',
+            zmid=0
+        ))
+        fig.update_layout(
+            title=title,
+            xaxis_title="Items",
+            yaxis_title="Items"
+        )
+        return fig
+class MoleculeVisualizer:
+    """Visualize molecular structures."""
+    @staticmethod
+    def smiles_to_svg(smiles: str, size: Tuple[int, int] = (300, 200)) -> str:
+        """
+        Convert SMILES to SVG image.
+        Args:
+            smiles: SMILES string.
+            size: (width, height) tuple.
+        Returns:
+            SVG string.
+        """
+        try:
+            from rdkit import Chem
+            from rdkit.Chem import Draw
+            mol = Chem.MolFromSmiles(smiles)
+            if mol is None:
+                return f"<svg><text>Invalid SMILES</text></svg>"
+            drawer = Draw.MolDraw2DSVG(size[0], size[1])
+            drawer.DrawMolecule(mol)
+            drawer.FinishDrawing()
+            return drawer.GetDrawingText()
+        except ImportError:
+            return f"<svg><text>RDKit not available</text></svg>"
+    @staticmethod
+    def plot_molecule_grid(
+        smiles_list: List[str],
+        labels: List[str] = None,
+        mols_per_row: int = 4,
+        size: Tuple[int, int] = (200, 200)
+    ):
+        """
+        Create a grid of molecule images.
+        Returns:
+            PIL Image or Plotly figure.
+        """
+        try:
+            from rdkit import Chem
+            from rdkit.Chem import Draw
+            mols = [Chem.MolFromSmiles(s) for s in smiles_list]
+            legends = labels or smiles_list
+            img = Draw.MolsToGridImage(
+                mols,
+                molsPerRow=mols_per_row,
+                subImgSize=size,
+                legends=legends
+            )
+            return img
+        except ImportError:
+            logger.warning("RDKit not available for molecule visualization")
+            return None
+class ResultsVisualizer:
+    """Visualize search and pipeline results."""
+    @staticmethod
+    def plot_search_scores(
+        results: List[Dict[str, Any]],
+        title: str = "Search Results Scores"
+    ):
+        """Plot bar chart of search result scores."""
+        if not PLOTLY_AVAILABLE:
+            raise ImportError("plotly required")
+        labels = [r.get("content", "")[:30] for r in results]
+        scores = [r.get("score", 0) for r in results]
+        modalities = [r.get("modality", "unknown") for r in results]
+        fig = go.Figure(data=[go.Bar(
+            x=labels,
+            y=scores,
+            marker_color=[
+                'blue' if m == 'text' else 'green' if m in ['smiles', 'molecule'] else 'red'
+                for m in modalities
+            ],
+            text=[f"{m}: {s:.3f}" for m, s in zip(modalities, scores)],
+            textposition='outside'
+        )])
+        fig.update_layout(
+            title=title,
+            xaxis_title="Content",
+            yaxis_title="Similarity Score",
+            xaxis_tickangle=-45
+        )
+        return fig
+    @staticmethod
+    def plot_modality_distribution(
+        items: List[Dict[str, Any]],
+        title: str = "Modality Distribution"
+    ):
+        """Plot pie chart of modality distribution."""
+        if not PLOTLY_AVAILABLE:
+            raise ImportError("plotly required")
+        modalities = [item.get("modality", "unknown") for item in items]
+        unique = list(set(modalities))
+        counts = [modalities.count(m) for m in unique]
+        fig = go.Figure(data=[go.Pie(
+            labels=unique,
+            values=counts,
+            hole=0.3
+        )])
+        fig.update_layout(title=title)
+        return fig
+    @staticmethod
+    def create_dashboard(
+        search_results: List[Dict],
+        embeddings: np.ndarray = None,
+        labels: List[str] = None
+    ):
+        """Create a multi-panel dashboard."""
+        if not PLOTLY_AVAILABLE:
+            raise ImportError("plotly required")
+        fig = make_subplots(
+            rows=2, cols=2,
+            subplot_titles=(
+                "Search Scores",
+                "Modality Distribution",
+                "Embedding Space",
+                "Similarity Heatmap"
+            ),
+            specs=[
+                [{"type": "bar"}, {"type": "pie"}],
+                [{"type": "scatter"}, {"type": "heatmap"}]
+            ]
+        )
+        # Panel 1: Search scores
+        scores = [r.get("score", 0) for r in search_results]
+        fig.add_trace(
+            go.Bar(y=scores, name="Scores"),
+            row=1, col=1
+        )
+        # Panel 2: Modality distribution
+        modalities = [r.get("modality", "unknown") for r in search_results]
+        unique_mods = list(set(modalities))
+        counts = [modalities.count(m) for m in unique_mods]
+        fig.add_trace(
+            go.Pie(labels=unique_mods, values=counts, name="Modalities"),
+            row=1, col=2
+        )
+        # Panel 3: Embedding scatter (if provided)
+        if embeddings is not None and len(embeddings) > 0:
+            if embeddings.shape[1] > 2:
+                coords = EmbeddingVisualizer.reduce_dimensions(embeddings, "pca", 2)
+            else:
+                coords = embeddings
+            fig.add_trace(
+                go.Scatter(
+                    x=coords[:, 0],
+                    y=coords[:, 1],
+                    mode='markers',
+                    text=labels,
+                    name="Embeddings"
+                ),
+                row=2, col=1
+            )
+        # Panel 4: Similarity matrix (if embeddings provided)
+        if embeddings is not None and len(embeddings) > 1:
+            norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
+            normalized = embeddings / np.clip(norms, 1e-9, None)
+            similarity = np.dot(normalized, normalized.T)
+            fig.add_trace(
+                go.Heatmap(z=similarity, colorscale='RdBu', name="Similarity"),
+                row=2, col=2
+            )
+        fig.update_layout(height=800, title_text="BioFlow Dashboard")
+        return fig

bioflow/workflows/__init__.py ADDED Viewed

	@@ -0,0 +1,49 @@

+"""
+BioFlow Workflows
+==================
+Pre-built pipelines for common discovery tasks.
+Pipelines:
+- DiscoveryPipeline: Drug discovery with DTI prediction
+- LiteratureMiningPipeline: Scientific literature search
+- ProteinDesignPipeline: Protein homolog search
+Ingestion Utilities:
+- load_json_data, load_csv_data
+- parse_smiles_file, parse_fasta_file
+- generate_sample_* for testing
+"""
+from bioflow.workflows.discovery import (
+    DiscoveryPipeline,
+    DiscoveryResult,
+    LiteratureMiningPipeline,
+    ProteinDesignPipeline,
+)
+from bioflow.workflows.ingestion import (
+    load_json_data,
+    load_csv_data,
+    parse_smiles_file,
+    parse_fasta_file,
+    generate_sample_molecules,
+    generate_sample_proteins,
+    generate_sample_abstracts,
+)
+__all__ = [
+    # Pipelines
+    "DiscoveryPipeline",
+    "DiscoveryResult",
+    "LiteratureMiningPipeline",
+    "ProteinDesignPipeline",
+    # Ingestion
+    "load_json_data",
+    "load_csv_data",
+    "parse_smiles_file",
+    "parse_fasta_file",
+    "generate_sample_molecules",
+    "generate_sample_proteins",
+    "generate_sample_abstracts",
+]

bioflow/workflows/discovery.py ADDED Viewed

	@@ -0,0 +1,400 @@

+"""
+BioFlow Discovery Pipeline
+===========================
+High-level API for common discovery workflows.
+Connects encoders, retrievers, and predictors into seamless pipelines.
+"""
+import logging
+from typing import List, Dict, Any, Optional, Union
+from dataclasses import dataclass, field
+from datetime import datetime
+from bioflow.core import (
+    Modality,
+    ToolRegistry,
+    BioFlowOrchestrator,
+    WorkflowConfig,
+    NodeConfig,
+    NodeType,
+    RetrievalResult,
+)
+from bioflow.core.nodes import (
+    NodeResult,
+    EncodeNode,
+    RetrieveNode,
+    PredictNode,
+    IngestNode,
+    FilterNode,
+    TraceabilityNode,
+)
+logger = logging.getLogger(__name__)
+@dataclass
+class DiscoveryResult:
+    """Complete result from a discovery pipeline."""
+    query: str
+    query_modality: str
+    candidates: List[Dict[str, Any]]
+    predictions: List[Dict[str, Any]]
+    top_hits: List[Dict[str, Any]]
+    execution_time_ms: float
+    metadata: Dict[str, Any] = field(default_factory=dict)
+    def to_dict(self) -> Dict[str, Any]:
+        return {
+            "query": self.query,
+            "query_modality": self.query_modality,
+            "num_candidates": len(self.candidates),
+            "num_predictions": len(self.predictions),
+            "top_hits": self.top_hits[:5],
+            "execution_time_ms": self.execution_time_ms,
+        }
+    def __repr__(self):
+        return f"DiscoveryResult(hits={len(self.top_hits)}, time={self.execution_time_ms:.0f}ms)"
+class DiscoveryPipeline:
+    """
+    High-level discovery pipeline for drug-target interactions.
+    Workflow:
+    1. Encode query (text/molecule/protein) → vector
+    2. Search vector DB for similar compounds/sequences
+    3. Predict binding affinity for each candidate
+    4. Filter and rank by predicted score
+    5. Add evidence links for traceability
+    Example:
+        >>> from bioflow.plugins import OBMEncoder, QdrantRetriever, DeepPurposePredictor
+        >>>
+        >>> pipeline = DiscoveryPipeline(
+        ...     encoder=OBMEncoder(),
+        ...     retriever=QdrantRetriever(...),
+        ...     predictor=DeepPurposePredictor()
+        ... )
+        >>>
+        >>> # Search for drug candidates
+        >>> results = pipeline.discover(
+        ...     query="EGFR inhibitor with low toxicity",
+        ...     target_sequence="MRKH...",
+        ...     limit=20
+        ... )
+    """
+    def __init__(
+        self,
+        encoder,
+        retriever,
+        predictor,
+        collection: str = "molecules"
+    ):
+        """
+        Initialize discovery pipeline.
+        Args:
+            encoder: BioEncoder instance (e.g., OBMEncoder)
+            retriever: BioRetriever instance (e.g., QdrantRetriever)
+            predictor: BioPredictor instance (e.g., DeepPurposePredictor)
+            collection: Default collection for retrieval
+        """
+        self.encoder = encoder
+        self.retriever = retriever
+        self.predictor = predictor
+        self.collection = collection
+        # Initialize nodes
+        self._encode_node = EncodeNode("encode", encoder, auto_detect=True)
+        self._retrieve_node = RetrieveNode("retrieve", retriever, collection=collection)
+        self._predict_node = PredictNode("predict", predictor)
+        self._filter_node = FilterNode("filter", threshold=0.3, top_k=10)
+        self._trace_node = TraceabilityNode("trace")
+        logger.info("DiscoveryPipeline initialized")
+    def discover(
+        self,
+        query: str,
+        target_sequence: str,
+        modality: Modality = None,
+        limit: int = 20,
+        filters: Dict[str, Any] = None,
+        threshold: float = 0.3,
+        top_k: int = 10
+    ) -> DiscoveryResult:
+        """
+        Run full discovery pipeline.
+        Args:
+            query: Search query (text, SMILES, or protein)
+            target_sequence: Target protein sequence for DTI prediction
+            modality: Input modality (auto-detected if None)
+            limit: Number of candidates to retrieve
+            filters: Metadata filters for retrieval
+            threshold: Minimum prediction score
+            top_k: Number of top hits to return
+        Returns:
+            DiscoveryResult with ranked candidates
+        """
+        start_time = datetime.now()
+        # Detect modality if not provided
+        if modality is None:
+            if hasattr(self.encoder, 'encode_auto'):
+                # Will auto-detect
+                modality = Modality.TEXT
+            else:
+                modality = Modality.TEXT
+        # Step 1: Encode query
+        logger.info(f"Encoding query: {query[:50]}...")
+        self._encode_node.modality = modality
+        encode_result = self._encode_node.execute(query)
+        # Step 2: Retrieve candidates
+        logger.info(f"Retrieving up to {limit} candidates...")
+        self._retrieve_node.limit = limit
+        self._retrieve_node.filters = filters or {}
+        retrieve_result = self._retrieve_node.execute(
+            encode_result.data,
+            context={"modality": modality}
+        )
+        candidates = retrieve_result.data
+        if not candidates:
+            logger.warning("No candidates found")
+            return DiscoveryResult(
+                query=query,
+                query_modality=modality.value,
+                candidates=[],
+                predictions=[],
+                top_hits=[],
+                execution_time_ms=(datetime.now() - start_time).total_seconds() * 1000
+            )
+        # Step 3: Predict binding
+        logger.info(f"Predicting binding for {len(candidates)} candidates...")
+        self._predict_node.target_sequence = target_sequence
+        self._predict_node.threshold = threshold
+        predict_result = self._predict_node.execute(candidates)
+        predictions = predict_result.data
+        # Step 4: Filter and rank
+        logger.info("Filtering and ranking...")
+        self._filter_node.threshold = threshold
+        self._filter_node.top_k = top_k
+        filter_result = self._filter_node.execute(predictions)
+        top_hits = filter_result.data
+        # Step 5: Add evidence links
+        logger.info("Adding evidence links...")
+        trace_result = self._trace_node.execute(top_hits)
+        enriched_hits = trace_result.data
+        execution_time = (datetime.now() - start_time).total_seconds() * 1000
+        return DiscoveryResult(
+            query=query,
+            query_modality=modality.value,
+            candidates=[{"id": c.id, "content": c.content, "score": c.score} for c in candidates],
+            predictions=predictions,
+            top_hits=enriched_hits,
+            execution_time_ms=execution_time,
+            metadata={
+                "collection": self.collection,
+                "limit": limit,
+                "threshold": threshold,
+                "target_length": len(target_sequence)
+            }
+        )
+    def ingest(
+        self,
+        data: List[Dict[str, Any]],
+        modality: Modality = Modality.SMILES,
+        content_field: str = "smiles"
+    ) -> List[str]:
+        """
+        Ingest data into the vector database.
+        Args:
+            data: List of items with content and metadata
+            modality: Type of content
+            content_field: Field name containing the content
+        Returns:
+            List of ingested IDs
+        """
+        ingest_node = IngestNode(
+            "ingest",
+            self.retriever,
+            collection=self.collection,
+            modality=modality,
+            content_field=content_field
+        )
+        result = ingest_node.execute(data)
+        logger.info(f"Ingested {len(result.data)} items into {self.collection}")
+        return result.data
+    def search(
+        self,
+        query: str,
+        modality: Modality = None,
+        limit: int = 10,
+        filters: Dict[str, Any] = None
+    ) -> List[RetrievalResult]:
+        """
+        Simple similarity search without prediction.
+        Args:
+            query: Search query
+            modality: Input modality
+            limit: Maximum results
+            filters: Metadata filters
+        Returns:
+            List of similar items
+        """
+        # Encode
+        self._encode_node.modality = modality or Modality.TEXT
+        self._encode_node.auto_detect = modality is None
+        encode_result = self._encode_node.execute(query)
+        # Retrieve
+        self._retrieve_node.limit = limit
+        self._retrieve_node.filters = filters or {}
+        retrieve_result = self._retrieve_node.execute(encode_result.data)
+        return retrieve_result.data
+class LiteratureMiningPipeline:
+    """
+    Pipeline for searching and analyzing scientific literature.
+    Workflow:
+    1. Encode query (text/molecule/protein)
+    2. Search literature database
+    3. Extract relevant evidence
+    4. Rank by relevance and diversity
+    """
+    def __init__(
+        self,
+        encoder,
+        retriever,
+        collection: str = "pubmed_abstracts"
+    ):
+        self.encoder = encoder
+        self.retriever = retriever
+        self.collection = collection
+        self._encode_node = EncodeNode("encode", encoder, auto_detect=True)
+        self._retrieve_node = RetrieveNode("retrieve", retriever, collection=collection)
+        self._trace_node = TraceabilityNode("trace")
+    def search(
+        self,
+        query: str,
+        modality: Modality = Modality.TEXT,
+        limit: int = 20,
+        filters: Dict[str, Any] = None
+    ) -> List[Dict[str, Any]]:
+        """
+        Search literature for relevant papers.
+        Args:
+            query: Search query
+            modality: Query type
+            limit: Maximum results
+            filters: Filters (e.g., year, species)
+        Returns:
+            List of papers with evidence links
+        """
+        # Encode query
+        self._encode_node.modality = modality
+        encode_result = self._encode_node.execute(query)
+        # Search
+        self._retrieve_node.limit = limit
+        self._retrieve_node.filters = filters or {}
+        retrieve_result = self._retrieve_node.execute(encode_result.data)
+        # Add evidence links
+        trace_result = self._trace_node.execute(retrieve_result.data)
+        return trace_result.data
+class ProteinDesignPipeline:
+    """
+    Pipeline for protein/antibody design workflows.
+    Workflow:
+    1. Encode seed protein
+    2. Find similar sequences in database
+    3. Analyze conservation and mutations
+    4. Suggest design candidates
+    """
+    def __init__(
+        self,
+        encoder,
+        retriever,
+        collection: str = "proteins"
+    ):
+        self.encoder = encoder
+        self.retriever = retriever
+        self.collection = collection
+    def find_homologs(
+        self,
+        sequence: str,
+        limit: int = 50,
+        species_filter: str = None
+    ) -> List[Dict[str, Any]]:
+        """
+        Find homologous proteins.
+        Args:
+            sequence: Query protein sequence
+            limit: Maximum results
+            species_filter: Filter by species
+        Returns:
+            List of homologous proteins with metadata
+        """
+        # Encode
+        embedding = self.encoder.encode(sequence, Modality.PROTEIN)
+        # Build filters
+        filters = {}
+        if species_filter:
+            filters["species"] = species_filter
+        # Search
+        results = self.retriever.search(
+            query=embedding.vector,
+            limit=limit,
+            filters=filters if filters else None,
+            collection=self.collection,
+            modality=Modality.PROTEIN
+        )
+        return [
+            {
+                "id": r.id,
+                "sequence": r.content,
+                "score": r.score,
+                **r.payload
+            }
+            for r in results
+        ]

bioflow/workflows/drug_discovery.yaml ADDED Viewed

	@@ -0,0 +1,54 @@

+# =============================================================================
+# BioFlow Workflow Example: Drug Discovery Pipeline
+# =============================================================================
+# This YAML defines a simple drug discovery workflow that:
+# 1. Encodes a query molecule
+# 2. Retrieves similar compounds from memory
+# 3. Predicts binding affinity for top candidates
+# =============================================================================
+name: drug_discovery_basic
+description: >
+  Basic drug discovery pipeline: encode query -> retrieve analogs -> predict DTI
+# Define the nodes in execution order
+nodes:
+  - id: encode_query
+    type: encode
+    tool: obm                    # Uses OBM multimodal encoder
+    inputs: [input]              # Takes pipeline input
+    params:
+      modality: smiles
+  - id: find_analogs
+    type: retrieve
+    tool: qdrant                 # Uses Qdrant vector search
+    inputs: [encode_query]       # Takes encoded vector
+    params:
+      limit: 10
+      modality: smiles
+      collection: molecules
+  - id: predict_binding
+    type: predict
+    tool: deeppurpose            # Uses DeepPurpose DTI predictor
+    inputs: [find_analogs]       # Takes retrieved molecules
+    params:
+      target: "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG"
+  - id: filter_hits
+    type: filter
+    tool: builtin                # Built-in filter
+    inputs: [predict_binding]
+    params:
+      threshold: 0.7
+      key: score
+# Final output comes from this node
+output_node: filter_hits
+# Optional metadata
+metadata:
+  version: "1.0"
+  author: "BioFlow Team"
+  tags: [drug-discovery, dti, molecules]

bioflow/workflows/ingestion.py ADDED Viewed

	@@ -0,0 +1,276 @@

+"""
+Data Ingestion Utilities
+=========================
+Helpers for ingesting data from common biological sources:
+- PubMed abstracts
+- UniProt proteins
+- ChEMBL molecules
+- Custom CSV/JSON files
+"""
+import logging
+from typing import List, Dict, Any, Optional, Generator
+from pathlib import Path
+import json
+from bioflow.core import Modality
+logger = logging.getLogger(__name__)
+def load_json_data(
+    path: str,
+    content_field: str = "content",
+    modality_field: str = None,
+    limit: int = None
+) -> List[Dict[str, Any]]:
+    """
+    Load data from JSON file.
+    Args:
+        path: Path to JSON file
+        content_field: Field containing main content
+        modality_field: Field indicating modality (optional)
+        limit: Maximum items to load
+    Returns:
+        List of data items
+    """
+    with open(path, 'r', encoding='utf-8') as f:
+        data = json.load(f)
+    if isinstance(data, dict):
+        data = data.get("data", data.get("items", [data]))
+    if limit:
+        data = data[:limit]
+    logger.info(f"Loaded {len(data)} items from {path}")
+    return data
+def load_csv_data(
+    path: str,
+    content_field: str = "content",
+    delimiter: str = ",",
+    limit: int = None
+) -> List[Dict[str, Any]]:
+    """
+    Load data from CSV file.
+    Args:
+        path: Path to CSV file
+        content_field: Column containing main content
+        delimiter: CSV delimiter
+        limit: Maximum items to load
+    Returns:
+        List of data items as dictionaries
+    """
+    import csv
+    data = []
+    with open(path, 'r', encoding='utf-8') as f:
+        reader = csv.DictReader(f, delimiter=delimiter)
+        for i, row in enumerate(reader):
+            if limit and i >= limit:
+                break
+            data.append(dict(row))
+    logger.info(f"Loaded {len(data)} items from {path}")
+    return data
+def parse_smiles_file(
+    path: str,
+    name_field: int = 1,
+    smiles_field: int = 0,
+    has_header: bool = True,
+    limit: int = None
+) -> List[Dict[str, Any]]:
+    """
+    Parse SMILES file (commonly .smi format).
+    Args:
+        path: Path to SMILES file
+        name_field: Column index for molecule name
+        smiles_field: Column index for SMILES string
+        has_header: Whether file has header row
+        limit: Maximum items to load
+    Returns:
+        List of molecule dictionaries
+    """
+    data = []
+    with open(path, 'r') as f:
+        for i, line in enumerate(f):
+            if has_header and i == 0:
+                continue
+            if limit and len(data) >= limit:
+                break
+            parts = line.strip().split()
+            if len(parts) >= 2:
+                data.append({
+                    "smiles": parts[smiles_field],
+                    "name": parts[name_field] if len(parts) > name_field else f"mol_{i}",
+                    "modality": "smiles"
+                })
+            elif len(parts) == 1:
+                data.append({
+                    "smiles": parts[0],
+                    "name": f"mol_{i}",
+                    "modality": "smiles"
+                })
+    logger.info(f"Loaded {len(data)} molecules from {path}")
+    return data
+def parse_fasta_file(
+    path: str,
+    limit: int = None
+) -> List[Dict[str, Any]]:
+    """
+    Parse FASTA file for protein sequences.
+    Args:
+        path: Path to FASTA file
+        limit: Maximum sequences to load
+    Returns:
+        List of protein dictionaries
+    """
+    data = []
+    current_header = None
+    current_sequence = []
+    with open(path, 'r') as f:
+        for line in f:
+            line = line.strip()
+            if line.startswith('>'):
+                # Save previous sequence
+                if current_header and current_sequence:
+                    seq = ''.join(current_sequence)
+                    data.append({
+                        "sequence": seq,
+                        "header": current_header,
+                        "uniprot_id": _extract_uniprot_id(current_header),
+                        "modality": "protein"
+                    })
+                    if limit and len(data) >= limit:
+                        break
+                current_header = line[1:]  # Remove >
+                current_sequence = []
+            else:
+                current_sequence.append(line)
+        # Don't forget last sequence
+        if current_header and current_sequence and (not limit or len(data) < limit):
+            seq = ''.join(current_sequence)
+            data.append({
+                "sequence": seq,
+                "header": current_header,
+                "uniprot_id": _extract_uniprot_id(current_header),
+                "modality": "protein"
+            })
+    logger.info(f"Loaded {len(data)} proteins from {path}")
+    return data
+def _extract_uniprot_id(header: str) -> str:
+    """Extract UniProt ID from FASTA header."""
+    # Common formats: sp|P12345|NAME or tr|P12345|NAME
+    if '|' in header:
+        parts = header.split('|')
+        if len(parts) >= 2:
+            return parts[1]
+    # Just take first word
+    return header.split()[0]
+def generate_sample_molecules() -> List[Dict[str, Any]]:
+    """
+    Generate sample molecule data for testing.
+    Returns:
+        List of common drug molecules
+    """
+    return [
+        {"smiles": "CC(=O)Oc1ccccc1C(=O)O", "name": "Aspirin", "drugbank_id": "DB00945", "modality": "smiles"},
+        {"smiles": "CN1C=NC2=C1C(=O)N(C(=O)N2C)C", "name": "Caffeine", "pubchem_id": "2519", "modality": "smiles"},
+        {"smiles": "CC(C)Cc1ccc(cc1)C(C)C(=O)O", "name": "Ibuprofen", "drugbank_id": "DB01050", "modality": "smiles"},
+        {"smiles": "CC(=O)Nc1ccc(cc1)O", "name": "Acetaminophen", "drugbank_id": "DB00316", "modality": "smiles"},
+        {"smiles": "CCO", "name": "Ethanol", "pubchem_id": "702", "modality": "smiles"},
+        {"smiles": "c1ccccc1", "name": "Benzene", "pubchem_id": "241", "modality": "smiles"},
+        {"smiles": "CC(C)NCC(O)c1ccc(O)c(O)c1", "name": "Isoprenaline", "drugbank_id": "DB01064", "modality": "smiles"},
+        {"smiles": "Clc1ccc2c(c1)C(=NCC2)c3ccccc3", "name": "Diazepam", "drugbank_id": "DB00829", "modality": "smiles"},
+    ]
+def generate_sample_proteins() -> List[Dict[str, Any]]:
+    """
+    Generate sample protein data for testing.
+    Returns:
+        List of common proteins
+    """
+    return [
+        {
+            "sequence": "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG",
+            "name": "Sample kinase fragment",
+            "uniprot_id": "P00533",
+            "species": "human",
+            "modality": "protein"
+        },
+        {
+            "sequence": "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH",
+            "name": "Hemoglobin alpha",
+            "uniprot_id": "P69905",
+            "species": "human",
+            "modality": "protein"
+        },
+        {
+            "sequence": "MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELDKAIGRNCNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRCALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL",
+            "name": "EGFR fragment",
+            "uniprot_id": "P00533",
+            "species": "human",
+            "modality": "protein"
+        },
+    ]
+def generate_sample_abstracts() -> List[Dict[str, Any]]:
+    """
+    Generate sample PubMed-style abstracts for testing.
+    Returns:
+        List of sample abstracts
+    """
+    return [
+        {
+            "content": "EGFR mutations are common in non-small cell lung cancer and predict response to tyrosine kinase inhibitors. Gefitinib and erlotinib have shown significant efficacy in patients with EGFR-mutant tumors.",
+            "pmid": "12345678",
+            "title": "EGFR inhibitors in lung cancer",
+            "year": 2023,
+            "modality": "text"
+        },
+        {
+            "content": "Drug-target interaction prediction using deep learning has emerged as a powerful approach for drug discovery. Neural networks can learn complex patterns from molecular structures and protein sequences.",
+            "pmid": "23456789",
+            "title": "Deep learning for DTI prediction",
+            "year": 2024,
+            "modality": "text"
+        },
+        {
+            "content": "Aspirin inhibits cyclooxygenase enzymes (COX-1 and COX-2), reducing prostaglandin synthesis. This mechanism underlies its anti-inflammatory and analgesic effects.",
+            "pmid": "34567890",
+            "title": "Mechanism of aspirin",
+            "year": 2022,
+            "modality": "text"
+        },
+    ]

bioflow/workflows/literature_mining.yaml ADDED Viewed

	@@ -0,0 +1,41 @@

+# =============================================================================
+# BioFlow Workflow Example: Literature Mining Pipeline
+# =============================================================================
+# This workflow searches scientific literature for relevant papers
+# based on a text, molecule, or protein query.
+# =============================================================================
+name: literature_mining
+description: >
+  Cross-modal literature search: encode any modality -> retrieve papers -> rank by relevance
+nodes:
+  - id: encode_query
+    type: encode
+    tool: obm
+    inputs: [input]
+    params:
+      modality: text             # Can be: text, smiles, protein
+  - id: search_papers
+    type: retrieve
+    tool: qdrant
+    inputs: [encode_query]
+    params:
+      limit: 20
+      collection: pubmed_abstracts
+      modality: text
+  - id: rank_results
+    type: custom
+    tool: mmr_rerank             # Maximum Marginal Relevance
+    inputs: [search_papers]
+    params:
+      diversity: 0.3
+      top_k: 10
+output_node: rank_results
+metadata:
+  version: "1.0"
+  tags: [literature, search, text-mining]

checkpoints/.placeholder ADDED Viewed

File without changes