| --- |
| license: cc-by-4.0 |
| task_categories: |
| - text-generation |
| language: |
| - en |
| tags: |
| - scientific-language-model |
| - protein |
| - molecule |
| - drug-discovery |
| - materials-science |
| - retrosynthesis |
| - antibody |
| - autoregressive |
| - generative |
| - one-model-fits-all |
| size_categories: |
| - 1B-10B |
| --- |
| |
| # LOGOS: Language of Generative Objects in Science |
|
|
| <p align="center"> |
| <img src="pics/logos.png" alt="LOGOS" width="180"> |
| </p> |
|
|
| <p align="center"> |
| <a href="https://arxiv.org/pdf/2606.16905" target="_blank"><img src="https://img.shields.io/badge/Technical Report-b5212f.svg?logo=arxiv" height="21px"></a> |
| <a href="https://github.com/LOGOS-Hub/LOGOS"><img src="https://img.shields.io/badge/GitHub-LOGOS-181717?logo=github&logoColor=white" height="21px"></a> |
| </p> |
|
|
| <p align="center"> |
| <a href="https://huggingface.co/LOGOS-Hub/LOGOS-8B"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Model-LOGOS--8B-yellow" height="21px"></a> |
| <a href="https://huggingface.co/LOGOS-Hub/LOGOS-pretrain-1B"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Model-LOGOS--pretrain--1B-yellow" height="21px"></a> |
| <a href="https://huggingface.co/LOGOS-Hub/LOGOS-pretrain-3B"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Model-LOGOS--pretrain--3B-yellow" height="21px"></a> |
| <a href="https://huggingface.co/LOGOS-Hub/LOGOS-pretrain-8B"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Model-LOGOS--pretrain--8B-yellow" height="21px"></a> |
| </p> |
|
|
| ## Overview |
|
|
| **LOGOS** (**L**anguage **O**f **G**enerative **O**bjects in **S**cience) is the first multi-domain generative framework built on a unified *scientific grammar*. It encodes diverse scientific objects — proteins, antibodies, small molecules, chemical reactions, materials, and their spatial interactions — as token sequences over a shared vocabulary, enabling a single autoregressive model to perform generation, prediction, and design across the natural sciences. |
|
|
| Unlike approaches that rely on natural language as an intermediary or require explicit 3D geometric networks, LOGOS operates directly on domain-native representations. Key spatial relationships (e.g., protein pocket–ligand contacts) are discretized and tokenized into the shared grammar, allowing the model to learn complex structural interactions in a purely sequential manner. |
|
|
| <p align="center"> |
| <img src="pics/LOGOS-mainfigure.png" alt="LOGOS Framework Overview" width="100%"> |
| </p> |
|
|
| ### Key Features |
|
|
| * **Unified Scientific Grammar**: A shared representational interface that encodes heterogeneous scientific objects and cross-object relationships into a common discrete token space. |
| * **One Model Fits All**: A single autoregressive model handles tasks across proteins, small molecules, materials, reactions, antibodies, and their interactions. |
| * **No Explicit 3D Geometry Required**: Spatial contact and constraint patterns are captured through tokenized representations, without relying on geometric neural networks or explicit coordinates. |
| * **Pre-training & Downstream Alignment**: The grammar space ensures formal consistency between continued pre-training objectives and downstream task goals. |
|
|
| <p align="center"> |
| <img src="pics/logos-data-process.png" alt="Data Construction in LOGOS" width="100%"> |
| </p> |
|
|
| ## Supported Tasks |
|
|
| LOGOS achieves competitive or state-of-the-art performance across six representative downstream tasks: |
|
|
| | Task | Domain | Description | |
| | ---- | ------ | ----------- | |
| | Interaction-Aware Ligand Design for Binding Pockets | Drug Discovery | Generate ligands capable of specifically binding to a protein binding pocket | |
| | Protein Ligand-Binding Site Identification | Structural Biology | Identify binding pockets from protein sequences | |
| | Retrosynthesis Prediction | Chemistry | Predict reactants given a target product | |
| | Unconditional Material Generation | Materials Science | Generate novel and valid materials | |
| | Protein Editing | Protein Engineering | Edit protein sequences for improved functional properties | |
| | Antibody CDR Design | Immunology | Design complementarity-determining regions for antibody engineering | |
|
|
| <p align="center"> |
| <img src="pics/bench_comparison.png" alt="Benchmark Comparison" width="100%"> |
| </p> |
|
|
| ## Model Architecture |
|
|
| LOGOS is based on an autoregressive Transformer architecture with continued multi-domain pre-training on a unified scientific grammar. The framework spans a parameter range from **1B to 8B**, with stable scaling behavior observed across this range. |
|
|
| ## Quick Start |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| model = AutoModelForCausalLM.from_pretrained("LOGOS-Hub/LOGOS-8B") |
| tokenizer = AutoTokenizer.from_pretrained("LOGOS-Hub/LOGOS-8B") |
| |
| input_text = "<your_scientific_grammar_input>" |
| inputs = tokenizer(input_text, return_tensors="pt") |
| outputs = model.generate(**inputs, max_new_tokens=512) |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| ``` |
|
|
| ## Citation |
|
|
| If you find this work useful in your research or applications, please cite our technical report. |
|
|
| ```bibtex |
| @misc{li2026speakinglanguagesciencegeneralpurpose, |
| title={Speaking the Language of Science: Toward a General-Purpose Generative Foundation Model for the Natural Sciences}, |
| author={Mingyang Li and Yurou Liu and Jieping Ye and Bing Su and Ji-Rong Wen and Zheng Wang}, |
| year={2026}, |
| eprint={2606.16905}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CL}, |
| url={https://arxiv.org/abs/2606.16905}, |
| } |
| ``` |
|
|
| ## License |
|
|
| This project is released under **[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode)**. |
|
|
| We welcome collaboration, feedback, and community contributions to advance unified generative modeling for the natural sciences. |
|
|