TropicBERT / README.md
yang0104's picture
Update README.md
79475af verified
|
raw
history blame
3.61 kB
---
language: []
tags:
- genomics
- dna
- tropical-fruits
- transformer
- masked-language-modeling
- foundation-model
license: apache-2.0
model-index:
- name: TropicBERT
results: []
---
# TropicBERT: Tropical Fruit Genomics Foundation Models
---
## ๐Ÿงฌ 1. Model Overview
**TropicBERT** is the first genomic foundation model series specifically designed for **tropical fruit crop genome sequences**. We pre-trained BERT models using the **MLM (Masked Language Modeling)** objective on datasets comprising **13 different species combinations**, releasing a total of **13 pre-trained model variants** covering 1, 5, and 10 species.
The models are developed based on **TropicBERT-LLMs_One_stop_tutorial**, a reproducible pipeline that successfully transfers the "pre-training and fine-tuning" paradigm from NLP to genomics, significantly lowering the barrier for developing plant genomic large language models.
๐Ÿ”— **Related Resources**๏ผš
* **GitHub Full Tutorial**: [TropicBERT-LLMs_One_stop_tutorial](https://github.com/yanglin789/TropicBERT-LLMs_One_stop_tutorial)
* Includes data preprocessing, pre-training, and fine-tuning scripts
* Provides sample datasets in FASTA/CSV format
* **Published Paper**: [ Paper ]
* More detailed model description and training details
---
## ๐Ÿ“– 2. Model Variants
We provide 13 model versions trained on different species combinations to meet various research needs๏ผš
| Model Name | Coverage | Species Count | Included Species (Scientific Name) |
| :--- | :--- | :---: | :--- |
| **base** | Comprehensive | 10 | *Melastoma candidum, Punica granatum, Syzygium samarangense, Plinia cauliflora, Psidium guajava, Eugenia stipitata, Eugenia brasiliensis, Eugenia candolleana, Eugenia Observa, Eugenia aggregata* |
| **tp035-039** | Group A | 5 | *Melastoma candidum, Punica granatum, Syzygium samarangense, Plinia cauliflora, Psidium guajava* |
| **tp040-044** | Group B | 5 | *Eugenia stipitata, Eugenia brasiliensis, Eugenia candolleana, Eugenia Observa, Eugenia aggregata* |
| **tp035** | Single Species | 1 | *Melastoma candidum* |
| **tp036** | Single Species | 1 | *Punica granatum* |
| **tp037** | Single Species | 1 | *Syzygium samarangense* |
| **tp038** | Single Species | 1 | *Plinia cauliflora* |
| **tp039** | Single Species | 1 | *Psidium guajava* |
| **tp040** | Single Species | 1 | *Eugenia stipitata* |
| **tp041** | Single Species | 1 | *Eugenia brasiliensis* |
| **tp042** | Single Species | 1 | *Eugenia candolleana* |
| **tp043** | Single Species | 1 | *Eugenia Observa* |
| **tp044** | Single Species | 1 | *Eugenia aggregata* |
---
## ๐Ÿš€ 3. Quick Usage
The following code demonstrates how to load the model and extract DNA sequence embeddings.
```python
from transformers import AutoTokenizer, AutoModel
import torch
model_name = "yang0104/TropicBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
seq = "ATGCGTACGTTAGCCTA..."
inputs = tokenizer(seq, return_tensors="pt")
outputs = model(**inputs)
embeddings = outputs.last_hidden_state
print(embeddings.shape)
```
## ๐Ÿ“ž 4. Contact & Citation
If you encounter any issues or wish to discuss collaboration, please contact us via:
- ๐Ÿ›  **Submit Issue**: [GitHub Issues](https://github.com/yanglin789/TropicBERT-LLMs_One_stop_tutorial/issues)
- ๐Ÿ’ฌ **Community Discussion**: [HuggingFace Discussions](https://huggingface.co/yang0104/TropicBERT/discussions)
- ๐Ÿ“ง **Email**: 1264894293yl@gmail.com
**Citation:**
If you use TropicBERT in your research, please cite the following paper:
> paper