File size: 3,609 Bytes
84be48a a0fb9f3 79475af a0fb9f3 84be48a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 | ---
language: []
tags:
- genomics
- dna
- tropical-fruits
- transformer
- masked-language-modeling
- foundation-model
license: apache-2.0
model-index:
- name: TropicBERT
results: []
---
# TropicBERT: Tropical Fruit Genomics Foundation Models
---
## 🧬 1. Model Overview
**TropicBERT** is the first genomic foundation model series specifically designed for **tropical fruit crop genome sequences**. We pre-trained BERT models using the **MLM (Masked Language Modeling)** objective on datasets comprising **13 different species combinations**, releasing a total of **13 pre-trained model variants** covering 1, 5, and 10 species.
The models are developed based on **TropicBERT-LLMs_One_stop_tutorial**, a reproducible pipeline that successfully transfers the "pre-training and fine-tuning" paradigm from NLP to genomics, significantly lowering the barrier for developing plant genomic large language models.
🔗 **Related Resources**:
* **GitHub Full Tutorial**: [TropicBERT-LLMs_One_stop_tutorial](https://github.com/yanglin789/TropicBERT-LLMs_One_stop_tutorial)
* Includes data preprocessing, pre-training, and fine-tuning scripts
* Provides sample datasets in FASTA/CSV format
* **Published Paper**: [ Paper ]
* More detailed model description and training details
---
## 📖 2. Model Variants
We provide 13 model versions trained on different species combinations to meet various research needs:
| Model Name | Coverage | Species Count | Included Species (Scientific Name) |
| :--- | :--- | :---: | :--- |
| **base** | Comprehensive | 10 | *Melastoma candidum, Punica granatum, Syzygium samarangense, Plinia cauliflora, Psidium guajava, Eugenia stipitata, Eugenia brasiliensis, Eugenia candolleana, Eugenia Observa, Eugenia aggregata* |
| **tp035-039** | Group A | 5 | *Melastoma candidum, Punica granatum, Syzygium samarangense, Plinia cauliflora, Psidium guajava* |
| **tp040-044** | Group B | 5 | *Eugenia stipitata, Eugenia brasiliensis, Eugenia candolleana, Eugenia Observa, Eugenia aggregata* |
| **tp035** | Single Species | 1 | *Melastoma candidum* |
| **tp036** | Single Species | 1 | *Punica granatum* |
| **tp037** | Single Species | 1 | *Syzygium samarangense* |
| **tp038** | Single Species | 1 | *Plinia cauliflora* |
| **tp039** | Single Species | 1 | *Psidium guajava* |
| **tp040** | Single Species | 1 | *Eugenia stipitata* |
| **tp041** | Single Species | 1 | *Eugenia brasiliensis* |
| **tp042** | Single Species | 1 | *Eugenia candolleana* |
| **tp043** | Single Species | 1 | *Eugenia Observa* |
| **tp044** | Single Species | 1 | *Eugenia aggregata* |
---
## 🚀 3. Quick Usage
The following code demonstrates how to load the model and extract DNA sequence embeddings.
```python
from transformers import AutoTokenizer, AutoModel
import torch
model_name = "yang0104/TropicBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
seq = "ATGCGTACGTTAGCCTA..."
inputs = tokenizer(seq, return_tensors="pt")
outputs = model(**inputs)
embeddings = outputs.last_hidden_state
print(embeddings.shape)
```
## 📞 4. Contact & Citation
If you encounter any issues or wish to discuss collaboration, please contact us via:
- 🛠 **Submit Issue**: [GitHub Issues](https://github.com/yanglin789/TropicBERT-LLMs_One_stop_tutorial/issues)
- 💬 **Community Discussion**: [HuggingFace Discussions](https://huggingface.co/yang0104/TropicBERT/discussions)
- 📧 **Email**: 1264894293yl@gmail.com
**Citation:**
If you use TropicBERT in your research, please cite the following paper:
> paper |