File size: 3,609 Bytes
84be48a
 
 
 
 
 
 
 
 
 
 
 
 
 
a0fb9f3
 
79475af
 
 
a0fb9f3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84be48a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
language: []
tags:
- genomics
- dna
- tropical-fruits
- transformer
- masked-language-modeling
- foundation-model
license: apache-2.0
model-index:
- name: TropicBERT
  results: []
---
# TropicBERT: Tropical Fruit Genomics Foundation Models

---


## 🧬 1. Model Overview

**TropicBERT** is the first genomic foundation model series specifically designed for **tropical fruit crop genome sequences**. We pre-trained BERT models using the **MLM (Masked Language Modeling)** objective on datasets comprising **13 different species combinations**, releasing a total of **13 pre-trained model variants** covering 1, 5, and 10 species.

The models are developed based on **TropicBERT-LLMs_One_stop_tutorial**, a reproducible pipeline that successfully transfers the "pre-training and fine-tuning" paradigm from NLP to genomics, significantly lowering the barrier for developing plant genomic large language models.

🔗 **Related Resources**:
*   **GitHub Full Tutorial**: [TropicBERT-LLMs_One_stop_tutorial](https://github.com/yanglin789/TropicBERT-LLMs_One_stop_tutorial)
    *   Includes data preprocessing, pre-training, and fine-tuning scripts
    *   Provides sample datasets in FASTA/CSV format
*   **Published Paper**: [ Paper ] 
    *   More detailed model description and training details

---

## 📖 2. Model Variants
We provide 13 model versions trained on different species combinations to meet various research needs:


| Model Name | Coverage | Species Count | Included Species (Scientific Name) |
| :--- | :--- | :---: | :--- |
| **base** | Comprehensive | 10 | *Melastoma candidum, Punica granatum, Syzygium samarangense, Plinia cauliflora, Psidium guajava, Eugenia stipitata, Eugenia brasiliensis, Eugenia candolleana, Eugenia Observa, Eugenia aggregata* |
| **tp035-039** | Group A | 5 | *Melastoma candidum, Punica granatum, Syzygium samarangense, Plinia cauliflora, Psidium guajava* |
| **tp040-044** | Group B | 5 | *Eugenia stipitata, Eugenia brasiliensis, Eugenia candolleana, Eugenia Observa, Eugenia aggregata* |
| **tp035** | Single Species | 1 | *Melastoma candidum* |
| **tp036** | Single Species | 1 | *Punica granatum* |
| **tp037** | Single Species | 1 | *Syzygium samarangense* |
| **tp038** | Single Species | 1 | *Plinia cauliflora* |
| **tp039** | Single Species | 1 | *Psidium guajava* |
| **tp040** | Single Species | 1 | *Eugenia stipitata* |
| **tp041** | Single Species | 1 | *Eugenia brasiliensis* |
| **tp042** | Single Species | 1 | *Eugenia candolleana* |
| **tp043** | Single Species | 1 | *Eugenia Observa* |
| **tp044** | Single Species | 1 | *Eugenia aggregata* |

---

## 🚀 3. Quick Usage

The following code demonstrates how to load the model and extract DNA sequence embeddings.

```python
from transformers import AutoTokenizer, AutoModel
import torch

model_name = "yang0104/TropicBERT"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

seq = "ATGCGTACGTTAGCCTA..."

inputs = tokenizer(seq, return_tensors="pt")
outputs = model(**inputs)

embeddings = outputs.last_hidden_state
print(embeddings.shape)
```
## 📞 4. Contact & Citation

If you encounter any issues or wish to discuss collaboration, please contact us via:

-   🛠 **Submit Issue**: [GitHub Issues](https://github.com/yanglin789/TropicBERT-LLMs_One_stop_tutorial/issues)
-   💬 **Community Discussion**: [HuggingFace Discussions](https://huggingface.co/yang0104/TropicBERT/discussions)
-   📧 **Email**: 1264894293yl@gmail.com

**Citation:**
If you use TropicBERT in your research, please cite the following paper:
> paper