| | --- |
| | language: [] |
| | tags: |
| | - genomics |
| | - dna |
| | - tropical-fruits |
| | - transformer |
| | - masked-language-modeling |
| | - foundation-model |
| | license: apache-2.0 |
| | model-index: |
| | - name: TropicBERT |
| | results: [] |
| | --- |
| | # TropicBERT: Tropical Fruit Genomics Foundation Models |
| |
|
| | --- |
| |
|
| |
|
| | ## ๐งฌ 1. Model Overview |
| |
|
| | **TropicBERT** is the first genomic foundation model series specifically designed for **tropical fruit crop genome sequences**. We pre-trained BERT models using the **MLM (Masked Language Modeling)** objective on datasets comprising **13 different species combinations**, releasing a total of **13 pre-trained model variants** covering 1, 5, and 10 species. |
| |
|
| | The models are developed based on **TropicBERT-LLMs_One_stop_tutorial**, a reproducible pipeline that successfully transfers the "pre-training and fine-tuning" paradigm from NLP to genomics, significantly lowering the barrier for developing plant genomic large language models. |
| | |
| | ๐ **Related Resources**๏ผ |
| | * **GitHub Full Tutorial**: [TropicBERT-LLMs_One_stop_tutorial](https://github.com/yanglin789/TropicBERT-LLMs_One_stop_tutorial) |
| | * Includes data preprocessing, pre-training, and fine-tuning scripts |
| | * Provides sample datasets in FASTA/CSV format |
| | * **Published Paper**: [ Paper ] |
| | * More detailed model description and training details |
| | |
| | --- |
| | |
| | ## ๐ 2. Model Variants |
| | We provide 13 model versions trained on different species combinations to meet various research needs๏ผ |
| | |
| | |
| | | Model Name | Coverage | Species Count | Included Species (Scientific Name) | |
| | | :--- | :--- | :---: | :--- | |
| | | **base** | Comprehensive | 10 | *Melastoma candidum, Punica granatum, Syzygium samarangense, Plinia cauliflora, Psidium guajava, Eugenia stipitata, Eugenia brasiliensis, Eugenia candolleana, Eugenia Observa, Eugenia aggregata* | |
| | | **tp035-039** | Group A | 5 | *Melastoma candidum, Punica granatum, Syzygium samarangense, Plinia cauliflora, Psidium guajava* | |
| | | **tp040-044** | Group B | 5 | *Eugenia stipitata, Eugenia brasiliensis, Eugenia candolleana, Eugenia Observa, Eugenia aggregata* | |
| | | **tp035** | Single Species | 1 | *Melastoma candidum* | |
| | | **tp036** | Single Species | 1 | *Punica granatum* | |
| | | **tp037** | Single Species | 1 | *Syzygium samarangense* | |
| | | **tp038** | Single Species | 1 | *Plinia cauliflora* | |
| | | **tp039** | Single Species | 1 | *Psidium guajava* | |
| | | **tp040** | Single Species | 1 | *Eugenia stipitata* | |
| | | **tp041** | Single Species | 1 | *Eugenia brasiliensis* | |
| | | **tp042** | Single Species | 1 | *Eugenia candolleana* | |
| | | **tp043** | Single Species | 1 | *Eugenia Observa* | |
| | | **tp044** | Single Species | 1 | *Eugenia aggregata* | |
| | |
| | --- |
| | |
| | ## ๐ 3. Quick Usage |
| | |
| | The following code demonstrates how to load the model and extract DNA sequence embeddings. |
| | |
| | ```python |
| | from transformers import AutoTokenizer, AutoModel |
| | import torch |
| | |
| | model_name = "yang0104/TropicBERT" |
| | |
| | tokenizer = AutoTokenizer.from_pretrained(model_name) |
| | model = AutoModel.from_pretrained(model_name) |
| | |
| | seq = "ATGCGTACGTTAGCCTA..." |
| | |
| | inputs = tokenizer(seq, return_tensors="pt") |
| | outputs = model(**inputs) |
| | |
| | embeddings = outputs.last_hidden_state |
| | print(embeddings.shape) |
| | ``` |
| | ## ๐ 4. Contact & Citation |
| | |
| | If you encounter any issues or wish to discuss collaboration, please contact us via: |
| | |
| | - ๐ **Submit Issue**: [GitHub Issues](https://github.com/yanglin789/TropicBERT-LLMs_One_stop_tutorial/issues) |
| | - ๐ฌ **Community Discussion**: [HuggingFace Discussions](https://huggingface.co/yang0104/TropicBERT/discussions) |
| | - ๐ง **Email**: 1264894293yl@gmail.com |
| | |
| | **Citation:** |
| | If you use TropicBERT in your research, please cite the following paper: |
| | > paper |