--- language: [] tags: - genomics - dna - tropical-fruits - transformer - masked-language-modeling - foundation-model license: apache-2.0 model-index: - name: TropicBERT results: [] --- # TropicBERT: Tropical Fruit Genomics Foundation Models --- ## 🧬 1. Model Overview **TropicBERT** is the first genomic foundation model series specifically designed for **tropical fruit crop genome sequences**. We pre-trained BERT models using the **MLM (Masked Language Modeling)** objective on datasets comprising **13 different species combinations**, releasing a total of **13 pre-trained model variants** covering 1, 5, and 10 species. The models are developed based on **TropicBERT-LLMs_One_stop_tutorial**, a reproducible pipeline that successfully transfers the "pre-training and fine-tuning" paradigm from NLP to genomics, significantly lowering the barrier for developing plant genomic large language models. πŸ”— **Related Resources**: * **GitHub Full Tutorial**: [TropicBERT-LLMs_One_stop_tutorial](https://github.com/yanglin789/TropicBERT-LLMs_One_stop_tutorial) * Includes data preprocessing, pre-training, and fine-tuning scripts * Provides sample datasets in FASTA/CSV format * **Published Paper**: [ Paper ] * More detailed model description and training details --- ## πŸ“– 2. Model Variants We provide 13 model versions trained on different species combinations to meet various research needs: | Model Name | Coverage | Species Count | Included Species (Scientific Name) | | :--- | :--- | :---: | :--- | | **base** | Comprehensive | 10 | *Melastoma candidum, Punica granatum, Syzygium samarangense, Plinia cauliflora, Psidium guajava, Eugenia stipitata, Eugenia brasiliensis, Eugenia candolleana, Eugenia Observa, Eugenia aggregata* | | **tp035-039** | Group A | 5 | *Melastoma candidum, Punica granatum, Syzygium samarangense, Plinia cauliflora, Psidium guajava* | | **tp040-044** | Group B | 5 | *Eugenia stipitata, Eugenia brasiliensis, Eugenia candolleana, Eugenia Observa, Eugenia aggregata* | | **tp035** | Single Species | 1 | *Melastoma candidum* | | **tp036** | Single Species | 1 | *Punica granatum* | | **tp037** | Single Species | 1 | *Syzygium samarangense* | | **tp038** | Single Species | 1 | *Plinia cauliflora* | | **tp039** | Single Species | 1 | *Psidium guajava* | | **tp040** | Single Species | 1 | *Eugenia stipitata* | | **tp041** | Single Species | 1 | *Eugenia brasiliensis* | | **tp042** | Single Species | 1 | *Eugenia candolleana* | | **tp043** | Single Species | 1 | *Eugenia Observa* | | **tp044** | Single Species | 1 | *Eugenia aggregata* | --- ## πŸš€ 3. Quick Usage The following code demonstrates how to load the model and extract DNA sequence embeddings. ```python from transformers import AutoTokenizer, AutoModel import torch model_name = "yang0104/TropicBERT" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name) seq = "ATGCGTACGTTAGCCTA..." inputs = tokenizer(seq, return_tensors="pt") outputs = model(**inputs) embeddings = outputs.last_hidden_state print(embeddings.shape) ``` ## πŸ“ž 4. Contact & Citation If you encounter any issues or wish to discuss collaboration, please contact us via: - πŸ›  **Submit Issue**: [GitHub Issues](https://github.com/yanglin789/TropicBERT-LLMs_One_stop_tutorial/issues) - πŸ’¬ **Community Discussion**: [HuggingFace Discussions](https://huggingface.co/yang0104/TropicBERT/discussions) - πŸ“§ **Email**: 1264894293yl@gmail.com **Citation:** If you use TropicBERT in your research, please cite the following paper: > paper