| | --- |
| | license: other |
| | license_name: livingmodels-research |
| | license_link: LICENSE |
| | --- |
| | # **BOTANIC-0** |
| | ## *Biological Omics Transformer for Agricultural and Nutritional Trait Inference in Crops* |
| |
|
| | > **BOTANIC-0** is a family of plant genome foundation models from [**Living Models**](https://www.dotomics.bio/). Encoder-only transformers pre-trained with **masked language modeling (MLM)** on nuclear DNA sequences from plant genome assemblies. |
| |
|
| | All details can be found on [our technical report](https://www.biorxiv.org/content/10.64898/2026.02.23.706817v1). |
| |
|
| | **If you use BOTANIC-0 in your research, please cite:** |
| |
|
| | ```bibtex |
| | @article {botanic0, |
| | author = {Ogier du Terrail, Jean and Marchand, Tanguy and Cabeli, Vincent and Khadir, Zhor and V{\'e}ran, Cyril and Strouk, L{\'e}onard}, |
| | title = {BOTANIC-0: a series of foundation models for plant genomic data}, |
| | elocation-id = {2026.02.23.706817}, |
| | year = {2026}, |
| | doi = {10.64898/2026.02.23.706817}, |
| | publisher = {Cold Spring Harbor Laboratory}, |
| | abstract = {Genomic language models (gLMs) have emerged as a powerful paradigm for learning regulatory biology directly from DNA sequence. Here, we introduce Botanic0, a family of plant genomic foundation models spanning 100M to 1B parameters and pretrained on 43 phylogenetically diverse plant genomes. The Botanic0-S, Botanic0-M, and Botanic0-L models form the first generation of a long-term research initiative, dedicated to advancing crop improvement research, genotype-to-phenotype modeling, and sequence-based genome editing. The architecture, pre-training pipeline and pre-training dataset of Botanic0 follow the seminal work of [1]. Across a broad suite of genomic and genetic prediction tasks, including regulatory element annotation, gene expression inference, and variant effect prediction, Botanic0 models achieve performance competitive with state-of-the-art foundation models, both in zero-shot settings and after fine-tuning. Scaling analyses reveal consistent improvements in predictive power with increased model capacity, highlighting the benefits of large-model pretraining for plant genomics. This work establishes our ability to train foundation models at scale, and lays the foundation for the next generations of models to come. To support reproducible research and community benchmarking, we release all Botanic0 models at https://huggingface.co/living-models/models.Competing Interest StatementThe authors declare the existence of a financial competing interest. All authors are or were employed by Living Models during their time on the project.}, |
| | URL = {https://www.biorxiv.org/content/early/2026/02/24/2026.02.23.706817}, |
| | eprint = {https://www.biorxiv.org/content/early/2026/02/24/2026.02.23.706817.full.pdf}, |
| | journal = {bioRxiv} |
| | } |
| | ``` |
| |
|
| |
|
| | --- |
| |
|
| | ## ⚠️ Research use only |
| |
|
| | **These models are intended for research use only.** They must not be used in production, clinical, or diagnostic contexts, or for any purpose other than non-commercial research and experimentation. Living Models and the model providers disclaim any liability for use outside this scope. |
| |
|
| | --- |
| |
|
| | ## Model variants |
| |
|
| | **BOTANIC-0** is available in three sizes: |
| |
|
| | | Component | **Botanic0-S** | **Botanic0-M** | **Botanic0-L** | |
| | |-----------|------------|------------|------------| |
| | | Hidden size | 1500 | 1500 | 1500 | |
| | | Num. layers | 4 | 10 | 40 | |
| | | Attention heads | 20 | 20 | 20 | |
| | | Intermediate (FFN) | 5120 | 5120 | 5120 | |
| | | Max sequence length | 1026 | 1026 | 1026 | |
| | | Vocabulary size | 4105 | 4105 | 4105 | |
| |
|
| | - **Tokenizer**: DNA tokenizer, 6-mer vocabulary (vocab size 4105). Special tokens: `<pad>`, `<mask>`, `<unk>`, `<cls>`. No BOS/EOS by default. |
| | - **Pre-training**: MLM (15% mask probability). Trained on plant nuclear genome assemblies. |
| |
|
| | --- |
| |
|
| | ## Intended use and limitations |
| |
|
| | - **Intended**: Non-commercial research on plant genomics with **BOTANIC-0** (e.g. representation learning, embeddings, variant effect scoring, downstream fine-tuning for classification). |
| | - **Not intended**: Production systems, clinical or diagnostic use, or any use outside research. Not validated on non-plant or non-DNA data. |
| |
|
| | --- |
| |
|
| | ## Examples of use |
| |
|
| | ⚠️⚠️⚠️⚠️ The "mps" backend of pytorch has a bug, which produces NaNs when padding tokens are used. Therefore we do not support it at the moment. ⚠️⚠️⚠️⚠️ |
| |
|
| | ### 1. Generating embeddings |
| |
|
| | Use **BOTANIC-0** to get sequence-level or token-level embeddings (e.g. for similarity, retrieval, or feeding into a downstream classifier): |
| |
|
| | ```python |
| | from transformers import AutoModel, AutoTokenizer |
| | import torch |
| | |
| | model_name = "living-models/Botanic0-S" |
| | model = AutoModel.from_pretrained(model_name, trust_remote_code=True) |
| | tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
| | |
| | # DNA sequence |
| | sequence = "ACGTACGTNNACGT" |
| | inputs = tokenizer( |
| | sequence, |
| | return_tensors="pt", |
| | padding=True, |
| | truncation=True, |
| | max_length=1024, |
| | ) |
| | |
| | # Last-layer hidden state: (batch, seq_len, hidden_size) |
| | with torch.inference_mode(): |
| | outputs = model( |
| | **inputs, |
| | output_hidden_states=True, |
| | return_dict=True, |
| | ) |
| | |
| | # Token-level embeddings (last layer) |
| | token_embeddings = outputs.hidden_states[-1] # (1, L, 1500) |
| | |
| | # Sequence-level embedding: mean over sequence length (excluding padding) |
| | attention_mask = inputs["attention_mask"] |
| | mask = attention_mask.unsqueeze(-1).float() |
| | sequence_embedding = (token_embeddings * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1e-9) |
| | # shape: (1, 1500) |
| | ``` |
| |
|
| | ### 2. Log-likelihood ratio (LLR) for variant effect |
| |
|
| | Use **BOTANIC-0** to score a single-nucleotide variant with a **log-likelihood ratio (LLR)**: LLR = log P(alt) − log P(ref), computed from the model logits at the variant position. Positive LLR favours the alternate allele; negative LLR favours the reference. |
| | For best results, the variant should be placed in the middle of the input sequence with sufficient flanking context on both sides, so the model can leverage surrounding genomic information when scoring the variant. |
| |
|
| | **K-mer masking** (recommended): mask the whole k-mer containing the variant and compare ref vs alt k-mer logits. |
| |
|
| | ```python |
| | from transformers import AutoModel, AutoTokenizer |
| | |
| | model_name = "living-models/Botanic0-S" |
| | model = AutoModel.from_pretrained(model_name, trust_remote_code=True) |
| | tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
| | |
| | # PHYB gene (AT2G18790) in Arabidopsis thaliana |
| | # 500 bp around the ATG start codon: 5'UTR (pos 0-249) | CDS (pos 250+) |
| | sequence = ( |
| | "TTTTTTTTTGTTATCTCTCTCTATCTGAGAGGCACACATTTTGCTTCGTCTTCTTCAATTTATTTTATTGGTTTCTC" |
| | "CACTTATCTCCGATCTCAATTCTCCCCATTTTCTTCTTCCTCAAGTTCAAAATTCTTGAGAATTTAGCTCTACCAGA" |
| | "ATTCGTCTCCGATAACTAGTGGATGATGATTCACCCTAAATCCTTCCTTGTCTCAAGGTAATTCTGAGAAATTTCTC" |
| | "AAATTCAAAATCAAACGGCATGGTTTCCGGAGTCGGGGGTAGTGGCGGTGGCCGTGGCGGTGGCCGTGGCGGAGAA" |
| | "GAAGAACCGTCGTCAAGTCACACTCCTAATAACCGAAGAGGAGGAGAACAAGCTCAATCGTCGGGAACGAAATCTC" |
| | "TCAGACCAAGAAGCAACACTGAATCAATGAGCAAAGCAATTCAACAGTACACCGTCGACGCAAGACTCCACGCCGT" |
| | "TTTCGAACAATCCGGCGAATCAGGGAAATCATTCGACTACT" |
| | ) |
| | |
| | # --- Deleterious variant: mutating the T of the ATG start codon --- |
| | llr_atg = model.score_variant_zero_shot( |
| | sequence=sequence, |
| | tokenizer=tokenizer, |
| | variant_pos=251, # 0-based position of 'T' in ATG |
| | variant_char="C", # alternate allele |
| | masking="kmer", # or "single_nt" |
| | ) |
| | print(f"ATG start codon T->C: LLR = {llr_atg:.4f}") |
| | # LLR < 0 (strongly negative → model prefers ref) |
| | |
| | # --- Neutral variant: position in the 5'UTR --- |
| | llr_utr = model.score_variant_zero_shot( |
| | sequence=sequence, |
| | tokenizer=tokenizer, |
| | variant_pos=155, # position in 5'UTR |
| | variant_char="A", # T -> A |
| | masking="kmer", |
| | ) |
| | print(f"5'UTR T->A: LLR = {llr_utr:.4f}") |
| | # LLR ≈ 0 (near zero → likely a neutral position) |
| | ``` |
| |
|
| | ### 3. Using Botanic0 embeddings to train XGBoost on a binay classification task |
| |
|
| | This code reproduce the XGBoost probing experiment done in our paper. It downloads the dataset introduced in "PlantCAD2: A Long-Context DNA Language |
| | Model for Cross-Species Functional Annotation in Angiosperms" (Zhai et al 2025) for the binary classification task "PlantCAD-TIS", computes the embeddings using Botanic and a XGBoost classifier on top of it. |
| |
|
| |
|
| | ```python |
| | import numpy as np |
| | import torch |
| | from datasets import Dataset, load_dataset |
| | from sklearn.metrics import average_precision_score, roc_auc_score |
| | from sklearn.model_selection import train_test_split |
| | from tqdm import tqdm |
| | from transformers import AutoModel, AutoTokenizer |
| | import xgboost as xgb |
| | |
| | MODEL_NAME = "living-models/Botanic0-S" |
| | USE_STRAND_INFORMATION = True |
| | NUMBER_OF_SAMPLES = 1000 # put -1 to use all samples |
| | BATCH_SIZE = 32 # can be lower depending on GPU memory size |
| | |
| | PLANTCAD_REPO_ID = "kuleshov-group/cross-species-single-nucleotide-annotation" |
| | SEQ_LENGTH_TIS = 512 |
| | |
| | # We use the embedding of the nucleotide located in the middle |
| | # of the sequence to train XGBoost model |
| | MIDDLE_TOKEN_IDX = int(np.ceil(SEQ_LENGTH_TIS / 2 / 6)) |
| | |
| | # Because some samples have "N" in their sequence, we add a 20% margin |
| | # in max_length. With such a value, almost all sequence fit within max_length. |
| | MAX_LENGTH = int(np.ceil(SEQ_LENGTH_TIS / 6 * 1.2)) |
| | |
| | print(f"MIDDLE_TOKEN_IDX: {MIDDLE_TOKEN_IDX}") |
| | |
| | |
| | def reverse_complement(sequence: str) -> str: |
| | comp = {"A": "T", "T": "A", "C": "G", "G": "C", "N": "N"} |
| | return "".join(comp[b] for b in sequence[::-1]) |
| | |
| | |
| | dataset = load_dataset( |
| | PLANTCAD_REPO_ID, |
| | data_files={ |
| | "train": f"TIS/train.tsv", |
| | "test": f"TIS/valid.tsv", |
| | }, |
| | ) |
| | train_df = dataset["train"].to_pandas() |
| | test_df = dataset["test"].to_pandas() |
| | |
| | if USE_STRAND_INFORMATION: |
| | for df in (train_df, test_df): |
| | neg = df["strand"] == "-" |
| | df.loc[neg, "sequences"] = df.loc[neg, "sequences"].apply(reverse_complement) |
| | |
| | if NUMBER_OF_SAMPLES != -1: |
| | n_train = min(NUMBER_OF_SAMPLES, len(train_df)) |
| | n_test = min(NUMBER_OF_SAMPLES, len(test_df)) |
| | train_df, _ = train_test_split( |
| | train_df, train_size=n_train, stratify=train_df["label"], random_state=0 |
| | ) |
| | test_df, _ = train_test_split( |
| | test_df, train_size=n_test, stratify=test_df["label"], random_state=0 |
| | ) |
| | |
| | print(f"Train size: {len(train_df)}, Test size: {len(test_df)}") |
| | |
| | y_train = train_df["label"].values |
| | y_test = test_df["label"].values |
| | print(f"y_train: {y_train.shape}, y_test: {y_test.shape}") |
| | |
| | device = "cuda" if torch.cuda.is_available() else "cpu" |
| | tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True) |
| | model = AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True) |
| | model = model.to(device) |
| | model.eval() |
| | |
| | train_ds = Dataset.from_pandas(train_df[["sequences", "label"]]) |
| | test_ds = Dataset.from_pandas(test_df[["sequences", "label"]]) |
| | |
| | |
| | def tokenize_fn(examples): |
| | return tokenizer( |
| | examples["sequences"], |
| | padding="max_length", |
| | truncation=True, |
| | max_length=MAX_LENGTH, |
| | return_tensors=None, |
| | ) |
| | |
| | |
| | train_ds = train_ds.map(tokenize_fn, batched=True, desc="Tokenizing train") |
| | test_ds = test_ds.map(tokenize_fn, batched=True, desc="Tokenizing test") |
| | |
| | |
| | def get_middle_token_embeddings(tokenized_ds): |
| | all_embeddings = [] |
| | n = len(tokenized_ds) |
| | for start in tqdm(range(0, n, BATCH_SIZE), desc="Forward pass", unit="batch"): |
| | end = min(start + BATCH_SIZE, n) |
| | batch = tokenized_ds[start:end] |
| | inputs = { |
| | k: torch.tensor(v).to(device) |
| | for k, v in batch.items() |
| | if k in ("input_ids", "attention_mask") |
| | } |
| | with torch.inference_mode(): |
| | out = model(**inputs, output_hidden_states=True, return_dict=True) |
| | emb = out.hidden_states[-1][:, MIDDLE_TOKEN_IDX, :].cpu().numpy() |
| | all_embeddings.append(emb) |
| | return np.concatenate(all_embeddings, axis=0) |
| | |
| | |
| | print("Computing embeddings...") |
| | X_train = get_middle_token_embeddings(train_ds) |
| | X_test = get_middle_token_embeddings(test_ds) |
| | |
| | print("Training XGBoost (can take several minutes when using full dataset and n_estimators=1000)...") |
| | clf = xgb.XGBClassifier( |
| | n_estimators=1000, |
| | max_depth=6, |
| | learning_rate=0.1, |
| | use_label_encoder=False, |
| | eval_metric="aucpr", |
| | ) |
| | clf.fit(X_train, y_train) |
| | |
| | proba = clf.predict_proba(X_test)[:, 1] |
| | aucpr = average_precision_score(y_test, proba) |
| | auc = roc_auc_score(y_test, proba) |
| | print(f"Test AUC-PR: {aucpr:.4f}") |
| | # Test AUC-PR: 0.4886 with Botanic0-S, NUMBER_OF_SAMPLES = 1000 and USE_STRAND_INFORMATION = True |
| | # Test AUC-PR: 0.7388 with Botanic0-S, NUMBER_OF_SAMPLES = -1 and USE_STRAND_INFORMATION = True |
| | print(f"Test AUC: {auc:.4f}") |
| | # Test AUC: 0.7915 with Botanic0-S, NUMBER_OF_SAMPLES = 1000 and USE_STRAND_INFORMATION = True |
| | # Test AUC-PR: 0.9170 with Botanic0-S, NUMBER_OF_SAMPLES = -1 and USE_STRAND_INFORMATION = True |
| | ``` |
| |
|
| | --- |
| |
|
| | ## License |
| |
|
| | - **Model**: See the [LICENSE](LICENSE) for terms of use and citation. |
| |
|
| | --- |
| |
|
| | *Model card maintained by Living Models. For questions or issues, please contact Living Models.* |