Update README.md

f8316ac verified 6 days ago

13 kB

	---
	license: other
	license_name: livingmodels-research
	license_link: LICENSE
	---
	# BOTANIC-0
	## Biological Omics Transformer for Agricultural and Nutritional Trait Inference in Crops

	> BOTANIC-0 is a family of plant genome foundation models from [Living Models](https://www.dotomics.bio/). Encoder-only transformers pre-trained with masked language modeling (MLM) on nuclear DNA sequences from plant genome assemblies.

	All details can be found on [our technical report](https://www.biorxiv.org/content/10.64898/2026.02.23.706817v1).

	If you use BOTANIC-0 in your research, please cite:

	```bibtex
	@article {botanic0,
	author = {Ogier du Terrail, Jean and Marchand, Tanguy and Cabeli, Vincent and Khadir, Zhor and V{\'e}ran, Cyril and Strouk, L{\'e}onard},
	title = {BOTANIC-0: a series of foundation models for plant genomic data},
	elocation-id = {2026.02.23.706817},
	year = {2026},
	doi = {10.64898/2026.02.23.706817},
	publisher = {Cold Spring Harbor Laboratory},
	abstract = {Genomic language models (gLMs) have emerged as a powerful paradigm for learning regulatory biology directly from DNA sequence. Here, we introduce Botanic0, a family of plant genomic foundation models spanning 100M to 1B parameters and pretrained on 43 phylogenetically diverse plant genomes. The Botanic0-S, Botanic0-M, and Botanic0-L models form the first generation of a long-term research initiative, dedicated to advancing crop improvement research, genotype-to-phenotype modeling, and sequence-based genome editing. The architecture, pre-training pipeline and pre-training dataset of Botanic0 follow the seminal work of [1]. Across a broad suite of genomic and genetic prediction tasks, including regulatory element annotation, gene expression inference, and variant effect prediction, Botanic0 models achieve performance competitive with state-of-the-art foundation models, both in zero-shot settings and after fine-tuning. Scaling analyses reveal consistent improvements in predictive power with increased model capacity, highlighting the benefits of large-model pretraining for plant genomics. This work establishes our ability to train foundation models at scale, and lays the foundation for the next generations of models to come. To support reproducible research and community benchmarking, we release all Botanic0 models at https://huggingface.co/living-models/models.Competing Interest StatementThe authors declare the existence of a financial competing interest. All authors are or were employed by Living Models during their time on the project.},
	URL = {https://www.biorxiv.org/content/early/2026/02/24/2026.02.23.706817},
	eprint = {https://www.biorxiv.org/content/early/2026/02/24/2026.02.23.706817.full.pdf},
	journal = {bioRxiv}
	}
	```


	---

	## ⚠️ Research use only

	These models are intended for research use only. They must not be used in production, clinical, or diagnostic contexts, or for any purpose other than non-commercial research and experimentation. Living Models and the model providers disclaim any liability for use outside this scope.

	---

	## Model variants

	BOTANIC-0 is available in three sizes:

	\| Component \| Botanic0-S \| Botanic0-M \| Botanic0-L \|
	\|-----------\|------------\|------------\|------------\|
	\| Hidden size \| 1500 \| 1500 \| 1500 \|
	\| Num. layers \| 4 \| 10 \| 40 \|
	\| Attention heads \| 20 \| 20 \| 20 \|
	\| Intermediate (FFN) \| 5120 \| 5120 \| 5120 \|
	\| Max sequence length \| 1026 \| 1026 \| 1026 \|
	\| Vocabulary size \| 4105 \| 4105 \| 4105 \|

	- Tokenizer: DNA tokenizer, 6-mer vocabulary (vocab size 4105). Special tokens: `<pad>`, `<mask>`, `<unk>`, `<cls>`. No BOS/EOS by default.
	- Pre-training: MLM (15% mask probability). Trained on plant nuclear genome assemblies.

	---

	## Intended use and limitations

	- Intended: Non-commercial research on plant genomics with BOTANIC-0 (e.g. representation learning, embeddings, variant effect scoring, downstream fine-tuning for classification).
	- Not intended: Production systems, clinical or diagnostic use, or any use outside research. Not validated on non-plant or non-DNA data.

	---

	## Examples of use

	⚠️⚠️⚠️⚠️ The "mps" backend of pytorch has a bug, which produces NaNs when padding tokens are used. Therefore we do not support it at the moment. ⚠️⚠️⚠️⚠️

	### 1. Generating embeddings

	Use BOTANIC-0 to get sequence-level or token-level embeddings (e.g. for similarity, retrieval, or feeding into a downstream classifier):

	```python
	from transformers import AutoModel, AutoTokenizer
	import torch

	model_name = "living-models/Botanic0-S"
	model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

	# DNA sequence
	sequence = "ACGTACGTNNACGT"
	inputs = tokenizer(
	sequence,
	return_tensors="pt",
	padding=True,
	truncation=True,
	max_length=1024,
	)

	# Last-layer hidden state: (batch, seq_len, hidden_size)
	with torch.inference_mode():
	outputs = model(
	**inputs,
	output_hidden_states=True,
	return_dict=True,
	)

	# Token-level embeddings (last layer)
	token_embeddings = outputs.hidden_states[-1] # (1, L, 1500)

	# Sequence-level embedding: mean over sequence length (excluding padding)
	attention_mask = inputs["attention_mask"]
	mask = attention_mask.unsqueeze(-1).float()
	sequence_embedding = (token_embeddings * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1e-9)
	# shape: (1, 1500)
	```

	### 2. Log-likelihood ratio (LLR) for variant effect

	Use BOTANIC-0 to score a single-nucleotide variant with a log-likelihood ratio (LLR): LLR = log P(alt) − log P(ref), computed from the model logits at the variant position. Positive LLR favours the alternate allele; negative LLR favours the reference.
	For best results, the variant should be placed in the middle of the input sequence with sufficient flanking context on both sides, so the model can leverage surrounding genomic information when scoring the variant.

	K-mer masking (recommended): mask the whole k-mer containing the variant and compare ref vs alt k-mer logits.

	```python
	from transformers import AutoModel, AutoTokenizer

	model_name = "living-models/Botanic0-S"
	model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

	# PHYB gene (AT2G18790) in Arabidopsis thaliana
	# 500 bp around the ATG start codon: 5'UTR (pos 0-249) \| CDS (pos 250+)
	sequence = (
	"TTTTTTTTTGTTATCTCTCTCTATCTGAGAGGCACACATTTTGCTTCGTCTTCTTCAATTTATTTTATTGGTTTCTC"
	"CACTTATCTCCGATCTCAATTCTCCCCATTTTCTTCTTCCTCAAGTTCAAAATTCTTGAGAATTTAGCTCTACCAGA"
	"ATTCGTCTCCGATAACTAGTGGATGATGATTCACCCTAAATCCTTCCTTGTCTCAAGGTAATTCTGAGAAATTTCTC"
	"AAATTCAAAATCAAACGGCATGGTTTCCGGAGTCGGGGGTAGTGGCGGTGGCCGTGGCGGTGGCCGTGGCGGAGAA"
	"GAAGAACCGTCGTCAAGTCACACTCCTAATAACCGAAGAGGAGGAGAACAAGCTCAATCGTCGGGAACGAAATCTC"
	"TCAGACCAAGAAGCAACACTGAATCAATGAGCAAAGCAATTCAACAGTACACCGTCGACGCAAGACTCCACGCCGT"
	"TTTCGAACAATCCGGCGAATCAGGGAAATCATTCGACTACT"
	)

	# --- Deleterious variant: mutating the T of the ATG start codon ---
	llr_atg = model.score_variant_zero_shot(
	sequence=sequence,
	tokenizer=tokenizer,
	variant_pos=251, # 0-based position of 'T' in ATG
	variant_char="C", # alternate allele
	masking="kmer", # or "single_nt"
	)
	print(f"ATG start codon T->C: LLR = {llr_atg:.4f}")
	# LLR < 0 (strongly negative → model prefers ref)

	# --- Neutral variant: position in the 5'UTR ---
	llr_utr = model.score_variant_zero_shot(
	sequence=sequence,
	tokenizer=tokenizer,
	variant_pos=155, # position in 5'UTR
	variant_char="A", # T -> A
	masking="kmer",
	)
	print(f"5'UTR T->A: LLR = {llr_utr:.4f}")
	# LLR ≈ 0 (near zero → likely a neutral position)
	```

	### 3. Using Botanic0 embeddings to train XGBoost on a binay classification task

	This code reproduce the XGBoost probing experiment done in our paper. It downloads the dataset introduced in "PlantCAD2: A Long-Context DNA Language
	Model for Cross-Species Functional Annotation in Angiosperms" (Zhai et al 2025) for the binary classification task "PlantCAD-TIS", computes the embeddings using Botanic and a XGBoost classifier on top of it.


	```python
	import numpy as np
	import torch
	from datasets import Dataset, load_dataset
	from sklearn.metrics import average_precision_score, roc_auc_score
	from sklearn.model_selection import train_test_split
	from tqdm import tqdm
	from transformers import AutoModel, AutoTokenizer
	import xgboost as xgb

	MODEL_NAME = "living-models/Botanic0-S"
	USE_STRAND_INFORMATION = True
	NUMBER_OF_SAMPLES = 1000 # put -1 to use all samples
	BATCH_SIZE = 32 # can be lower depending on GPU memory size

	PLANTCAD_REPO_ID = "kuleshov-group/cross-species-single-nucleotide-annotation"
	SEQ_LENGTH_TIS = 512

	# We use the embedding of the nucleotide located in the middle
	# of the sequence to train XGBoost model
	MIDDLE_TOKEN_IDX = int(np.ceil(SEQ_LENGTH_TIS / 2 / 6))

	# Because some samples have "N" in their sequence, we add a 20% margin
	# in max_length. With such a value, almost all sequence fit within max_length.
	MAX_LENGTH = int(np.ceil(SEQ_LENGTH_TIS / 6 * 1.2))

	print(f"MIDDLE_TOKEN_IDX: {MIDDLE_TOKEN_IDX}")


	def reverse_complement(sequence: str) -> str:
	comp = {"A": "T", "T": "A", "C": "G", "G": "C", "N": "N"}
	return "".join(comp[b] for b in sequence[::-1])


	dataset = load_dataset(
	PLANTCAD_REPO_ID,
	data_files={
	"train": f"TIS/train.tsv",
	"test": f"TIS/valid.tsv",
	},
	)
	train_df = dataset["train"].to_pandas()
	test_df = dataset["test"].to_pandas()

	if USE_STRAND_INFORMATION:
	for df in (train_df, test_df):
	neg = df["strand"] == "-"
	df.loc[neg, "sequences"] = df.loc[neg, "sequences"].apply(reverse_complement)

	if NUMBER_OF_SAMPLES != -1:
	n_train = min(NUMBER_OF_SAMPLES, len(train_df))
	n_test = min(NUMBER_OF_SAMPLES, len(test_df))
	train_df, _ = train_test_split(
	train_df, train_size=n_train, stratify=train_df["label"], random_state=0
	)
	test_df, _ = train_test_split(
	test_df, train_size=n_test, stratify=test_df["label"], random_state=0
	)

	print(f"Train size: {len(train_df)}, Test size: {len(test_df)}")

	y_train = train_df["label"].values
	y_test = test_df["label"].values
	print(f"y_train: {y_train.shape}, y_test: {y_test.shape}")

	device = "cuda" if torch.cuda.is_available() else "cpu"
	tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
	model = AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True)
	model = model.to(device)
	model.eval()

	train_ds = Dataset.from_pandas(train_df[["sequences", "label"]])
	test_ds = Dataset.from_pandas(test_df[["sequences", "label"]])


	def tokenize_fn(examples):
	return tokenizer(
	examples["sequences"],
	padding="max_length",
	truncation=True,
	max_length=MAX_LENGTH,
	return_tensors=None,
	)


	train_ds = train_ds.map(tokenize_fn, batched=True, desc="Tokenizing train")
	test_ds = test_ds.map(tokenize_fn, batched=True, desc="Tokenizing test")


	def get_middle_token_embeddings(tokenized_ds):
	all_embeddings = []
	n = len(tokenized_ds)
	for start in tqdm(range(0, n, BATCH_SIZE), desc="Forward pass", unit="batch"):
	end = min(start + BATCH_SIZE, n)
	batch = tokenized_ds[start:end]
	inputs = {
	k: torch.tensor(v).to(device)
	for k, v in batch.items()
	if k in ("input_ids", "attention_mask")
	}
	with torch.inference_mode():
	out = model(**inputs, output_hidden_states=True, return_dict=True)
	emb = out.hidden_states[-1][:, MIDDLE_TOKEN_IDX, :].cpu().numpy()
	all_embeddings.append(emb)
	return np.concatenate(all_embeddings, axis=0)


	print("Computing embeddings...")
	X_train = get_middle_token_embeddings(train_ds)
	X_test = get_middle_token_embeddings(test_ds)

	print("Training XGBoost (can take several minutes when using full dataset and n_estimators=1000)...")
	clf = xgb.XGBClassifier(
	n_estimators=1000,
	max_depth=6,
	learning_rate=0.1,
	use_label_encoder=False,
	eval_metric="aucpr",
	)
	clf.fit(X_train, y_train)

	proba = clf.predict_proba(X_test)[:, 1]
	aucpr = average_precision_score(y_test, proba)
	auc = roc_auc_score(y_test, proba)
	print(f"Test AUC-PR: {aucpr:.4f}")
	# Test AUC-PR: 0.4886 with Botanic0-S, NUMBER_OF_SAMPLES = 1000 and USE_STRAND_INFORMATION = True
	# Test AUC-PR: 0.7388 with Botanic0-S, NUMBER_OF_SAMPLES = -1 and USE_STRAND_INFORMATION = True
	print(f"Test AUC: {auc:.4f}")
	# Test AUC: 0.7915 with Botanic0-S, NUMBER_OF_SAMPLES = 1000 and USE_STRAND_INFORMATION = True
	# Test AUC-PR: 0.9170 with Botanic0-S, NUMBER_OF_SAMPLES = -1 and USE_STRAND_INFORMATION = True
	```

	---

	## License

	- Model: See the [LICENSE](LICENSE) for terms of use and citation.

	---

	Model card maintained by Living Models. For questions or issues, please contact Living Models.