Evo2-40B-8K

A clean, minimal HuggingFace port of Evo 2 40B base, the largest 8K-context StripedHyena2 DNA foundation model. Provides native support for layer-by-layer hidden state extraction, attention-weight extraction, and a runtime-switchable attention backend.

NVIDIA Transformer Engine required. This variant uses FP8 input projections (use_fp8_input_projections=True) which require TransformerEngine and a Hopper-class GPU (H100 / H200). Install with:

pip install transformer-engine[pytorch]>=2.3.0

Multi-GPU loading requires accelerate. This variant's bf16 weights (~76 GB) plus activations exceed a single 80 GB H100. Use device_map="auto" to shard across 2 or more H100s; install accelerate first:

pip install accelerate

Why this port?

arcinstitute/evo2_40b_base ships a .pt checkpoint that requires the evo2 and vortex Python packages just to instantiate the model. Even with both installed, common pain points remain:

  1. Not a HuggingFace model. No from_pretrained, no AutoModel, no AutoModelForCausalLM - the original ships a thin Python wrapper around a custom nn.Module.
  2. No way to extract attention weights. The reference uses flash-attn unconditionally and discards the (B, H, T, T) attention matrix; there is no official path to read it back.
  3. evo2 + vortex packages mandatory even for inference.

This repo fixes all three. The math is bit-exact with the vortex reference (max_abs_diff = 0.000e+00 at every layer; see Parity Verification). Loads with from_pretrained and trust_remote_code=True - no evo2 / vortex install needed.

Architecture

Parameter Value
Total parameters ~38.0B
Architecture StripedHyena 2 (interleaved Hyena cascade + MHA blocks)
Layers 50
Attention heads 64
Embedding dimension 8192
Inner MLP size 21 888
Vocabulary size 512 (UTF-8 byte-level)
Attention block indices 3, 10, 17, 24, 31, 35, 42, 49 (8 blocks total)
Hyena block indices all others (42 blocks: hcs / hcm / hcl pattern)
Positional encoding RoPE (base = 1 000 000)
Max sequence length 8 192
Training dtype bfloat16 (Hyena modal-form log_poles / residues and rotary inv_freq kept in fp32)
FP8 input projections yes (TransformerEngine required)
Weight format model.safetensors (38.0B params, 16 files)

Pretraining

  • Objective: causal byte-level next-token prediction.
  • Data: OpenGenome2, 8.8 trillion tokens spanning all domains of life.
  • Source checkpoint: arcinstitute/evo2_40b_base (evo2_40b_base.pt).

Parity Verification

Hidden-state representations verified bit-exact (max_abs_diff = 0.000e+00) to the vortex reference at every block output, using attn_implementation="sdpa" in bf16 (the same backend vortex's SelfAttention calls when use_flash_attn=False). Logits from Evo2ForCausalLM were also verified bit-exact (top-1 agreement: 128/128 positions on a 128-byte ACGT input). Verified on H100 with PyTorch 2.7 / CUDA 12.

Two non-obvious correctness fixes were required versus a naive port (see Implementation Notes for details):

  1. inv_freq recomputation. from_pretrained(dtype=bf16) casts buffers to bf16, which loses ~7 bits of mantissa in the rotary inv_freq = 1 / base^(2i/dim). Our to_bfloat16_except_poles_residues() recomputes inv_freq in fp32 from self.base to match vortex's to_bfloat16_except_pr_lc(to_float32=True).
  2. SDPA backend used for parity. Vortex's reference SelfAttention (use_flash_attn=False) calls F.scaled_dot_product_attention, not a textbook softmax loop. Parity is measured with attn_implementation="sdpa" on our side. Using "eager" (textbook einsum + softmax) is mathematically equivalent but not bit-exact in bf16; using "flash_attention_2" (the recommended runtime backend) is also not bit-exact but agrees within bf16 noise.

Related Models

See the full Evo 2 collection on the Arc Institute HF org for the original weights, or the Taykhoom/Evo2-* collection for our minimal HF ports.

Model Size Context Notes
Taykhoom/Evo2-1B-8K 1B 8 192
Taykhoom/Evo2-7B-8K 7B 8 192
Taykhoom/Evo2-7B-262K 7B 262 144
Taykhoom/Evo2-7B-1M 7B 1 048 576
Taykhoom/Evo2-20B-1M 20B 1 048 576
Taykhoom/Evo2-40B-8K 40B 8 192 <- this model
Taykhoom/Evo2-40B-1M 40B 1 048 576

Usage

Note on dtype. Evo 2 was trained in bfloat16, with the Hyena log_poles / residues (modal-form filter parameters) and the rotary inv_freq kept in fp32 for numerical stability. Passing dtype=... to from_pretrained only affects the initial load precision - Evo2Model.__init__ and Evo2ForCausalLM.__init__ always call to_bfloat16_except_poles_residues(), so the model runs in bf16 with these fp32 invariants regardless. This is intentional: bf16 is the trained precision, and the fp32 islands are required for stability.

Note on attention backend. By HuggingFace convention this model defaults to attn_implementation="sdpa" (F.scaled_dot_product_attention) since SDPA needs only torch and runs on any GPU. The original Arc Institute Evo 2 inference path uses flash_attention_2, which is faster on long sequences but requires a separate flash-attn install. All usage examples below opt in to flash_attention_2 explicitly because most real users will want it. Drop the kwarg (or pass "sdpa" / "eager") if you don't have flash-attn installed.

Embedding generation (no LM head)

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/Evo2-40B-8K", trust_remote_code=True)
model = AutoModel.from_pretrained(
    "Taykhoom/Evo2-40B-8K",
    trust_remote_code=True,
    attn_implementation="flash_attention_2",  # or "sdpa" (default) or "eager"
).cuda().eval()

seqs = ["ACGTACGTACGT", "GGGTTTAAACCC"]
inputs = tokenizer(seqs, return_tensors="pt", padding=True).to(model.device)

with torch.no_grad():
    out = model(**inputs, output_hidden_states=True)

last_hidden  = out.last_hidden_state   # (B, T, 8192)
all_layers   = out.hidden_states       # tuple of (B, T, 8192), len = 52
middle_layer = all_layers[25]          # input to block 25 (= output of block 24)

Recommended embedding: pre-norm of the middle block

The Evo 2 paper reports that intermediate representations work better than the final layer for downstream tasks - specifically the pre-norm of a middle block. For this variant the middle block is blocks[25], so the recommended embedding is blocks[25].pre_norm(hidden_states[25]):

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/Evo2-40B-8K", trust_remote_code=True)
model = AutoModel.from_pretrained(
    "Taykhoom/Evo2-40B-8K",
    trust_remote_code=True,
    attn_implementation="flash_attention_2",
).cuda().eval()

inputs = tokenizer(["ACGTACGTACGT"], return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model(**inputs, output_hidden_states=True)
    pre_norm_middle = model.backbone.blocks[25].pre_norm(
        out.hidden_states[25]
    )                                      # (B, T, 8192)

HF has no built-in API for sub-block intermediates like pre-norm outputs (only block outputs via output_hidden_states). The pattern above applies the block's pre_norm submodule directly to the corresponding hidden_states entry; this gives a bit-identical result to registering a forward hook on backbone.blocks[i].pre_norm and is simpler than using PyTorch hooks. Note that it does require running the full forward pass and then re-applying pre_norm, so a forward hook is more efficient if you only need this single intermediate.

LM logits

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/Evo2-40B-8K", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "Taykhoom/Evo2-40B-8K", trust_remote_code=True,
    attn_implementation="flash_attention_2",
).cuda().eval()

inputs = tokenizer(["ACGT"], return_tensors="pt").to(model.device)
with torch.no_grad():
    logits = model(**inputs).logits   # (1, T, 512)

Generation

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/Evo2-40B-8K", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "Taykhoom/Evo2-40B-8K", trust_remote_code=True,
    attn_implementation="flash_attention_2",
).cuda().eval()

inputs = tokenizer(["ACGT"], return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=128, do_sample=True, top_k=4, temperature=1.0)
print(tokenizer.decode(out[0]))

generation_config.json ships with eos_token_id = 0 (the EOD byte) and pad_token_id = 1 so model.generate() stops naturally at the trained end-of-document token.

Attention weights

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/Evo2-40B-8K", trust_remote_code=True)
model = AutoModel.from_pretrained(
    "Taykhoom/Evo2-40B-8K",
    trust_remote_code=True,
    attn_implementation="eager",  # required for output_attentions to populate
).cuda().eval()

inputs = tokenizer(["ACGTACGT"], return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model(**inputs, output_attentions=True)

# out.attentions is a tuple of length 50. Entries at indices not in
# [3, 10, 17, 24, 31, 35, 42, 49] are None (Hyena blocks have no attention matrix).
# The 8 attention block(s) at those indices return a (B, num_heads, T, T) tensor.
attn_block_3 = out.attentions[3]

Multi-GPU loading (optional)

For sharding across multiple GPUs (required for 40B, optional for smaller variants), install accelerate and pass device_map="auto":

from transformers import AutoModelForCausalLM
# pip install accelerate
model = AutoModelForCausalLM.from_pretrained(
    "Taykhoom/Evo2-40B-8K", trust_remote_code=True,
    device_map="auto",  # accelerate will shard across all visible GPUs
)

Fine-tuning

This HuggingFace port has not been tested for fine-tuning - it's verified only for inference parity. For fine-tuning, follow the original Arc Institute guidance and use either Savanna (the framework Evo 2 was pretrained in) or Nvidia BioNeMo, which provides an official Evo 2 fine-tuning recipe.

Implementation Notes

  • inv_freq kept in fp32 (critical for parity). HF's from_pretrained(dtype=bf16) casts all buffers, including the rotary inv_freq, to bf16. The geometric series inv_freq[i] = 1 / base^(2i/dim) loses ~7 bits of mantissa in bf16, which shifts the cos/sin tables by ~5e-2 per cell at higher positions and contributes ~3e-2 of Q/K noise per attention layer. Our to_bfloat16_except_poles_residues() (called in each __init__) recomputes inv_freq in fp32 from the stored self.base and invalidates the cos/sin cache, mirroring vortex's to_bfloat16_except_pr_lc(to_float32=True).
  • log_poles / residues kept in fp32 (critical for stability). The Hyena cascade long (hcl) blocks parameterize an IIR filter via log_poles and residues; bf16 quantisation makes the recurrence numerically unstable. Both are stored as fp32 in the safetensors and restored to fp32 by force_dtype() after load.
  • attn_implementation switching (attention.py). Three backends, selected via the standard HF attn_implementation kwarg to from_pretrained (default chosen by HF auto-detection - typically "sdpa"):
    • "sdpa": calls F.scaled_dot_product_attention. Bit-exact with vortex's reference path (when vortex uses use_flash_attn=False).
    • "flash_attention_2": calls flash_attn.flash_attn_qkvpacked_func. Matches the original Arc Institute inference path; faster on long sequences; requires flash-attn installed.
    • "eager": textbook einsum + softmax(QK^T) + einsum. Slowest, used internally when output_attentions=True so the attention matrix is materialized.
  • Block dispatch (hyena.py). StripedHyena 2 has 4 block types, dispatched by layer_idx membership in four config lists: attn_layer_idxs (MHA + RoPE), hcl_layer_idxs (modal-form IIR via FFT), hcm_layer_idxs (medium FIR cascade, inner length 128), hcs_layer_idxs (short FIR cascade, inner length 7). The disjoint union must equal range(num_layers).
  • TELinear with pure-PyTorch fallback (layers.py). Hyena cascade blocks use a TransformerEngine-backed input projection (3x hidden_size output) that supports FP8 quantisation. When TE is not installed, a TELinear fallback class with the same state_dict layout (weight, bias) is used - checkpoints are cross-loadable.
  • Custom cache (cache.py). Evo2Cache wraps four block-type-specific dataclasses: InferenceParams for MHA KV cache, HyenaCascadeIIRInferenceParams for hcl, and two HyenaCascadeFIRInferenceParams for hcm / hcs. Passed through model.generate() as past_key_values (we set _supports_cache_class = False so HF treats it as an opaque dict rather than wrapping it in a DynamicCache).
  • Tokenizer (tokenization_evo2.py). Byte-level UTF-8, vocab_size = 512. Pad token = byte \x01. EOS = byte \x00 (set as eos_token_id in generation_config.json). Tokenizer does not add EOS at encoding time - matches the original Evo 2 inference pipeline.
  • Dependencies. torch, transformers, numpy, safetensors, huggingface_hub. transformer-engine[pytorch] is required for this variant's FP8 input projections. accelerate is required if you load with device_map="auto" (the model is too large to fit on a single 80 GB H100 with activations). flash_attn is optional (only needed if you pass attn_implementation="flash_attention_2").

Citation

@article{brixi2026_evo2,
  title   = {Genome modelling and design across all domains of life with {Evo} 2},
  author  = {Brixi, Garyk and Durrant, Matthew G. and Ku, Jerome and Naghipourfar, Mohsen and Poli, Michael and Sun, Gwanggyu and Brockman, Greg and Chang, Daniel and Fanton, Alison and Gonzalez, Gabriel A. and King, Samuel H. and Li, David B. and Merchant, Aditi T. and Nguyen, Eric and Ricci-Tam, Chiara and Romero, David W. and Schmok, Jonathan C. and Taghibakhshi, Ali and Vorontsov, Anton and Yang, Brandon and Deng, Myra and Gorton, Liv and Nguyen, Nam and Wang, Nicholas K. and Pearce, Michael T. and Simon, Elana and Adams, Etowah and Amador, Zachary J. and Ashley, Euan A. and Baccus, Stephen A. and Dai, Haoyu and Dillmann, Steven and Ermon, Stefano and Guo, Daniel and Herschl, Michael H. and Ilango, Rajesh and Janik, Ken and Lu, Amy X. and Mehta, Reshma and Mofrad, Mohammad R. K. and Ng, Madelena Y. and Pannu, Jaspreet and {R{\'e}}, Christopher and St. John, John and Sullivan, Jeremy and Tey, Joseph and Viggiano, Ben and Zhu, Kevin and Zynda, Greg and Balsam, Daniel and Collison, Patrick and Costa, Anthony B. and Hernandez-Boussard, Tina and Ho, Eric and Liu, Ming-Yu and McGrath, Thomas and Powell, Kimberly and Pinglay, Sudarshan and Burke, Dave P. and Goodarzi, Hani and Hsu, Patrick D. and Hie, Brian L.},
  journal = {Nature},
  year    = {2026},
  doi     = {10.1038/s41586-026-10176-5}
}

Credits

Original Evo 2 model and code by Brixi et al. (arcinstitute/evo2, Zymrael/vortex). Source checkpoint: arcinstitute/evo2_40b_base.

The HuggingFace conversion code in this repo was authored primarily by Claude and reviewed manually by Taykhoom Dalal.

License

Apache 2.0, following the original Evo 2 release.

Downloads last month
-
Safetensors
Model size
40B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Taykhoom/Evo2-40B-8K