Instructions to use Taykhoom/Evo2-7B-1M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Taykhoom/Evo2-7B-1M with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Taykhoom/Evo2-7B-1M", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("Taykhoom/Evo2-7B-1M", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Taykhoom/Evo2-7B-1M with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Taykhoom/Evo2-7B-1M" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Taykhoom/Evo2-7B-1M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Taykhoom/Evo2-7B-1M
- SGLang
How to use Taykhoom/Evo2-7B-1M with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Taykhoom/Evo2-7B-1M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Taykhoom/Evo2-7B-1M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Taykhoom/Evo2-7B-1M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Taykhoom/Evo2-7B-1M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Taykhoom/Evo2-7B-1M with Docker Model Runner:
docker model run hf.co/Taykhoom/Evo2-7B-1M
Evo2-7B-1M
A clean, minimal HuggingFace port of Evo 2 7B (1M), the 1M-context StripedHyena2 variant. Provides native support for layer-by-layer hidden state extraction, attention-weight extraction, and a runtime-switchable attention backend.
No Transformer Engine required. This variant runs in pure bf16 on any CUDA-capable GPU (no FP8 / TE dependency).
Why this port?
arcinstitute/evo2_7b ships a .pt checkpoint that requires the evo2 and vortex Python packages just to instantiate the model. Even with both installed, common pain points remain:
- Not a HuggingFace model. No
from_pretrained, noAutoModel, noAutoModelForCausalLM- the original ships a thin Python wrapper around a customnn.Module. - No way to extract attention weights. The reference uses flash-attn unconditionally and discards the
(B, H, T, T)attention matrix; there is no official path to read it back. evo2+vortexpackages mandatory even for inference.
This repo fixes all three. The math is bit-exact with the vortex reference (max_abs_diff = 0.000e+00 at every layer; see Parity Verification). Loads with from_pretrained and trust_remote_code=True - no evo2 / vortex install needed.
Architecture
| Parameter | Value |
|---|---|
| Total parameters | ~6.6B |
| Architecture | StripedHyena 2 (interleaved Hyena cascade + MHA blocks) |
| Layers | 32 |
| Attention heads | 32 |
| Embedding dimension | 4096 |
| Inner MLP size | 11 264 |
| Vocabulary size | 512 (UTF-8 byte-level) |
| Attention block indices | 3, 10, 17, 24, 31 (5 blocks total) |
| Hyena block indices | all others (27 blocks: hcs / hcm / hcl pattern) |
| Positional encoding | RoPE (base = 10 000), linearly scaled by 128x |
| Max sequence length | 1 048 576 |
| Training dtype | bfloat16 (Hyena modal-form log_poles / residues and rotary inv_freq kept in fp32) |
| FP8 input projections | no |
| Weight format | model.safetensors (6.6B params, 3 files) |
Pretraining
- Objective: causal byte-level next-token prediction.
- Data: OpenGenome2, 8.8 trillion tokens spanning all domains of life.
- Source checkpoint:
arcinstitute/evo2_7b(evo2_7b.pt).
Parity Verification
Hidden-state representations verified bit-exact (max_abs_diff = 0.000e+00) to the vortex reference at every block output, using attn_implementation="sdpa" in bf16 (the same backend vortex's SelfAttention calls when use_flash_attn=False). Logits from Evo2ForCausalLM were also verified bit-exact (top-1 agreement: 128/128 positions on a 128-byte ACGT input). Verified on H100 with PyTorch 2.7 / CUDA 12.
Two non-obvious correctness fixes were required versus a naive port (see Implementation Notes for details):
inv_freqrecomputation.from_pretrained(dtype=bf16)casts buffers to bf16, which loses ~7 bits of mantissa in the rotaryinv_freq = 1 / base^(2i/dim). Ourto_bfloat16_except_poles_residues()recomputesinv_freqin fp32 fromself.baseto match vortex'sto_bfloat16_except_pr_lc(to_float32=True).- SDPA backend used for parity. Vortex's reference
SelfAttention(use_flash_attn=False) callsF.scaled_dot_product_attention, not a textbook softmax loop. Parity is measured withattn_implementation="sdpa"on our side. Using"eager"(textbook einsum + softmax) is mathematically equivalent but not bit-exact in bf16; using"flash_attention_2"(the recommended runtime backend) is also not bit-exact but agrees within bf16 noise.
Related Models
See the full Evo 2 collection on the Arc Institute HF org for the original weights, or the Taykhoom/Evo2-* collection for our minimal HF ports.
| Model | Size | Context | Notes |
|---|---|---|---|
| Taykhoom/Evo2-1B-8K | 1B | 8 192 | |
| Taykhoom/Evo2-7B-8K | 7B | 8 192 | |
| Taykhoom/Evo2-7B-262K | 7B | 262 144 | |
| Taykhoom/Evo2-7B-1M | 7B | 1 048 576 | <- this model |
| Taykhoom/Evo2-20B-1M | 20B | 1 048 576 | |
| Taykhoom/Evo2-40B-8K | 40B | 8 192 | |
| Taykhoom/Evo2-40B-1M | 40B | 1 048 576 |
Usage
Note on dtype. Evo 2 was trained in bfloat16, with the Hyena
log_poles/residues(modal-form filter parameters) and the rotaryinv_freqkept in fp32 for numerical stability. Passingdtype=...tofrom_pretrainedonly affects the initial load precision -Evo2Model.__init__andEvo2ForCausalLM.__init__always callto_bfloat16_except_poles_residues(), so the model runs in bf16 with these fp32 invariants regardless. This is intentional: bf16 is the trained precision, and the fp32 islands are required for stability.
Note on attention backend. By HuggingFace convention this model defaults to
attn_implementation="sdpa"(F.scaled_dot_product_attention) since SDPA needs onlytorchand runs on any GPU. The original Arc Institute Evo 2 inference path uses flash_attention_2, which is faster on long sequences but requires a separateflash-attninstall. All usage examples below opt in toflash_attention_2explicitly because most real users will want it. Drop the kwarg (or pass"sdpa"/"eager") if you don't haveflash-attninstalled.
Embedding generation (no LM head)
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/Evo2-7B-1M", trust_remote_code=True)
model = AutoModel.from_pretrained(
"Taykhoom/Evo2-7B-1M",
trust_remote_code=True,
attn_implementation="flash_attention_2", # or "sdpa" (default) or "eager"
).cuda().eval()
seqs = ["ACGTACGTACGT", "GGGTTTAAACCC"]
inputs = tokenizer(seqs, return_tensors="pt", padding=True).to(model.device)
with torch.no_grad():
out = model(**inputs, output_hidden_states=True)
last_hidden = out.last_hidden_state # (B, T, 4096)
all_layers = out.hidden_states # tuple of (B, T, 4096), len = 34
middle_layer = all_layers[16] # input to block 16 (= output of block 15)
Recommended embedding: pre-norm of the middle block
The Evo 2 paper reports that intermediate representations work better than the final layer for downstream tasks - specifically the pre-norm of a middle block. For this variant the middle block is blocks[16], so the recommended embedding is blocks[16].pre_norm(hidden_states[16]):
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/Evo2-7B-1M", trust_remote_code=True)
model = AutoModel.from_pretrained(
"Taykhoom/Evo2-7B-1M",
trust_remote_code=True,
attn_implementation="flash_attention_2",
).cuda().eval()
inputs = tokenizer(["ACGTACGTACGT"], return_tensors="pt").to(model.device)
with torch.no_grad():
out = model(**inputs, output_hidden_states=True)
pre_norm_middle = model.backbone.blocks[16].pre_norm(
out.hidden_states[16]
) # (B, T, 4096)
HF has no built-in API for sub-block intermediates like pre-norm outputs (only block outputs via output_hidden_states). The pattern above applies the block's pre_norm submodule directly to the corresponding hidden_states entry; this gives a bit-identical result to registering a forward hook on backbone.blocks[i].pre_norm and is simpler than using PyTorch hooks. Note that it does require running the full forward pass and then re-applying pre_norm, so a forward hook is more efficient if you only need this single intermediate.
LM logits
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/Evo2-7B-1M", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"Taykhoom/Evo2-7B-1M", trust_remote_code=True,
attn_implementation="flash_attention_2",
).cuda().eval()
inputs = tokenizer(["ACGT"], return_tensors="pt").to(model.device)
with torch.no_grad():
logits = model(**inputs).logits # (1, T, 512)
Generation
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/Evo2-7B-1M", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"Taykhoom/Evo2-7B-1M", trust_remote_code=True,
attn_implementation="flash_attention_2",
).cuda().eval()
inputs = tokenizer(["ACGT"], return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=128, do_sample=True, top_k=4, temperature=1.0)
print(tokenizer.decode(out[0]))
generation_config.json ships with eos_token_id = 0 (the EOD byte) and pad_token_id = 1 so model.generate() stops naturally at the trained end-of-document token.
Attention weights
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/Evo2-7B-1M", trust_remote_code=True)
model = AutoModel.from_pretrained(
"Taykhoom/Evo2-7B-1M",
trust_remote_code=True,
attn_implementation="eager", # required for output_attentions to populate
).cuda().eval()
inputs = tokenizer(["ACGTACGT"], return_tensors="pt").to(model.device)
with torch.no_grad():
out = model(**inputs, output_attentions=True)
# out.attentions is a tuple of length 32. Entries at indices not in
# [3, 10, 17, 24, 31] are None (Hyena blocks have no attention matrix).
# The 5 attention block(s) at those indices return a (B, num_heads, T, T) tensor.
attn_block_3 = out.attentions[3]
Multi-GPU loading (optional)
For sharding across multiple GPUs (required for 40B, optional for smaller
variants), install accelerate
and pass device_map="auto":
from transformers import AutoModelForCausalLM
# pip install accelerate
model = AutoModelForCausalLM.from_pretrained(
"Taykhoom/Evo2-7B-1M", trust_remote_code=True,
device_map="auto", # accelerate will shard across all visible GPUs
)
Fine-tuning
This HuggingFace port has not been tested for fine-tuning - it's verified only for inference parity. For fine-tuning, follow the original Arc Institute guidance and use either Savanna (the framework Evo 2 was pretrained in) or Nvidia BioNeMo, which provides an official Evo 2 fine-tuning recipe.
Implementation Notes
inv_freqkept in fp32 (critical for parity). HF'sfrom_pretrained(dtype=bf16)casts all buffers, including the rotaryinv_freq, to bf16. The geometric seriesinv_freq[i] = 1 / base^(2i/dim)loses ~7 bits of mantissa in bf16, which shifts the cos/sin tables by ~5e-2 per cell at higher positions and contributes ~3e-2 of Q/K noise per attention layer. Ourto_bfloat16_except_poles_residues()(called in each__init__) recomputesinv_freqin fp32 from the storedself.baseand invalidates the cos/sin cache, mirroring vortex'sto_bfloat16_except_pr_lc(to_float32=True).log_poles/residueskept in fp32 (critical for stability). The Hyena cascade long (hcl) blocks parameterize an IIR filter vialog_polesandresidues; bf16 quantisation makes the recurrence numerically unstable. Both are stored as fp32 in the safetensors and restored to fp32 byforce_dtype()after load.attn_implementationswitching (attention.py). Three backends, selected via the standard HFattn_implementationkwarg tofrom_pretrained(default chosen by HF auto-detection - typically"sdpa"):"sdpa": callsF.scaled_dot_product_attention. Bit-exact with vortex's reference path (when vortex usesuse_flash_attn=False)."flash_attention_2": callsflash_attn.flash_attn_qkvpacked_func. Matches the original Arc Institute inference path; faster on long sequences; requiresflash-attninstalled."eager": textbook einsum + softmax(QK^T) + einsum. Slowest, used internally whenoutput_attentions=Trueso the attention matrix is materialized.
- Block dispatch (
hyena.py). StripedHyena 2 has 4 block types, dispatched bylayer_idxmembership in four config lists:attn_layer_idxs(MHA + RoPE),hcl_layer_idxs(modal-form IIR via FFT),hcm_layer_idxs(medium FIR cascade, inner length 128),hcs_layer_idxs(short FIR cascade, inner length 7). The disjoint union must equalrange(num_layers). TELinearwith pure-PyTorch fallback (layers.py). Hyena cascade blocks use a TransformerEngine-backed input projection (3x hidden_size output) that supports FP8 quantisation. When TE is not installed, aTELinearfallback class with the same state_dict layout (weight,bias) is used - checkpoints are cross-loadable.- Custom cache (
cache.py).Evo2Cachewraps four block-type-specific dataclasses:InferenceParamsfor MHA KV cache,HyenaCascadeIIRInferenceParamsfor hcl, and twoHyenaCascadeFIRInferenceParamsfor hcm / hcs. Passed throughmodel.generate()aspast_key_values(we set_supports_cache_class = Falseso HF treats it as an opaque dict rather than wrapping it in aDynamicCache). - Tokenizer (
tokenization_evo2.py). Byte-level UTF-8, vocab_size = 512. Pad token = byte\x01. EOS = byte\x00(set aseos_token_idingeneration_config.json). Tokenizer does not add EOS at encoding time - matches the original Evo 2 inference pipeline. - Dependencies.
torch,transformers,numpy,safetensors,huggingface_hub.accelerateis optional but recommended if you want to load withdevice_map="auto"for multi-GPU sharding.flash_attnis optional (only needed if you passattn_implementation="flash_attention_2").
Citation
@article{brixi2026_evo2,
title = {Genome modelling and design across all domains of life with {Evo} 2},
author = {Brixi, Garyk and Durrant, Matthew G. and Ku, Jerome and Naghipourfar, Mohsen and Poli, Michael and Sun, Gwanggyu and Brockman, Greg and Chang, Daniel and Fanton, Alison and Gonzalez, Gabriel A. and King, Samuel H. and Li, David B. and Merchant, Aditi T. and Nguyen, Eric and Ricci-Tam, Chiara and Romero, David W. and Schmok, Jonathan C. and Taghibakhshi, Ali and Vorontsov, Anton and Yang, Brandon and Deng, Myra and Gorton, Liv and Nguyen, Nam and Wang, Nicholas K. and Pearce, Michael T. and Simon, Elana and Adams, Etowah and Amador, Zachary J. and Ashley, Euan A. and Baccus, Stephen A. and Dai, Haoyu and Dillmann, Steven and Ermon, Stefano and Guo, Daniel and Herschl, Michael H. and Ilango, Rajesh and Janik, Ken and Lu, Amy X. and Mehta, Reshma and Mofrad, Mohammad R. K. and Ng, Madelena Y. and Pannu, Jaspreet and {R{\'e}}, Christopher and St. John, John and Sullivan, Jeremy and Tey, Joseph and Viggiano, Ben and Zhu, Kevin and Zynda, Greg and Balsam, Daniel and Collison, Patrick and Costa, Anthony B. and Hernandez-Boussard, Tina and Ho, Eric and Liu, Ming-Yu and McGrath, Thomas and Powell, Kimberly and Pinglay, Sudarshan and Burke, Dave P. and Goodarzi, Hani and Hsu, Patrick D. and Hie, Brian L.},
journal = {Nature},
year = {2026},
doi = {10.1038/s41586-026-10176-5}
}
Credits
Original Evo 2 model and code by Brixi et al. (arcinstitute/evo2, Zymrael/vortex). Source checkpoint: arcinstitute/evo2_7b.
The HuggingFace conversion code in this repo was authored primarily by Claude and reviewed manually by Taykhoom Dalal.
License
Apache 2.0, following the original Evo 2 release.
- Downloads last month
- -