Evo1-1-7B-8K

A clean, minimal HuggingFace port of Evo 1 (8k), the original ~7B-parameter StripedHyena DNA foundation model. Native support for layer-by-layer hidden state extraction, attention-weight extraction, and a runtime-switchable attention backend.

Why this port?

togethercomputer/evo-1-8k-base ships a trust_remote_code HF implementation but it has four gaps that force every downstream user to monkey-patch the model:

  1. output_hidden_states=True is hardcoded to None (intermediate embeddings require forward hooks).
  2. output_attentions=True is unsupported (flash-attn discards the (B, H, T, T) matrix; users must patch the attention module).
  3. attn_implementation cannot be switched at load time - flash_attn is mandatory at every attention layer.
  4. The bare backbone is not exposed via AutoModel.from_pretrained; only the LM-head wrapper exists.

This repo fixes all four. The math is bit-exact with the togethercomputer reference (max_abs_diff = 0.000e+00 at every layer; see Parity Verification).

Architecture

Parameter Value
Total parameters ~7B
Layers 32
Attention heads 32
Embedding dimension 4096
Inner MLP size 10928
Vocabulary size 512 (UTF-8 byte-level)
Attention layer indices [8, 16, 24]
Hyena layer indices all others
Hyena state size 8
Positional encoding RoPE (base = 10000)
Architecture StripedHyena (alternating Hyena / MHA blocks)
Max sequence length 8 192
Training dtype bfloat16 (Hyena modal-form poles / residues kept in fp32)

Pretraining

  • Objective: causal byte-level next-token prediction.
  • Data: OpenGenome, ~300B tokens of prokaryotic whole-genome DNA.
  • Source checkpoint: togethercomputer/evo-1-8k-base@1.1_fix.

Parity Verification

Hidden-state representations verified bit-exact (max_abs_diff = 0.000e+00) to the togethercomputer reference at all 33 representation levels (token embedding + each of the 32 transformer blocks + final RMSNorm), using attn_implementation="flash_attention_2" in bf16 (matches the reference's backend choice and the trained dtype). Logits from Evo1ForCausalLM were also verified bit-exact (top-1 agreement: 128/128 positions). Verified on H100 with PyTorch 2.7.1 / CUDA 12.9.

Numerical equivalence across attention backends

flash_attention_2 is bit-exact with the original togethercomputer / evo-design implementations (same CUDA kernel). The sdpa and eager backends use different kernels (PyTorch's bundled flash kernel and pure-PyTorch matmul, respectively); these compute mathematically equivalent attention but accumulate floating-point operations in slightly different orders, producing per-block diffs at the bf16 noise floor (relative error roughly 1e-4 to 1e-2).

Unlike a standard transformer, where attention is softmax-bounded and per-block diffs stay small through the stack, StripedHyena's Hyena layers use an unbounded-gain IIR filter (no softmax) - so any small per-attention-block diff gets amplified by Hyena's filter gain. Across 32 layers this compounds to ~1% relative error in the intermediate residual stream, though the final post-RMSNorm output is bounded. Use flash_attention_2 if you need to match the reference's activations bit-for-bit.

Related Models

See the full Evo1 collection on the Hub.

Model Context Notes
Taykhoom/Evo1-1-7B-8K 8 192 Original Evo 1 base model (8k context).
Taykhoom/Evo1-1-7B-131K 131 072 Long-context Evo 1 with linearly-scaled RoPE (131k context).
Taykhoom/Evo1-1.5-7B-8K 8 192 Evo 1.5: Evo 1 (8k) further trained on ~50% more pretraining tokens.

Usage

Note on dtype. Evo1 was trained in bfloat16, with the Hyena poles / residues (modal-form filter parameters) kept in fp32 for numerical stability. Passing dtype=... to from_pretrained only affects the initial load precision (peak memory during loading) and does not change the inference dtype - Evo1Model.__init__ and Evo1ForCausalLM.__init__ unconditionally call to_bfloat16_except_poles_residues(), so the model always runs in bf16 with poles/residues in fp32. This is intentional: the trained activations are bf16-stable and fp16-unstable, and the modal-form filter requires fp32 for numerical stability - a single mixed config is the only valid one.

Note on attention backend. By default, from_pretrained selects attn_implementation="sdpa" (PyTorch's bundled scaled-dot-product-attention) - this works out of the box without flash_attn installed. The original togethercomputer / evo-design implementations use flash_attn unconditionally; for bit-exact reproduction of reference outputs, explicitly pass attn_implementation="flash_attention_2" (and pip install flash-attn). See Numerical equivalence across attention backends for the magnitude of the difference.

Embedding generation (no LM head)

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/Evo1-1-7B-8K", trust_remote_code=True)
model = AutoModel.from_pretrained(
    "Taykhoom/Evo1-1-7B-8K",
    trust_remote_code=True,
    attn_implementation="flash_attention_2",  # bit-exact with reference; or omit to default to "sdpa"
).cuda().eval()

seqs = ["ACGTACGTACGT", "GGGTTTAAACCC"]
inputs = tokenizer(seqs, return_tensors="pt", padding=True).to(model.device)

with torch.no_grad():
    out = model(**inputs, output_hidden_states=True)

last_hidden  = out.last_hidden_state  # (B, T, 4096)
all_layers   = out.hidden_states      # tuple of (B, T, 4096), len = 34 (embed + 32 blocks + post-norm)
layer_12_emb = all_layers[12]         # often used as a "middle" representation

LM logits

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/Evo1-1-7B-8K", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "Taykhoom/Evo1-1-7B-8K",
    trust_remote_code=True,
    attn_implementation="flash_attention_2",
).cuda().eval()

inputs = tokenizer(["ACGT"], return_tensors="pt").to(model.device)
with torch.no_grad():
    logits = model(**inputs).logits   # (1, T, 512)

Generation

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/Evo1-1-7B-8K", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "Taykhoom/Evo1-1-7B-8K",
    trust_remote_code=True,
    attn_implementation="flash_attention_2",
).cuda().eval()

inputs = tokenizer(["ACGT"], return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=128, do_sample=True, top_k=4, temperature=1.0)
print(tokenizer.decode(out[0]))

generation_config.json ships with eos_token_id = 0 (the EOD byte) and pad_token_id = 1 so model.generate() stops naturally at the trained end-of-document token without needing extra kwargs. Note that the tokenizer itself does not add an EOS at encoding time - this matches the original Evo1 inference pipeline (only generation stops on EOS; embedding/scoring uses raw byte input).

Attention weights

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/Evo1-1-7B-8K", trust_remote_code=True)
model = AutoModel.from_pretrained(
    "Taykhoom/Evo1-1-7B-8K",
    trust_remote_code=True,
    attn_implementation="eager",  # required for output_attentions to populate
).cuda().eval()

inputs = tokenizer(["ACGTACGT"], return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model(**inputs, output_attentions=True)

# out.attentions is a tuple of length 32. Entries at indices not in [8, 16, 24]
# are None (Hyena blocks have no attention matrix). Entries at [8, 16, 24] are
# (B, num_heads, T, T) tensors.
attn_block_8 = out.attentions[8]

Multi-GPU loading (optional)

Loading via accelerate's device_map is supported (_no_split_modules is set so each AttentionBlock / ParallelGatedConvBlock stays atomic on one device, with hidden state automatically transferred across device boundaries):

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/Evo1-1-7B-8K", trust_remote_code=True)
model = AutoModel.from_pretrained(
    "Taykhoom/Evo1-1-7B-8K",
    trust_remote_code=True,
    attn_implementation="flash_attention_2",
    device_map="auto",     # auto-shard across all visible GPUs; falls back to single GPU if only one is present
).eval()

Requires pip install accelerate.

Fine-tuning

Standard HuggingFace conventions. For sequence-level tasks, take the final last_hidden_state (or any intermediate hidden_states[i]) and feed it into a downstream head.

Implementation Notes

  • Custom attention module (attention.py). Replaces flash_attn.modules.mha.MHA with a small in-repo MHA class that supports attn_implementation="eager" / "sdpa" / "flash_attention_2". Parameter names (Wqkv, out_proj, rotary_emb.inv_freq) are preserved so existing checkpoints load unchanged. When output_attentions=True, the sdpa and flash paths automatically fall back to eager so the attention matrix is materialized.
  • Custom rotary embedding (rotary.py). When flash_attn is installed we delegate to its Triton kernel (faster on long sequences). The pure-PyTorch fallback does the rotary multiply in fp32 internally (then casts back) so it produces bit-exactly identical results to the Triton kernel - a bf16 multiply here introduces ~3e-2 error per layer that compounds to ~1% relative across 32 layers.
  • Hyena engine (engine.py). Copied verbatim from the togethercomputer reference (FFT-based long convolution, modal-form prefill).
  • Cache subclass (cache.py). Evo1Cache(transformers.cache_utils.Cache) wraps the two block-type-specific inference param dataclasses (InferenceParams for attention KV cache, RecurrentInferenceParams for Hyena FIR window + IIR modal state). Exposes get_seq_length() / get_max_cache_shape() so HF's model.generate() can introspect cache state; falls through to cache["mha"] / cache["hyena"] for the model internals.
  • Tokenizer (tokenization_evo1.py). Byte-level UTF-8 with vocab_size = 512. Pad token is byte \x01. No CLS, no EOS appended at encoding time (matches original Evo1 inference). The _decode method is numpy-2.x compatible (the original np.uint8.clip(min=32, max=512) was an overflow on numpy 2).
  • Dependencies. torch, transformers, numpy, safetensors, huggingface_hub (only for from_pretrained downloads). flash_attn is only required if you pass attn_implementation="flash_attention_2".

Citation

@article{nguyen2024_evo,
  title   = {Sequence modeling and design from molecular to genome scale with {Evo}},
  author  = {Nguyen, Eric and Poli, Michael and Durrant, Matthew G. and Kang, Brian and Katrekar, Dhruva and Li, David B. and Bartie, Liam J. and Thomas, Armin W. and King, Samuel H. and Brixi, Garyk and Sullivan, Jeremy and Ng, Madelena Y. and Lewis, Ashley and Lou, Aaron and Ermon, Stefano and Baccus, Stephen A. and Hernandez-Boussard, Tina and {R{\'e}}, Christopher and Hsu, Patrick D. and Hie, Brian L.},
  journal = {Science},
  volume  = {386},
  number  = {6723},
  pages   = {eado9336},
  year    = {2024},
  doi     = {10.1126/science.ado9336}
}

Credits

Original Evo1 model and code by Nguyen et al. Source repo: evo-design/evo. Source checkpoint: togethercomputer/evo-1-8k-base.

The HuggingFace conversion code in this repo was authored primarily by Claude and reviewed manually by Taykhoom Dalal.

License

Apache 2.0, following the original Evo1 release.

Downloads last month
-
Safetensors
Model size
6B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Taykhoom/Evo1-1-7B-8K