GenerTeam's picture
Update README.md
bf487d7 verified
---
library_name: transformers
license: mit
pipeline_tag: text-generation
tags:
- biology
- genomics
- long-context
---
# GENERator-eukaryote-3b-base model
## **Important Notice**
If you are using GENERator for **sequence generation**, please ensure that the length of each input sequence is a multiple of **6**. This can be achieved by either:
1. Padding the sequence on the left with `'A'` (**left padding**);
2. Truncating the sequence from the left (**left truncation**).
This requirement arises because **GENERator** employs a 6-mer tokenizer. If the input sequence length is not a multiple of **6**, the tokenizer will append an `'<oov>'` (out-of-vocabulary) token to the end of the token sequence. This can result in uninformative subsequent generations, such as repeated `'AAAAAA'`.
We apologize for any inconvenience this may cause and recommend adhering to the above guidelines to ensure accurate and meaningful generation results.
## Abouts
In this repository, we present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs and 3B parameters, trained on an expansive dataset comprising 386 billion base pairs of eukaryotic DNA. The extensive and diverse pre-training data endow the GENERator with enhanced understanding and generation capabilities across various organisms.
For more technical details, please refer to our paper [GENERator: A Long-Context Generative Genomic Foundation Model](https://arxiv.org/abs/2502.07272). The code and implementation details are available on Github: [https://github.com/GenerTeam/GENERator](https://github.com/GenerTeam/GENERator).
## How to use
### Simple example1: generation
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load the tokenizer and model.
tokenizer = AutoTokenizer.from_pretrained("GenerTeam/GENERator-eukaryote-3b-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("GenerTeam/GENERator-eukaryote-3b-base")
config = model.config
max_length = config.max_position_embeddings
# Define input sequences.
sequences = [
"ATGAGGTGGCAAGAAATGGGCTAC",
"GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT"
]
def left_padding(sequence, padding_char='A', multiple=6):
remainder = len(sequence) % multiple
if remainder != 0:
padding_length = multiple - remainder
return padding_char * padding_length + sequence
return sequence
def left_truncation(sequence, multiple=6):
remainder = len(sequence) % multiple
if remainder != 0:
return sequence[remainder:]
return sequence
# Apply left_padding to all sequences
# padded_sequences = [left_padding(seq) for seq in sequences]
# Apply left_truncation to all sequences
truncated_sequences = [left_truncation(seq) for seq in sequences]
# Process the sequences
sequences = [tokenizer.bos_token + sequence for sequence in truncated_sequences]
# Tokenize the sequences
tokenizer.padding_side = "left"
inputs = tokenizer(
sequences,
add_special_tokens=False,
return_tensors="pt",
padding=True,
truncation=True,
max_length=max_length
)
# Generate the sequences
with torch.inference_mode():
outputs = model.generate(**inputs, max_new_tokens=32, temperature=0.00001, top_k=1)
# Decode the generated sequences
decoded_sequences = tokenizer.batch_decode(outputs, skip_special_tokens=True)
# Print the decoded sequences
print(decoded_sequences)
# It is expected to observe non-sense decoded sequences (e.g., 'AAAAAA')
# The input sequences are too short to provide sufficient context.
```
### Simple example2: embedding
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("GENERator-eukaryote-3b-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("GENERator-eukaryote-3b-base")
# Get model configuration
config = model.config
max_length = config.max_position_embeddings
# Define input sequences
sequences = [
"ATGAGGTGGCAAGAAATGGGCTAC",
"GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT"
]
# Truncate each sequence to the nearest multiple of 6
processed_sequences = [tokenizer.bos_token + seq[:len(seq)//6*6] for seq in sequences]
# Tokenization
tokenizer.padding_side = "right"
inputs = tokenizer(
processed_sequences,
add_special_tokens=True,
return_tensors="pt",
padding=True,
truncation=True,
max_length=max_length
)
# Model Inference
with torch.inference_mode():
outputs = model(**inputs, output_hidden_states=True)
hidden_states = outputs.hidden_states[-1]
attention_mask = inputs["attention_mask"]
# Option 1: Last token (EOS) embedding
last_token_indices = attention_mask.sum(dim=1) - 1
eos_embeddings = hidden_states[torch.arange(hidden_states.size(0)), last_token_indices, :]
# Option 2: Mean pooling over all tokens
expanded_mask = attention_mask.unsqueeze(-1).expand(hidden_states.size()).to(torch.float32)
sum_embeddings = torch.sum(hidden_states * expanded_mask, dim=1)
mean_embeddings = sum_embeddings / expanded_mask.sum(dim=1)
# Output
print("EOS (Last Token) Embeddings:", eos_embeddings)
print("Mean Pooling Embeddings:", mean_embeddings)
# ============================================================================
# Additional notes:
# - The preprocessing step ensures sequences are multiples of 6 for 6-mer tokenizer
# - For causal LM, the last token embedding (EOS) is commonly used
# - Mean pooling considers all tokens including BOS and content tokens
# - The choice depends on your downstream task requirements
# - Both methods handle variable sequence lengths via attention mask
# ============================================================================
```
## Citation
```
@misc{wu2025generator,
title={GENERator: A Long-Context Generative Genomic Foundation Model},
author={Wei Wu and Qiuyi Li and Mingyang Li and Kun Fu and Fuli Feng and Jieping Ye and Hui Xiong and Zheng Wang},
year={2025},
eprint={2502.07272},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.07272},
}
```