|
|
--- |
|
|
library_name: transformers |
|
|
license: mit |
|
|
pipeline_tag: text-generation |
|
|
tags: |
|
|
- biology |
|
|
- genomics |
|
|
- long-context |
|
|
--- |
|
|
|
|
|
# GENERator-eukaryote-3b-base model |
|
|
|
|
|
## **Important Notice** |
|
|
If you are using GENERator for **sequence generation**, please ensure that the length of each input sequence is a multiple of **6**. This can be achieved by either: |
|
|
1. Padding the sequence on the left with `'A'` (**left padding**); |
|
|
2. Truncating the sequence from the left (**left truncation**). |
|
|
|
|
|
This requirement arises because **GENERator** employs a 6-mer tokenizer. If the input sequence length is not a multiple of **6**, the tokenizer will append an `'<oov>'` (out-of-vocabulary) token to the end of the token sequence. This can result in uninformative subsequent generations, such as repeated `'AAAAAA'`. |
|
|
|
|
|
We apologize for any inconvenience this may cause and recommend adhering to the above guidelines to ensure accurate and meaningful generation results. |
|
|
|
|
|
|
|
|
## Abouts |
|
|
In this repository, we present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs and 3B parameters, trained on an expansive dataset comprising 386 billion base pairs of eukaryotic DNA. The extensive and diverse pre-training data endow the GENERator with enhanced understanding and generation capabilities across various organisms. |
|
|
|
|
|
For more technical details, please refer to our paper [GENERator: A Long-Context Generative Genomic Foundation Model](https://arxiv.org/abs/2502.07272). The code and implementation details are available on Github: [https://github.com/GenerTeam/GENERator](https://github.com/GenerTeam/GENERator). |
|
|
|
|
|
|
|
|
## How to use |
|
|
### Simple example1: generation |
|
|
|
|
|
```python |
|
|
|
|
|
import torch |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
# Load the tokenizer and model. |
|
|
tokenizer = AutoTokenizer.from_pretrained("GenerTeam/GENERator-eukaryote-3b-base", trust_remote_code=True) |
|
|
model = AutoModelForCausalLM.from_pretrained("GenerTeam/GENERator-eukaryote-3b-base") |
|
|
config = model.config |
|
|
|
|
|
max_length = config.max_position_embeddings |
|
|
|
|
|
# Define input sequences. |
|
|
sequences = [ |
|
|
"ATGAGGTGGCAAGAAATGGGCTAC", |
|
|
"GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT" |
|
|
] |
|
|
|
|
|
def left_padding(sequence, padding_char='A', multiple=6): |
|
|
remainder = len(sequence) % multiple |
|
|
if remainder != 0: |
|
|
padding_length = multiple - remainder |
|
|
return padding_char * padding_length + sequence |
|
|
return sequence |
|
|
|
|
|
def left_truncation(sequence, multiple=6): |
|
|
remainder = len(sequence) % multiple |
|
|
if remainder != 0: |
|
|
return sequence[remainder:] |
|
|
return sequence |
|
|
|
|
|
# Apply left_padding to all sequences |
|
|
# padded_sequences = [left_padding(seq) for seq in sequences] |
|
|
|
|
|
# Apply left_truncation to all sequences |
|
|
truncated_sequences = [left_truncation(seq) for seq in sequences] |
|
|
|
|
|
# Process the sequences |
|
|
sequences = [tokenizer.bos_token + sequence for sequence in truncated_sequences] |
|
|
|
|
|
# Tokenize the sequences |
|
|
tokenizer.padding_side = "left" |
|
|
inputs = tokenizer( |
|
|
sequences, |
|
|
add_special_tokens=False, |
|
|
return_tensors="pt", |
|
|
padding=True, |
|
|
truncation=True, |
|
|
max_length=max_length |
|
|
) |
|
|
|
|
|
# Generate the sequences |
|
|
with torch.inference_mode(): |
|
|
outputs = model.generate(**inputs, max_new_tokens=32, temperature=0.00001, top_k=1) |
|
|
|
|
|
# Decode the generated sequences |
|
|
decoded_sequences = tokenizer.batch_decode(outputs, skip_special_tokens=True) |
|
|
|
|
|
# Print the decoded sequences |
|
|
print(decoded_sequences) |
|
|
|
|
|
# It is expected to observe non-sense decoded sequences (e.g., 'AAAAAA') |
|
|
# The input sequences are too short to provide sufficient context. |
|
|
``` |
|
|
|
|
|
### Simple example2: embedding |
|
|
|
|
|
```python |
|
|
|
|
|
|
|
|
import torch |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
# Load the tokenizer and model |
|
|
tokenizer = AutoTokenizer.from_pretrained("GENERator-eukaryote-3b-base", trust_remote_code=True) |
|
|
model = AutoModelForCausalLM.from_pretrained("GENERator-eukaryote-3b-base") |
|
|
|
|
|
# Get model configuration |
|
|
config = model.config |
|
|
max_length = config.max_position_embeddings |
|
|
|
|
|
# Define input sequences |
|
|
sequences = [ |
|
|
"ATGAGGTGGCAAGAAATGGGCTAC", |
|
|
"GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT" |
|
|
] |
|
|
|
|
|
# Truncate each sequence to the nearest multiple of 6 |
|
|
processed_sequences = [tokenizer.bos_token + seq[:len(seq)//6*6] for seq in sequences] |
|
|
|
|
|
# Tokenization |
|
|
tokenizer.padding_side = "right" |
|
|
inputs = tokenizer( |
|
|
processed_sequences, |
|
|
add_special_tokens=True, |
|
|
return_tensors="pt", |
|
|
padding=True, |
|
|
truncation=True, |
|
|
max_length=max_length |
|
|
) |
|
|
|
|
|
# Model Inference |
|
|
with torch.inference_mode(): |
|
|
outputs = model(**inputs, output_hidden_states=True) |
|
|
|
|
|
hidden_states = outputs.hidden_states[-1] |
|
|
attention_mask = inputs["attention_mask"] |
|
|
|
|
|
# Option 1: Last token (EOS) embedding |
|
|
last_token_indices = attention_mask.sum(dim=1) - 1 |
|
|
eos_embeddings = hidden_states[torch.arange(hidden_states.size(0)), last_token_indices, :] |
|
|
|
|
|
# Option 2: Mean pooling over all tokens |
|
|
expanded_mask = attention_mask.unsqueeze(-1).expand(hidden_states.size()).to(torch.float32) |
|
|
sum_embeddings = torch.sum(hidden_states * expanded_mask, dim=1) |
|
|
mean_embeddings = sum_embeddings / expanded_mask.sum(dim=1) |
|
|
|
|
|
# Output |
|
|
print("EOS (Last Token) Embeddings:", eos_embeddings) |
|
|
print("Mean Pooling Embeddings:", mean_embeddings) |
|
|
|
|
|
# ============================================================================ |
|
|
# Additional notes: |
|
|
# - The preprocessing step ensures sequences are multiples of 6 for 6-mer tokenizer |
|
|
# - For causal LM, the last token embedding (EOS) is commonly used |
|
|
# - Mean pooling considers all tokens including BOS and content tokens |
|
|
# - The choice depends on your downstream task requirements |
|
|
# - Both methods handle variable sequence lengths via attention mask |
|
|
# ============================================================================ |
|
|
|
|
|
``` |
|
|
|
|
|
## Citation |
|
|
``` |
|
|
@misc{wu2025generator, |
|
|
title={GENERator: A Long-Context Generative Genomic Foundation Model}, |
|
|
author={Wei Wu and Qiuyi Li and Mingyang Li and Kun Fu and Fuli Feng and Jieping Ye and Hui Xiong and Zheng Wang}, |
|
|
year={2025}, |
|
|
eprint={2502.07272}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL}, |
|
|
url={https://arxiv.org/abs/2502.07272}, |
|
|
} |
|
|
``` |