| | --- |
| | library_name: transformers |
| | license: mit |
| | pipeline_tag: text-generation |
| | tags: |
| | - biology |
| | - genomics |
| | - long-context |
| | arxiv: 2502.07272 |
| | --- |
| | |
| | # GENERator-v2-eukaryote-1.2b-base model |
| |
|
| | ## **Important Notice** |
| | If you are using **GENERator** for sequence generation, please ensure that the length of each input sequence is a multiple of **6**. This can be achieved by either: |
| | 1. Padding the sequence on the left with `'A'` (**left padding**); |
| | 2. Truncating the sequence from the left (**left truncation**). |
| |
|
| | This requirement arises because **GENERator** employs a 6-mer tokenizer. If the input sequence length is not a multiple of **6**, the tokenizer will append an `'<oov>'` (out-of-vocabulary) token to the end of the token sequence. This can result in uninformative subsequent generations, such as repeated `'AAAAAA'`. |
| |
|
| | We apologize for any inconvenience this may cause and recommend adhering to the above guidelines to ensure accurate and meaningful generation results. |
| |
|
| | ## Abouts |
| | In this repository, we present GENERator-v2, a generative genomic foundation with enhanced performance in eukaryotic domain. More technical details are coming soon... |
| |
|
| | Python scripts for downstream analysis are available on Github: [https://github.com/GenerTeam/GENERator](https://github.com/GenerTeam/GENERator). |
| |
|
| | ## How to use |
| | ### Simple example1: generation |
| |
|
| | ```python |
| | |
| | import torch |
| | from transformers import AutoTokenizer, AutoModelForCausalLM |
| | |
| | # Load the tokenizer and model. |
| | tokenizer = AutoTokenizer.from_pretrained("GenerTeam/GENERator-v2-eukaryote-1.2b-base", trust_remote_code=True) |
| | model = AutoModelForCausalLM.from_pretrained("GenerTeam/GENERator-v2-eukaryote-1.2b-base") |
| | config = model.config |
| | |
| | max_length = config.max_position_embeddings |
| | |
| | # Define input sequences. |
| | sequences = [ |
| | "ATGAGGTGGCAAGAAATGGGCTAC", |
| | "GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT" |
| | ] |
| | |
| | def left_padding(sequence, padding_char='A', multiple=6): |
| | remainder = len(sequence) % multiple |
| | if remainder != 0: |
| | padding_length = multiple - remainder |
| | return padding_char * padding_length + sequence |
| | return sequence |
| | |
| | def left_truncation(sequence, multiple=6): |
| | remainder = len(sequence) % multiple |
| | if remainder != 0: |
| | return sequence[remainder:] |
| | return sequence |
| | |
| | # Apply left_padding to all sequences |
| | # padded_sequences = [left_padding(seq) for seq in sequences] |
| | |
| | # Apply left_truncation to all sequences |
| | truncated_sequences = [left_truncation(seq) for seq in sequences] |
| | |
| | # Process the sequences |
| | sequences = [tokenizer.bos_token + sequence for sequence in truncated_sequences] |
| | |
| | # Tokenize the sequences |
| | tokenizer.padding_side = "left" |
| | inputs = tokenizer( |
| | sequences, |
| | add_special_tokens=False, |
| | return_tensors="pt", |
| | padding=True, |
| | truncation=True, |
| | max_length=max_length |
| | ) |
| | |
| | # Generate the sequences |
| | with torch.inference_mode(): |
| | outputs = model.generate(**inputs, max_new_tokens=32, temperature=0.00001, top_k=1) |
| | |
| | # Decode the generated sequences |
| | decoded_sequences = tokenizer.batch_decode(outputs, skip_special_tokens=True) |
| | |
| | # Print the decoded sequences |
| | print(decoded_sequences) |
| | |
| | # It is expected to observe non-sense decoded sequences (e.g., 'AAAAAA') |
| | # The input sequences are too short to provide sufficient context. |
| | ``` |
| |
|
| | ### Simple example2: embedding |
| |
|
| | ```python |
| | |
| | import torch |
| | from transformers import AutoTokenizer, AutoModelForCausalLM |
| | |
| | # Load the tokenizer and model |
| | tokenizer = AutoTokenizer.from_pretrained("GENERator-v2-eukaryote-1.2b-base", trust_remote_code=True) |
| | model = AutoModelForCausalLM.from_pretrained("GENERator-v2-eukaryote-1.2b-base") |
| | |
| | # Get model configuration |
| | config = model.config |
| | max_length = config.max_position_embeddings |
| | |
| | # Define input sequences |
| | sequences = [ |
| | "ATGAGGTGGCAAGAAATGGGCTAC", |
| | "GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT" |
| | ] |
| | |
| | # Truncate each sequence to the nearest multiple of 6 |
| | processed_sequences = [tokenizer.bos_token + seq[:len(seq)//6*6] for seq in sequences] |
| | |
| | # Tokenization |
| | tokenizer.padding_side = "right" |
| | inputs = tokenizer( |
| | processed_sequences, |
| | add_special_tokens=True, |
| | return_tensors="pt", |
| | padding=True, |
| | truncation=True, |
| | max_length=max_length |
| | ) |
| | |
| | # Model Inference |
| | with torch.inference_mode(): |
| | outputs = model(**inputs, output_hidden_states=True) |
| | |
| | hidden_states = outputs.hidden_states[-1] |
| | attention_mask = inputs["attention_mask"] |
| | |
| | # Option 1: Last token (EOS) embedding |
| | last_token_indices = attention_mask.sum(dim=1) - 1 |
| | eos_embeddings = hidden_states[torch.arange(hidden_states.size(0)), last_token_indices, :] |
| | |
| | # Option 2: Mean pooling over all tokens |
| | expanded_mask = attention_mask.unsqueeze(-1).expand(hidden_states.size()).to(torch.float32) |
| | sum_embeddings = torch.sum(hidden_states * expanded_mask, dim=1) |
| | mean_embeddings = sum_embeddings / expanded_mask.sum(dim=1) |
| | |
| | # Output |
| | print("EOS (Last Token) Embeddings:", eos_embeddings) |
| | print("Mean Pooling Embeddings:", mean_embeddings) |
| | |
| | # ============================================================================ |
| | # Additional notes: |
| | # - The preprocessing step ensures sequences are multiples of 6 for 6-mer tokenizer |
| | # - For causal LM, the last token embedding (EOS) is commonly used |
| | # - Mean pooling considers all tokens including BOS and content tokens |
| | # - The choice depends on your downstream task requirements |
| | # - Both methods handle variable sequence lengths via attention mask |
| | # ============================================================================ |
| | |
| | ``` |
| |
|
| | ## Citation |
| | ``` |
| | @misc{wu2025generator, |
| | title={GENERator: A Long-Context Generative Genomic Foundation Model}, |
| | author={Wei Wu and Qiuyi Li and Mingyang Li and Kun Fu and Fuli Feng and Jieping Ye and Hui Xiong and Zheng Wang}, |
| | year={2025}, |
| | eprint={2502.07272}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CL}, |
| | url={https://arxiv.org/abs/2502.07272}, |
| | } |
| | ``` |