| | --- |
| | library_name: transformers |
| | license: mit |
| | pipeline_tag: text-generation |
| | tags: |
| | - biology |
| | - genomics |
| | - long-context |
| | --- |
| | |
| | # GENERator-eukaryote-3b-base model |
| |
|
| | ## Abouts |
| | In this repository, we present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs and 3B parameters, trained on an expansive dataset comprising 386 billion base pairs of eukaryotic DNA. The extensive and diverse pre-training data endow the GENERator with enhanced understanding and generation capabilities across various organisms. |
| |
|
| | For more technical details, please refer to our paper [GENERator: A Long-Context Generative Genomic Foundation Model](https://huggingface.co/papers/2502.07272). |
| |
|
| | Code: https://github.com/GenerTeam/GENERator |
| |
|
| | ## How to use |
| | ### Simple example1: generation |
| |
|
| | ```python |
| | |
| | import torch |
| | from transformers import AutoTokenizer, AutoModelForCausalLM |
| | |
| | # Load the tokenizer and model. |
| | tokenizer = AutoTokenizer.from_pretrained("GenerTeam/GENERator-eukaryote-3b-base", trust_remote_code=True) |
| | model = AutoModelForCausalLM.from_pretrained("GenerTeam/GENERator-eukaryote-3b-base") |
| | config = model.config |
| | |
| | max_length = config.max_position_embeddings |
| | |
| | # Define input sequences. |
| | sequences = [ |
| | "ATGAGGTGGCAAGAAATGGGCTAC", |
| | "GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT" |
| | ] |
| | |
| | # Process the sequences |
| | sequences = [tokenizer.bos_token + sequence for sequence in sequences] |
| | |
| | # Tokenize the sequences |
| | tokenizer.padding_side = "left" |
| | inputs = tokenizer( |
| | sequences, |
| | add_special_tokens=False, |
| | return_tensors="pt", |
| | padding=True, |
| | truncation=True, |
| | max_length=max_length |
| | ) |
| | |
| | # Generate the sequences |
| | with torch.inference_mode(): |
| | outputs = model.generate(**inputs, max_new_tokens=32, temperature=0.00001, top_k=1) |
| | |
| | # Decode the generated sequences |
| | decoded_sequences = tokenizer.batch_decode(outputs, skip_special_tokens=True) |
| | |
| | # Print the decoded sequences |
| | print(decoded_sequences) |
| | |
| | # It is expected to observe non-sense decoded sequences (e.g., 'AAAAAA') |
| | # The input sequences are too short to provide sufficient context. |
| | ``` |
| |
|
| | ### Simple example2: embedding |
| |
|
| | ```python |
| | |
| | import torch |
| | from transformers import AutoTokenizer, AutoModelForCausalLM |
| | |
| | # Load the tokenizer and model. |
| | tokenizer = AutoTokenizer.from_pretrained("GENERator-eukaryote-3b-base", trust_remote_code=True) |
| | model = AutoModelForCausalLM.from_pretrained("GenerTeam/GENERator-eukaryote-3b-base") |
| | |
| | config = model.config |
| | max_length = config.max_position_embeddings |
| | |
| | # Define input sequences. |
| | sequences = [ |
| | "ATGAGGTGGCAAGAAATGGGCTAC", |
| | "GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT" |
| | ] |
| | |
| | # Tokenize the sequences with add_special_tokens=True to automatically add special tokens, |
| | # such as the BOS EOS token, at the appropriate positions. |
| | tokenizer.padding_side = "right" |
| | inputs = tokenizer( |
| | sequences, |
| | add_special_tokens=True, |
| | return_tensors="pt", |
| | padding=True, |
| | truncation=True, |
| | max_length=max_length |
| | ) |
| | |
| | # Perform a forward pass through the model to obtain the outputs, including hidden states. |
| | with torch.inference_mode(): |
| | outputs = model(**inputs, output_hidden_states=True) |
| | |
| | # Retrieve the hidden states from the last layer. |
| | hidden_states = outputs.hidden_states[-1] # Shape: (batch_size, sequence_length, hidden_size) |
| | |
| | # Use the attention_mask to determine the index of the last token in each sequence. |
| | # Since add_special_tokens=True is used, the last token is typically the EOS token. |
| | attention_mask = inputs["attention_mask"] |
| | last_token_indices = attention_mask.sum(dim=1) - 1 # Index of the last token for each sequence |
| | |
| | # Extract the embedding corresponding to the EOS token for each sequence. |
| | seq_embeddings = [] |
| | for i, token_index in enumerate(last_token_indices): |
| | # Fetch the embedding for the last token (EOS token). |
| | seq_embedding = hidden_states[i, token_index, :] |
| | seq_embeddings.append(seq_embedding) |
| | |
| | # Stack the embeddings into a tensor with shape (batch_size, hidden_size) |
| | seq_embeddings = torch.stack(seq_embeddings) |
| | |
| | print("Sequence Embeddings:", seq_embeddings) |
| | |
| | ``` |
| |
|
| | ## Citation |
| | ``` |
| | @misc{wu2025generator, |
| | title={GENERator: A Long-Context Generative Genomic Foundation Model}, |
| | author={Wei Wu and Qiuyi Li and Mingyang Li and Kun Fu and Fuli Feng and Jieping Ye and Hui Xiong and Zheng Wang}, |
| | year={2025}, |
| | eprint={2502.07272}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CL}, |
| | url={https://arxiv.org/abs/2502.07272}, |
| | } |
| | ``` |