Update README.md

bf487d7 verified 2 months ago

6.15 kB

	---
	library_name: transformers
	license: mit
	pipeline_tag: text-generation
	tags:
	- biology
	- genomics
	- long-context
	---

	# GENERator-eukaryote-3b-base model

	## Important Notice
	If you are using GENERator for sequence generation, please ensure that the length of each input sequence is a multiple of 6. This can be achieved by either:
	1. Padding the sequence on the left with `'A'` (left padding);
	2. Truncating the sequence from the left (left truncation).

	This requirement arises because GENERator employs a 6-mer tokenizer. If the input sequence length is not a multiple of 6, the tokenizer will append an `'<oov>'` (out-of-vocabulary) token to the end of the token sequence. This can result in uninformative subsequent generations, such as repeated `'AAAAAA'`.

	We apologize for any inconvenience this may cause and recommend adhering to the above guidelines to ensure accurate and meaningful generation results.


	## Abouts
	In this repository, we present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs and 3B parameters, trained on an expansive dataset comprising 386 billion base pairs of eukaryotic DNA. The extensive and diverse pre-training data endow the GENERator with enhanced understanding and generation capabilities across various organisms.

	For more technical details, please refer to our paper [GENERator: A Long-Context Generative Genomic Foundation Model](https://arxiv.org/abs/2502.07272). The code and implementation details are available on Github: [https://github.com/GenerTeam/GENERator](https://github.com/GenerTeam/GENERator).


	## How to use
	### Simple example1: generation

	```python

	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM

	# Load the tokenizer and model.
	tokenizer = AutoTokenizer.from_pretrained("GenerTeam/GENERator-eukaryote-3b-base", trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained("GenerTeam/GENERator-eukaryote-3b-base")
	config = model.config

	max_length = config.max_position_embeddings

	# Define input sequences.
	sequences = [
	"ATGAGGTGGCAAGAAATGGGCTAC",
	"GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT"
	]

	def left_padding(sequence, padding_char='A', multiple=6):
	remainder = len(sequence) % multiple
	if remainder != 0:
	padding_length = multiple - remainder
	return padding_char * padding_length + sequence
	return sequence

	def left_truncation(sequence, multiple=6):
	remainder = len(sequence) % multiple
	if remainder != 0:
	return sequence[remainder:]
	return sequence

	# Apply left_padding to all sequences
	# padded_sequences = [left_padding(seq) for seq in sequences]

	# Apply left_truncation to all sequences
	truncated_sequences = [left_truncation(seq) for seq in sequences]

	# Process the sequences
	sequences = [tokenizer.bos_token + sequence for sequence in truncated_sequences]

	# Tokenize the sequences
	tokenizer.padding_side = "left"
	inputs = tokenizer(
	sequences,
	add_special_tokens=False,
	return_tensors="pt",
	padding=True,
	truncation=True,
	max_length=max_length
	)

	# Generate the sequences
	with torch.inference_mode():
	outputs = model.generate(**inputs, max_new_tokens=32, temperature=0.00001, top_k=1)

	# Decode the generated sequences
	decoded_sequences = tokenizer.batch_decode(outputs, skip_special_tokens=True)

	# Print the decoded sequences
	print(decoded_sequences)

	# It is expected to observe non-sense decoded sequences (e.g., 'AAAAAA')
	# The input sequences are too short to provide sufficient context.
	```

	### Simple example2: embedding

	```python


	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM

	# Load the tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained("GENERator-eukaryote-3b-base", trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained("GENERator-eukaryote-3b-base")

	# Get model configuration
	config = model.config
	max_length = config.max_position_embeddings

	# Define input sequences
	sequences = [
	"ATGAGGTGGCAAGAAATGGGCTAC",
	"GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT"
	]

	# Truncate each sequence to the nearest multiple of 6
	processed_sequences = [tokenizer.bos_token + seq[:len(seq)//6*6] for seq in sequences]

	# Tokenization
	tokenizer.padding_side = "right"
	inputs = tokenizer(
	processed_sequences,
	add_special_tokens=True,
	return_tensors="pt",
	padding=True,
	truncation=True,
	max_length=max_length
	)

	# Model Inference
	with torch.inference_mode():
	outputs = model(**inputs, output_hidden_states=True)

	hidden_states = outputs.hidden_states[-1]
	attention_mask = inputs["attention_mask"]

	# Option 1: Last token (EOS) embedding
	last_token_indices = attention_mask.sum(dim=1) - 1
	eos_embeddings = hidden_states[torch.arange(hidden_states.size(0)), last_token_indices, :]

	# Option 2: Mean pooling over all tokens
	expanded_mask = attention_mask.unsqueeze(-1).expand(hidden_states.size()).to(torch.float32)
	sum_embeddings = torch.sum(hidden_states * expanded_mask, dim=1)
	mean_embeddings = sum_embeddings / expanded_mask.sum(dim=1)

	# Output
	print("EOS (Last Token) Embeddings:", eos_embeddings)
	print("Mean Pooling Embeddings:", mean_embeddings)

	# ============================================================================
	# Additional notes:
	# - The preprocessing step ensures sequences are multiples of 6 for 6-mer tokenizer
	# - For causal LM, the last token embedding (EOS) is commonly used
	# - Mean pooling considers all tokens including BOS and content tokens
	# - The choice depends on your downstream task requirements
	# - Both methods handle variable sequence lengths via attention mask
	# ============================================================================

	```

	## Citation
	```
	@misc{wu2025generator,
	title={GENERator: A Long-Context Generative Genomic Foundation Model},
	author={Wei Wu and Qiuyi Li and Mingyang Li and Kun Fu and Fuli Feng and Jieping Ye and Hui Xiong and Zheng Wang},
	year={2025},
	eprint={2502.07272},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2502.07272},
	}
	```