RNARL / README.md

你çsglin

Modify README

a7d4876 11 months ago

6.09 kB

	---
	tags:
	- generation
	- protein-sequence
	- rna-sequence
	- pytorch
	---

	# Protein to RNA CDS Sequence Generation Model

	This model is a custom PyTorch model designed to generate RNA CDS sequences from protein sequences. It utilizes a custom transformer-based architecture incorporating an ESM-2 encoder and a Mixture-of-Experts (MoE) layer.

	## Model Architecture

	The model `ActorModel_encoder_esm2` is defined in `utils.py`.

	The key parameters used for instantiation are:

	- `d_model`: Dimension of the model's internal representation (768).
	- `nhead`: Number of attention heads (8).
	- `num_encoder_layers`: Number of transformer encoder layers (8).
	- `dim_feedforward`: Dimension of the feedforward network (`d_model * 2`).
	- `esm2_dim`: Dimension of the ESM-2 embeddings (1280 for esm2_t33_650M_UR50D).
	- `dropout`: Dropout rate (0.3).
	- `num_experts`: Number of experts in the MoE layer (6).
	- `top_k_experts`: Number of top experts to use (2).
	- `device`: The device to run the model on.

	## Files in this Repository

	- `homo_mrna.pt`: The PyTorch state_dict of the trained model for Homo sapiens mRNA.
	- `homo_circ.pt`: The PyTorch state_dict of the trained model for Homo sapiens circlar RNA.
	- `Arabidopsis.pt`: The PyTorch state_dict of the trained model for Arabidopsis thaliana mRNA.
	- `CR.pt`: The PyTorch state_dict of the trained model for Chlamydomonas reinhardtii mRNA.
	- `EscherichiaColi.pt`: The PyTorch state_dict of the trained model for Escherichia coli mRNA.
	- `PC.pt`: The PyTorch state_dict of the trained model for Penicillium chrysogenum mRNA.
	- `TK.pt`: The PyTorch state_dict of the trained model for Thermococcus kodakarensis KOD1 mRNA.
	- `utils.py`: Contains the definition of the `ActorModel_encoder_esm2` class and the `Tokenizer` class.
	- `transformer_encoder_MoE.py`: Contains the definition of the `Encoder` class
	- `README.md`: This file.

	## How to Load the Model

	Since this is a custom model, you need to download the `utils.py`,`transformer_encoder_MoE.py`, and the `.pt` file and then instantiate the model class and load the state dictionary.

	1. Download Files:
	You can download the files using the `huggingface_hub` library:
	```python
	from huggingface_hub import hf_hub_download
	import os

	repo_id = "sglin/RNARL"
	local_dir = "./my_RNARL"

	# Download model weights and utils.py
	hf_hub_download(repo_id=repo_id, filename="homo_mrna.pt", local_dir=local_dir)
	hf_hub_download(repo_id=repo_id, filename="homo_circ.pt", local_dir=local_dir)
	hf_hub_download(repo_id=repo_id, filename="Arabidopsis.pt", local_dir=local_dir)
	hf_hub_download(repo_id=repo_id, filename="CR.pt", local_dir=local_dir)
	hf_hub_download(repo_id=repo_id, filename="EscherichiaColi.pt", local_dir=local_dir)
	hf_hub_download(repo_id=repo_id, filename="PC.pt", local_dir=local_dir)
	hf_hub_download(repo_id=repo_id, filename="TK.pt", local_dir=local_dir)
	hf_hub_download(repo_id=repo_id, filename="utils.py", local_dir=local_dir)
	hf_hub_download(repo_id=repo_id, filename="transformer_encoder_MoE.py", local_dir=local_dir)

	# Now utils.py,transformer_encoder_MoE.py and model weights are in ./my_RNARL
	```

	2. Import Model Class:

	```python
	# Assuming you are in or have added ./my_RNARL to your path
	# Example: If in local_dir
	# import sys
	# sys.path.append("./my_RNARL")
	# from utils import Tokenizer, ActorModel_encoder_esm2

	# Or if you copied utils.py to your current working directory:
	from utils import Tokenizer, ActorModel_encoder_esm2
	```

	3. Load ESM-2 (Dependency):
	The model requires the ESM-2 encoder. You'll need to load it separately, typically from Hugging Face Hub.
	```python
	from transformers import AutoTokenizer, EsmModel
	import torch

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	esm2_tokenizer = AutoTokenizer.from_pretrained("esm2_t33_650M_UR50D")
	esm2_model = EsmModel.from_pretrained("esm2_t33_650M_UR50D").to(device)
	esm2_model.eval()
	esm2_dim = esm2_model.config.hidden_size # Get the actual dimension
	```
	Note: Your original script used a local path (`./esm2_model_t33_650M_UR50D`). Users loading from the Hub will likely prefer loading directly from the official Hugging Face repo unless you explicitly provide the ESM-2 files in your repo (which is usually not necessary as they are already on the Hub).

	4. Instantiate Custom Model and Load Weights:
	Instantiate your `ActorModel_encoder_esm2` using the parameters from your training script and load the state dictionary.
	```python
	# Define the parameters used during training
	d_model = 768
	nhead = 8
	num_encoder_layers = 8
	dim_feedforward = d_model * 2 # or the exact value you used
	dropout = 0.3
	num_experts = 6
	top_k_experts = 2
	# vocab_size needs to match your Tokenizer
	tokenizer = Tokenizer() # Instantiate your custom tokenizer
	vocab_size = len(tokenizer.tokens) # Get vocab size from your tokenizer

	# Instantiate the model
	model = ActorModel_encoder_esm2(
	vocab_size=vocab_size,
	d_model=d_model,
	nhead=nhead,
	num_encoder_layers=num_encoder_layers,
	dim_feedforward=dim_feedforward,
	esm2_dim=esm2_dim, # Use the esm2_model's dimension
	dropout=dropout,
	num_experts=num_experts,
	top_k_experts=top_k_experts,
	device=device
	)

	# Load the state dictionary
	model_weights_path = os.path.join(local_dir, "homo_mrna.pt")
	model.load_state_dict(torch.load(model_weights_path, map_location=device))
	model.to(device)
	model.eval()

	print("Model loaded successfully!")

	# Now you can use the 'model' object for inference
	# Remember you also need your Tokenizer and the ESM-2 tokenizer/model
	```

	## Dependencies

	- `torch`
	- `transformers`
	- `huggingface_hub`
	- `pandas`
	- `numpy`
	- The specific ESM-2 model used (`esm2_t33_650M_UR50D` or the one you used).

	## License

	[ MIT, Apache 2.0]

	## Contact

	[linsg4521@sjtu.edu.cn]