| --- |
| license: apache-2.0 |
| language: |
| - en |
| tags: |
| - biology |
| - genomics |
| - llama |
| - fine-tuned |
| - plasmid |
| - gene-function |
| - genome-assembly |
| - gene-essentiality |
| pipeline_tag: text-generation |
| base_model: meta-llama/Meta-Llama-3.1-8B |
| --- |
| |
| # GenSyntax |
|
|
| GenSyntax is a fine-tuned large language model for genomic sequence analysis and inference. Built on the Llama 3.1 8B architecture, it is specifically adapted for five core genomic tasks: plasmid host identification, gene function prediction, genome assembly, gene essentiality prediction, and minimal genome derivation. |
|
|
| ## Model Details |
|
|
| | Property | Value | |
| |---|---| |
| | **Base Model** | Meta-Llama-3.1-8B | |
| | **Architecture** | LlamaForCausalLM | |
| | **Parameters** | ~8B | |
| | **Hidden Size** | 4096 | |
| | **Layers** | 32 | |
| | **Attention Heads** | 32 (GQA: 8 KV heads) | |
| | **Context Length** | 131,072 tokens | |
| | **Precision** | bfloat16 | |
|
|
| ## Intended Use |
|
|
| GenSyntax is designed for computational biology researchers who need to apply LLM-based reasoning to genomic sequences. It supports the following inference tasks: |
|
|
| 1. **Plasmid Host Identification** β predict the bacterial host range of a plasmid from its sequence. |
| 2. **Gene Function Prediction** β infer the functional annotation of a gene given its sequence context. |
| 3. **Genome Assembly** β reconstruct genome sequences from contig fragments. |
| 4. **Gene Essentiality Prediction** β classify whether a gene is essential for cell survival. |
| 5. **Minimal Genome Derivation** β determine the minimal gene set required for a viable organism. |
|
|
| ## Hardware Requirements |
|
|
| A single NVIDIA RTX 4090 (24 GB VRAM) is sufficient for inference. For faster throughput, multi-GPU setups are supported via `device_map="auto"`. |
|
|
| ## How to Use |
|
|
| ### Load the Model |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| import torch |
| |
| model_path = "MoonTideF/GenSyntax" # or local path |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_path) |
| model = AutoModelForCausalLM.from_pretrained( |
| model_path, |
| torch_dtype=torch.bfloat16, |
| device_map="auto", |
| ) |
| ``` |
|
|
| ### Inference Scripts |
|
|
| Clone the [GenSyntax repository](https://github.com/nishiwen1214/GenSyntax) and use the provided scripts: |
|
|
| ```bash |
| git clone https://github.com/nishiwen1214/GenSyntax.git |
| cd GenSyntax |
| pip install -r requirements.txt |
| ``` |
|
|
| #### Plasmid Host Identification |
|
|
| ```bash |
| python Plasmid_host_identification.py \ |
| --model /path/to/GenSyntax \ |
| --input-json-paths test_data/gene_task1_test_1000_format.json |
| ``` |
|
|
| #### Gene Function Prediction |
|
|
| ```bash |
| python Gene_function_prediction.py \ |
| --model /path/to/GenSyntax \ |
| --input-json-paths test_data/gene_task2_test_500_opts.json |
| ``` |
|
|
| #### Genome Assembly |
|
|
| ```bash |
| python Genome_assembly.py \ |
| --model /path/to/GenSyntax \ |
| --input-json-paths test_data/gene_task3_test_500_contig3_format.json |
| ``` |
|
|
| #### Gene Essentiality Prediction |
|
|
| ```bash |
| python Gene_essentiality_prediction.py \ |
| --model /path/to/GenSyntax \ |
| --input-json-paths test_data/gene_task4_test_1000_format.json |
| ``` |
|
|
| #### Minimal Genome Derivation |
|
|
| ```bash |
| python minimal_genome_inference.py \ |
| --model /path/to/GenSyntax \ |
| --input-json-paths test_data/bacteria_chromosomes_9-mini.json |
| ``` |
|
|
| ## Training Data |
|
|
| The training and evaluation datasets are available on HuggingFace: |
|
|
| π [GenSyntax Datasets on HuggingFace](https://huggingface.co/datasets/ShiwenNi/GenSyntax-data) |
|
|
| The dataset includes complete test sets for each task, along with training and test data for cell phenotype prediction. |
|
|
| ## Generation Config |
|
|
| | Parameter | Value | |
| |---|---| |
| | `temperature` | 0.6 | |
| | `top_p` | 0.9 | |
| | `do_sample` | True | |
|
|
| ## Citation |
|
|
| If you use GenSyntax in your research, please cite the corresponding paper and link to the [GitHub repository](https://github.com/nishiwen1214/GenSyntax). |
|
|
| ## License |
|
|
| This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). |
|
|