--- license: apache-2.0 language: - en tags: - biology - genomics - llama - fine-tuned - plasmid - gene-function - genome-assembly - gene-essentiality pipeline_tag: text-generation base_model: meta-llama/Meta-Llama-3.1-8B --- # GenSyntax GenSyntax is a fine-tuned large language model for genomic sequence analysis and inference. Built on the Llama 3.1 8B architecture, it is specifically adapted for five core genomic tasks: plasmid host identification, gene function prediction, genome assembly, gene essentiality prediction, and minimal genome derivation. ## Model Details | Property | Value | |---|---| | **Base Model** | Meta-Llama-3.1-8B | | **Architecture** | LlamaForCausalLM | | **Parameters** | ~8B | | **Hidden Size** | 4096 | | **Layers** | 32 | | **Attention Heads** | 32 (GQA: 8 KV heads) | | **Context Length** | 131,072 tokens | | **Precision** | bfloat16 | ## Intended Use GenSyntax is designed for computational biology researchers who need to apply LLM-based reasoning to genomic sequences. It supports the following inference tasks: 1. **Plasmid Host Identification** — predict the bacterial host range of a plasmid from its sequence. 2. **Gene Function Prediction** — infer the functional annotation of a gene given its sequence context. 3. **Genome Assembly** — reconstruct genome sequences from contig fragments. 4. **Gene Essentiality Prediction** — classify whether a gene is essential for cell survival. 5. **Minimal Genome Derivation** — determine the minimal gene set required for a viable organism. ## Hardware Requirements A single NVIDIA RTX 4090 (24 GB VRAM) is sufficient for inference. For faster throughput, multi-GPU setups are supported via `device_map="auto"`. ## How to Use ### Load the Model ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_path = "MoonTideF/GenSyntax" # or local path tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModelForCausalLM.from_pretrained( model_path, torch_dtype=torch.bfloat16, device_map="auto", ) ``` ### Inference Scripts Clone the [GenSyntax repository](https://github.com/nishiwen1214/GenSyntax) and use the provided scripts: ```bash git clone https://github.com/nishiwen1214/GenSyntax.git cd GenSyntax pip install -r requirements.txt ``` #### Plasmid Host Identification ```bash python Plasmid_host_identification.py \ --model /path/to/GenSyntax \ --input-json-paths test_data/gene_task1_test_1000_format.json ``` #### Gene Function Prediction ```bash python Gene_function_prediction.py \ --model /path/to/GenSyntax \ --input-json-paths test_data/gene_task2_test_500_opts.json ``` #### Genome Assembly ```bash python Genome_assembly.py \ --model /path/to/GenSyntax \ --input-json-paths test_data/gene_task3_test_500_contig3_format.json ``` #### Gene Essentiality Prediction ```bash python Gene_essentiality_prediction.py \ --model /path/to/GenSyntax \ --input-json-paths test_data/gene_task4_test_1000_format.json ``` #### Minimal Genome Derivation ```bash python minimal_genome_inference.py \ --model /path/to/GenSyntax \ --input-json-paths test_data/bacteria_chromosomes_9-mini.json ``` ## Training Data The training and evaluation datasets are available on HuggingFace: 👉 [GenSyntax Datasets on HuggingFace](https://huggingface.co/datasets/ShiwenNi/GenSyntax-data) The dataset includes complete test sets for each task, along with training and test data for cell phenotype prediction. ## Generation Config | Parameter | Value | |---|---| | `temperature` | 0.6 | | `top_p` | 0.9 | | `do_sample` | True | ## Citation If you use GenSyntax in your research, please cite the corresponding paper and link to the [GitHub repository](https://github.com/nishiwen1214/GenSyntax). ## License This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).