File size: 3,922 Bytes
2237443 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 | ---
license: apache-2.0
language:
- en
tags:
- biology
- genomics
- llama
- fine-tuned
- plasmid
- gene-function
- genome-assembly
- gene-essentiality
pipeline_tag: text-generation
base_model: meta-llama/Meta-Llama-3.1-8B
---
# GenSyntax
GenSyntax is a fine-tuned large language model for genomic sequence analysis and inference. Built on the Llama 3.1 8B architecture, it is specifically adapted for five core genomic tasks: plasmid host identification, gene function prediction, genome assembly, gene essentiality prediction, and minimal genome derivation.
## Model Details
| Property | Value |
|---|---|
| **Base Model** | Meta-Llama-3.1-8B |
| **Architecture** | LlamaForCausalLM |
| **Parameters** | ~8B |
| **Hidden Size** | 4096 |
| **Layers** | 32 |
| **Attention Heads** | 32 (GQA: 8 KV heads) |
| **Context Length** | 131,072 tokens |
| **Precision** | bfloat16 |
## Intended Use
GenSyntax is designed for computational biology researchers who need to apply LLM-based reasoning to genomic sequences. It supports the following inference tasks:
1. **Plasmid Host Identification** — predict the bacterial host range of a plasmid from its sequence.
2. **Gene Function Prediction** — infer the functional annotation of a gene given its sequence context.
3. **Genome Assembly** — reconstruct genome sequences from contig fragments.
4. **Gene Essentiality Prediction** — classify whether a gene is essential for cell survival.
5. **Minimal Genome Derivation** — determine the minimal gene set required for a viable organism.
## Hardware Requirements
A single NVIDIA RTX 4090 (24 GB VRAM) is sufficient for inference. For faster throughput, multi-GPU setups are supported via `device_map="auto"`.
## How to Use
### Load the Model
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_path = "MoonTideF/GenSyntax" # or local path
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto",
)
```
### Inference Scripts
Clone the [GenSyntax repository](https://github.com/nishiwen1214/GenSyntax) and use the provided scripts:
```bash
git clone https://github.com/nishiwen1214/GenSyntax.git
cd GenSyntax
pip install -r requirements.txt
```
#### Plasmid Host Identification
```bash
python Plasmid_host_identification.py \
--model /path/to/GenSyntax \
--input-json-paths test_data/gene_task1_test_1000_format.json
```
#### Gene Function Prediction
```bash
python Gene_function_prediction.py \
--model /path/to/GenSyntax \
--input-json-paths test_data/gene_task2_test_500_opts.json
```
#### Genome Assembly
```bash
python Genome_assembly.py \
--model /path/to/GenSyntax \
--input-json-paths test_data/gene_task3_test_500_contig3_format.json
```
#### Gene Essentiality Prediction
```bash
python Gene_essentiality_prediction.py \
--model /path/to/GenSyntax \
--input-json-paths test_data/gene_task4_test_1000_format.json
```
#### Minimal Genome Derivation
```bash
python minimal_genome_inference.py \
--model /path/to/GenSyntax \
--input-json-paths test_data/bacteria_chromosomes_9-mini.json
```
## Training Data
The training and evaluation datasets are available on HuggingFace:
👉 [GenSyntax Datasets on HuggingFace](https://huggingface.co/datasets/ShiwenNi/GenSyntax-data)
The dataset includes complete test sets for each task, along with training and test data for cell phenotype prediction.
## Generation Config
| Parameter | Value |
|---|---|
| `temperature` | 0.6 |
| `top_p` | 0.9 |
| `do_sample` | True |
## Citation
If you use GenSyntax in your research, please cite the corresponding paper and link to the [GitHub repository](https://github.com/nishiwen1214/GenSyntax).
## License
This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
|