File size: 3,922 Bytes
2237443
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
license: apache-2.0
language:
- en
tags:
- biology
- genomics
- llama
- fine-tuned
- plasmid
- gene-function
- genome-assembly
- gene-essentiality
pipeline_tag: text-generation
base_model: meta-llama/Meta-Llama-3.1-8B
---

# GenSyntax

GenSyntax is a fine-tuned large language model for genomic sequence analysis and inference. Built on the Llama 3.1 8B architecture, it is specifically adapted for five core genomic tasks: plasmid host identification, gene function prediction, genome assembly, gene essentiality prediction, and minimal genome derivation.

## Model Details

| Property | Value |
|---|---|
| **Base Model** | Meta-Llama-3.1-8B |
| **Architecture** | LlamaForCausalLM |
| **Parameters** | ~8B |
| **Hidden Size** | 4096 |
| **Layers** | 32 |
| **Attention Heads** | 32 (GQA: 8 KV heads) |
| **Context Length** | 131,072 tokens |
| **Precision** | bfloat16 |

## Intended Use

GenSyntax is designed for computational biology researchers who need to apply LLM-based reasoning to genomic sequences. It supports the following inference tasks:

1. **Plasmid Host Identification** — predict the bacterial host range of a plasmid from its sequence.
2. **Gene Function Prediction** — infer the functional annotation of a gene given its sequence context.
3. **Genome Assembly** — reconstruct genome sequences from contig fragments.
4. **Gene Essentiality Prediction** — classify whether a gene is essential for cell survival.
5. **Minimal Genome Derivation** — determine the minimal gene set required for a viable organism.

## Hardware Requirements

A single NVIDIA RTX 4090 (24 GB VRAM) is sufficient for inference. For faster throughput, multi-GPU setups are supported via `device_map="auto"`.

## How to Use

### Load the Model

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_path = "MoonTideF/GenSyntax"  # or local path

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
```

### Inference Scripts

Clone the [GenSyntax repository](https://github.com/nishiwen1214/GenSyntax) and use the provided scripts:

```bash
git clone https://github.com/nishiwen1214/GenSyntax.git
cd GenSyntax
pip install -r requirements.txt
```

#### Plasmid Host Identification

```bash
python Plasmid_host_identification.py \
    --model /path/to/GenSyntax \
    --input-json-paths test_data/gene_task1_test_1000_format.json
```

#### Gene Function Prediction

```bash
python Gene_function_prediction.py \
    --model /path/to/GenSyntax \
    --input-json-paths test_data/gene_task2_test_500_opts.json
```

#### Genome Assembly

```bash
python Genome_assembly.py \
    --model /path/to/GenSyntax \
    --input-json-paths test_data/gene_task3_test_500_contig3_format.json
```

#### Gene Essentiality Prediction

```bash
python Gene_essentiality_prediction.py \
    --model /path/to/GenSyntax \
    --input-json-paths test_data/gene_task4_test_1000_format.json
```

#### Minimal Genome Derivation

```bash
python minimal_genome_inference.py \
    --model /path/to/GenSyntax \
    --input-json-paths test_data/bacteria_chromosomes_9-mini.json
```

## Training Data

The training and evaluation datasets are available on HuggingFace:

👉 [GenSyntax Datasets on HuggingFace](https://huggingface.co/datasets/ShiwenNi/GenSyntax-data)

The dataset includes complete test sets for each task, along with training and test data for cell phenotype prediction.

## Generation Config

| Parameter | Value |
|---|---|
| `temperature` | 0.6 |
| `top_p` | 0.9 |
| `do_sample` | True |

## Citation

If you use GenSyntax in your research, please cite the corresponding paper and link to the [GitHub repository](https://github.com/nishiwen1214/GenSyntax).

## License

This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).