File size: 3,922 Bytes
---
license: apache-2.0
language:
- en
tags:
- biology
- genomics
- llama
- fine-tuned
- plasmid
- gene-function
- genome-assembly
- gene-essentiality
pipeline_tag: text-generation
base_model: meta-llama/Meta-Llama-3.1-8B
---

# GenSyntax

GenSyntax is a fine-tuned large language model for genomic sequence analysis and inference. Built on the Llama 3.1 8B architecture, it is specifically adapted for five core genomic tasks: plasmid host identification, gene function prediction, genome assembly, gene essentiality prediction, and minimal genome derivation.

## Model Details

| Property | Value |
|---|---|
| **Base Model** | Meta-Llama-3.1-8B |
| **Architecture** | LlamaForCausalLM |
| **Parameters** | ~8B |
| **Hidden Size** | 4096 |
| **Layers** | 32 |
| **Attention Heads** | 32 (GQA: 8 KV heads) |
| **Context Length** | 131,072 tokens |
| **Precision** | bfloat16 |

## Intended Use

GenSyntax is designed for computational biology researchers who need to apply LLM-based reasoning to genomic sequences. It supports the following inference tasks:

1. **Plasmid Host Identification** — predict the bacterial host range of a plasmid from its sequence.
2. **Gene Function Prediction** — infer the functional annotation of a gene given its sequence context.
3. **Genome Assembly** — reconstruct genome sequences from contig fragments.
4. **Gene Essentiality Prediction** — classify whether a gene is essential for cell survival.
5. **Minimal Genome Derivation** — determine the minimal gene set required for a viable organism.

## Hardware Requirements

A single NVIDIA RTX 4090 (24 GB VRAM) is sufficient for inference. For faster throughput, multi-GPU setups are supported via `device_map="auto"`.

## How to Use

### Load the Model

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_path = "MoonTideF/GenSyntax"  # or local path

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
```

### Inference Scripts

Clone the [GenSyntax repository](https://github.com/nishiwen1214/GenSyntax) and use the provided scripts:

```bash
git clone https://github.com/nishiwen1214/GenSyntax.git
cd GenSyntax
pip install -r requirements.txt
```

#### Plasmid Host Identification

```bash
python Plasmid_host_identification.py \
    --model /path/to/GenSyntax \
    --input-json-paths test_data/gene_task1_test_1000_format.json
```

#### Gene Function Prediction

```bash
python Gene_function_prediction.py \
    --model /path/to/GenSyntax \
    --input-json-paths test_data/gene_task2_test_500_opts.json
```

#### Genome Assembly

```bash
python Genome_assembly.py \
    --model /path/to/GenSyntax \
    --input-json-paths test_data/gene_task3_test_500_contig3_format.json
```

#### Gene Essentiality Prediction

```bash
python Gene_essentiality_prediction.py \
    --model /path/to/GenSyntax \
    --input-json-paths test_data/gene_task4_test_1000_format.json
```

#### Minimal Genome Derivation

```bash
python minimal_genome_inference.py \
    --model /path/to/GenSyntax \
    --input-json-paths test_data/bacteria_chromosomes_9-mini.json
```

## Training Data

The training and evaluation datasets are available on HuggingFace:

👉 [GenSyntax Datasets on HuggingFace](https://huggingface.co/datasets/ShiwenNi/GenSyntax-data)

The dataset includes complete test sets for each task, along with training and test data for cell phenotype prediction.

## Generation Config

| Parameter | Value |
|---|---|
| `temperature` | 0.6 |
| `top_p` | 0.9 |
| `do_sample` | True |

## Citation

If you use GenSyntax in your research, please cite the corresponding paper and link to the [GitHub repository](https://github.com/nishiwen1214/GenSyntax).

## License

This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).