GenSyntax

GenSyntax is a fine-tuned large language model for genomic sequence analysis and inference. Built on the Llama 3.1 8B architecture, it is specifically adapted for five core genomic tasks: plasmid host identification, gene function prediction, genome assembly, gene essentiality prediction, and minimal genome derivation.

Model Details

Property Value
Base Model Meta-Llama-3.1-8B
Architecture LlamaForCausalLM
Parameters ~8B
Hidden Size 4096
Layers 32
Attention Heads 32 (GQA: 8 KV heads)
Context Length 131,072 tokens
Precision bfloat16

Intended Use

GenSyntax is designed for computational biology researchers who need to apply LLM-based reasoning to genomic sequences. It supports the following inference tasks:

  1. Plasmid Host Identification โ€” predict the bacterial host range of a plasmid from its sequence.
  2. Gene Function Prediction โ€” infer the functional annotation of a gene given its sequence context.
  3. Genome Assembly โ€” reconstruct genome sequences from contig fragments.
  4. Gene Essentiality Prediction โ€” classify whether a gene is essential for cell survival.
  5. Minimal Genome Derivation โ€” determine the minimal gene set required for a viable organism.

Hardware Requirements

A single NVIDIA RTX 4090 (24 GB VRAM) is sufficient for inference. For faster throughput, multi-GPU setups are supported via device_map="auto".

How to Use

Load the Model

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_path = "MoonTideF/GenSyntax"  # or local path

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

Inference Scripts

Clone the GenSyntax repository and use the provided scripts:

git clone https://github.com/nishiwen1214/GenSyntax.git
cd GenSyntax
pip install -r requirements.txt

Plasmid Host Identification

python Plasmid_host_identification.py \
    --model /path/to/GenSyntax \
    --input-json-paths test_data/gene_task1_test_1000_format.json

Gene Function Prediction

python Gene_function_prediction.py \
    --model /path/to/GenSyntax \
    --input-json-paths test_data/gene_task2_test_500_opts.json

Genome Assembly

python Genome_assembly.py \
    --model /path/to/GenSyntax \
    --input-json-paths test_data/gene_task3_test_500_contig3_format.json

Gene Essentiality Prediction

python Gene_essentiality_prediction.py \
    --model /path/to/GenSyntax \
    --input-json-paths test_data/gene_task4_test_1000_format.json

Minimal Genome Derivation

python minimal_genome_inference.py \
    --model /path/to/GenSyntax \
    --input-json-paths test_data/bacteria_chromosomes_9-mini.json

Training Data

The training and evaluation datasets are available on HuggingFace:

๐Ÿ‘‰ GenSyntax Datasets on HuggingFace

The dataset includes complete test sets for each task, along with training and test data for cell phenotype prediction.

Generation Config

Parameter Value
temperature 0.6
top_p 0.9
do_sample True

Citation

If you use GenSyntax in your research, please cite the corresponding paper and link to the GitHub repository.

License

This model is released under the Apache 2.0 License.

Downloads last month
17
Safetensors
Model size
8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for MoonTideF/GenSyntax

Finetuned
(1783)
this model