GenSyntax / README.md
MoonTideF's picture
Upload folder using huggingface_hub
2237443 verified
---
license: apache-2.0
language:
- en
tags:
- biology
- genomics
- llama
- fine-tuned
- plasmid
- gene-function
- genome-assembly
- gene-essentiality
pipeline_tag: text-generation
base_model: meta-llama/Meta-Llama-3.1-8B
---
# GenSyntax
GenSyntax is a fine-tuned large language model for genomic sequence analysis and inference. Built on the Llama 3.1 8B architecture, it is specifically adapted for five core genomic tasks: plasmid host identification, gene function prediction, genome assembly, gene essentiality prediction, and minimal genome derivation.
## Model Details
| Property | Value |
|---|---|
| **Base Model** | Meta-Llama-3.1-8B |
| **Architecture** | LlamaForCausalLM |
| **Parameters** | ~8B |
| **Hidden Size** | 4096 |
| **Layers** | 32 |
| **Attention Heads** | 32 (GQA: 8 KV heads) |
| **Context Length** | 131,072 tokens |
| **Precision** | bfloat16 |
## Intended Use
GenSyntax is designed for computational biology researchers who need to apply LLM-based reasoning to genomic sequences. It supports the following inference tasks:
1. **Plasmid Host Identification** β€” predict the bacterial host range of a plasmid from its sequence.
2. **Gene Function Prediction** β€” infer the functional annotation of a gene given its sequence context.
3. **Genome Assembly** β€” reconstruct genome sequences from contig fragments.
4. **Gene Essentiality Prediction** β€” classify whether a gene is essential for cell survival.
5. **Minimal Genome Derivation** β€” determine the minimal gene set required for a viable organism.
## Hardware Requirements
A single NVIDIA RTX 4090 (24 GB VRAM) is sufficient for inference. For faster throughput, multi-GPU setups are supported via `device_map="auto"`.
## How to Use
### Load the Model
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_path = "MoonTideF/GenSyntax" # or local path
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto",
)
```
### Inference Scripts
Clone the [GenSyntax repository](https://github.com/nishiwen1214/GenSyntax) and use the provided scripts:
```bash
git clone https://github.com/nishiwen1214/GenSyntax.git
cd GenSyntax
pip install -r requirements.txt
```
#### Plasmid Host Identification
```bash
python Plasmid_host_identification.py \
--model /path/to/GenSyntax \
--input-json-paths test_data/gene_task1_test_1000_format.json
```
#### Gene Function Prediction
```bash
python Gene_function_prediction.py \
--model /path/to/GenSyntax \
--input-json-paths test_data/gene_task2_test_500_opts.json
```
#### Genome Assembly
```bash
python Genome_assembly.py \
--model /path/to/GenSyntax \
--input-json-paths test_data/gene_task3_test_500_contig3_format.json
```
#### Gene Essentiality Prediction
```bash
python Gene_essentiality_prediction.py \
--model /path/to/GenSyntax \
--input-json-paths test_data/gene_task4_test_1000_format.json
```
#### Minimal Genome Derivation
```bash
python minimal_genome_inference.py \
--model /path/to/GenSyntax \
--input-json-paths test_data/bacteria_chromosomes_9-mini.json
```
## Training Data
The training and evaluation datasets are available on HuggingFace:
πŸ‘‰ [GenSyntax Datasets on HuggingFace](https://huggingface.co/datasets/ShiwenNi/GenSyntax-data)
The dataset includes complete test sets for each task, along with training and test data for cell phenotype prediction.
## Generation Config
| Parameter | Value |
|---|---|
| `temperature` | 0.6 |
| `top_p` | 0.9 |
| `do_sample` | True |
## Citation
If you use GenSyntax in your research, please cite the corresponding paper and link to the [GitHub repository](https://github.com/nishiwen1214/GenSyntax).
## License
This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).