Upload folder using huggingface_hub

2237443 verified 3 days ago

3.92 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- biology
	- genomics
	- llama
	- fine-tuned
	- plasmid
	- gene-function
	- genome-assembly
	- gene-essentiality
	pipeline_tag: text-generation
	base_model: meta-llama/Meta-Llama-3.1-8B
	---

	# GenSyntax

	GenSyntax is a fine-tuned large language model for genomic sequence analysis and inference. Built on the Llama 3.1 8B architecture, it is specifically adapted for five core genomic tasks: plasmid host identification, gene function prediction, genome assembly, gene essentiality prediction, and minimal genome derivation.

	## Model Details

	\| Property \| Value \|
	\|---\|---\|
	\| Base Model \| Meta-Llama-3.1-8B \|
	\| Architecture \| LlamaForCausalLM \|
	\| Parameters \| ~8B \|
	\| Hidden Size \| 4096 \|
	\| Layers \| 32 \|
	\| Attention Heads \| 32 (GQA: 8 KV heads) \|
	\| Context Length \| 131,072 tokens \|
	\| Precision \| bfloat16 \|

	## Intended Use

	GenSyntax is designed for computational biology researchers who need to apply LLM-based reasoning to genomic sequences. It supports the following inference tasks:

	1. Plasmid Host Identification — predict the bacterial host range of a plasmid from its sequence.
	2. Gene Function Prediction — infer the functional annotation of a gene given its sequence context.
	3. Genome Assembly — reconstruct genome sequences from contig fragments.
	4. Gene Essentiality Prediction — classify whether a gene is essential for cell survival.
	5. Minimal Genome Derivation — determine the minimal gene set required for a viable organism.

	## Hardware Requirements

	A single NVIDIA RTX 4090 (24 GB VRAM) is sufficient for inference. For faster throughput, multi-GPU setups are supported via `device_map="auto"`.

	## How to Use

	### Load the Model

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	model_path = "MoonTideF/GenSyntax" # or local path

	tokenizer = AutoTokenizer.from_pretrained(model_path)
	model = AutoModelForCausalLM.from_pretrained(
	model_path,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	)
	```

	### Inference Scripts

	Clone the [GenSyntax repository](https://github.com/nishiwen1214/GenSyntax) and use the provided scripts:

	```bash
	git clone https://github.com/nishiwen1214/GenSyntax.git
	cd GenSyntax
	pip install -r requirements.txt
	```

	#### Plasmid Host Identification

	```bash
	python Plasmid_host_identification.py \
	--model /path/to/GenSyntax \
	--input-json-paths test_data/gene_task1_test_1000_format.json
	```

	#### Gene Function Prediction

	```bash
	python Gene_function_prediction.py \
	--model /path/to/GenSyntax \
	--input-json-paths test_data/gene_task2_test_500_opts.json
	```

	#### Genome Assembly

	```bash
	python Genome_assembly.py \
	--model /path/to/GenSyntax \
	--input-json-paths test_data/gene_task3_test_500_contig3_format.json
	```

	#### Gene Essentiality Prediction

	```bash
	python Gene_essentiality_prediction.py \
	--model /path/to/GenSyntax \
	--input-json-paths test_data/gene_task4_test_1000_format.json
	```

	#### Minimal Genome Derivation

	```bash
	python minimal_genome_inference.py \
	--model /path/to/GenSyntax \
	--input-json-paths test_data/bacteria_chromosomes_9-mini.json
	```

	## Training Data

	The training and evaluation datasets are available on HuggingFace:

	👉 [GenSyntax Datasets on HuggingFace](https://huggingface.co/datasets/ShiwenNi/GenSyntax-data)

	The dataset includes complete test sets for each task, along with training and test data for cell phenotype prediction.

	## Generation Config

	\| Parameter \| Value \|
	\|---\|---\|
	\| `temperature` \| 0.6 \|
	\| `top_p` \| 0.9 \|
	\| `do_sample` \| True \|

	## Citation

	If you use GenSyntax in your research, please cite the corresponding paper and link to the [GitHub repository](https://github.com/nishiwen1214/GenSyntax).

	## License

	This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).