MoonTideF commited on
Commit
2237443
·
verified ·
1 Parent(s): 33e7d46

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,139 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - biology
7
+ - genomics
8
+ - llama
9
+ - fine-tuned
10
+ - plasmid
11
+ - gene-function
12
+ - genome-assembly
13
+ - gene-essentiality
14
+ pipeline_tag: text-generation
15
+ base_model: meta-llama/Meta-Llama-3.1-8B
16
+ ---
17
+
18
+ # GenSyntax
19
+
20
+ GenSyntax is a fine-tuned large language model for genomic sequence analysis and inference. Built on the Llama 3.1 8B architecture, it is specifically adapted for five core genomic tasks: plasmid host identification, gene function prediction, genome assembly, gene essentiality prediction, and minimal genome derivation.
21
+
22
+ ## Model Details
23
+
24
+ | Property | Value |
25
+ |---|---|
26
+ | **Base Model** | Meta-Llama-3.1-8B |
27
+ | **Architecture** | LlamaForCausalLM |
28
+ | **Parameters** | ~8B |
29
+ | **Hidden Size** | 4096 |
30
+ | **Layers** | 32 |
31
+ | **Attention Heads** | 32 (GQA: 8 KV heads) |
32
+ | **Context Length** | 131,072 tokens |
33
+ | **Precision** | bfloat16 |
34
+
35
+ ## Intended Use
36
+
37
+ GenSyntax is designed for computational biology researchers who need to apply LLM-based reasoning to genomic sequences. It supports the following inference tasks:
38
+
39
+ 1. **Plasmid Host Identification** — predict the bacterial host range of a plasmid from its sequence.
40
+ 2. **Gene Function Prediction** — infer the functional annotation of a gene given its sequence context.
41
+ 3. **Genome Assembly** — reconstruct genome sequences from contig fragments.
42
+ 4. **Gene Essentiality Prediction** — classify whether a gene is essential for cell survival.
43
+ 5. **Minimal Genome Derivation** — determine the minimal gene set required for a viable organism.
44
+
45
+ ## Hardware Requirements
46
+
47
+ A single NVIDIA RTX 4090 (24 GB VRAM) is sufficient for inference. For faster throughput, multi-GPU setups are supported via `device_map="auto"`.
48
+
49
+ ## How to Use
50
+
51
+ ### Load the Model
52
+
53
+ ```python
54
+ from transformers import AutoTokenizer, AutoModelForCausalLM
55
+ import torch
56
+
57
+ model_path = "MoonTideF/GenSyntax" # or local path
58
+
59
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
60
+ model = AutoModelForCausalLM.from_pretrained(
61
+ model_path,
62
+ torch_dtype=torch.bfloat16,
63
+ device_map="auto",
64
+ )
65
+ ```
66
+
67
+ ### Inference Scripts
68
+
69
+ Clone the [GenSyntax repository](https://github.com/nishiwen1214/GenSyntax) and use the provided scripts:
70
+
71
+ ```bash
72
+ git clone https://github.com/nishiwen1214/GenSyntax.git
73
+ cd GenSyntax
74
+ pip install -r requirements.txt
75
+ ```
76
+
77
+ #### Plasmid Host Identification
78
+
79
+ ```bash
80
+ python Plasmid_host_identification.py \
81
+ --model /path/to/GenSyntax \
82
+ --input-json-paths test_data/gene_task1_test_1000_format.json
83
+ ```
84
+
85
+ #### Gene Function Prediction
86
+
87
+ ```bash
88
+ python Gene_function_prediction.py \
89
+ --model /path/to/GenSyntax \
90
+ --input-json-paths test_data/gene_task2_test_500_opts.json
91
+ ```
92
+
93
+ #### Genome Assembly
94
+
95
+ ```bash
96
+ python Genome_assembly.py \
97
+ --model /path/to/GenSyntax \
98
+ --input-json-paths test_data/gene_task3_test_500_contig3_format.json
99
+ ```
100
+
101
+ #### Gene Essentiality Prediction
102
+
103
+ ```bash
104
+ python Gene_essentiality_prediction.py \
105
+ --model /path/to/GenSyntax \
106
+ --input-json-paths test_data/gene_task4_test_1000_format.json
107
+ ```
108
+
109
+ #### Minimal Genome Derivation
110
+
111
+ ```bash
112
+ python minimal_genome_inference.py \
113
+ --model /path/to/GenSyntax \
114
+ --input-json-paths test_data/bacteria_chromosomes_9-mini.json
115
+ ```
116
+
117
+ ## Training Data
118
+
119
+ The training and evaluation datasets are available on HuggingFace:
120
+
121
+ 👉 [GenSyntax Datasets on HuggingFace](https://huggingface.co/datasets/ShiwenNi/GenSyntax-data)
122
+
123
+ The dataset includes complete test sets for each task, along with training and test data for cell phenotype prediction.
124
+
125
+ ## Generation Config
126
+
127
+ | Parameter | Value |
128
+ |---|---|
129
+ | `temperature` | 0.6 |
130
+ | `top_p` | 0.9 |
131
+ | `do_sample` | True |
132
+
133
+ ## Citation
134
+
135
+ If you use GenSyntax in your research, please cite the corresponding paper and link to the [GitHub repository](https://github.com/nishiwen1214/GenSyntax).
136
+
137
+ ## License
138
+
139
+ This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
chat_template.jinja ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% endif %}{% if system_message is defined %}{{ 'System: ' + system_message + '<|end_of_text|>' + '
2
+ ' }}{% endif %}{% for message in loop_messages %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ 'Human: ' + content + '<|end_of_text|>' + '
3
+ Assistant:' }}{% elif message['role'] == 'assistant' %}{{ content + '<|end_of_text|>' + '
4
+ ' }}{% endif %}{% endfor %}
config.json ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LlamaForCausalLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 128000,
8
+ "dtype": "bfloat16",
9
+ "eos_token_id": 128001,
10
+ "head_dim": 128,
11
+ "hidden_act": "silu",
12
+ "hidden_size": 4096,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 14336,
15
+ "max_position_embeddings": 131072,
16
+ "mlp_bias": false,
17
+ "model_type": "llama",
18
+ "num_attention_heads": 32,
19
+ "num_hidden_layers": 32,
20
+ "num_key_value_heads": 8,
21
+ "pad_token_id": null,
22
+ "pretraining_tp": 1,
23
+ "rms_norm_eps": 1e-05,
24
+ "rope_parameters": {
25
+ "factor": 8.0,
26
+ "high_freq_factor": 4.0,
27
+ "low_freq_factor": 1.0,
28
+ "original_max_position_embeddings": 8192,
29
+ "rope_theta": 500000.0,
30
+ "rope_type": "llama3"
31
+ },
32
+ "tie_word_embeddings": false,
33
+ "transformers_version": "5.7.0",
34
+ "use_cache": true,
35
+ "vocab_size": 128256
36
+ }
generation_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 128000,
4
+ "do_sample": true,
5
+ "eos_token_id": 128001,
6
+ "temperature": 0.6,
7
+ "top_p": 0.9,
8
+ "transformers_version": "5.7.0"
9
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a579316488a768e25e8c49d571448bd2f36d0f1e2ffa85322c8d5abf5ed19d71
3
+ size 16060556616
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6b9e4e7fb171f92fd137b777cc2714bf87d11576700a1dcd7a399e7bbe39537b
3
+ size 17209920
tokenizer_config.json ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "backend": "tokenizers",
3
+ "bos_token": "<|begin_of_text|>",
4
+ "clean_up_tokenization_spaces": true,
5
+ "eos_token": "<|end_of_text|>",
6
+ "is_local": true,
7
+ "local_files_only": false,
8
+ "model_input_names": [
9
+ "input_ids",
10
+ "attention_mask"
11
+ ],
12
+ "model_max_length": 131072,
13
+ "pad_token": "<|end_of_text|>",
14
+ "padding_side": "right",
15
+ "split_special_tokens": false,
16
+ "tokenizer_class": "TokenizersBackend"
17
+ }