loubnabnl HF Staff commited on
Commit
9cb7d0e
·
verified ·
1 Parent(s): 45967f0

Initial upload from HuggingFaceBio/carbon-8B-longctx-32k-from-1T-decay@step-227500 with new README

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ language:
5
+ - dna
6
+ tags:
7
+ - dna
8
+ - genomic
9
+ - transformers
10
+ ---
11
+
12
+ # Carbon-8B
13
+
14
+ A larger, higher-capacity member of the **Carbon** family of generative DNA foundation models.
15
+
16
+ Carbon-8B is the 8B-parameter sibling of [Carbon-3B](https://huggingface.co/HuggingFaceBio/Carbon-3B). It is intended for users who can afford additional inference cost in exchange for stronger downstream performance. For the full design rationale, tokenizer specification, evaluation protocol, and usage details, please refer to the **[Carbon-3B model card](https://huggingface.co/HuggingFaceBio/Carbon-3B)** and the Carbon technical report — this card focuses only on what is specific to Carbon-8B.
17
+
18
+ ## Model Summary
19
+
20
+ - **8B-parameter decoder-only autoregressive model** trained on DNA and RNA sequences with a primary focus on eukaryotes.
21
+ - **Same hybrid tokenizer** as Carbon-3B (non-overlapping 6-mer for DNA + Qwen3 BPE for English text). Each DNA token encodes 6 bp. Wrap DNA inputs with `<dna>...</dna>` — see the Carbon-3B card for tokenizer details and usage caveats.
22
+ - **Native context: 32,768 tokens (≈ 196 kbp).** Carbon-8B was extended with a long-context decay stage from an 8 k-context base, so it natively handles 32 k tokens. You can apply YaRN at 4× to extrapolate up to 128 k tokens (≈ 786 kbp).
23
+ - Released as a standard Hugging Face causal LM (`LlamaForCausalLM`).
24
+
25
+ ```python
26
+ from transformers import AutoModelForCausalLM, AutoTokenizer
27
+ import torch
28
+
29
+ repo = "HuggingFaceBio/Carbon-8B"
30
+ tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
31
+ model = AutoModelForCausalLM.from_pretrained(
32
+ repo, dtype=torch.bfloat16,
33
+ ).cuda().eval()
34
+
35
+ prompt = "<dna>ATGCGCTAGCTACGATCGATCGTAGCTAGCTAGCTAGCTACG" # multiple of 6 bp
36
+ inputs = tok(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")
37
+ out = model.generate(**inputs, max_new_tokens=64, do_sample=False)
38
+ print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
39
+ ```
40
+
41
+ ## Training
42
+
43
+ Carbon-8B follows the same pre-training recipe as Carbon-3B on the **[`HuggingFaceBio/carbon-pretraining-corpus`](https://huggingface.co/datasets/HuggingFaceBio/carbon-pretraining-corpus)** with the identical data mixture on 1T DNA 6-mer tokens. The main recipe ingredients:
44
+
45
+ - **Learning-rate schedule: cosine** (instead of the WSD schedule used for Carbon-3B).
46
+ - **Loss schedule:** after 100B tokens the loss switches from cross-entropy to FNS loss until the end of training.
47
+ - **Pre-training**: on 1T 6-mer tokens (≈ 6T DNA base pairs), with GBS=512, seq=8192 → 4.19 M tok/step. On 32 nodes (TP=4, DP=64), bfloat16, AdamW. We keep the same training mixture even in the decay phase with 70% Generator eukaryote data with metadata with dropout, 16% mRNA, 4% splice mRNA and 10% Prokaryote data.
48
+ - **Long-context extension stage.** After pre-training, Carbon-8B undergoes a long-context decay phase that extends the native context from 8,192 to 32,768 tokens (≈ 196 kbp). You can apply YaRN at 4× to further extrapolate to 128 k tokens (≈ 786 kbp).
49
+
50
+ Training infrastructure, framework ([Megatron-LM-Carbon](https://github.com/huggingface/Megatron-LM-Carbon)), and conversion path ([Megatron-Bridge](https://github.com/NVIDIA/Megatron-Bridge)) are identical to Carbon-3B.
51
+
52
+ ## Evaluation
53
+
54
+ All evaluations are zero-shot and use the [public Carbon evaluation pipeline](https://github.com/huggingface/carbon/tree/main/evaluation). See the [Carbon-3B card](https://huggingface.co/HuggingFaceBio/Carbon-3B#evaluation) for the full task suite, metrics, and methodology.
55
+
56
+ ### Downstream tasks
57
+
58
+ | Category | Metric (%) | Carbon 8B | Carbon 3B | Evo2 7B (1M) |
59
+ |---|---|---|---|---|
60
+ | Generative | SR eukaryote | **64.03** | <u>61.50</u> | 59.83 |
61
+ | Variant effect prediction | BRCA2 AUROC | **85.60** | <u>84.64</u> | 83.52 |
62
+ | | TraitGym Mendelian AUPRC by-chrom | <u>36.81</u> | 34.24 | **38.36** |
63
+ | | ClinVar coding AUROC, 48 kb | <u>93.43</u> | 93.30 | **93.70** |
64
+ | | ClinVar non-coding AUROC, 48 kb | **91.98** | <u>91.56</u> | 90.03 |
65
+ | Perturbation | TATA v2 | <u>65.62</u> | **65.94** | 63.72 |
66
+ | | SYN v2 | **92.18** | 82.78 | <u>84.92</u> |
67
+
68
+ ### Genome-NIAH (long-context retrieval)
69
+
70
+ Genome-NIAH measures how well a DNA model actually *uses* its long context. See the [`hf-carbon/genome-niah` dataset card](https://huggingface.co/datasets/hf-carbon/genome-niah) for the benchmark design.
71
+
72
+ | Context length | Carbon 3B (native / YaRN 4×) | Carbon 8B (native / YaRN 4×) | Evo2 7B |
73
+ |------------------------|------------------------------|------------------------------|---------|
74
+ | 16 k tokens (98 kbp) | 0.73 / 0.91 | 0.78 / 0.89 | 0.97 |
75
+ | 32 k tokens (196 kbp) | 0.55 / 0.90 | 0.69 / 0.87 | 0.95 |
76
+ | 64 k tokens (393 kbp) | — / 0.79 | — / 0.86 | 0.80 |
77
+ | 128 k tokens (786 kbp) | — / 0.27 | — / 0.65 | *running* |
78
+
79
+ Carbon-8B retrieves reliably up to its 32 k native boundary; **YaRN 4×** recovers most of the loss at the 32 k → 64 k boundary and extends usable retrieval to ≈ 786 kbp.
80
+
81
+ ## Intended use
82
+
83
+ Generative modelling, variant-effect prediction, motif-perturbation analysis, and long-context retrieval on DNA sequences. For faster inference at shorter contexts, use **Carbon-3B**.
84
+
85
+ ## License
86
+
87
+ Apache 2.0.
added_tokens.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</think>": 151668,
3
+ "</tool_call>": 151658,
4
+ "</tool_response>": 151666,
5
+ "<think>": 151667,
6
+ "<tool_call>": 151657,
7
+ "<tool_response>": 151665,
8
+ "<|box_end|>": 151649,
9
+ "<|box_start|>": 151648,
10
+ "<|endoftext|>": 151643,
11
+ "<|file_sep|>": 151664,
12
+ "<|fim_middle|>": 151660,
13
+ "<|fim_pad|>": 151662,
14
+ "<|fim_prefix|>": 151659,
15
+ "<|fim_suffix|>": 151661,
16
+ "<|im_end|>": 151645,
17
+ "<|im_start|>": 151644,
18
+ "<|image_pad|>": 151655,
19
+ "<|object_ref_end|>": 151647,
20
+ "<|object_ref_start|>": 151646,
21
+ "<|quad_end|>": 151651,
22
+ "<|quad_start|>": 151650,
23
+ "<|repo_name|>": 151663,
24
+ "<|video_pad|>": 151656,
25
+ "<|vision_end|>": 151653,
26
+ "<|vision_pad|>": 151654,
27
+ "<|vision_start|>": 151652
28
+ }
chat_template.jinja ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- if tools %}
2
+ {{- '<|im_start|>system\n' }}
3
+ {%- if messages[0].role == 'system' %}
4
+ {{- messages[0].content + '\n\n' }}
5
+ {%- endif %}
6
+ {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
7
+ {%- for tool in tools %}
8
+ {{- "\n" }}
9
+ {{- tool | tojson }}
10
+ {%- endfor %}
11
+ {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
12
+ {%- else %}
13
+ {%- if messages[0].role == 'system' %}
14
+ {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
15
+ {%- endif %}
16
+ {%- endif %}
17
+ {%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
18
+ {%- for message in messages[::-1] %}
19
+ {%- set index = (messages|length - 1) - loop.index0 %}
20
+ {%- if ns.multi_step_tool and message.role == "user" and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
21
+ {%- set ns.multi_step_tool = false %}
22
+ {%- set ns.last_query_index = index %}
23
+ {%- endif %}
24
+ {%- endfor %}
25
+ {%- for message in messages %}
26
+ {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
27
+ {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
28
+ {%- elif message.role == "assistant" %}
29
+ {%- set content = message.content %}
30
+ {%- set reasoning_content = '' %}
31
+ {%- if message.reasoning_content is defined and message.reasoning_content is not none %}
32
+ {%- set reasoning_content = message.reasoning_content %}
33
+ {%- else %}
34
+ {%- if '</think>' in message.content %}
35
+ {%- set content = message.content.split('</think>')[-1].lstrip('\n') %}
36
+ {%- set reasoning_content = message.content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
37
+ {%- endif %}
38
+ {%- endif %}
39
+ {%- if loop.index0 > ns.last_query_index %}
40
+ {%- if loop.last or (not loop.last and reasoning_content) %}
41
+ {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
42
+ {%- else %}
43
+ {{- '<|im_start|>' + message.role + '\n' + content }}
44
+ {%- endif %}
45
+ {%- else %}
46
+ {{- '<|im_start|>' + message.role + '\n' + content }}
47
+ {%- endif %}
48
+ {%- if message.tool_calls %}
49
+ {%- for tool_call in message.tool_calls %}
50
+ {%- if (loop.first and content) or (not loop.first) %}
51
+ {{- '\n' }}
52
+ {%- endif %}
53
+ {%- if tool_call.function %}
54
+ {%- set tool_call = tool_call.function %}
55
+ {%- endif %}
56
+ {{- '<tool_call>\n{"name": "' }}
57
+ {{- tool_call.name }}
58
+ {{- '", "arguments": ' }}
59
+ {%- if tool_call.arguments is string %}
60
+ {{- tool_call.arguments }}
61
+ {%- else %}
62
+ {{- tool_call.arguments | tojson }}
63
+ {%- endif %}
64
+ {{- '}\n</tool_call>' }}
65
+ {%- endfor %}
66
+ {%- endif %}
67
+ {{- '<|im_end|>\n' }}
68
+ {%- elif message.role == "tool" %}
69
+ {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
70
+ {{- '<|im_start|>user' }}
71
+ {%- endif %}
72
+ {{- '\n<tool_response>\n' }}
73
+ {{- message.content }}
74
+ {{- '\n</tool_response>' }}
75
+ {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
76
+ {{- '<|im_end|>\n' }}
77
+ {%- endif %}
78
+ {%- endif %}
79
+ {%- endfor %}
80
+ {%- if add_generation_prompt %}
81
+ {{- '<|im_start|>assistant\n' }}
82
+ {%- if enable_thinking is defined and enable_thinking is false %}
83
+ {{- '<think>\n\n</think>\n\n' }}
84
+ {%- endif %}
85
+ {%- endif %}
config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LlamaForCausalLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 1,
8
+ "dtype": "float32",
9
+ "eos_token_id": 2,
10
+ "head_dim": 128,
11
+ "hidden_act": "silu",
12
+ "hidden_size": 4096,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 14336,
15
+ "max_position_embeddings": 32768,
16
+ "mlp_bias": false,
17
+ "model_type": "llama",
18
+ "num_attention_heads": 32,
19
+ "num_hidden_layers": 32,
20
+ "num_key_value_heads": 8,
21
+ "pretraining_tp": 1,
22
+ "rms_norm_eps": 1e-05,
23
+ "rope_scaling": null,
24
+ "rope_theta": 5000000.0,
25
+ "tie_word_embeddings": false,
26
+ "transformers_version": "4.57.6",
27
+ "use_cache": true,
28
+ "vocab_size": 155776
29
+ }
dna_config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "k": 6,
3
+ "dna_start_id": 151669,
4
+ "dna_vocab_size": 4107,
5
+ "dna_special_tokens": [
6
+ "<dna>",
7
+ "</dna>",
8
+ "<oov>"
9
+ ]
10
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "transformers_version": "4.57.6"
6
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model-00001-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8b7517e3390bff6bbbb00bcf5ca809caa54300446789a58533aebf85b7f7d14a
3
+ size 2467334984
model-00002-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b259414045cc70d190a424128e94041f44cca03bfbec7f2b59626e039d9fc95b
3
+ size 2499909576
model-00003-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0059b3e8e3d8d824c2884fef90684ef4b279fbf466553f6b39a95bcd8a88f849
3
+ size 2499909616
model-00004-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a5c18abc7fb7c435dd1340e189da1fbf6414654ad4acc9b0b653605811e6d5e0
3
+ size 2416006472
model-00005-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:108ab1b5bf91fe27d811b96307438653238c2d409b10aea39253cfb1fc2c459c
3
+ size 2499909632
model-00006-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:56b8285a88e4efb5bec0ff2e96fc3a94516a113764a0f40d0d9aca2f37396973
3
+ size 2499909640
model-00007-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aeec8fcc37dc292661ab47fea3089783a3ca1a056b3b1c2932efc48b0e9b3836
3
+ size 1628463872
model.safetensors.index.json ADDED
@@ -0,0 +1,299 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_parameters": 8255705088,
4
+ "total_size": 33022820352
5
+ },
6
+ "weight_map": {
7
+ "model.layers.10.self_attn.q_proj.weight": "model-00003-of-00007.safetensors",
8
+ "model.layers.22.self_attn.k_proj.weight": "model-00005-of-00007.safetensors",
9
+ "model.layers.16.input_layernorm.weight": "model-00004-of-00007.safetensors",
10
+ "model.layers.30.mlp.gate_proj.weight": "model-00006-of-00007.safetensors",
11
+ "model.layers.30.mlp.down_proj.weight": "model-00006-of-00007.safetensors",
12
+ "model.layers.6.input_layernorm.weight": "model-00002-of-00007.safetensors",
13
+ "model.layers.7.self_attn.o_proj.weight": "model-00002-of-00007.safetensors",
14
+ "model.layers.12.self_attn.k_proj.weight": "model-00003-of-00007.safetensors",
15
+ "model.layers.13.input_layernorm.weight": "model-00003-of-00007.safetensors",
16
+ "model.layers.19.mlp.down_proj.weight": "model-00005-of-00007.safetensors",
17
+ "model.layers.10.self_attn.k_proj.weight": "model-00003-of-00007.safetensors",
18
+ "model.layers.30.input_layernorm.weight": "model-00006-of-00007.safetensors",
19
+ "model.layers.21.self_attn.k_proj.weight": "model-00005-of-00007.safetensors",
20
+ "model.layers.15.mlp.down_proj.weight": "model-00004-of-00007.safetensors",
21
+ "model.layers.18.self_attn.q_proj.weight": "model-00004-of-00007.safetensors",
22
+ "model.layers.22.post_attention_layernorm.weight": "model-00005-of-00007.safetensors",
23
+ "model.layers.12.mlp.down_proj.weight": "model-00003-of-00007.safetensors",
24
+ "model.norm.weight": "model-00007-of-00007.safetensors",
25
+ "model.layers.23.self_attn.q_proj.weight": "model-00005-of-00007.safetensors",
26
+ "model.layers.9.post_attention_layernorm.weight": "model-00003-of-00007.safetensors",
27
+ "model.layers.21.input_layernorm.weight": "model-00005-of-00007.safetensors",
28
+ "model.layers.14.mlp.up_proj.weight": "model-00004-of-00007.safetensors",
29
+ "model.layers.6.self_attn.q_proj.weight": "model-00002-of-00007.safetensors",
30
+ "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00007.safetensors",
31
+ "model.layers.23.self_attn.v_proj.weight": "model-00005-of-00007.safetensors",
32
+ "model.layers.30.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
33
+ "model.layers.21.mlp.up_proj.weight": "model-00005-of-00007.safetensors",
34
+ "model.layers.11.input_layernorm.weight": "model-00003-of-00007.safetensors",
35
+ "model.layers.4.self_attn.v_proj.weight": "model-00002-of-00007.safetensors",
36
+ "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00007.safetensors",
37
+ "model.layers.8.post_attention_layernorm.weight": "model-00003-of-00007.safetensors",
38
+ "model.layers.13.self_attn.v_proj.weight": "model-00003-of-00007.safetensors",
39
+ "model.layers.31.mlp.gate_proj.weight": "model-00007-of-00007.safetensors",
40
+ "model.layers.22.self_attn.v_proj.weight": "model-00005-of-00007.safetensors",
41
+ "model.layers.29.mlp.up_proj.weight": "model-00006-of-00007.safetensors",
42
+ "model.layers.8.mlp.up_proj.weight": "model-00003-of-00007.safetensors",
43
+ "model.layers.4.mlp.up_proj.weight": "model-00002-of-00007.safetensors",
44
+ "model.layers.15.self_attn.o_proj.weight": "model-00004-of-00007.safetensors",
45
+ "model.layers.23.mlp.up_proj.weight": "model-00005-of-00007.safetensors",
46
+ "model.layers.26.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
47
+ "model.layers.16.post_attention_layernorm.weight": "model-00004-of-00007.safetensors",
48
+ "model.layers.25.mlp.down_proj.weight": "model-00006-of-00007.safetensors",
49
+ "model.layers.0.input_layernorm.weight": "model-00001-of-00007.safetensors",
50
+ "model.layers.11.mlp.up_proj.weight": "model-00003-of-00007.safetensors",
51
+ "model.layers.8.self_attn.o_proj.weight": "model-00002-of-00007.safetensors",
52
+ "model.layers.24.mlp.up_proj.weight": "model-00005-of-00007.safetensors",
53
+ "model.layers.16.self_attn.o_proj.weight": "model-00004-of-00007.safetensors",
54
+ "model.layers.27.input_layernorm.weight": "model-00006-of-00007.safetensors",
55
+ "model.layers.31.mlp.up_proj.weight": "model-00007-of-00007.safetensors",
56
+ "model.layers.30.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
57
+ "model.layers.4.self_attn.k_proj.weight": "model-00002-of-00007.safetensors",
58
+ "model.layers.28.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
59
+ "model.layers.1.input_layernorm.weight": "model-00001-of-00007.safetensors",
60
+ "model.layers.2.input_layernorm.weight": "model-00002-of-00007.safetensors",
61
+ "model.layers.24.post_attention_layernorm.weight": "model-00005-of-00007.safetensors",
62
+ "model.layers.23.mlp.down_proj.weight": "model-00005-of-00007.safetensors",
63
+ "model.layers.15.self_attn.q_proj.weight": "model-00004-of-00007.safetensors",
64
+ "model.layers.24.input_layernorm.weight": "model-00005-of-00007.safetensors",
65
+ "model.layers.24.self_attn.q_proj.weight": "model-00005-of-00007.safetensors",
66
+ "model.layers.28.mlp.up_proj.weight": "model-00006-of-00007.safetensors",
67
+ "model.layers.13.self_attn.q_proj.weight": "model-00003-of-00007.safetensors",
68
+ "model.layers.28.input_layernorm.weight": "model-00006-of-00007.safetensors",
69
+ "model.layers.5.mlp.down_proj.weight": "model-00002-of-00007.safetensors",
70
+ "model.layers.12.self_attn.q_proj.weight": "model-00003-of-00007.safetensors",
71
+ "model.layers.28.mlp.down_proj.weight": "model-00006-of-00007.safetensors",
72
+ "model.layers.20.post_attention_layernorm.weight": "model-00005-of-00007.safetensors",
73
+ "model.layers.3.input_layernorm.weight": "model-00002-of-00007.safetensors",
74
+ "model.layers.7.post_attention_layernorm.weight": "model-00002-of-00007.safetensors",
75
+ "model.layers.7.self_attn.k_proj.weight": "model-00002-of-00007.safetensors",
76
+ "model.layers.31.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
77
+ "model.layers.5.mlp.gate_proj.weight": "model-00002-of-00007.safetensors",
78
+ "model.layers.1.self_attn.q_proj.weight": "model-00001-of-00007.safetensors",
79
+ "model.layers.4.self_attn.q_proj.weight": "model-00002-of-00007.safetensors",
80
+ "model.layers.10.mlp.gate_proj.weight": "model-00003-of-00007.safetensors",
81
+ "model.layers.17.mlp.gate_proj.weight": "model-00004-of-00007.safetensors",
82
+ "model.layers.2.self_attn.k_proj.weight": "model-00001-of-00007.safetensors",
83
+ "model.layers.4.input_layernorm.weight": "model-00002-of-00007.safetensors",
84
+ "model.layers.15.mlp.gate_proj.weight": "model-00004-of-00007.safetensors",
85
+ "model.layers.18.mlp.up_proj.weight": "model-00004-of-00007.safetensors",
86
+ "model.layers.27.post_attention_layernorm.weight": "model-00006-of-00007.safetensors",
87
+ "model.layers.29.input_layernorm.weight": "model-00006-of-00007.safetensors",
88
+ "model.layers.30.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
89
+ "model.layers.14.self_attn.k_proj.weight": "model-00003-of-00007.safetensors",
90
+ "model.layers.18.mlp.down_proj.weight": "model-00004-of-00007.safetensors",
91
+ "model.layers.27.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
92
+ "model.layers.22.self_attn.q_proj.weight": "model-00005-of-00007.safetensors",
93
+ "model.layers.19.post_attention_layernorm.weight": "model-00005-of-00007.safetensors",
94
+ "model.layers.13.self_attn.k_proj.weight": "model-00003-of-00007.safetensors",
95
+ "model.layers.17.input_layernorm.weight": "model-00004-of-00007.safetensors",
96
+ "model.layers.6.mlp.down_proj.weight": "model-00002-of-00007.safetensors",
97
+ "model.layers.10.self_attn.o_proj.weight": "model-00003-of-00007.safetensors",
98
+ "model.layers.11.mlp.gate_proj.weight": "model-00003-of-00007.safetensors",
99
+ "model.layers.11.self_attn.k_proj.weight": "model-00003-of-00007.safetensors",
100
+ "model.layers.14.input_layernorm.weight": "model-00004-of-00007.safetensors",
101
+ "model.layers.23.self_attn.o_proj.weight": "model-00005-of-00007.safetensors",
102
+ "model.layers.4.mlp.gate_proj.weight": "model-00002-of-00007.safetensors",
103
+ "model.layers.20.self_attn.k_proj.weight": "model-00005-of-00007.safetensors",
104
+ "model.layers.17.mlp.down_proj.weight": "model-00004-of-00007.safetensors",
105
+ "model.layers.22.mlp.down_proj.weight": "model-00005-of-00007.safetensors",
106
+ "model.layers.15.self_attn.k_proj.weight": "model-00004-of-00007.safetensors",
107
+ "model.layers.9.self_attn.o_proj.weight": "model-00003-of-00007.safetensors",
108
+ "model.layers.24.mlp.down_proj.weight": "model-00005-of-00007.safetensors",
109
+ "model.layers.3.self_attn.q_proj.weight": "model-00002-of-00007.safetensors",
110
+ "model.layers.23.input_layernorm.weight": "model-00005-of-00007.safetensors",
111
+ "model.layers.20.self_attn.q_proj.weight": "model-00005-of-00007.safetensors",
112
+ "model.layers.3.mlp.down_proj.weight": "model-00002-of-00007.safetensors",
113
+ "model.layers.9.input_layernorm.weight": "model-00003-of-00007.safetensors",
114
+ "model.embed_tokens.weight": "model-00001-of-00007.safetensors",
115
+ "model.layers.1.self_attn.k_proj.weight": "model-00001-of-00007.safetensors",
116
+ "model.layers.10.mlp.up_proj.weight": "model-00003-of-00007.safetensors",
117
+ "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00007.safetensors",
118
+ "model.layers.8.self_attn.k_proj.weight": "model-00002-of-00007.safetensors",
119
+ "model.layers.3.post_attention_layernorm.weight": "model-00002-of-00007.safetensors",
120
+ "model.layers.0.mlp.up_proj.weight": "model-00001-of-00007.safetensors",
121
+ "model.layers.28.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
122
+ "model.layers.15.post_attention_layernorm.weight": "model-00004-of-00007.safetensors",
123
+ "model.layers.11.mlp.down_proj.weight": "model-00003-of-00007.safetensors",
124
+ "model.layers.19.mlp.gate_proj.weight": "model-00004-of-00007.safetensors",
125
+ "model.layers.11.post_attention_layernorm.weight": "model-00003-of-00007.safetensors",
126
+ "model.layers.21.self_attn.q_proj.weight": "model-00005-of-00007.safetensors",
127
+ "model.layers.10.self_attn.v_proj.weight": "model-00003-of-00007.safetensors",
128
+ "model.layers.13.self_attn.o_proj.weight": "model-00003-of-00007.safetensors",
129
+ "model.layers.10.input_layernorm.weight": "model-00003-of-00007.safetensors",
130
+ "model.layers.26.mlp.down_proj.weight": "model-00006-of-00007.safetensors",
131
+ "model.layers.2.mlp.up_proj.weight": "model-00001-of-00007.safetensors",
132
+ "model.layers.14.mlp.down_proj.weight": "model-00004-of-00007.safetensors",
133
+ "model.layers.18.mlp.gate_proj.weight": "model-00004-of-00007.safetensors",
134
+ "model.layers.22.mlp.gate_proj.weight": "model-00005-of-00007.safetensors",
135
+ "model.layers.24.mlp.gate_proj.weight": "model-00005-of-00007.safetensors",
136
+ "model.layers.7.mlp.gate_proj.weight": "model-00002-of-00007.safetensors",
137
+ "model.layers.9.self_attn.q_proj.weight": "model-00003-of-00007.safetensors",
138
+ "model.layers.17.post_attention_layernorm.weight": "model-00004-of-00007.safetensors",
139
+ "model.layers.27.mlp.up_proj.weight": "model-00006-of-00007.safetensors",
140
+ "model.layers.2.mlp.down_proj.weight": "model-00002-of-00007.safetensors",
141
+ "model.layers.18.input_layernorm.weight": "model-00004-of-00007.safetensors",
142
+ "model.layers.27.self_attn.o_proj.weight": "model-00006-of-00007.safetensors",
143
+ "model.layers.9.self_attn.v_proj.weight": "model-00003-of-00007.safetensors",
144
+ "model.layers.25.mlp.gate_proj.weight": "model-00005-of-00007.safetensors",
145
+ "model.layers.5.self_attn.o_proj.weight": "model-00002-of-00007.safetensors",
146
+ "model.layers.20.input_layernorm.weight": "model-00005-of-00007.safetensors",
147
+ "model.layers.13.post_attention_layernorm.weight": "model-00003-of-00007.safetensors",
148
+ "model.layers.24.self_attn.v_proj.weight": "model-00005-of-00007.safetensors",
149
+ "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00007.safetensors",
150
+ "model.layers.17.mlp.up_proj.weight": "model-00004-of-00007.safetensors",
151
+ "model.layers.5.self_attn.v_proj.weight": "model-00002-of-00007.safetensors",
152
+ "model.layers.26.mlp.up_proj.weight": "model-00006-of-00007.safetensors",
153
+ "model.layers.19.input_layernorm.weight": "model-00005-of-00007.safetensors",
154
+ "model.layers.5.post_attention_layernorm.weight": "model-00002-of-00007.safetensors",
155
+ "model.layers.31.mlp.down_proj.weight": "model-00007-of-00007.safetensors",
156
+ "model.layers.11.self_attn.v_proj.weight": "model-00003-of-00007.safetensors",
157
+ "model.layers.15.mlp.up_proj.weight": "model-00004-of-00007.safetensors",
158
+ "model.layers.3.mlp.up_proj.weight": "model-00002-of-00007.safetensors",
159
+ "model.layers.17.self_attn.v_proj.weight": "model-00004-of-00007.safetensors",
160
+ "model.layers.25.self_attn.v_proj.weight": "model-00005-of-00007.safetensors",
161
+ "model.layers.8.mlp.down_proj.weight": "model-00003-of-00007.safetensors",
162
+ "model.layers.16.self_attn.v_proj.weight": "model-00004-of-00007.safetensors",
163
+ "model.layers.27.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
164
+ "model.layers.29.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
165
+ "model.layers.29.post_attention_layernorm.weight": "model-00006-of-00007.safetensors",
166
+ "model.layers.6.mlp.gate_proj.weight": "model-00002-of-00007.safetensors",
167
+ "model.layers.9.mlp.down_proj.weight": "model-00003-of-00007.safetensors",
168
+ "model.layers.25.self_attn.o_proj.weight": "model-00005-of-00007.safetensors",
169
+ "model.layers.7.mlp.up_proj.weight": "model-00002-of-00007.safetensors",
170
+ "model.layers.25.post_attention_layernorm.weight": "model-00006-of-00007.safetensors",
171
+ "model.layers.21.self_attn.o_proj.weight": "model-00005-of-00007.safetensors",
172
+ "model.layers.7.mlp.down_proj.weight": "model-00002-of-00007.safetensors",
173
+ "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00007.safetensors",
174
+ "model.layers.16.mlp.down_proj.weight": "model-00004-of-00007.safetensors",
175
+ "model.layers.30.self_attn.o_proj.weight": "model-00006-of-00007.safetensors",
176
+ "model.layers.23.mlp.gate_proj.weight": "model-00005-of-00007.safetensors",
177
+ "lm_head.weight": "model-00007-of-00007.safetensors",
178
+ "model.layers.27.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
179
+ "model.layers.25.input_layernorm.weight": "model-00006-of-00007.safetensors",
180
+ "model.layers.6.self_attn.v_proj.weight": "model-00002-of-00007.safetensors",
181
+ "model.layers.25.self_attn.k_proj.weight": "model-00005-of-00007.safetensors",
182
+ "model.layers.12.input_layernorm.weight": "model-00003-of-00007.safetensors",
183
+ "model.layers.5.self_attn.q_proj.weight": "model-00002-of-00007.safetensors",
184
+ "model.layers.1.mlp.up_proj.weight": "model-00001-of-00007.safetensors",
185
+ "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00007.safetensors",
186
+ "model.layers.8.self_attn.q_proj.weight": "model-00002-of-00007.safetensors",
187
+ "model.layers.3.self_attn.k_proj.weight": "model-00002-of-00007.safetensors",
188
+ "model.layers.19.self_attn.v_proj.weight": "model-00004-of-00007.safetensors",
189
+ "model.layers.18.self_attn.o_proj.weight": "model-00004-of-00007.safetensors",
190
+ "model.layers.16.self_attn.q_proj.weight": "model-00004-of-00007.safetensors",
191
+ "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00007.safetensors",
192
+ "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00007.safetensors",
193
+ "model.layers.15.self_attn.v_proj.weight": "model-00004-of-00007.safetensors",
194
+ "model.layers.24.self_attn.o_proj.weight": "model-00005-of-00007.safetensors",
195
+ "model.layers.4.mlp.down_proj.weight": "model-00002-of-00007.safetensors",
196
+ "model.layers.22.input_layernorm.weight": "model-00005-of-00007.safetensors",
197
+ "model.layers.6.self_attn.o_proj.weight": "model-00002-of-00007.safetensors",
198
+ "model.layers.13.mlp.gate_proj.weight": "model-00003-of-00007.safetensors",
199
+ "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00007.safetensors",
200
+ "model.layers.6.self_attn.k_proj.weight": "model-00002-of-00007.safetensors",
201
+ "model.layers.14.self_attn.o_proj.weight": "model-00003-of-00007.safetensors",
202
+ "model.layers.3.self_attn.o_proj.weight": "model-00002-of-00007.safetensors",
203
+ "model.layers.1.mlp.down_proj.weight": "model-00001-of-00007.safetensors",
204
+ "model.layers.20.mlp.up_proj.weight": "model-00005-of-00007.safetensors",
205
+ "model.layers.28.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
206
+ "model.layers.12.post_attention_layernorm.weight": "model-00003-of-00007.safetensors",
207
+ "model.layers.12.self_attn.v_proj.weight": "model-00003-of-00007.safetensors",
208
+ "model.layers.26.self_attn.o_proj.weight": "model-00006-of-00007.safetensors",
209
+ "model.layers.29.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
210
+ "model.layers.9.mlp.gate_proj.weight": "model-00003-of-00007.safetensors",
211
+ "model.layers.22.mlp.up_proj.weight": "model-00005-of-00007.safetensors",
212
+ "model.layers.19.self_attn.k_proj.weight": "model-00004-of-00007.safetensors",
213
+ "model.layers.26.input_layernorm.weight": "model-00006-of-00007.safetensors",
214
+ "model.layers.10.post_attention_layernorm.weight": "model-00003-of-00007.safetensors",
215
+ "model.layers.12.mlp.up_proj.weight": "model-00003-of-00007.safetensors",
216
+ "model.layers.26.mlp.gate_proj.weight": "model-00006-of-00007.safetensors",
217
+ "model.layers.28.mlp.gate_proj.weight": "model-00006-of-00007.safetensors",
218
+ "model.layers.14.post_attention_layernorm.weight": "model-00004-of-00007.safetensors",
219
+ "model.layers.22.self_attn.o_proj.weight": "model-00005-of-00007.safetensors",
220
+ "model.layers.21.self_attn.v_proj.weight": "model-00005-of-00007.safetensors",
221
+ "model.layers.26.post_attention_layernorm.weight": "model-00006-of-00007.safetensors",
222
+ "model.layers.4.self_attn.o_proj.weight": "model-00002-of-00007.safetensors",
223
+ "model.layers.1.self_attn.v_proj.weight": "model-00001-of-00007.safetensors",
224
+ "model.layers.9.mlp.up_proj.weight": "model-00003-of-00007.safetensors",
225
+ "model.layers.13.mlp.up_proj.weight": "model-00003-of-00007.safetensors",
226
+ "model.layers.12.self_attn.o_proj.weight": "model-00003-of-00007.safetensors",
227
+ "model.layers.17.self_attn.o_proj.weight": "model-00004-of-00007.safetensors",
228
+ "model.layers.23.post_attention_layernorm.weight": "model-00005-of-00007.safetensors",
229
+ "model.layers.31.post_attention_layernorm.weight": "model-00007-of-00007.safetensors",
230
+ "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00007.safetensors",
231
+ "model.layers.30.mlp.up_proj.weight": "model-00006-of-00007.safetensors",
232
+ "model.layers.3.self_attn.v_proj.weight": "model-00002-of-00007.safetensors",
233
+ "model.layers.29.self_attn.o_proj.weight": "model-00006-of-00007.safetensors",
234
+ "model.layers.28.post_attention_layernorm.weight": "model-00006-of-00007.safetensors",
235
+ "model.layers.16.mlp.up_proj.weight": "model-00004-of-00007.safetensors",
236
+ "model.layers.7.self_attn.v_proj.weight": "model-00002-of-00007.safetensors",
237
+ "model.layers.3.mlp.gate_proj.weight": "model-00002-of-00007.safetensors",
238
+ "model.layers.11.self_attn.q_proj.weight": "model-00003-of-00007.safetensors",
239
+ "model.layers.5.mlp.up_proj.weight": "model-00002-of-00007.safetensors",
240
+ "model.layers.18.self_attn.k_proj.weight": "model-00004-of-00007.safetensors",
241
+ "model.layers.23.self_attn.k_proj.weight": "model-00005-of-00007.safetensors",
242
+ "model.layers.29.mlp.gate_proj.weight": "model-00006-of-00007.safetensors",
243
+ "model.layers.20.mlp.down_proj.weight": "model-00005-of-00007.safetensors",
244
+ "model.layers.2.self_attn.q_proj.weight": "model-00001-of-00007.safetensors",
245
+ "model.layers.7.input_layernorm.weight": "model-00002-of-00007.safetensors",
246
+ "model.layers.19.self_attn.q_proj.weight": "model-00004-of-00007.safetensors",
247
+ "model.layers.4.post_attention_layernorm.weight": "model-00002-of-00007.safetensors",
248
+ "model.layers.31.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
249
+ "model.layers.16.mlp.gate_proj.weight": "model-00004-of-00007.safetensors",
250
+ "model.layers.18.post_attention_layernorm.weight": "model-00004-of-00007.safetensors",
251
+ "model.layers.20.self_attn.o_proj.weight": "model-00005-of-00007.safetensors",
252
+ "model.layers.17.self_attn.q_proj.weight": "model-00004-of-00007.safetensors",
253
+ "model.layers.5.self_attn.k_proj.weight": "model-00002-of-00007.safetensors",
254
+ "model.layers.9.self_attn.k_proj.weight": "model-00003-of-00007.safetensors",
255
+ "model.layers.8.input_layernorm.weight": "model-00003-of-00007.safetensors",
256
+ "model.layers.25.mlp.up_proj.weight": "model-00006-of-00007.safetensors",
257
+ "model.layers.19.mlp.up_proj.weight": "model-00004-of-00007.safetensors",
258
+ "model.layers.6.mlp.up_proj.weight": "model-00002-of-00007.safetensors",
259
+ "model.layers.29.mlp.down_proj.weight": "model-00006-of-00007.safetensors",
260
+ "model.layers.18.self_attn.v_proj.weight": "model-00004-of-00007.safetensors",
261
+ "model.layers.30.post_attention_layernorm.weight": "model-00006-of-00007.safetensors",
262
+ "model.layers.11.self_attn.o_proj.weight": "model-00003-of-00007.safetensors",
263
+ "model.layers.13.mlp.down_proj.weight": "model-00003-of-00007.safetensors",
264
+ "model.layers.0.mlp.down_proj.weight": "model-00001-of-00007.safetensors",
265
+ "model.layers.17.self_attn.k_proj.weight": "model-00004-of-00007.safetensors",
266
+ "model.layers.8.mlp.gate_proj.weight": "model-00002-of-00007.safetensors",
267
+ "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00007.safetensors",
268
+ "model.layers.2.post_attention_layernorm.weight": "model-00002-of-00007.safetensors",
269
+ "model.layers.16.self_attn.k_proj.weight": "model-00004-of-00007.safetensors",
270
+ "model.layers.21.post_attention_layernorm.weight": "model-00005-of-00007.safetensors",
271
+ "model.layers.25.self_attn.q_proj.weight": "model-00005-of-00007.safetensors",
272
+ "model.layers.21.mlp.gate_proj.weight": "model-00005-of-00007.safetensors",
273
+ "model.layers.6.post_attention_layernorm.weight": "model-00002-of-00007.safetensors",
274
+ "model.layers.8.self_attn.v_proj.weight": "model-00002-of-00007.safetensors",
275
+ "model.layers.21.mlp.down_proj.weight": "model-00005-of-00007.safetensors",
276
+ "model.layers.31.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
277
+ "model.layers.12.mlp.gate_proj.weight": "model-00003-of-00007.safetensors",
278
+ "model.layers.14.self_attn.q_proj.weight": "model-00003-of-00007.safetensors",
279
+ "model.layers.20.mlp.gate_proj.weight": "model-00005-of-00007.safetensors",
280
+ "model.layers.31.input_layernorm.weight": "model-00007-of-00007.safetensors",
281
+ "model.layers.28.self_attn.o_proj.weight": "model-00006-of-00007.safetensors",
282
+ "model.layers.5.input_layernorm.weight": "model-00002-of-00007.safetensors",
283
+ "model.layers.15.input_layernorm.weight": "model-00004-of-00007.safetensors",
284
+ "model.layers.7.self_attn.q_proj.weight": "model-00002-of-00007.safetensors",
285
+ "model.layers.29.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
286
+ "model.layers.24.self_attn.k_proj.weight": "model-00005-of-00007.safetensors",
287
+ "model.layers.19.self_attn.o_proj.weight": "model-00004-of-00007.safetensors",
288
+ "model.layers.10.mlp.down_proj.weight": "model-00003-of-00007.safetensors",
289
+ "model.layers.20.self_attn.v_proj.weight": "model-00005-of-00007.safetensors",
290
+ "model.layers.26.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
291
+ "model.layers.26.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
292
+ "model.layers.31.self_attn.o_proj.weight": "model-00006-of-00007.safetensors",
293
+ "model.layers.27.mlp.down_proj.weight": "model-00006-of-00007.safetensors",
294
+ "model.layers.2.self_attn.v_proj.weight": "model-00001-of-00007.safetensors",
295
+ "model.layers.14.self_attn.v_proj.weight": "model-00003-of-00007.safetensors",
296
+ "model.layers.14.mlp.gate_proj.weight": "model-00004-of-00007.safetensors",
297
+ "model.layers.27.mlp.gate_proj.weight": "model-00006-of-00007.safetensors"
298
+ }
299
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>"
16
+ ],
17
+ "eos_token": {
18
+ "content": "<|endoftext|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "pad_token": {
25
+ "content": "<|endoftext|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aeb13307a71acd8fe81861d94ad54ab689df773318809eed3cbe794b4492dae4
3
+ size 11422654
tokenizer.py ADDED
@@ -0,0 +1,583 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ HybridDNATokenizer: Combines Qwen3 BPE tokenization with DNA 6-mer tokenization.
3
+
4
+ DNA sequences wrapped in <dna>...</dna> tags are tokenized as 6-mers.
5
+ All other text uses Qwen3's BPE tokenization.
6
+
7
+ Supports token_mask for Fine-grained Nucleotide Supervision (FNS):
8
+ -2: padding token
9
+ -1: text token (BPE)
10
+ 0: DNA special token (<dna>, </dna>, <oov>)
11
+ 1-5: partial 6-mer token — valid_length real bases at positions [0, valid_length),
12
+ right-padded with 'A' at positions [valid_length, k) so loss can supervise
13
+ positions 0..valid_len-1 via pos_mask = (valid_len > pos)
14
+ 6: full 6-mer
15
+ """
16
+
17
+ import os
18
+ import json
19
+ import itertools
20
+ from typing import List, Optional, Tuple, Dict, Union, Any
21
+
22
+ from transformers import PreTrainedTokenizer, AutoTokenizer, BatchEncoding
23
+
24
+
25
+ class HybridDNATokenizer(PreTrainedTokenizer):
26
+ """
27
+ Hybrid tokenizer combining Qwen3 BPE with DNA 6-mer tokenization.
28
+
29
+ DNA regions must be wrapped in <dna>...</dna> tags to be tokenized as 6-mers.
30
+ Without tags, DNA sequences are tokenized as regular BPE text.
31
+
32
+ For pure-DNA input (no metadata tokens), pass auto_dna_tags=True to have
33
+ <dna>...</dna> tags added automatically when they are absent. Do NOT set
34
+ this if the input may contain BPE metadata such as species tags
35
+ (<fungi_species> etc.) — those must appear outside <dna>...</dna> and would
36
+ be incorrectly k-mer encoded if auto-wrapping fired.
37
+ """
38
+
39
+ model_input_names = ["input_ids", "attention_mask"]
40
+
41
+ def __init__(
42
+ self,
43
+ base_tokenizer_path: Optional[str] = None,
44
+ k: int = 6,
45
+ auto_dna_tags: bool = False,
46
+ **kwargs
47
+ ):
48
+ self.k = k
49
+
50
+ # Load base tokenizer (Qwen3-4B-Base)
51
+ self._base_tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Base")
52
+
53
+ # Get base vocabulary
54
+ self._base_vocab = self._base_tokenizer.get_vocab()
55
+ self._base_vocab_size = len(self._base_vocab)
56
+
57
+ # Initialize DNA vocabulary
58
+ self._init_dna_vocab()
59
+
60
+ # Build combined vocabulary
61
+ self._build_combined_vocab()
62
+
63
+ # Set special tokens
64
+ self._eos_token = kwargs.pop('eos_token', None) or "<|endoftext|>"
65
+ self._pad_token = kwargs.pop('pad_token', None) or self._base_tokenizer.pad_token or "<|endoftext|>"
66
+
67
+ # Initialize parent class
68
+ super().__init__(
69
+ eos_token=self._eos_token,
70
+ pad_token=self._pad_token,
71
+ **kwargs
72
+ )
73
+
74
+ self.special_tokens = self.dna_special_tokens + [self._eos_token, self._pad_token]
75
+ self.auto_dna_tags = auto_dna_tags
76
+
77
+ def _init_dna_vocab(self):
78
+ """Initialize DNA vocabulary (special tokens + k-mers + padding for 128 alignment)."""
79
+ bases = ['A', 'T', 'C', 'G']
80
+
81
+ # DNA special tokens
82
+ self.dna_special_tokens = ["<dna>", "</dna>", "<oov>"]
83
+
84
+ # Generate all k-mer combinations (4^k = 4096 for k=6)
85
+ self.kmers = [''.join(kmer) for kmer in itertools.product(bases, repeat=self.k)]
86
+
87
+ # DNA tokens start after base vocabulary
88
+ self.dna_start_id = self._base_vocab_size
89
+
90
+ # All DNA tokens get new IDs (no reuse of base vocab IDs, even for
91
+ # overlapping tokens like CCCCCC — they have different semantics in
92
+ # DNA context vs BPE context, per Qiuyi's recommendation)
93
+ base_dna_tokens = self.dna_special_tokens + self.kmers
94
+
95
+ # Calculate padding for 128 alignment
96
+ total_vocab_unpadded = self._base_vocab_size + len(base_dna_tokens)
97
+ target_vocab_size = ((total_vocab_unpadded + 127) // 128) * 128
98
+ num_padding_tokens = target_vocab_size - total_vocab_unpadded
99
+
100
+ # Add unused padding tokens
101
+ self.padding_tokens = [f"<unused_{i}>" for i in range(num_padding_tokens)]
102
+
103
+ # Create DNA token mappings — all get sequential new IDs
104
+ self.dna_token_to_id = {}
105
+ self.dna_id_to_token = {}
106
+
107
+ current_id = self.dna_start_id
108
+ for token in base_dna_tokens:
109
+ self.dna_token_to_id[token] = current_id
110
+ self.dna_id_to_token[current_id] = token
111
+ current_id += 1
112
+
113
+ # Add padding tokens
114
+ for token in self.padding_tokens:
115
+ self.dna_token_to_id[token] = current_id
116
+ self.dna_id_to_token[current_id] = token
117
+ current_id += 1
118
+
119
+ self.dna_vocab_size = len(base_dna_tokens) + len(self.padding_tokens)
120
+
121
+ # Set DNA special token IDs
122
+ self.dna_begin_token_id = self.dna_token_to_id["<dna>"]
123
+ self.dna_end_token_id = self.dna_token_to_id["</dna>"]
124
+ self.oov_token_id = self.dna_token_to_id["<oov>"]
125
+
126
+ def _build_combined_vocab(self):
127
+ """Build combined vocabulary (base + DNA)."""
128
+ self._vocab = self._base_vocab.copy()
129
+
130
+ for token, token_id in self.dna_token_to_id.items():
131
+ if token not in self._vocab:
132
+ self._vocab[token] = token_id
133
+
134
+ self._id_to_token = {v: k for k, v in self._vocab.items()}
135
+ for token_id, token in self.dna_id_to_token.items():
136
+ if token_id not in self._id_to_token:
137
+ self._id_to_token[token_id] = token
138
+
139
+ @property
140
+ def vocab_size(self) -> int:
141
+ return max(self._vocab.values()) + 1
142
+
143
+ def get_vocab(self) -> Dict[str, int]:
144
+ return self._vocab.copy()
145
+
146
+ def __len__(self):
147
+ # Override default (len(get_vocab())) because get_vocab() deduplicates
148
+ # CCCCCC which exists as both BPE (ID 91443) and DNA 6-mer (ID 154402).
149
+ return self.vocab_size
150
+
151
+ def _split_by_dna_tags(self, text: str) -> List[Tuple[str, bool]]:
152
+ segments = []
153
+ i = 0
154
+ n = len(text)
155
+
156
+ while i < n:
157
+ start_pos = text.find('<dna>', i)
158
+ end_pos = text.find('</dna>', i)
159
+
160
+ if start_pos == -1 and end_pos == -1:
161
+ remaining = text[i:]
162
+ if remaining:
163
+ segments.append((remaining, False))
164
+ break
165
+
166
+ if start_pos == -1 and end_pos != -1:
167
+ dna_region = text[i:end_pos + 6]
168
+ if dna_region:
169
+ segments.append((dna_region, True))
170
+ i = end_pos + 6
171
+ continue
172
+
173
+ if start_pos != -1 and end_pos == -1:
174
+ if i < start_pos:
175
+ normal_text = text[i:start_pos]
176
+ if normal_text:
177
+ segments.append((normal_text, False))
178
+ dna_region = text[start_pos:]
179
+ if dna_region:
180
+ segments.append((dna_region, True))
181
+ break
182
+
183
+ if start_pos < end_pos:
184
+ if i < start_pos:
185
+ normal_text = text[i:start_pos]
186
+ if normal_text:
187
+ segments.append((normal_text, False))
188
+ dna_region = text[start_pos:end_pos + 6]
189
+ if dna_region:
190
+ segments.append((dna_region, True))
191
+ i = end_pos + 6
192
+ else:
193
+ dna_region = text[i:end_pos + 6]
194
+ if dna_region:
195
+ segments.append((dna_region, True))
196
+ i = end_pos + 6
197
+
198
+ return segments
199
+
200
+ def _parse_dna_region(self, dna_region: str) -> Tuple[str, bool, bool]:
201
+ if dna_region == '<dna>':
202
+ return '', True, False
203
+ elif dna_region == '</dna>':
204
+ return '', False, True
205
+
206
+ has_start = dna_region.startswith('<dna>')
207
+ has_end = dna_region.endswith('</dna>')
208
+
209
+ content = dna_region
210
+ if has_start:
211
+ content = content[5:]
212
+ if has_end and content.endswith('</dna>'):
213
+ content = content[:-6]
214
+
215
+ return content.strip(), has_start, has_end
216
+
217
+ def _process_dna_sequence(self, dna_seq: str) -> Dict:
218
+ k = self.k
219
+ dna_seq = dna_seq.upper()
220
+
221
+ kmer_tokens = []
222
+ valid_bases = set('ATCG')
223
+
224
+ def is_valid_kmer(kmer):
225
+ return len(kmer) == k and all(base in valid_bases for base in kmer)
226
+
227
+ for i in range(0, len(dna_seq) - k + 1, k):
228
+ kmer = dna_seq[i:i+k]
229
+ if is_valid_kmer(kmer):
230
+ kmer_tokens.append(kmer)
231
+ else:
232
+ kmer_tokens.append("<oov>")
233
+
234
+ processed_length = len(kmer_tokens) * k
235
+ remaining = dna_seq[processed_length:]
236
+ padding_length = 0
237
+ valid_length = k
238
+
239
+ if remaining:
240
+ padding_needed = k - len(remaining)
241
+ # Right-pad with A: real bases occupy positions [0, valid_length).
242
+ # The hybrid BP loss supervises positions 0..valid_len-1 via
243
+ # pos_mask = (valid_len > pos)
244
+ # so padding must be at the END, not the start.
245
+ padded = remaining + 'A' * padding_needed
246
+
247
+ if is_valid_kmer(padded):
248
+ kmer_tokens.append(padded)
249
+ else:
250
+ kmer_tokens.append("<oov>")
251
+
252
+ padding_length = padding_needed
253
+ valid_length = len(remaining)
254
+
255
+ return {
256
+ "kmer_tokens": kmer_tokens,
257
+ "padding_length": padding_length,
258
+ "valid_length": valid_length,
259
+ }
260
+
261
+ def _tokenize(self, text: str, **kwargs) -> List[str]:
262
+ return list(text)
263
+
264
+ def _convert_token_to_id(self, token: str) -> int:
265
+ if token in self.dna_token_to_id:
266
+ return self.dna_token_to_id[token]
267
+ return self._base_vocab.get(token, self._base_tokenizer.unk_token_id or 0)
268
+
269
+ def _convert_id_to_token(self, index: int) -> str:
270
+ if index in self.dna_id_to_token:
271
+ return self.dna_id_to_token[index]
272
+ return self._id_to_token.get(index, "<oov>")
273
+
274
+ def convert_tokens_to_string(self, tokens: List[str]) -> str:
275
+ return "".join(tokens)
276
+
277
+ def encode(
278
+ self,
279
+ text: str,
280
+ add_special_tokens: bool = False,
281
+ return_token_mask: bool = False,
282
+ auto_dna_tags: Optional[bool] = None,
283
+ **kwargs
284
+ ) -> Union[List[int], Tuple[List[int], List[int]]]:
285
+ use_auto = self.auto_dna_tags if auto_dna_tags is None else auto_dna_tags
286
+ if use_auto and '<dna>' not in text:
287
+ text = f'<dna>{text}</dna>'
288
+
289
+ segments = self._split_by_dna_tags(text)
290
+
291
+ token_ids = []
292
+ token_mask = [] if return_token_mask else None
293
+
294
+ for segment_content, is_dna in segments:
295
+ if is_dna:
296
+ dna_content, has_start, has_end = self._parse_dna_region(segment_content)
297
+
298
+ if has_start:
299
+ token_ids.append(self.dna_begin_token_id)
300
+ if return_token_mask:
301
+ token_mask.append(0)
302
+
303
+ if dna_content:
304
+ result = self._process_dna_sequence(dna_content)
305
+
306
+ for idx, kmer in enumerate(result["kmer_tokens"]):
307
+ token_id = self.dna_token_to_id.get(kmer, self.oov_token_id)
308
+ token_ids.append(token_id)
309
+
310
+ if return_token_mask:
311
+ if kmer == "<oov>":
312
+ token_mask.append(0)
313
+ elif idx == len(result["kmer_tokens"]) - 1 and result["padding_length"] > 0:
314
+ token_mask.append(result["valid_length"])
315
+ else:
316
+ token_mask.append(self.k)
317
+
318
+ if has_end:
319
+ token_ids.append(self.dna_end_token_id)
320
+ if return_token_mask:
321
+ token_mask.append(0)
322
+ else:
323
+ base_ids = self._base_tokenizer.encode(
324
+ segment_content,
325
+ add_special_tokens=False
326
+ )
327
+ token_ids.extend(base_ids)
328
+ if return_token_mask:
329
+ token_mask.extend([-1] * len(base_ids))
330
+
331
+ # Do NOT append EOS when add_special_tokens=True. Qwen3 doesn't add
332
+ # BOS/EOS either, and appending EOS here breaks lighteval's
333
+ # tok_encode_pair: it relies on
334
+ # len(encode(ctx)) + len(encode(answer)) == len(encode(ctx + answer))
335
+ # which the extra EOS violates by shifting the split by 1.
336
+
337
+ if return_token_mask:
338
+ return token_ids, token_mask
339
+ return token_ids
340
+
341
+ def decode(
342
+ self,
343
+ token_ids: Union[int, List[int]],
344
+ skip_special_tokens: bool = False,
345
+ **kwargs
346
+ ) -> str:
347
+ if isinstance(token_ids, int):
348
+ token_ids = [token_ids]
349
+
350
+ if skip_special_tokens:
351
+ special_ids = {self.eos_token_id, self.pad_token_id}
352
+ token_ids = [tid for tid in token_ids if tid not in special_ids]
353
+
354
+ parts = []
355
+ i = 0
356
+
357
+ while i < len(token_ids):
358
+ tid = token_ids[i]
359
+
360
+ if tid == self.dna_begin_token_id:
361
+ dna_tokens = []
362
+ i += 1
363
+
364
+ while i < len(token_ids) and token_ids[i] != self.dna_end_token_id:
365
+ if token_ids[i] in self.dna_id_to_token:
366
+ dna_tokens.append(self.dna_id_to_token[token_ids[i]])
367
+ i += 1
368
+
369
+ dna_seq = ''.join(dna_tokens)
370
+
371
+ if skip_special_tokens:
372
+ parts.append(dna_seq)
373
+ else:
374
+ parts.append(f"<dna>{dna_seq}")
375
+ if i < len(token_ids) and token_ids[i] == self.dna_end_token_id:
376
+ parts.append("</dna>")
377
+ i += 1
378
+
379
+ elif tid in self.dna_id_to_token:
380
+ # This branch handles k-mer tokens that appear without a <dna>
381
+ # wrapper — the common generation case where <dna> was in the
382
+ # prompt but only the generated portion is being decoded.
383
+ # K-mer tokens are content, not special tokens, so always decode
384
+ # them. Only drop true DNA special tokens (<dna>, </dna>, <oov>)
385
+ # when skip_special_tokens=True.
386
+ is_dna_special = tid in (self.dna_begin_token_id, self.dna_end_token_id, self.oov_token_id)
387
+ if not (skip_special_tokens and is_dna_special):
388
+ parts.append(self.dna_id_to_token[tid])
389
+ i += 1
390
+
391
+ else:
392
+ text_ids = []
393
+ while i < len(token_ids):
394
+ curr_id = token_ids[i]
395
+ if curr_id in self.dna_id_to_token or curr_id == self.dna_begin_token_id:
396
+ break
397
+ text_ids.append(curr_id)
398
+ i += 1
399
+
400
+ if text_ids:
401
+ decoded = self._base_tokenizer.decode(text_ids, skip_special_tokens=skip_special_tokens)
402
+ parts.append(decoded)
403
+
404
+ return ''.join(parts)
405
+
406
+ def batch_decode(
407
+ self,
408
+ sequences: Union[List[int], List[List[int]], "torch.Tensor"],
409
+ skip_special_tokens: bool = False,
410
+ **kwargs
411
+ ) -> List[str]:
412
+ return [
413
+ self.decode(
414
+ seq.tolist() if hasattr(seq, 'tolist') else list(seq),
415
+ skip_special_tokens=skip_special_tokens,
416
+ **kwargs
417
+ )
418
+ for seq in sequences
419
+ ]
420
+
421
+ def __call__(
422
+ self,
423
+ text: Union[str, List[str]],
424
+ add_special_tokens: bool = False,
425
+ padding: bool = False,
426
+ truncation: bool = False,
427
+ max_length: Optional[int] = None,
428
+ return_tensors: Optional[str] = None,
429
+ return_token_mask: bool = False,
430
+ auto_dna_tags: Optional[bool] = None,
431
+ **kwargs
432
+ ) -> Dict[str, Any]:
433
+ is_batch = isinstance(text, list)
434
+ texts = text if is_batch else [text]
435
+
436
+ all_ids = []
437
+ all_masks = [] if return_token_mask else None
438
+
439
+ for t in texts:
440
+ if return_token_mask:
441
+ ids, mask = self.encode(t, add_special_tokens=add_special_tokens, return_token_mask=True, auto_dna_tags=auto_dna_tags)
442
+ all_ids.append(ids)
443
+ all_masks.append(mask)
444
+ else:
445
+ ids = self.encode(t, add_special_tokens=add_special_tokens, return_token_mask=False, auto_dna_tags=auto_dna_tags)
446
+ all_ids.append(ids)
447
+
448
+ if padding:
449
+ max_len = max(len(ids) for ids in all_ids)
450
+ if max_length:
451
+ max_len = min(max_len, max_length)
452
+
453
+ padded_ids = []
454
+ attention_masks = []
455
+ padded_token_masks = [] if return_token_mask else None
456
+
457
+ for idx, ids in enumerate(all_ids):
458
+ pad_len = max_len - len(ids)
459
+
460
+ if pad_len > 0:
461
+ ids = ids + [self.pad_token_id] * pad_len
462
+ attn = [1] * (max_len - pad_len) + [0] * pad_len
463
+ if return_token_mask:
464
+ mask = all_masks[idx] + [-2] * pad_len
465
+ else:
466
+ ids = ids[:max_len]
467
+ attn = [1] * max_len
468
+ if return_token_mask:
469
+ mask = all_masks[idx][:max_len]
470
+
471
+ padded_ids.append(ids)
472
+ attention_masks.append(attn)
473
+ if return_token_mask:
474
+ padded_token_masks.append(mask)
475
+
476
+ all_ids = padded_ids
477
+ all_masks = padded_token_masks
478
+ else:
479
+ attention_masks = [[1] * len(ids) for ids in all_ids]
480
+
481
+ result = {
482
+ "input_ids": all_ids if is_batch else all_ids[0],
483
+ "attention_mask": attention_masks if is_batch else attention_masks[0],
484
+ }
485
+
486
+ if return_token_mask:
487
+ result["token_mask"] = all_masks if is_batch else all_masks[0]
488
+
489
+ if return_tensors == "pt":
490
+ import torch
491
+ if is_batch:
492
+ result["input_ids"] = torch.tensor(result["input_ids"])
493
+ result["attention_mask"] = torch.tensor(result["attention_mask"])
494
+ if return_token_mask:
495
+ result["token_mask"] = torch.tensor(result["token_mask"])
496
+ else:
497
+ result["input_ids"] = torch.tensor([result["input_ids"]])
498
+ result["attention_mask"] = torch.tensor([result["attention_mask"]])
499
+ if return_token_mask:
500
+ result["token_mask"] = torch.tensor([result["token_mask"]])
501
+
502
+ return BatchEncoding(result, tensor_type=return_tensors)
503
+
504
+ def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
505
+ vocab_file = os.path.join(
506
+ save_directory,
507
+ (filename_prefix + "-" if filename_prefix else "") + "vocab.json"
508
+ )
509
+
510
+ with open(vocab_file, "w", encoding="utf-8") as f:
511
+ json.dump(self._vocab, f, ensure_ascii=False, indent=2)
512
+
513
+ return (vocab_file,)
514
+
515
+ def save_pretrained(self, save_directory: str, **kwargs):
516
+ os.makedirs(save_directory, exist_ok=True)
517
+
518
+ # Save base tokenizer files
519
+ self._base_tokenizer.save_pretrained(save_directory)
520
+
521
+ # Save DNA config
522
+ dna_config = {
523
+ "k": self.k,
524
+ "dna_start_id": self.dna_start_id,
525
+ "dna_vocab_size": self.dna_vocab_size,
526
+ "dna_special_tokens": self.dna_special_tokens,
527
+ "auto_dna_tags": self.auto_dna_tags,
528
+ }
529
+
530
+ dna_config_path = os.path.join(save_directory, "dna_config.json")
531
+ with open(dna_config_path, "w", encoding="utf-8") as f:
532
+ json.dump(dna_config, f, indent=2)
533
+
534
+ # Update tokenizer_config.json with auto_map
535
+ config_path = os.path.join(save_directory, "tokenizer_config.json")
536
+
537
+ if os.path.exists(config_path):
538
+ with open(config_path, "r") as f:
539
+ config = json.load(f)
540
+ else:
541
+ config = {}
542
+
543
+ config.update({
544
+ "tokenizer_class": "HybridDNATokenizer",
545
+ "auto_map": {
546
+ "AutoTokenizer": ["tokenizer.HybridDNATokenizer", None]
547
+ },
548
+ "k": self.k,
549
+ "auto_dna_tags": self.auto_dna_tags,
550
+ })
551
+
552
+ with open(config_path, "w", encoding="utf-8") as f:
553
+ json.dump(config, f, indent=2, ensure_ascii=False)
554
+
555
+ # Copy this tokenizer.py to save directory
556
+ import shutil
557
+ src_py = os.path.abspath(__file__)
558
+ dst_py = os.path.join(save_directory, "tokenizer.py")
559
+ if os.path.exists(src_py) and src_py != dst_py:
560
+ shutil.copy2(src_py, dst_py)
561
+
562
+ return (save_directory,)
563
+
564
+ @classmethod
565
+ def from_pretrained(cls, pretrained_model_name_or_path: str, **kwargs):
566
+ k = 6
567
+ auto_dna_tags = False
568
+
569
+ dna_config_path = os.path.join(pretrained_model_name_or_path, "dna_config.json")
570
+ tok_config_path = os.path.join(pretrained_model_name_or_path, "tokenizer_config.json")
571
+
572
+ if os.path.exists(dna_config_path):
573
+ with open(dna_config_path, "r") as f:
574
+ dna_config = json.load(f)
575
+ k = dna_config.get("k", 6)
576
+ auto_dna_tags = dna_config.get("auto_dna_tags", False)
577
+ elif os.path.exists(tok_config_path):
578
+ with open(tok_config_path, "r") as f:
579
+ tok_config = json.load(f)
580
+ k = tok_config.get("k", 6)
581
+ auto_dna_tags = tok_config.get("auto_dna_tags", False)
582
+
583
+ return cls(base_tokenizer_path=pretrained_model_name_or_path, k=k, auto_dna_tags=auto_dna_tags, **kwargs)
tokenizer_config.json ADDED
@@ -0,0 +1,247 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ },
181
+ "151665": {
182
+ "content": "<tool_response>",
183
+ "lstrip": false,
184
+ "normalized": false,
185
+ "rstrip": false,
186
+ "single_word": false,
187
+ "special": false
188
+ },
189
+ "151666": {
190
+ "content": "</tool_response>",
191
+ "lstrip": false,
192
+ "normalized": false,
193
+ "rstrip": false,
194
+ "single_word": false,
195
+ "special": false
196
+ },
197
+ "151667": {
198
+ "content": "<think>",
199
+ "lstrip": false,
200
+ "normalized": false,
201
+ "rstrip": false,
202
+ "single_word": false,
203
+ "special": false
204
+ },
205
+ "151668": {
206
+ "content": "</think>",
207
+ "lstrip": false,
208
+ "normalized": false,
209
+ "rstrip": false,
210
+ "single_word": false,
211
+ "special": false
212
+ }
213
+ },
214
+ "additional_special_tokens": [
215
+ "<|im_start|>",
216
+ "<|im_end|>",
217
+ "<|object_ref_start|>",
218
+ "<|object_ref_end|>",
219
+ "<|box_start|>",
220
+ "<|box_end|>",
221
+ "<|quad_start|>",
222
+ "<|quad_end|>",
223
+ "<|vision_start|>",
224
+ "<|vision_end|>",
225
+ "<|vision_pad|>",
226
+ "<|image_pad|>",
227
+ "<|video_pad|>"
228
+ ],
229
+ "bos_token": null,
230
+ "clean_up_tokenization_spaces": false,
231
+ "eos_token": "<|endoftext|>",
232
+ "errors": "replace",
233
+ "extra_special_tokens": {},
234
+ "model_max_length": 131072,
235
+ "pad_token": "<|endoftext|>",
236
+ "split_special_tokens": false,
237
+ "tokenizer_class": "HybridDNATokenizer",
238
+ "unk_token": null,
239
+ "auto_map": {
240
+ "AutoTokenizer": [
241
+ "tokenizer.HybridDNATokenizer",
242
+ null
243
+ ]
244
+ },
245
+ "k": 6,
246
+ "auto_dna_tags": false
247
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff