loubnabnl HF Staff commited on 3 days ago

Commit

9cb7d0e

verified ·

1 Parent(s): 45967f0

Initial upload from HuggingFaceBio/carbon-8B-longctx-32k-from-1T-decay@step-227500 with new README

Browse files

Files changed (21) hide show

.gitattributes +1 -0
README.md +87 -0
added_tokens.json +28 -0
chat_template.jinja +85 -0
config.json +29 -0
dna_config.json +10 -0
generation_config.json +6 -0
merges.txt +0 -0
model-00001-of-00007.safetensors +3 -0
model-00002-of-00007.safetensors +3 -0
model-00003-of-00007.safetensors +3 -0
model-00004-of-00007.safetensors +3 -0
model-00005-of-00007.safetensors +3 -0
model-00006-of-00007.safetensors +3 -0
model-00007-of-00007.safetensors +3 -0
model.safetensors.index.json +299 -0
special_tokens_map.json +31 -0
tokenizer.json +3 -0
tokenizer.py +583 -0
tokenizer_config.json +247 -0
vocab.json +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,87 @@

+---
+library_name: transformers
+license: apache-2.0
+language:
+  - dna
+tags:
+  - dna
+  - genomic
+  - transformers
+---
+# Carbon-8B
+A larger, higher-capacity member of the **Carbon** family of generative DNA foundation models.
+Carbon-8B is the 8B-parameter sibling of [Carbon-3B](https://huggingface.co/HuggingFaceBio/Carbon-3B). It is intended for users who can afford additional inference cost in exchange for stronger downstream performance. For the full design rationale, tokenizer specification, evaluation protocol, and usage details, please refer to the **[Carbon-3B model card](https://huggingface.co/HuggingFaceBio/Carbon-3B)** and the Carbon technical report — this card focuses only on what is specific to Carbon-8B.
+## Model Summary
+- **8B-parameter decoder-only autoregressive model** trained on DNA and RNA sequences with a primary focus on eukaryotes.
+- **Same hybrid tokenizer** as Carbon-3B (non-overlapping 6-mer for DNA + Qwen3 BPE for English text). Each DNA token encodes 6 bp. Wrap DNA inputs with `<dna>...</dna>` — see the Carbon-3B card for tokenizer details and usage caveats.
+- **Native context: 32,768 tokens (≈ 196 kbp).** Carbon-8B was extended with a long-context decay stage from an 8 k-context base, so it natively handles 32 k tokens. You can apply YaRN at 4× to extrapolate up to 128 k tokens (≈ 786 kbp).
+- Released as a standard Hugging Face causal LM (`LlamaForCausalLM`).
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+repo = "HuggingFaceBio/Carbon-8B"
+tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    repo, dtype=torch.bfloat16,
+).cuda().eval()
+prompt = "<dna>ATGCGCTAGCTACGATCGATCGTAGCTAGCTAGCTAGCTACG"   # multiple of 6 bp
+inputs = tok(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")
+out = model.generate(**inputs, max_new_tokens=64, do_sample=False)
+print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
+```
+## Training
+Carbon-8B follows the same pre-training recipe as Carbon-3B on the **[`HuggingFaceBio/carbon-pretraining-corpus`](https://huggingface.co/datasets/HuggingFaceBio/carbon-pretraining-corpus)** with the identical data mixture on 1T DNA 6-mer tokens. The main recipe ingredients:
+- **Learning-rate schedule: cosine** (instead of the WSD schedule used for Carbon-3B).
+- **Loss schedule:** after 100B tokens the loss switches from cross-entropy to FNS loss until the end of training.
+- **Pre-training**: on 1T 6-mer tokens (≈ 6T DNA base pairs), with GBS=512, seq=8192 → 4.19 M tok/step. On 32 nodes (TP=4, DP=64), bfloat16, AdamW. We keep the same training mixture even in the decay phase with 70% Generator eukaryote data with metadata with dropout, 16% mRNA, 4% splice mRNA and 10% Prokaryote data.
+- **Long-context extension stage.** After pre-training, Carbon-8B undergoes a long-context decay phase that extends the native context from 8,192 to 32,768 tokens (≈ 196 kbp). You can apply YaRN at 4× to further extrapolate to 128 k tokens (≈ 786 kbp).
+Training infrastructure, framework ([Megatron-LM-Carbon](https://github.com/huggingface/Megatron-LM-Carbon)), and conversion path ([Megatron-Bridge](https://github.com/NVIDIA/Megatron-Bridge)) are identical to Carbon-3B.
+## Evaluation
+All evaluations are zero-shot and use the [public Carbon evaluation pipeline](https://github.com/huggingface/carbon/tree/main/evaluation). See the [Carbon-3B card](https://huggingface.co/HuggingFaceBio/Carbon-3B#evaluation) for the full task suite, metrics, and methodology.
+### Downstream tasks
+| Category | Metric (%) | Carbon 8B | Carbon 3B | Evo2 7B (1M) |
+|---|---|---|---|---|
+| Generative | SR eukaryote | **64.03** | <u>61.50</u> | 59.83 |
+| Variant effect prediction | BRCA2 AUROC | **85.60** | <u>84.64</u> | 83.52 |
+| | TraitGym Mendelian AUPRC by-chrom | <u>36.81</u> | 34.24 | **38.36** |
+| | ClinVar coding AUROC, 48 kb | <u>93.43</u> | 93.30 | **93.70** |
+| | ClinVar non-coding AUROC, 48 kb | **91.98** | <u>91.56</u> | 90.03 |
+| Perturbation | TATA v2 | <u>65.62</u> | **65.94** | 63.72 |
+| | SYN v2 | **92.18** | 82.78 | <u>84.92</u> |
+### Genome-NIAH (long-context retrieval)
+Genome-NIAH measures how well a DNA model actually *uses* its long context. See the [`hf-carbon/genome-niah` dataset card](https://huggingface.co/datasets/hf-carbon/genome-niah) for the benchmark design.
+| Context length         | Carbon 3B (native / YaRN 4×) | Carbon 8B (native / YaRN 4×) | Evo2 7B |
+|------------------------|------------------------------|------------------------------|---------|
+| 16 k tokens (98 kbp)   | 0.73 / 0.91                  | 0.78 / 0.89                  | 0.97    |
+| 32 k tokens (196 kbp)  | 0.55 / 0.90                  | 0.69 / 0.87                  | 0.95    |
+| 64 k tokens (393 kbp)  | — / 0.79                     | — / 0.86                     | 0.80    |
+| 128 k tokens (786 kbp) | — / 0.27                     | — / 0.65                     | *running* |
+Carbon-8B retrieves reliably up to its 32 k native boundary; **YaRN 4×** recovers most of the loss at the 32 k → 64 k boundary and extends usable retrieval to ≈ 786 kbp.
+## Intended use
+Generative modelling, variant-effect prediction, motif-perturbation analysis, and long-context retrieval on DNA sequences. For faster inference at shorter contexts, use **Carbon-3B**.
+## License
+Apache 2.0.

added_tokens.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "</think>": 151668,
+  "</tool_call>": 151658,
+  "</tool_response>": 151666,
+  "<think>": 151667,
+  "<tool_call>": 151657,
+  "<tool_response>": 151665,
+  "<|box_end|>": 151649,
+  "<|box_start|>": 151648,
+  "<|endoftext|>": 151643,
+  "<|file_sep|>": 151664,
+  "<|fim_middle|>": 151660,
+  "<|fim_pad|>": 151662,
+  "<|fim_prefix|>": 151659,
+  "<|fim_suffix|>": 151661,
+  "<|im_end|>": 151645,
+  "<|im_start|>": 151644,
+  "<|image_pad|>": 151655,
+  "<|object_ref_end|>": 151647,
+  "<|object_ref_start|>": 151646,
+  "<|quad_end|>": 151651,
+  "<|quad_start|>": 151650,
+  "<|repo_name|>": 151663,
+  "<|video_pad|>": 151656,
+  "<|vision_end|>": 151653,
+  "<|vision_pad|>": 151654,
+  "<|vision_start|>": 151652
+}

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,85 @@

+{%- if tools %}
+    {{- '<|im_start|>system\n' }}
+    {%- if messages[0].role == 'system' %}
+        {{- messages[0].content + '\n\n' }}
+    {%- endif %}
+    {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
+    {%- for tool in tools %}
+        {{- "\n" }}
+        {{- tool | tojson }}
+    {%- endfor %}
+    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
+{%- else %}
+    {%- if messages[0].role == 'system' %}
+        {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
+{%- for message in messages[::-1] %}
+    {%- set index = (messages|length - 1) - loop.index0 %}
+    {%- if ns.multi_step_tool and message.role == "user" and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
+        {%- set ns.multi_step_tool = false %}
+        {%- set ns.last_query_index = index %}
+    {%- endif %}
+{%- endfor %}
+{%- for message in messages %}
+    {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
+        {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
+    {%- elif message.role == "assistant" %}
+        {%- set content = message.content %}
+        {%- set reasoning_content = '' %}
+        {%- if message.reasoning_content is defined and message.reasoning_content is not none %}
+            {%- set reasoning_content = message.reasoning_content %}
+        {%- else %}
+            {%- if '</think>' in message.content %}
+                {%- set content = message.content.split('</think>')[-1].lstrip('\n') %}
+                {%- set reasoning_content = message.content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
+            {%- endif %}
+        {%- endif %}
+        {%- if loop.index0 > ns.last_query_index %}
+            {%- if loop.last or (not loop.last and reasoning_content) %}
+                {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
+            {%- else %}
+                {{- '<|im_start|>' + message.role + '\n' + content }}
+            {%- endif %}
+        {%- else %}
+            {{- '<|im_start|>' + message.role + '\n' + content }}
+        {%- endif %}
+        {%- if message.tool_calls %}
+            {%- for tool_call in message.tool_calls %}
+                {%- if (loop.first and content) or (not loop.first) %}
+                    {{- '\n' }}
+                {%- endif %}
+                {%- if tool_call.function %}
+                    {%- set tool_call = tool_call.function %}
+                {%- endif %}
+                {{- '<tool_call>\n{"name": "' }}
+                {{- tool_call.name }}
+                {{- '", "arguments": ' }}
+                {%- if tool_call.arguments is string %}
+                    {{- tool_call.arguments }}
+                {%- else %}
+                    {{- tool_call.arguments | tojson }}
+                {%- endif %}
+                {{- '}\n</tool_call>' }}
+            {%- endfor %}
+        {%- endif %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '\n<tool_response>\n' }}
+        {{- message.content }}
+        {{- '\n</tool_response>' }}
+        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant\n' }}
+    {%- if enable_thinking is defined and enable_thinking is false %}
+        {{- '<think>\n\n</think>\n\n' }}
+    {%- endif %}
+{%- endif %}

config.json ADDED Viewed

	@@ -0,0 +1,29 @@

+{
+  "architectures": [
+    "LlamaForCausalLM"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "bos_token_id": 1,
+  "dtype": "float32",
+  "eos_token_id": 2,
+  "head_dim": 128,
+  "hidden_act": "silu",
+  "hidden_size": 4096,
+  "initializer_range": 0.02,
+  "intermediate_size": 14336,
+  "max_position_embeddings": 32768,
+  "mlp_bias": false,
+  "model_type": "llama",
+  "num_attention_heads": 32,
+  "num_hidden_layers": 32,
+  "num_key_value_heads": 8,
+  "pretraining_tp": 1,
+  "rms_norm_eps": 1e-05,
+  "rope_scaling": null,
+  "rope_theta": 5000000.0,
+  "tie_word_embeddings": false,
+  "transformers_version": "4.57.6",
+  "use_cache": true,
+  "vocab_size": 155776
+}

dna_config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "k": 6,
+  "dna_start_id": 151669,
+  "dna_vocab_size": 4107,
+  "dna_special_tokens": [
+    "<dna>",
+    "</dna>",
+    "<oov>"
+  ]
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "transformers_version": "4.57.6"
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model-00001-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8b7517e3390bff6bbbb00bcf5ca809caa54300446789a58533aebf85b7f7d14a
+size 2467334984

model-00002-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b259414045cc70d190a424128e94041f44cca03bfbec7f2b59626e039d9fc95b
+size 2499909576

model-00003-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0059b3e8e3d8d824c2884fef90684ef4b279fbf466553f6b39a95bcd8a88f849
+size 2499909616

model-00004-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a5c18abc7fb7c435dd1340e189da1fbf6414654ad4acc9b0b653605811e6d5e0
+size 2416006472

model-00005-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:108ab1b5bf91fe27d811b96307438653238c2d409b10aea39253cfb1fc2c459c
+size 2499909632

model-00006-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:56b8285a88e4efb5bec0ff2e96fc3a94516a113764a0f40d0d9aca2f37396973
+size 2499909640

model-00007-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:aeec8fcc37dc292661ab47fea3089783a3ca1a056b3b1c2932efc48b0e9b3836
+size 1628463872

model.safetensors.index.json ADDED Viewed

	@@ -0,0 +1,299 @@

+{
+    "metadata": {
+        "total_parameters": 8255705088,
+        "total_size": 33022820352
+    },
+    "weight_map": {
+        "model.layers.10.self_attn.q_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.22.self_attn.k_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.16.input_layernorm.weight": "model-00004-of-00007.safetensors",
+        "model.layers.30.mlp.gate_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.30.mlp.down_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.6.input_layernorm.weight": "model-00002-of-00007.safetensors",
+        "model.layers.7.self_attn.o_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.12.self_attn.k_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.13.input_layernorm.weight": "model-00003-of-00007.safetensors",
+        "model.layers.19.mlp.down_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.10.self_attn.k_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.30.input_layernorm.weight": "model-00006-of-00007.safetensors",
+        "model.layers.21.self_attn.k_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.15.mlp.down_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.18.self_attn.q_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.22.post_attention_layernorm.weight": "model-00005-of-00007.safetensors",
+        "model.layers.12.mlp.down_proj.weight": "model-00003-of-00007.safetensors",
+        "model.norm.weight": "model-00007-of-00007.safetensors",
+        "model.layers.23.self_attn.q_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.9.post_attention_layernorm.weight": "model-00003-of-00007.safetensors",
+        "model.layers.21.input_layernorm.weight": "model-00005-of-00007.safetensors",
+        "model.layers.14.mlp.up_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.6.self_attn.q_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00007.safetensors",
+        "model.layers.23.self_attn.v_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.30.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.21.mlp.up_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.11.input_layernorm.weight": "model-00003-of-00007.safetensors",
+        "model.layers.4.self_attn.v_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00007.safetensors",
+        "model.layers.8.post_attention_layernorm.weight": "model-00003-of-00007.safetensors",
+        "model.layers.13.self_attn.v_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.31.mlp.gate_proj.weight": "model-00007-of-00007.safetensors",
+        "model.layers.22.self_attn.v_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.29.mlp.up_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.8.mlp.up_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.4.mlp.up_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.15.self_attn.o_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.23.mlp.up_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.26.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.16.post_attention_layernorm.weight": "model-00004-of-00007.safetensors",
+        "model.layers.25.mlp.down_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.0.input_layernorm.weight": "model-00001-of-00007.safetensors",
+        "model.layers.11.mlp.up_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.8.self_attn.o_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.24.mlp.up_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.16.self_attn.o_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.27.input_layernorm.weight": "model-00006-of-00007.safetensors",
+        "model.layers.31.mlp.up_proj.weight": "model-00007-of-00007.safetensors",
+        "model.layers.30.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.4.self_attn.k_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.28.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.1.input_layernorm.weight": "model-00001-of-00007.safetensors",
+        "model.layers.2.input_layernorm.weight": "model-00002-of-00007.safetensors",
+        "model.layers.24.post_attention_layernorm.weight": "model-00005-of-00007.safetensors",
+        "model.layers.23.mlp.down_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.15.self_attn.q_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.24.input_layernorm.weight": "model-00005-of-00007.safetensors",
+        "model.layers.24.self_attn.q_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.28.mlp.up_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.13.self_attn.q_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.28.input_layernorm.weight": "model-00006-of-00007.safetensors",
+        "model.layers.5.mlp.down_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.12.self_attn.q_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.28.mlp.down_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.20.post_attention_layernorm.weight": "model-00005-of-00007.safetensors",
+        "model.layers.3.input_layernorm.weight": "model-00002-of-00007.safetensors",
+        "model.layers.7.post_attention_layernorm.weight": "model-00002-of-00007.safetensors",
+        "model.layers.7.self_attn.k_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.31.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.5.mlp.gate_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.1.self_attn.q_proj.weight": "model-00001-of-00007.safetensors",
+        "model.layers.4.self_attn.q_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.10.mlp.gate_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.17.mlp.gate_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.2.self_attn.k_proj.weight": "model-00001-of-00007.safetensors",
+        "model.layers.4.input_layernorm.weight": "model-00002-of-00007.safetensors",
+        "model.layers.15.mlp.gate_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.18.mlp.up_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.27.post_attention_layernorm.weight": "model-00006-of-00007.safetensors",
+        "model.layers.29.input_layernorm.weight": "model-00006-of-00007.safetensors",
+        "model.layers.30.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.14.self_attn.k_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.18.mlp.down_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.27.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.22.self_attn.q_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.19.post_attention_layernorm.weight": "model-00005-of-00007.safetensors",
+        "model.layers.13.self_attn.k_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.17.input_layernorm.weight": "model-00004-of-00007.safetensors",
+        "model.layers.6.mlp.down_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.10.self_attn.o_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.11.mlp.gate_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.11.self_attn.k_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.14.input_layernorm.weight": "model-00004-of-00007.safetensors",
+        "model.layers.23.self_attn.o_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.4.mlp.gate_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.20.self_attn.k_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.17.mlp.down_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.22.mlp.down_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.15.self_attn.k_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.9.self_attn.o_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.24.mlp.down_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.3.self_attn.q_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.23.input_layernorm.weight": "model-00005-of-00007.safetensors",
+        "model.layers.20.self_attn.q_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.3.mlp.down_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.9.input_layernorm.weight": "model-00003-of-00007.safetensors",
+        "model.embed_tokens.weight": "model-00001-of-00007.safetensors",
+        "model.layers.1.self_attn.k_proj.weight": "model-00001-of-00007.safetensors",
+        "model.layers.10.mlp.up_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00007.safetensors",
+        "model.layers.8.self_attn.k_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.3.post_attention_layernorm.weight": "model-00002-of-00007.safetensors",
+        "model.layers.0.mlp.up_proj.weight": "model-00001-of-00007.safetensors",
+        "model.layers.28.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.15.post_attention_layernorm.weight": "model-00004-of-00007.safetensors",
+        "model.layers.11.mlp.down_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.19.mlp.gate_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.11.post_attention_layernorm.weight": "model-00003-of-00007.safetensors",
+        "model.layers.21.self_attn.q_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.10.self_attn.v_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.13.self_attn.o_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.10.input_layernorm.weight": "model-00003-of-00007.safetensors",
+        "model.layers.26.mlp.down_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.2.mlp.up_proj.weight": "model-00001-of-00007.safetensors",
+        "model.layers.14.mlp.down_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.18.mlp.gate_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.22.mlp.gate_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.24.mlp.gate_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.7.mlp.gate_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.9.self_attn.q_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.17.post_attention_layernorm.weight": "model-00004-of-00007.safetensors",
+        "model.layers.27.mlp.up_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.2.mlp.down_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.18.input_layernorm.weight": "model-00004-of-00007.safetensors",
+        "model.layers.27.self_attn.o_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.9.self_attn.v_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.25.mlp.gate_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.5.self_attn.o_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.20.input_layernorm.weight": "model-00005-of-00007.safetensors",
+        "model.layers.13.post_attention_layernorm.weight": "model-00003-of-00007.safetensors",
+        "model.layers.24.self_attn.v_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00007.safetensors",
+        "model.layers.17.mlp.up_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.5.self_attn.v_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.26.mlp.up_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.19.input_layernorm.weight": "model-00005-of-00007.safetensors",
+        "model.layers.5.post_attention_layernorm.weight": "model-00002-of-00007.safetensors",
+        "model.layers.31.mlp.down_proj.weight": "model-00007-of-00007.safetensors",
+        "model.layers.11.self_attn.v_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.15.mlp.up_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.3.mlp.up_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.17.self_attn.v_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.25.self_attn.v_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.8.mlp.down_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.16.self_attn.v_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.27.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.29.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.29.post_attention_layernorm.weight": "model-00006-of-00007.safetensors",
+        "model.layers.6.mlp.gate_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.9.mlp.down_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.25.self_attn.o_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.7.mlp.up_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.25.post_attention_layernorm.weight": "model-00006-of-00007.safetensors",
+        "model.layers.21.self_attn.o_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.7.mlp.down_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00007.safetensors",
+        "model.layers.16.mlp.down_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.30.self_attn.o_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.23.mlp.gate_proj.weight": "model-00005-of-00007.safetensors",
+        "lm_head.weight": "model-00007-of-00007.safetensors",
+        "model.layers.27.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.25.input_layernorm.weight": "model-00006-of-00007.safetensors",
+        "model.layers.6.self_attn.v_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.25.self_attn.k_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.12.input_layernorm.weight": "model-00003-of-00007.safetensors",
+        "model.layers.5.self_attn.q_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.1.mlp.up_proj.weight": "model-00001-of-00007.safetensors",
+        "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00007.safetensors",
+        "model.layers.8.self_attn.q_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.3.self_attn.k_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.19.self_attn.v_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.18.self_attn.o_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.16.self_attn.q_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00007.safetensors",
+        "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00007.safetensors",
+        "model.layers.15.self_attn.v_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.24.self_attn.o_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.4.mlp.down_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.22.input_layernorm.weight": "model-00005-of-00007.safetensors",
+        "model.layers.6.self_attn.o_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.13.mlp.gate_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00007.safetensors",
+        "model.layers.6.self_attn.k_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.14.self_attn.o_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.3.self_attn.o_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.1.mlp.down_proj.weight": "model-00001-of-00007.safetensors",
+        "model.layers.20.mlp.up_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.28.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.12.post_attention_layernorm.weight": "model-00003-of-00007.safetensors",
+        "model.layers.12.self_attn.v_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.26.self_attn.o_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.29.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.9.mlp.gate_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.22.mlp.up_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.19.self_attn.k_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.26.input_layernorm.weight": "model-00006-of-00007.safetensors",
+        "model.layers.10.post_attention_layernorm.weight": "model-00003-of-00007.safetensors",
+        "model.layers.12.mlp.up_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.26.mlp.gate_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.28.mlp.gate_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.14.post_attention_layernorm.weight": "model-00004-of-00007.safetensors",
+        "model.layers.22.self_attn.o_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.21.self_attn.v_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.26.post_attention_layernorm.weight": "model-00006-of-00007.safetensors",
+        "model.layers.4.self_attn.o_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.1.self_attn.v_proj.weight": "model-00001-of-00007.safetensors",
+        "model.layers.9.mlp.up_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.13.mlp.up_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.12.self_attn.o_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.17.self_attn.o_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.23.post_attention_layernorm.weight": "model-00005-of-00007.safetensors",
+        "model.layers.31.post_attention_layernorm.weight": "model-00007-of-00007.safetensors",
+        "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00007.safetensors",
+        "model.layers.30.mlp.up_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.3.self_attn.v_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.29.self_attn.o_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.28.post_attention_layernorm.weight": "model-00006-of-00007.safetensors",
+        "model.layers.16.mlp.up_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.7.self_attn.v_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.3.mlp.gate_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.11.self_attn.q_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.5.mlp.up_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.18.self_attn.k_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.23.self_attn.k_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.29.mlp.gate_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.20.mlp.down_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.2.self_attn.q_proj.weight": "model-00001-of-00007.safetensors",
+        "model.layers.7.input_layernorm.weight": "model-00002-of-00007.safetensors",
+        "model.layers.19.self_attn.q_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.4.post_attention_layernorm.weight": "model-00002-of-00007.safetensors",
+        "model.layers.31.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.16.mlp.gate_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.18.post_attention_layernorm.weight": "model-00004-of-00007.safetensors",
+        "model.layers.20.self_attn.o_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.17.self_attn.q_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.5.self_attn.k_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.9.self_attn.k_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.8.input_layernorm.weight": "model-00003-of-00007.safetensors",
+        "model.layers.25.mlp.up_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.19.mlp.up_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.6.mlp.up_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.29.mlp.down_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.18.self_attn.v_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.30.post_attention_layernorm.weight": "model-00006-of-00007.safetensors",
+        "model.layers.11.self_attn.o_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.13.mlp.down_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.0.mlp.down_proj.weight": "model-00001-of-00007.safetensors",
+        "model.layers.17.self_attn.k_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.8.mlp.gate_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00007.safetensors",
+        "model.layers.2.post_attention_layernorm.weight": "model-00002-of-00007.safetensors",
+        "model.layers.16.self_attn.k_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.21.post_attention_layernorm.weight": "model-00005-of-00007.safetensors",
+        "model.layers.25.self_attn.q_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.21.mlp.gate_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.6.post_attention_layernorm.weight": "model-00002-of-00007.safetensors",
+        "model.layers.8.self_attn.v_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.21.mlp.down_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.31.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.12.mlp.gate_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.14.self_attn.q_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.20.mlp.gate_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.31.input_layernorm.weight": "model-00007-of-00007.safetensors",
+        "model.layers.28.self_attn.o_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.5.input_layernorm.weight": "model-00002-of-00007.safetensors",
+        "model.layers.15.input_layernorm.weight": "model-00004-of-00007.safetensors",
+        "model.layers.7.self_attn.q_proj.weight": "model-00002-of-00007.safetensors",
+        "model.layers.29.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.24.self_attn.k_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.19.self_attn.o_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.10.mlp.down_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.20.self_attn.v_proj.weight": "model-00005-of-00007.safetensors",
+        "model.layers.26.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.26.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.31.self_attn.o_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.27.mlp.down_proj.weight": "model-00006-of-00007.safetensors",
+        "model.layers.2.self_attn.v_proj.weight": "model-00001-of-00007.safetensors",
+        "model.layers.14.self_attn.v_proj.weight": "model-00003-of-00007.safetensors",
+        "model.layers.14.mlp.gate_proj.weight": "model-00004-of-00007.safetensors",
+        "model.layers.27.mlp.gate_proj.weight": "model-00006-of-00007.safetensors"
+    }
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,31 @@

+{
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:aeb13307a71acd8fe81861d94ad54ab689df773318809eed3cbe794b4492dae4
+size 11422654

tokenizer.py ADDED Viewed

	@@ -0,0 +1,583 @@

+"""
+HybridDNATokenizer: Combines Qwen3 BPE tokenization with DNA 6-mer tokenization.
+DNA sequences wrapped in <dna>...</dna> tags are tokenized as 6-mers.
+All other text uses Qwen3's BPE tokenization.
+Supports token_mask for Fine-grained Nucleotide Supervision (FNS):
+  -2: padding token
+  -1: text token (BPE)
+   0: DNA special token (<dna>, </dna>, <oov>)
+  1-5: partial 6-mer token — valid_length real bases at positions [0, valid_length),
+       right-padded with 'A' at positions [valid_length, k) so loss can supervise
+       positions 0..valid_len-1 via pos_mask = (valid_len > pos)
+   6: full 6-mer
+"""
+import os
+import json
+import itertools
+from typing import List, Optional, Tuple, Dict, Union, Any
+from transformers import PreTrainedTokenizer, AutoTokenizer, BatchEncoding
+class HybridDNATokenizer(PreTrainedTokenizer):
+    """
+    Hybrid tokenizer combining Qwen3 BPE with DNA 6-mer tokenization.
+    DNA regions must be wrapped in <dna>...</dna> tags to be tokenized as 6-mers.
+    Without tags, DNA sequences are tokenized as regular BPE text.
+    For pure-DNA input (no metadata tokens), pass auto_dna_tags=True to have
+    <dna>...</dna> tags added automatically when they are absent.  Do NOT set
+    this if the input may contain BPE metadata such as species tags
+    (<fungi_species> etc.) — those must appear outside <dna>...</dna> and would
+    be incorrectly k-mer encoded if auto-wrapping fired.
+    """
+    model_input_names = ["input_ids", "attention_mask"]
+    def __init__(
+        self,
+        base_tokenizer_path: Optional[str] = None,
+        k: int = 6,
+        auto_dna_tags: bool = False,
+        **kwargs
+    ):
+        self.k = k
+        # Load base tokenizer (Qwen3-4B-Base)
+        self._base_tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Base")
+        # Get base vocabulary
+        self._base_vocab = self._base_tokenizer.get_vocab()
+        self._base_vocab_size = len(self._base_vocab)
+        # Initialize DNA vocabulary
+        self._init_dna_vocab()
+        # Build combined vocabulary
+        self._build_combined_vocab()
+        # Set special tokens
+        self._eos_token = kwargs.pop('eos_token', None) or "<|endoftext|>"
+        self._pad_token = kwargs.pop('pad_token', None) or self._base_tokenizer.pad_token or "<|endoftext|>"
+        # Initialize parent class
+        super().__init__(
+            eos_token=self._eos_token,
+            pad_token=self._pad_token,
+            **kwargs
+        )
+        self.special_tokens = self.dna_special_tokens + [self._eos_token, self._pad_token]
+        self.auto_dna_tags = auto_dna_tags
+    def _init_dna_vocab(self):
+        """Initialize DNA vocabulary (special tokens + k-mers + padding for 128 alignment)."""
+        bases = ['A', 'T', 'C', 'G']
+        # DNA special tokens
+        self.dna_special_tokens = ["<dna>", "</dna>", "<oov>"]
+        # Generate all k-mer combinations (4^k = 4096 for k=6)
+        self.kmers = [''.join(kmer) for kmer in itertools.product(bases, repeat=self.k)]
+        # DNA tokens start after base vocabulary
+        self.dna_start_id = self._base_vocab_size
+        # All DNA tokens get new IDs (no reuse of base vocab IDs, even for
+        # overlapping tokens like CCCCCC — they have different semantics in
+        # DNA context vs BPE context, per Qiuyi's recommendation)
+        base_dna_tokens = self.dna_special_tokens + self.kmers
+        # Calculate padding for 128 alignment
+        total_vocab_unpadded = self._base_vocab_size + len(base_dna_tokens)
+        target_vocab_size = ((total_vocab_unpadded + 127) // 128) * 128
+        num_padding_tokens = target_vocab_size - total_vocab_unpadded
+        # Add unused padding tokens
+        self.padding_tokens = [f"<unused_{i}>" for i in range(num_padding_tokens)]
+        # Create DNA token mappings — all get sequential new IDs
+        self.dna_token_to_id = {}
+        self.dna_id_to_token = {}
+        current_id = self.dna_start_id
+        for token in base_dna_tokens:
+            self.dna_token_to_id[token] = current_id
+            self.dna_id_to_token[current_id] = token
+            current_id += 1
+        # Add padding tokens
+        for token in self.padding_tokens:
+            self.dna_token_to_id[token] = current_id
+            self.dna_id_to_token[current_id] = token
+            current_id += 1
+        self.dna_vocab_size = len(base_dna_tokens) + len(self.padding_tokens)
+        # Set DNA special token IDs
+        self.dna_begin_token_id = self.dna_token_to_id["<dna>"]
+        self.dna_end_token_id = self.dna_token_to_id["</dna>"]
+        self.oov_token_id = self.dna_token_to_id["<oov>"]
+    def _build_combined_vocab(self):
+        """Build combined vocabulary (base + DNA)."""
+        self._vocab = self._base_vocab.copy()
+        for token, token_id in self.dna_token_to_id.items():
+            if token not in self._vocab:
+                self._vocab[token] = token_id
+        self._id_to_token = {v: k for k, v in self._vocab.items()}
+        for token_id, token in self.dna_id_to_token.items():
+            if token_id not in self._id_to_token:
+                self._id_to_token[token_id] = token
+    @property
+    def vocab_size(self) -> int:
+        return max(self._vocab.values()) + 1
+    def get_vocab(self) -> Dict[str, int]:
+        return self._vocab.copy()
+    def __len__(self):
+        # Override default (len(get_vocab())) because get_vocab() deduplicates
+        # CCCCCC which exists as both BPE (ID 91443) and DNA 6-mer (ID 154402).
+        return self.vocab_size
+    def _split_by_dna_tags(self, text: str) -> List[Tuple[str, bool]]:
+        segments = []
+        i = 0
+        n = len(text)
+        while i < n:
+            start_pos = text.find('<dna>', i)
+            end_pos = text.find('</dna>', i)
+            if start_pos == -1 and end_pos == -1:
+                remaining = text[i:]
+                if remaining:
+                    segments.append((remaining, False))
+                break
+            if start_pos == -1 and end_pos != -1:
+                dna_region = text[i:end_pos + 6]
+                if dna_region:
+                    segments.append((dna_region, True))
+                i = end_pos + 6
+                continue
+            if start_pos != -1 and end_pos == -1:
+                if i < start_pos:
+                    normal_text = text[i:start_pos]
+                    if normal_text:
+                        segments.append((normal_text, False))
+                dna_region = text[start_pos:]
+                if dna_region:
+                    segments.append((dna_region, True))
+                break
+            if start_pos < end_pos:
+                if i < start_pos:
+                    normal_text = text[i:start_pos]
+                    if normal_text:
+                        segments.append((normal_text, False))
+                dna_region = text[start_pos:end_pos + 6]
+                if dna_region:
+                    segments.append((dna_region, True))
+                i = end_pos + 6
+            else:
+                dna_region = text[i:end_pos + 6]
+                if dna_region:
+                    segments.append((dna_region, True))
+                i = end_pos + 6
+        return segments
+    def _parse_dna_region(self, dna_region: str) -> Tuple[str, bool, bool]:
+        if dna_region == '<dna>':
+            return '', True, False
+        elif dna_region == '</dna>':
+            return '', False, True
+        has_start = dna_region.startswith('<dna>')
+        has_end = dna_region.endswith('</dna>')
+        content = dna_region
+        if has_start:
+            content = content[5:]
+        if has_end and content.endswith('</dna>'):
+            content = content[:-6]
+        return content.strip(), has_start, has_end
+    def _process_dna_sequence(self, dna_seq: str) -> Dict:
+        k = self.k
+        dna_seq = dna_seq.upper()
+        kmer_tokens = []
+        valid_bases = set('ATCG')
+        def is_valid_kmer(kmer):
+            return len(kmer) == k and all(base in valid_bases for base in kmer)
+        for i in range(0, len(dna_seq) - k + 1, k):
+            kmer = dna_seq[i:i+k]
+            if is_valid_kmer(kmer):
+                kmer_tokens.append(kmer)
+            else:
+                kmer_tokens.append("<oov>")
+        processed_length = len(kmer_tokens) * k
+        remaining = dna_seq[processed_length:]
+        padding_length = 0
+        valid_length = k
+        if remaining:
+            padding_needed = k - len(remaining)
+            # Right-pad with A: real bases occupy positions [0, valid_length).
+            # The hybrid BP loss supervises positions 0..valid_len-1 via
+            #   pos_mask = (valid_len > pos)
+            # so padding must be at the END, not the start.
+            padded = remaining + 'A' * padding_needed
+            if is_valid_kmer(padded):
+                kmer_tokens.append(padded)
+            else:
+                kmer_tokens.append("<oov>")
+            padding_length = padding_needed
+            valid_length = len(remaining)
+        return {
+            "kmer_tokens": kmer_tokens,
+            "padding_length": padding_length,
+            "valid_length": valid_length,
+        }
+    def _tokenize(self, text: str, **kwargs) -> List[str]:
+        return list(text)
+    def _convert_token_to_id(self, token: str) -> int:
+        if token in self.dna_token_to_id:
+            return self.dna_token_to_id[token]
+        return self._base_vocab.get(token, self._base_tokenizer.unk_token_id or 0)
+    def _convert_id_to_token(self, index: int) -> str:
+        if index in self.dna_id_to_token:
+            return self.dna_id_to_token[index]
+        return self._id_to_token.get(index, "<oov>")
+    def convert_tokens_to_string(self, tokens: List[str]) -> str:
+        return "".join(tokens)
+    def encode(
+        self,
+        text: str,
+        add_special_tokens: bool = False,
+        return_token_mask: bool = False,
+        auto_dna_tags: Optional[bool] = None,
+        **kwargs
+    ) -> Union[List[int], Tuple[List[int], List[int]]]:
+        use_auto = self.auto_dna_tags if auto_dna_tags is None else auto_dna_tags
+        if use_auto and '<dna>' not in text:
+            text = f'<dna>{text}</dna>'
+        segments = self._split_by_dna_tags(text)
+        token_ids = []
+        token_mask = [] if return_token_mask else None
+        for segment_content, is_dna in segments:
+            if is_dna:
+                dna_content, has_start, has_end = self._parse_dna_region(segment_content)
+                if has_start:
+                    token_ids.append(self.dna_begin_token_id)
+                    if return_token_mask:
+                        token_mask.append(0)
+                if dna_content:
+                    result = self._process_dna_sequence(dna_content)
+                    for idx, kmer in enumerate(result["kmer_tokens"]):
+                        token_id = self.dna_token_to_id.get(kmer, self.oov_token_id)
+                        token_ids.append(token_id)
+                        if return_token_mask:
+                            if kmer == "<oov>":
+                                token_mask.append(0)
+                            elif idx == len(result["kmer_tokens"]) - 1 and result["padding_length"] > 0:
+                                token_mask.append(result["valid_length"])
+                            else:
+                                token_mask.append(self.k)
+                if has_end:
+                    token_ids.append(self.dna_end_token_id)
+                    if return_token_mask:
+                        token_mask.append(0)
+            else:
+                base_ids = self._base_tokenizer.encode(
+                    segment_content,
+                    add_special_tokens=False
+                )
+                token_ids.extend(base_ids)
+                if return_token_mask:
+                    token_mask.extend([-1] * len(base_ids))
+        # Do NOT append EOS when add_special_tokens=True. Qwen3 doesn't add
+        # BOS/EOS either, and appending EOS here breaks lighteval's
+        # tok_encode_pair: it relies on
+        #   len(encode(ctx)) + len(encode(answer)) == len(encode(ctx + answer))
+        # which the extra EOS violates by shifting the split by 1.
+        if return_token_mask:
+            return token_ids, token_mask
+        return token_ids
+    def decode(
+        self,
+        token_ids: Union[int, List[int]],
+        skip_special_tokens: bool = False,
+        **kwargs
+    ) -> str:
+        if isinstance(token_ids, int):
+            token_ids = [token_ids]
+        if skip_special_tokens:
+            special_ids = {self.eos_token_id, self.pad_token_id}
+            token_ids = [tid for tid in token_ids if tid not in special_ids]
+        parts = []
+        i = 0
+        while i < len(token_ids):
+            tid = token_ids[i]
+            if tid == self.dna_begin_token_id:
+                dna_tokens = []
+                i += 1
+                while i < len(token_ids) and token_ids[i] != self.dna_end_token_id:
+                    if token_ids[i] in self.dna_id_to_token:
+                        dna_tokens.append(self.dna_id_to_token[token_ids[i]])
+                    i += 1
+                dna_seq = ''.join(dna_tokens)
+                if skip_special_tokens:
+                    parts.append(dna_seq)
+                else:
+                    parts.append(f"<dna>{dna_seq}")
+                    if i < len(token_ids) and token_ids[i] == self.dna_end_token_id:
+                        parts.append("</dna>")
+                        i += 1
+            elif tid in self.dna_id_to_token:
+                # This branch handles k-mer tokens that appear without a <dna>
+                # wrapper — the common generation case where <dna> was in the
+                # prompt but only the generated portion is being decoded.
+                # K-mer tokens are content, not special tokens, so always decode
+                # them.  Only drop true DNA special tokens (<dna>, </dna>, <oov>)
+                # when skip_special_tokens=True.
+                is_dna_special = tid in (self.dna_begin_token_id, self.dna_end_token_id, self.oov_token_id)
+                if not (skip_special_tokens and is_dna_special):
+                    parts.append(self.dna_id_to_token[tid])
+                i += 1
+            else:
+                text_ids = []
+                while i < len(token_ids):
+                    curr_id = token_ids[i]
+                    if curr_id in self.dna_id_to_token or curr_id == self.dna_begin_token_id:
+                        break
+                    text_ids.append(curr_id)
+                    i += 1
+                if text_ids:
+                    decoded = self._base_tokenizer.decode(text_ids, skip_special_tokens=skip_special_tokens)
+                    parts.append(decoded)
+        return ''.join(parts)
+    def batch_decode(
+        self,
+        sequences: Union[List[int], List[List[int]], "torch.Tensor"],
+        skip_special_tokens: bool = False,
+        **kwargs
+    ) -> List[str]:
+        return [
+            self.decode(
+                seq.tolist() if hasattr(seq, 'tolist') else list(seq),
+                skip_special_tokens=skip_special_tokens,
+                **kwargs
+            )
+            for seq in sequences
+        ]
+    def __call__(
+        self,
+        text: Union[str, List[str]],
+        add_special_tokens: bool = False,
+        padding: bool = False,
+        truncation: bool = False,
+        max_length: Optional[int] = None,
+        return_tensors: Optional[str] = None,
+        return_token_mask: bool = False,
+        auto_dna_tags: Optional[bool] = None,
+        **kwargs
+    ) -> Dict[str, Any]:
+        is_batch = isinstance(text, list)
+        texts = text if is_batch else [text]
+        all_ids = []
+        all_masks = [] if return_token_mask else None
+        for t in texts:
+            if return_token_mask:
+                ids, mask = self.encode(t, add_special_tokens=add_special_tokens, return_token_mask=True, auto_dna_tags=auto_dna_tags)
+                all_ids.append(ids)
+                all_masks.append(mask)
+            else:
+                ids = self.encode(t, add_special_tokens=add_special_tokens, return_token_mask=False, auto_dna_tags=auto_dna_tags)
+                all_ids.append(ids)
+        if padding:
+            max_len = max(len(ids) for ids in all_ids)
+            if max_length:
+                max_len = min(max_len, max_length)
+            padded_ids = []
+            attention_masks = []
+            padded_token_masks = [] if return_token_mask else None
+            for idx, ids in enumerate(all_ids):
+                pad_len = max_len - len(ids)
+                if pad_len > 0:
+                    ids = ids + [self.pad_token_id] * pad_len
+                    attn = [1] * (max_len - pad_len) + [0] * pad_len
+                    if return_token_mask:
+                        mask = all_masks[idx] + [-2] * pad_len
+                else:
+                    ids = ids[:max_len]
+                    attn = [1] * max_len
+                    if return_token_mask:
+                        mask = all_masks[idx][:max_len]
+                padded_ids.append(ids)
+                attention_masks.append(attn)
+                if return_token_mask:
+                    padded_token_masks.append(mask)
+            all_ids = padded_ids
+            all_masks = padded_token_masks
+        else:
+            attention_masks = [[1] * len(ids) for ids in all_ids]
+        result = {
+            "input_ids": all_ids if is_batch else all_ids[0],
+            "attention_mask": attention_masks if is_batch else attention_masks[0],
+        }
+        if return_token_mask:
+            result["token_mask"] = all_masks if is_batch else all_masks[0]
+        if return_tensors == "pt":
+            import torch
+            if is_batch:
+                result["input_ids"] = torch.tensor(result["input_ids"])
+                result["attention_mask"] = torch.tensor(result["attention_mask"])
+                if return_token_mask:
+                    result["token_mask"] = torch.tensor(result["token_mask"])
+            else:
+                result["input_ids"] = torch.tensor([result["input_ids"]])
+                result["attention_mask"] = torch.tensor([result["attention_mask"]])
+                if return_token_mask:
+                    result["token_mask"] = torch.tensor([result["token_mask"]])
+        return BatchEncoding(result, tensor_type=return_tensors)
+    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
+        vocab_file = os.path.join(
+            save_directory,
+            (filename_prefix + "-" if filename_prefix else "") + "vocab.json"
+        )
+        with open(vocab_file, "w", encoding="utf-8") as f:
+            json.dump(self._vocab, f, ensure_ascii=False, indent=2)
+        return (vocab_file,)
+    def save_pretrained(self, save_directory: str, **kwargs):
+        os.makedirs(save_directory, exist_ok=True)
+        # Save base tokenizer files
+        self._base_tokenizer.save_pretrained(save_directory)
+        # Save DNA config
+        dna_config = {
+            "k": self.k,
+            "dna_start_id": self.dna_start_id,
+            "dna_vocab_size": self.dna_vocab_size,
+            "dna_special_tokens": self.dna_special_tokens,
+            "auto_dna_tags": self.auto_dna_tags,
+        }
+        dna_config_path = os.path.join(save_directory, "dna_config.json")
+        with open(dna_config_path, "w", encoding="utf-8") as f:
+            json.dump(dna_config, f, indent=2)
+        # Update tokenizer_config.json with auto_map
+        config_path = os.path.join(save_directory, "tokenizer_config.json")
+        if os.path.exists(config_path):
+            with open(config_path, "r") as f:
+                config = json.load(f)
+        else:
+            config = {}
+        config.update({
+            "tokenizer_class": "HybridDNATokenizer",
+            "auto_map": {
+                "AutoTokenizer": ["tokenizer.HybridDNATokenizer", None]
+            },
+            "k": self.k,
+            "auto_dna_tags": self.auto_dna_tags,
+        })
+        with open(config_path, "w", encoding="utf-8") as f:
+            json.dump(config, f, indent=2, ensure_ascii=False)
+        # Copy this tokenizer.py to save directory
+        import shutil
+        src_py = os.path.abspath(__file__)
+        dst_py = os.path.join(save_directory, "tokenizer.py")
+        if os.path.exists(src_py) and src_py != dst_py:
+            shutil.copy2(src_py, dst_py)
+        return (save_directory,)
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: str, **kwargs):
+        k = 6
+        auto_dna_tags = False
+        dna_config_path = os.path.join(pretrained_model_name_or_path, "dna_config.json")
+        tok_config_path = os.path.join(pretrained_model_name_or_path, "tokenizer_config.json")
+        if os.path.exists(dna_config_path):
+            with open(dna_config_path, "r") as f:
+                dna_config = json.load(f)
+            k = dna_config.get("k", 6)
+            auto_dna_tags = dna_config.get("auto_dna_tags", False)
+        elif os.path.exists(tok_config_path):
+            with open(tok_config_path, "r") as f:
+                tok_config = json.load(f)
+            k = tok_config.get("k", 6)
+            auto_dna_tags = tok_config.get("auto_dna_tags", False)
+        return cls(base_tokenizer_path=pretrained_model_name_or_path, k=k, auto_dna_tags=auto_dna_tags, **kwargs)

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,247 @@

+{
+  "add_bos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "151643": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151644": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151645": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151646": {
+      "content": "<|object_ref_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151647": {
+      "content": "<|object_ref_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151648": {
+      "content": "<|box_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151649": {
+      "content": "<|box_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151650": {
+      "content": "<|quad_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151651": {
+      "content": "<|quad_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151652": {
+      "content": "<|vision_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151653": {
+      "content": "<|vision_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151654": {
+      "content": "<|vision_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151655": {
+      "content": "<|image_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151656": {
+      "content": "<|video_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151657": {
+      "content": "<tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151658": {
+      "content": "</tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151659": {
+      "content": "<|fim_prefix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151660": {
+      "content": "<|fim_middle|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151661": {
+      "content": "<|fim_suffix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151662": {
+      "content": "<|fim_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151663": {
+      "content": "<|repo_name|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151664": {
+      "content": "<|file_sep|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151665": {
+      "content": "<tool_response>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151666": {
+      "content": "</tool_response>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151667": {
+      "content": "<think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151668": {
+      "content": "</think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|endoftext|>",
+  "errors": "replace",
+  "extra_special_tokens": {},
+  "model_max_length": 131072,
+  "pad_token": "<|endoftext|>",
+  "split_special_tokens": false,
+  "tokenizer_class": "HybridDNATokenizer",
+  "unk_token": null,
+  "auto_map": {
+    "AutoTokenizer": [
+      "tokenizer.HybridDNATokenizer",
+      null
+    ]
+  },
+  "k": 6,
+  "auto_dna_tags": false
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff