Taykhoom commited on
Commit
d94efee
·
verified ·
1 Parent(s): 23dd20e

Upload folder using huggingface_hub

Browse files
LICENSE ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ GENBIO AI COMMUNITY LICENSE AGREEMENT
2
+
3
+ This GenBio AI Community License Agreement (the “License”) constitutes an agreement between you or the legal entity you represent (“you” or “your”) and GENBIO.AI, INC. (“GenBio”), governing your use of the GenBio Materials. If you are using the GenBio Materials on behalf of a legal entity, you represent and warrant to GenBio that you have full legal authority to act on behalf of that legal entity as applicable under the License. If you do not have the authority to accept this License or if you disagree with any or all of the License, you shall not use the GenBio Materials in any manner. By using or distributing any portion or element of the GenBio Materials, you imply your agreement to be bound by the License.
4
+
5
+ “GenBio Materials” means any datasets, code, model weights or any other materials provided by GenBio at the following GitHub Page https://github.com/genbio-ai or Hugging Face Page https://huggingface.co/genbio-ai, including any updates or modifications made from time to time, whether in Source or Object form, and is made available to you under this License.
6
+
7
+
8
+ 1. License Grant.
9
+ 1.1 License Scope. Subject to the terms of this License, GenBio grants you a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable and royalty-free limited license under GenBio’s intellectual property or other rights owned by GenBio embodied in the GenBio Materials to use, reproduce, distribute, and create Derivative Works of, and make modifications to, the GenBio Materials for any Non-Commercial Purposes.
10
+ 1.2 Use Restrictions. Restricted activities in relation to the License or use of GenBio Materials include:
11
+ 1.2.1 You shall use the GenBio Materials, Contributions, Derivative Works, Outputs and Output Derivatives (as defined below) solely for Non-Commercial Purposes;
12
+ 1.2.2 You shall not, directly or indirectly: (a) use or provide access to any Outputs or Output Derivatives to train, optimize, improve, or otherwise enhance the functionality or performance of any machine learning models or related technologies that are similar to the GenBio Materials; (b) engage in any form of model distillation or other methods that would achieve the purposes described in subsection (a) above. Notwithstanding the foregoing, you may use Outputs and Output Derivatives to train, optimize, improve, or enhance the functionality or performance of: (i) The GenBio Materials itself; and (ii) downstream Derivative Works of the GenBio Materials;
13
+ 1.2.3 Your use of the GenBio Materials shall be subject to any additional terms and conditions that: (a) GenBio provides to you separately; or (b) GenBio otherwise makes available to you.
14
+
15
+ 2. Sharing and Distribution.
16
+ 2.1 Subject to Section 1, if you distribute or make available the GenBio Materials or a Derivative Work to a third party for your Non-Commercial Purposes, in Source or Object form, you shall:
17
+ 2.1.1 provide a copy of this License to that third party;
18
+ 2.1.2 retain the following attribution notice within a “Notice” text file distributed as a part of such copies: “This is licensed under the GenBio AI Community License Agreement, Copyright © GENBIO.AI, INC. All Rights Reserved”; and
19
+ 2.1.3 prominently display “Powered by GenBio AI” on a related website, user interface, blogpost, about page, or product documentation.
20
+ 2.2 If You create a Derivative Work, you may add your own attribution notice(s) to the “Notice” text file included with that Derivative Work, provided that you clearly indicate which attributions apply to the GenBio Materials and state in the “Notice” text file that you changed the GenBio Materials and how it was modified.
21
+
22
+ 3. Submission of Contribution.
23
+ Unless you explicitly state otherwise, any Contribution intentionally submitted for inclusion in the GenBio Materials by you to GenBio shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with GenBio regarding such Contributions.
24
+
25
+ 4. Export Control.
26
+ You shall comply with the applicable U.S. Foreign Corrupt Practices Act and all applicable export laws, restrictions and regulations of the U.S. Department of Commerce, and any other applicable U.S. and foreign authority.
27
+
28
+ 5. Disclaimer of Warranty.
29
+ GENBIO MATERIALS PROVIDED BY GENBIO OR ANY OUTPUT YOU RECEIVED ARE PROVIDED “AS IS.” EXCEPT TO THE EXTENT PROHIBITED BY LAW. GENBIO MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, WHETHER EXPRESS, IMPLIED OR OTHERWISE, REGARDING THE ACCURACY, COMPLETENESS OR PERFORMANCE OF THE SERVICES AND YOUR OUTPUT, OR WITH RESPECT TO SATISFACTORY QUALITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT.
30
+
31
+ 6. Limitation of Liability.
32
+ In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the GenBio Materials (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
33
+
34
+ 7. General Terms.
35
+ 7.1 Relationship of Parties. You and GenBio are independent contractors, and nothing herein shall be deemed to constitute either party as the agent or representative of the other or both parties as joint venturers or partners for any purpose.
36
+ 7.2 Assignment. This License and the rights and obligations herein may not be assigned or transferred, in whole or in part, by You without the prior written consent of GenBio. Any assignment in violation of this provision is void. GenBio may freely assign or transfer this License, in whole or in part. This License shall be binding upon, and inure to the benefit of, the successors and permitted assigns of the parties.
37
+ 7.3 Governing Law. This License shall be governed, construed and interpreted in accordance with the laws of the State of California, without giving effect to principles of conflicts of law. Each of the parties to this License consents to the exclusive jurisdiction and venue of the courts of the state and federal courts of California.
38
+ 7.4 Severability. If any provision of this License is held to be invalid, illegal or unenforceable in any respect, that provision shall be limited or eliminated to the minimum extent necessary so that this License otherwise remains in full force and effect and enforceable.
39
+
40
+ 8. Definitions.
41
+ 8.1 “Commercial Entity” means any entity engaged in any activity intended for or directed toward commercial advantage or monetary compensation, including, without limitation, the development of any product or service intended to be sold or made available for a fee. For the purpose of this License, references to a Commercial Entity expressly exclude any universities, non-profit organizations, not-for-profit entities, research institutes and educational and government bodies.
42
+ 8.2 “Contribution” means any work of authorship, including the original version of the GenBio Materials and any modifications or additions to that GenBio Materials or Derivative Works thereof, that is intentionally submitted to GenBio for inclusion in the GenBio Materials by the copyright owner or by an individual or legal entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, “submitted” means any form of electronic, verbal, or written communication sent to GenBio or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, GenBio for the purpose of discussing and improving the GenBio Materials, but excluding Outputs and all communications that are conspicuously marked or otherwise designated in writing by the copyright owner as “Not a Contribution”.
43
+ 8.3 “Contributor” means GenBio and any individual or legal entity on behalf of whom a Contribution has been received by GenBio and subsequently incorporated within the GenBio Materials.
44
+ 8.4 “Derivative Work” means any work, whether in Source or Object form, that is based on (or derived from) the GenBio Materials and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the GenBio Materials and Derivative Works thereof.
45
+ 8.5 “Non-Commercial Purposes” means uses not intended for or directed toward commercial advantage or monetary compensation, or the facilitation of development of any product or service to be sold or made available for a fee. For the avoidance of doubt, the provision of Outputs as a service is not a Non-Commercial Purpose.
46
+ 8.6 “Object” means any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
47
+ 8.7 “Output” means any output, including any protein sequence, structure prediction, functional annotation, molecule, descriptions of a molecule, model, sequence, text, and/or image that is elicited directly or indirectly by, or otherwise made available to, you in connection with your use of the GenBio Materials, including, but not limited to, the use of AI-Powered Technology. For the avoidance of doubt, it includes any intermediate results, such as activations across model layers, intermediate outputs from model layers (e.g., attention maps), as well as gradients and embeddings produced by the GenBio Materials.
48
+ 8.8 “Output Derivatives” means any enhancements, modifications and derivative works of Outputs (including, but not limited to, any derivative sequences or molecules).
49
+ 8.9 “Source” means the preferred form for making modifications, including but not limited to GenBio Materials source code, documentation source, and configuration files.
50
+
README.md ADDED
@@ -0,0 +1,142 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - rna
4
+ library_name: transformers
5
+ tags:
6
+ - RNA
7
+ - language-model
8
+ license: other
9
+ ---
10
+
11
+ # AIDO.RNA-650M-CDS
12
+
13
+ 650M-parameter model fine-tuned on coding sequences (CDS). This is a standalone HuggingFace port that loads without the
14
+ [ModelGenerator](https://github.com/genbio-ai/ModelGenerator) package.
15
+
16
+ ## Architecture
17
+
18
+ | Parameter | Value |
19
+ |---|---|
20
+ | Layers | 33 |
21
+ | Attention heads | 20 |
22
+ | Embedding dimension | 1280 |
23
+ | Intermediate (MLP) size | 3392 |
24
+ | Vocabulary size | 16 |
25
+ | Positional encoding | RoPE (rotary_percent=1.0) |
26
+ | Normalization | LayerNorm |
27
+ | MLP activation | SwiGLU |
28
+ | Architecture | Pre-LN Transformer |
29
+ | Max sequence length | 1024 (training truncation; RoPE has no hard limit) |
30
+
31
+ **Vocabulary:** `[PAD]`, `[MASK]`, `[CLS]`, `[SEP]`, `[UNK]`, `A`, `G`, `C`, `T`, `U`, `N`,
32
+ `[BOS]`, `[EOS]`, `[UNUSED1]`, `[UNUSED2]`, `[UNUSED3]`
33
+
34
+ ## Pretraining
35
+
36
+ - **Objective:** Masked language modeling (MLM) on RNA sequences
37
+ - **Data:** 42M ncRNA sequences (pre-training) + CDS fine-tuning
38
+ - **Source checkpoint:** `genbio-ai/AIDO.RNA-650M-CDS`
39
+
40
+ ### Checkpoint selection
41
+
42
+ Preferred over the base 650M for tasks involving coding regions or translation.
43
+
44
+ ## Parity Verification
45
+
46
+ Hidden-state representations compared against the original `genbio-ai/AIDO.RNA-650M-CDS` weights at all
47
+ 34 representation levels (embedding + 33 transformer layers).
48
+ Intermediate layer differences are floating-point accumulation noise normalised away by the
49
+ final layer norm; the final output matches the original within 1e-5.
50
+ max diff = 4.77e-06 (final output), 3.81e-05 (intermediate layers). Verified on GPU with PyTorch 2.7 / CUDA 12.
51
+
52
+ ## Related Models
53
+
54
+ See the full [AIDO.RNA collection](COLLECTION_URL).
55
+
56
+ | Model | Parameters | Data | Notes |
57
+ |---|---|---|---|
58
+ | [AIDO.RNA-1M-MARS](COLLECTION_URL) | 1M | MARS ncRNA | Smallest MARS variant |
59
+ | [AIDO.RNA-25M-MARS](COLLECTION_URL) | 25M | MARS ncRNA | Mid-size MARS variant |
60
+ | [AIDO.RNA-300M-MARS](COLLECTION_URL) | 300M | MARS ncRNA | Large MARS variant |
61
+ | [AIDO.RNA-650M](COLLECTION_URL) | 650M | 42M ncRNA | Base model |
62
+ | [AIDO.RNA-650M-CDS](COLLECTION_URL) | 650M | 42M ncRNA + CDS | CDS-adapted |
63
+ | [AIDO.RNA-1.6B](COLLECTION_URL) | 1.6B | 42M ncRNA | Largest base model |
64
+ | [AIDO.RNA-1.6B-CDS](COLLECTION_URL) | 1.6B | 42M ncRNA + CDS | Largest CDS-adapted |
65
+
66
+ ## Usage
67
+
68
+ ### Embedding generation
69
+
70
+ ```python
71
+ import torch
72
+ from transformers import AutoTokenizer, AutoModel
73
+
74
+ tokenizer = AutoTokenizer.from_pretrained("Taykhoom/AIDO.RNA-650M-CDS", trust_remote_code=True)
75
+ model = AutoModel.from_pretrained("Taykhoom/AIDO.RNA-650M-CDS", trust_remote_code=True)
76
+ model.eval()
77
+
78
+ sequences = ["ACGUGCUAGCUAGCUA", "AUGCUAGCUAGCUAGC"]
79
+ enc = tokenizer(sequences, return_tensors="pt", padding=True)
80
+
81
+ with torch.no_grad():
82
+ out = model(**enc)
83
+
84
+ cls_emb = out.last_hidden_state[:, 0, :] # (batch, 1280) -- CLS token
85
+ token_emb = out.last_hidden_state # (batch, seq_len, 1280)
86
+
87
+ # Intermediate layers
88
+ out_all = model(**enc, output_hidden_states=True)
89
+ layer3_emb = out_all.hidden_states[3]
90
+ ```
91
+
92
+ ### MLM logits
93
+
94
+ ```python
95
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
96
+
97
+ tokenizer = AutoTokenizer.from_pretrained("Taykhoom/AIDO.RNA-650M-CDS", trust_remote_code=True)
98
+ model = AutoModelForMaskedLM.from_pretrained("Taykhoom/AIDO.RNA-650M-CDS", trust_remote_code=True)
99
+ model.eval()
100
+
101
+ enc = tokenizer(["ACGU[MASK]GCUA"], return_tensors="pt")
102
+ with torch.no_grad():
103
+ logits = model(**enc).logits # (1, seq_len, 16)
104
+ ```
105
+
106
+ ### Fine-tuning
107
+
108
+ Standard HF conventions. Use `cls_emb = out.last_hidden_state[:, 0, :]` (CLS token) as
109
+ input to a task-specific head for sequence-level tasks.
110
+
111
+ ## Implementation Notes
112
+
113
+ The original `genbio-ai/AIDO.RNA-650M-CDS` checkpoint requires the
114
+ [ModelGenerator](https://github.com/genbio-ai/ModelGenerator) package to load.
115
+ This port is a clean standalone re-implementation:
116
+
117
+ - All model logic is contained in `modeling_aidorna.py` and `configuration_aidorna.py`.
118
+ - `attn_implementation="sdpa"` and `attn_implementation="flash_attention_2"` are added
119
+ (not present in the original genbio-ai implementation).
120
+ - Architecture: pre-LN Transformer with SwiGLU MLP and RoPE positional embeddings.
121
+
122
+ ## Citation
123
+
124
+ ```bibtex
125
+ @article{zou2024_aidorna,
126
+ title = {A Large-Scale Foundation Model for {RNA} Function and Structure Prediction},
127
+ author = {Zou, Shuxian and Tao, Tianhua and Mahbub, Sazan and Ellington, Caleb N. and Algayres, Robin and Li, Dian and Zhuang, Yonghao and Wang, Hongyi and Song, Le and Xing, Eric P.},
128
+ journal = {bioRxiv},
129
+ year = {2024},
130
+ doi = {10.1101/2024.11.28.625345}
131
+ }
132
+ ```
133
+
134
+ ## Credits
135
+
136
+ Original model and code by Zou et al. Source: [GitHub](https://github.com/genbio-ai/AIDO).
137
+ The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code)
138
+ and reviewed manually by Taykhoom Dalal.
139
+
140
+ ## License
141
+
142
+ GenBio AI Community License, following the original repository. See `LICENSE` for details.
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_linear_bias": true,
3
+ "attention_probs_dropout_prob": 0.0,
4
+ "auto_map": {
5
+ "AutoConfig": "configuration_aidorna.AIDORNAConfig",
6
+ "AutoModel": "modeling_aidorna.AIDORNAModel",
7
+ "AutoModelForMaskedLM": "modeling_aidorna.AIDORNAForMaskedLM"
8
+ },
9
+ "hidden_act": "swiglu",
10
+ "hidden_dropout_prob": 0.0,
11
+ "hidden_size": 1280,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3392,
14
+ "layer_norm_eps": 1e-05,
15
+ "max_position_embeddings": 1024,
16
+ "model_type": "aidorna",
17
+ "normalization_type": "LayerNorm",
18
+ "num_attention_heads": 20,
19
+ "num_hidden_layers": 33,
20
+ "pad_token_id": 0,
21
+ "position_embedding_type": "rope",
22
+ "rotary_percent": 1.0,
23
+ "seq_len_interpolation_factor": null,
24
+ "transformers_version": "4.57.6",
25
+ "vocab_size": 16
26
+ }
configuration_aidorna.py ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import PretrainedConfig
2
+
3
+
4
+ class AIDORNAConfig(PretrainedConfig):
5
+ model_type = "aidorna"
6
+
7
+ auto_map = {
8
+ "AutoConfig": "configuration_aidorna.AIDORNAConfig",
9
+ "AutoModel": "modeling_aidorna.AIDORNAModel",
10
+ "AutoModelForMaskedLM": "modeling_aidorna.AIDORNAForMaskedLM",
11
+ }
12
+
13
+ def __init__(
14
+ self,
15
+ vocab_size: int = 16,
16
+ hidden_size: int = 1280,
17
+ num_hidden_layers: int = 33,
18
+ num_attention_heads: int = 20,
19
+ intermediate_size: int = 3392,
20
+ hidden_act: str = "swiglu",
21
+ hidden_dropout_prob: float = 0.0,
22
+ attention_probs_dropout_prob: float = 0.0,
23
+ max_position_embeddings: int = 1024,
24
+ initializer_range: float = 0.02,
25
+ layer_norm_eps: float = 1e-5,
26
+ pad_token_id: int = 0,
27
+ position_embedding_type: str = "rope",
28
+ rotary_percent: float = 1.0,
29
+ seq_len_interpolation_factor=None,
30
+ normalization_type: str = "LayerNorm",
31
+ add_linear_bias: bool = True,
32
+ **kwargs,
33
+ ):
34
+ super().__init__(pad_token_id=pad_token_id, **kwargs)
35
+ self.auto_map = {
36
+ "AutoConfig": "configuration_aidorna.AIDORNAConfig",
37
+ "AutoModel": "modeling_aidorna.AIDORNAModel",
38
+ "AutoModelForMaskedLM": "modeling_aidorna.AIDORNAForMaskedLM",
39
+ }
40
+ self.vocab_size = vocab_size
41
+ self.hidden_size = hidden_size
42
+ self.num_hidden_layers = num_hidden_layers
43
+ self.num_attention_heads = num_attention_heads
44
+ self.intermediate_size = intermediate_size
45
+ self.hidden_act = hidden_act
46
+ self.hidden_dropout_prob = hidden_dropout_prob
47
+ self.attention_probs_dropout_prob = attention_probs_dropout_prob
48
+ self.max_position_embeddings = max_position_embeddings
49
+ self.initializer_range = initializer_range
50
+ self.layer_norm_eps = layer_norm_eps
51
+ self.position_embedding_type = position_embedding_type
52
+ self.rotary_percent = rotary_percent
53
+ self.seq_len_interpolation_factor = seq_len_interpolation_factor
54
+ self.normalization_type = normalization_type
55
+ self.add_linear_bias = add_linear_bias
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0d7173b5093ccda67d9206c08e3dba2de7d44b430ab658f34419e131e30f7211
3
+ size 2593642120
modeling_aidorna.py ADDED
@@ -0,0 +1,470 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import math
2
+ from typing import Optional, Union
3
+
4
+ import torch
5
+ import torch.nn.functional as F
6
+ import torch.utils.checkpoint
7
+ from torch import nn
8
+
9
+ from transformers import PreTrainedModel
10
+ from transformers.activations import ACT2FN
11
+ from transformers.modeling_outputs import BaseModelOutput, MaskedLMOutput
12
+
13
+ from .configuration_aidorna import AIDORNAConfig
14
+
15
+
16
+ class AIDORNARMSNorm(nn.Module):
17
+ def __init__(self, hidden_size, eps=1e-6):
18
+ super().__init__()
19
+ self.weight = nn.Parameter(torch.ones(hidden_size))
20
+ self.variance_epsilon = eps
21
+
22
+ def forward(self, hidden_states):
23
+ input_dtype = hidden_states.dtype
24
+ hidden_states = hidden_states.to(torch.float32)
25
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
26
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
27
+ return self.weight * hidden_states.to(input_dtype)
28
+
29
+
30
+ def _get_norm_cls(config):
31
+ if config.normalization_type == "RMSNorm":
32
+ return AIDORNARMSNorm
33
+ return nn.LayerNorm
34
+
35
+
36
+ class RotaryEmbedding(nn.Module):
37
+ def __init__(self, kv_channels, rotary_percent=1.0, seq_len_interpolation_factor=None, rotary_base=10000):
38
+ super().__init__()
39
+ dim = kv_channels
40
+ if rotary_percent < 1.0:
41
+ dim = int(dim * rotary_percent)
42
+ self.seq_len_interpolation_factor = seq_len_interpolation_factor
43
+ inv_freq = 1.0 / (rotary_base ** (torch.arange(0, dim, 2, dtype=torch.float32) / dim))
44
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
45
+
46
+ def forward(self, max_seq_len, offset=0):
47
+ seq = torch.arange(max_seq_len, device=self.inv_freq.device, dtype=self.inv_freq.dtype) + offset
48
+ if self.seq_len_interpolation_factor is not None:
49
+ seq = seq / self.seq_len_interpolation_factor
50
+ freqs = torch.outer(seq, self.inv_freq)
51
+ emb = torch.cat((freqs, freqs), dim=-1)
52
+ return emb[:, None, None, :] # (T, 1, 1, rot_dim*2)
53
+
54
+
55
+ def _rotate_half(x):
56
+ x1, x2 = torch.chunk(x, 2, dim=-1)
57
+ return torch.cat((-x2, x1), dim=-1)
58
+
59
+
60
+ def apply_rotary_pos_emb(t, freqs):
61
+ rot_dim = freqs.shape[-1]
62
+ t_rot, t_pass = t[..., :rot_dim], t[..., rot_dim:]
63
+ cos_ = torch.cos(freqs).to(t.dtype)
64
+ sin_ = torch.sin(freqs).to(t.dtype)
65
+ t_rot = (t_rot * cos_) + (_rotate_half(t_rot) * sin_)
66
+ return torch.cat((t_rot, t_pass), dim=-1)
67
+
68
+
69
+ def _qkv_rope(q, k, rotary_pos_emb):
70
+ q = q.permute(2, 0, 1, 3).contiguous() # (B,H,T,d) -> (T,B,H,d)
71
+ k = k.permute(2, 0, 1, 3).contiguous()
72
+ q_pos, k_pos = (rotary_pos_emb, rotary_pos_emb) if not isinstance(rotary_pos_emb, tuple) else rotary_pos_emb
73
+ q = apply_rotary_pos_emb(q, q_pos)
74
+ k = apply_rotary_pos_emb(k, k_pos)
75
+ q = q.permute(1, 2, 0, 3).contiguous() # (T,B,H,d) -> (B,H,T,d)
76
+ k = k.permute(1, 2, 0, 3).contiguous()
77
+ return q, k
78
+
79
+
80
+ class AIDORNASelfAttention(nn.Module):
81
+ def __init__(self, config):
82
+ super().__init__()
83
+ assert config.hidden_size % config.num_attention_heads == 0
84
+ self.num_heads = config.num_attention_heads
85
+ self.head_dim = config.hidden_size // config.num_attention_heads
86
+ self.all_head_size = self.num_heads * self.head_dim
87
+
88
+ self.query = nn.Linear(config.hidden_size, self.all_head_size, bias=config.add_linear_bias)
89
+ self.key = nn.Linear(config.hidden_size, self.all_head_size, bias=config.add_linear_bias)
90
+ self.value = nn.Linear(config.hidden_size, self.all_head_size, bias=config.add_linear_bias)
91
+ self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
92
+
93
+ def _project(self, x):
94
+ B, T, _ = x.size()
95
+ def t(z):
96
+ return z.view(B, T, self.num_heads, self.head_dim).permute(0, 2, 1, 3)
97
+ return t(self.query(x)), t(self.key(x)), t(self.value(x))
98
+
99
+ def forward(self, hidden_states, attention_mask, output_attentions=False, rotary_pos_emb=None):
100
+ B, T, _ = hidden_states.size()
101
+ q, k, v = self._project(hidden_states)
102
+
103
+ if rotary_pos_emb is not None:
104
+ q, k = _qkv_rope(q, k, rotary_pos_emb)
105
+
106
+ scores = torch.matmul(q, k.transpose(-1, -2)) / math.sqrt(self.head_dim)
107
+ if attention_mask is not None:
108
+ scores = scores + attention_mask.to(scores.dtype)
109
+
110
+ probs = F.softmax(scores, dim=-1)
111
+ if attention_mask is not None:
112
+ probs = probs.masked_fill(attention_mask < -1e5, 0.0)
113
+ probs = self.dropout(probs)
114
+
115
+ context = torch.matmul(probs, v).permute(0, 2, 1, 3).contiguous().view(B, T, self.all_head_size)
116
+ return context, probs if output_attentions else None
117
+
118
+
119
+ class AIDORNASdpaAttention(AIDORNASelfAttention):
120
+ def forward(self, hidden_states, attention_mask, output_attentions=False, rotary_pos_emb=None):
121
+ if output_attentions:
122
+ return super().forward(hidden_states, attention_mask, output_attentions=True, rotary_pos_emb=rotary_pos_emb)
123
+
124
+ B, T, _ = hidden_states.size()
125
+ q, k, v = self._project(hidden_states)
126
+
127
+ if rotary_pos_emb is not None:
128
+ q, k = _qkv_rope(q, k, rotary_pos_emb)
129
+
130
+ attn_mask = attention_mask.to(q.dtype) if attention_mask is not None else None
131
+ context = F.scaled_dot_product_attention(q, k, v, attn_mask=attn_mask)
132
+ context = context.permute(0, 2, 1, 3).contiguous().view(B, T, self.all_head_size)
133
+ return context, None
134
+
135
+
136
+ class AIDORNAFlashAttention2(AIDORNASelfAttention):
137
+ def forward(self, hidden_states, attention_mask, output_attentions=False, rotary_pos_emb=None):
138
+ if output_attentions:
139
+ return super().forward(hidden_states, attention_mask, output_attentions=True, rotary_pos_emb=rotary_pos_emb)
140
+
141
+ try:
142
+ from flash_attn import flash_attn_func
143
+ from flash_attn.bert_padding import pad_input, unpad_input
144
+ except ImportError as e:
145
+ raise ImportError(
146
+ "flash_attn is required for attn_implementation='flash_attention_2'. "
147
+ "Install with: pip install flash-attn --no-build-isolation"
148
+ ) from e
149
+
150
+ B, T, _ = hidden_states.size()
151
+ q, k, v = self._project(hidden_states) # each (B, H, T, d)
152
+
153
+ if rotary_pos_emb is not None:
154
+ q, k = _qkv_rope(q, k, rotary_pos_emb)
155
+
156
+ # flash_attn layout: (B, T, H, d)
157
+ q, k, v = q.permute(0, 2, 1, 3), k.permute(0, 2, 1, 3), v.permute(0, 2, 1, 3)
158
+
159
+ orig_dtype = q.dtype
160
+ if q.dtype not in (torch.float16, torch.bfloat16):
161
+ q = q.to(torch.bfloat16)
162
+ k = k.to(torch.bfloat16)
163
+ v = v.to(torch.bfloat16)
164
+
165
+ # Derive key padding mask from extended additive attention mask.
166
+ # Position 0 (CLS) is always non-padding, so row 0 reliably reflects per-token pad status.
167
+ if attention_mask is not None:
168
+ key_padding_mask_bool = (attention_mask[:, 0, 0, :] > -1e10) # True = attend
169
+ has_padding = not key_padding_mask_bool.all()
170
+ else:
171
+ key_padding_mask_bool = None
172
+ has_padding = False
173
+
174
+ if has_padding and key_padding_mask_bool is not None:
175
+ from flash_attn import flash_attn_varlen_func
176
+ q_u, indices, cu_seqlens, max_seqlen, _ = unpad_input(q, key_padding_mask_bool)
177
+ k_u, *_ = unpad_input(k, key_padding_mask_bool)
178
+ v_u, *_ = unpad_input(v, key_padding_mask_bool)
179
+ out_u = flash_attn_varlen_func(
180
+ q_u, k_u, v_u,
181
+ cu_seqlens_q=cu_seqlens, cu_seqlens_k=cu_seqlens,
182
+ max_seqlen_q=max_seqlen, max_seqlen_k=max_seqlen,
183
+ causal=False,
184
+ )
185
+ out = pad_input(out_u, indices, B, T) # (B, T, H, d)
186
+ else:
187
+ out = flash_attn_func(q, k, v, causal=False) # (B, T, H, d)
188
+
189
+ context = out.to(orig_dtype).contiguous().view(B, T, self.all_head_size)
190
+ return context, None
191
+
192
+
193
+ AIDORNA_ATTENTION_CLASSES = {
194
+ "eager": AIDORNASelfAttention,
195
+ "sdpa": AIDORNASdpaAttention,
196
+ "flash_attention_2": AIDORNAFlashAttention2,
197
+ }
198
+
199
+
200
+ class AIDORNASelfOutput(nn.Module):
201
+ def __init__(self, config):
202
+ super().__init__()
203
+ self.dense = nn.Linear(config.hidden_size, config.hidden_size, bias=config.add_linear_bias)
204
+ self.dropout = nn.Dropout(config.hidden_dropout_prob)
205
+
206
+ def forward(self, hidden_states, residual):
207
+ return residual + self.dropout(self.dense(hidden_states))
208
+
209
+
210
+ class AIDORNAAttention(nn.Module):
211
+ def __init__(self, config):
212
+ super().__init__()
213
+ norm_cls = _get_norm_cls(config)
214
+ self.ln = norm_cls(config.hidden_size, eps=config.layer_norm_eps)
215
+ attn_cls = AIDORNA_ATTENTION_CLASSES[getattr(config, "_attn_implementation", "eager")]
216
+ self.self = attn_cls(config)
217
+ self.output = AIDORNASelfOutput(config)
218
+
219
+ def forward(self, hidden_states, attention_mask, output_attentions=False, rotary_pos_emb=None):
220
+ ln_out = self.ln(hidden_states)
221
+ attn_out, attn_weights = self.self(ln_out, attention_mask, output_attentions, rotary_pos_emb)
222
+ out = self.output(attn_out, hidden_states)
223
+ return out, attn_weights
224
+
225
+
226
+ class AIDORNAMLP(nn.Module):
227
+ def __init__(self, config):
228
+ super().__init__()
229
+ self.up_proj = nn.Linear(config.hidden_size, config.intermediate_size, bias=config.add_linear_bias)
230
+ self.gate_proj = nn.Linear(config.hidden_size, config.intermediate_size, bias=config.add_linear_bias)
231
+ self.down_proj = nn.Linear(config.intermediate_size, config.hidden_size, bias=config.add_linear_bias)
232
+
233
+ def forward(self, x):
234
+ return self.down_proj(F.silu(self.gate_proj(x)) * self.up_proj(x))
235
+
236
+
237
+ class AIDORNALayer(nn.Module):
238
+ def __init__(self, config):
239
+ super().__init__()
240
+ self.attention = AIDORNAAttention(config)
241
+ norm_cls = _get_norm_cls(config)
242
+ self.ln = norm_cls(config.hidden_size, eps=config.layer_norm_eps)
243
+ self.mlp = AIDORNAMLP(config)
244
+ self.dropout = nn.Dropout(config.hidden_dropout_prob)
245
+
246
+ def forward(self, hidden_states, attention_mask, output_attentions=False, rotary_pos_emb=None):
247
+ attn_out, attn_weights = self.attention(hidden_states, attention_mask, output_attentions, rotary_pos_emb)
248
+ mlp_out = self.dropout(self.mlp(self.ln(attn_out)))
249
+ return attn_out + mlp_out, attn_weights
250
+
251
+
252
+ class AIDORNAEncoder(nn.Module):
253
+ def __init__(self, config):
254
+ super().__init__()
255
+ self.layer = nn.ModuleList([AIDORNALayer(config) for _ in range(config.num_hidden_layers)])
256
+ norm_cls = _get_norm_cls(config)
257
+ self.ln = norm_cls(config.hidden_size, eps=config.layer_norm_eps)
258
+ self.gradient_checkpointing = False
259
+
260
+ def forward(self, hidden_states, attention_mask, output_attentions, output_hidden_states, rotary_pos_emb=None):
261
+ all_hidden_states = (hidden_states,) if output_hidden_states else None
262
+ all_attentions = () if output_attentions else None
263
+
264
+ for layer in self.layer:
265
+ if self.gradient_checkpointing and self.training:
266
+ hidden_states, attn_w = self._gradient_checkpointing_func(
267
+ layer.__call__, hidden_states, attention_mask, output_attentions, rotary_pos_emb
268
+ )
269
+ else:
270
+ hidden_states, attn_w = layer(hidden_states, attention_mask, output_attentions, rotary_pos_emb)
271
+
272
+ if output_hidden_states:
273
+ all_hidden_states = all_hidden_states + (hidden_states,)
274
+ if output_attentions:
275
+ all_attentions = all_attentions + (attn_w,)
276
+
277
+ hidden_states = self.ln(hidden_states)
278
+
279
+ if output_hidden_states:
280
+ # Replace the last element with the final LN-normalised version
281
+ all_hidden_states = all_hidden_states[:-1] + (hidden_states,)
282
+
283
+ return hidden_states, all_hidden_states, all_attentions
284
+
285
+
286
+ class AIDORNAEmbeddings(nn.Module):
287
+ def __init__(self, config):
288
+ super().__init__()
289
+ self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
290
+ self.dropout = nn.Dropout(config.hidden_dropout_prob)
291
+ self.position_embedding_type = config.position_embedding_type
292
+
293
+ if config.position_embedding_type == "absolute":
294
+ self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
295
+ self.register_buffer(
296
+ "position_ids",
297
+ torch.arange(config.max_position_embeddings).expand((1, -1)),
298
+ persistent=False,
299
+ )
300
+
301
+ def forward(self, input_ids):
302
+ embeddings = self.word_embeddings(input_ids)
303
+ if self.position_embedding_type == "absolute":
304
+ seq_len = input_ids.size(1)
305
+ embeddings = embeddings + self.position_embeddings(self.position_ids[:, :seq_len])
306
+ return self.dropout(embeddings)
307
+
308
+
309
+ def _make_extended_attention_mask(attention_mask):
310
+ b1s = attention_mask.unsqueeze(1) # (B, 1, T)
311
+ bs1 = attention_mask.unsqueeze(2) # (B, T, 1)
312
+ bss = (b1s * bs1).unsqueeze(1) # (B, 1, T, T)
313
+ return (bss < 0.5).float() * torch.finfo(torch.float).min
314
+
315
+
316
+ class AIDORNAPreTrainedModel(PreTrainedModel):
317
+ config_class = AIDORNAConfig
318
+ base_model_prefix = "model"
319
+ supports_gradient_checkpointing = True
320
+ _supports_sdpa = True
321
+ _supports_flash_attn_2 = True
322
+
323
+ def _init_weights(self, module):
324
+ if isinstance(module, (nn.Linear, nn.Embedding)):
325
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
326
+ if isinstance(module, nn.LayerNorm):
327
+ module.bias.data.zero_()
328
+ module.weight.data.fill_(1.0)
329
+ if isinstance(module, AIDORNARMSNorm):
330
+ module.weight.data.fill_(1.0)
331
+ if isinstance(module, nn.Linear) and module.bias is not None:
332
+ module.bias.data.zero_()
333
+
334
+
335
+ class AIDORNAModel(AIDORNAPreTrainedModel):
336
+ def __init__(self, config):
337
+ super().__init__(config)
338
+ self.embeddings = AIDORNAEmbeddings(config)
339
+ self.encoder = AIDORNAEncoder(config)
340
+
341
+ if config.position_embedding_type == "rope":
342
+ head_dim = config.hidden_size // config.num_attention_heads
343
+ self.rotary_pos_emb = RotaryEmbedding(
344
+ kv_channels=head_dim,
345
+ rotary_percent=config.rotary_percent,
346
+ seq_len_interpolation_factor=config.seq_len_interpolation_factor,
347
+ )
348
+ self.post_init()
349
+
350
+ def get_input_embeddings(self):
351
+ return self.embeddings.word_embeddings
352
+
353
+ def set_input_embeddings(self, value):
354
+ self.embeddings.word_embeddings = value
355
+
356
+ def forward(
357
+ self,
358
+ input_ids=None,
359
+ attention_mask=None,
360
+ output_hidden_states=None,
361
+ output_attentions=None,
362
+ return_dict=None,
363
+ ):
364
+ output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
365
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
366
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
367
+
368
+ if input_ids is None:
369
+ raise ValueError("input_ids must be provided")
370
+
371
+ if attention_mask is None:
372
+ attention_mask = torch.ones_like(input_ids)
373
+
374
+ extended_mask = _make_extended_attention_mask(attention_mask)
375
+
376
+ rotary_pos_emb = None
377
+ if self.config.position_embedding_type == "rope":
378
+ rotary_pos_emb = self.rotary_pos_emb(input_ids.size(1))
379
+
380
+ hidden_states = self.embeddings(input_ids)
381
+ last_hidden_state, all_hidden_states, all_attentions = self.encoder(
382
+ hidden_states, extended_mask, output_attentions, output_hidden_states, rotary_pos_emb
383
+ )
384
+
385
+ if not return_dict:
386
+ return tuple(v for v in [last_hidden_state, all_hidden_states, all_attentions] if v is not None)
387
+
388
+ return BaseModelOutput(
389
+ last_hidden_state=last_hidden_state,
390
+ hidden_states=all_hidden_states,
391
+ attentions=all_attentions,
392
+ )
393
+
394
+
395
+ class AIDORNAPredictionHeadTransform(nn.Module):
396
+ def __init__(self, config):
397
+ super().__init__()
398
+ self.dense = nn.Linear(config.hidden_size, config.hidden_size)
399
+ self.transform_act_fn = ACT2FN["gelu"]
400
+ norm_cls = _get_norm_cls(config)
401
+ self.LayerNorm = norm_cls(config.hidden_size, eps=config.layer_norm_eps)
402
+
403
+ def forward(self, hidden_states):
404
+ return self.LayerNorm(self.transform_act_fn(self.dense(hidden_states)))
405
+
406
+
407
+ class AIDORNALMPredictionHead(nn.Module):
408
+ def __init__(self, config):
409
+ super().__init__()
410
+ self.transform = AIDORNAPredictionHeadTransform(config)
411
+ self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
412
+ self.bias = nn.Parameter(torch.zeros(config.vocab_size))
413
+ self.decoder.bias = self.bias
414
+
415
+ def forward(self, hidden_states):
416
+ return self.decoder(self.transform(hidden_states))
417
+
418
+
419
+ class AIDORNAOnlyMLMHead(nn.Module):
420
+ def __init__(self, config):
421
+ super().__init__()
422
+ self.predictions = AIDORNALMPredictionHead(config)
423
+
424
+ def forward(self, x):
425
+ return self.predictions(x)
426
+
427
+
428
+ class AIDORNAForMaskedLM(AIDORNAPreTrainedModel):
429
+ def __init__(self, config):
430
+ super().__init__(config)
431
+ self.model = AIDORNAModel(config)
432
+ self.cls = AIDORNAOnlyMLMHead(config)
433
+ self.post_init()
434
+
435
+ def get_output_embeddings(self):
436
+ return self.cls.predictions.decoder
437
+
438
+ def set_output_embeddings(self, new_embeddings):
439
+ self.cls.predictions.decoder = new_embeddings
440
+
441
+ def get_input_embeddings(self):
442
+ return self.model.get_input_embeddings()
443
+
444
+ def set_input_embeddings(self, value):
445
+ self.model.set_input_embeddings(value)
446
+
447
+ def forward(
448
+ self,
449
+ input_ids=None,
450
+ attention_mask=None,
451
+ labels=None,
452
+ output_hidden_states=None,
453
+ output_attentions=None,
454
+ return_dict=None,
455
+ ):
456
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
457
+ out = self.model(
458
+ input_ids=input_ids,
459
+ attention_mask=attention_mask,
460
+ output_hidden_states=output_hidden_states,
461
+ output_attentions=output_attentions,
462
+ return_dict=return_dict,
463
+ )
464
+ logits = self.cls(out[0] if not return_dict else out.last_hidden_state)
465
+ loss = None
466
+ if labels is not None:
467
+ loss = F.cross_entropy(logits.view(-1, self.config.vocab_size), labels.view(-1), ignore_index=-100)
468
+ if not return_dict:
469
+ return ((loss, logits) + out[1:]) if loss is not None else (logits,) + out[1:]
470
+ return MaskedLMOutput(loss=loss, logits=logits, hidden_states=out.hidden_states, attentions=out.attentions)
special_tokens_map.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "[BOS]",
3
+ "cls_token": "[CLS]",
4
+ "eos_token": "[EOS]",
5
+ "mask_token": "[MASK]",
6
+ "pad_token": "[PAD]",
7
+ "sep_token": "[SEP]",
8
+ "unk_token": "[UNK]"
9
+ }
tokenization_aidorna.py ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from typing import List, Optional
3
+
4
+ from transformers import PreTrainedTokenizer
5
+
6
+ _DEFAULT_VOCAB = [
7
+ "[PAD]", "[MASK]", "[CLS]", "[SEP]", "[UNK]",
8
+ "A", "G", "C", "T", "U", "N",
9
+ "[BOS]", "[EOS]", "[UNUSED1]", "[UNUSED2]", "[UNUSED3]",
10
+ ]
11
+
12
+
13
+ class AIDORNATokenizer(PreTrainedTokenizer):
14
+ vocab_files_names = {"vocab_file": "vocab.txt"}
15
+ model_input_names = ["input_ids", "attention_mask"]
16
+
17
+ def __init__(
18
+ self,
19
+ vocab_file=None,
20
+ unk_token="[UNK]",
21
+ cls_token="[CLS]",
22
+ pad_token="[PAD]",
23
+ mask_token="[MASK]",
24
+ sep_token="[SEP]",
25
+ bos_token="[BOS]",
26
+ eos_token="[EOS]",
27
+ **kwargs,
28
+ ):
29
+ if vocab_file is not None and os.path.isfile(vocab_file):
30
+ with open(vocab_file) as f:
31
+ self.all_tokens = [line.strip() for line in f if line.strip()]
32
+ else:
33
+ self.all_tokens = list(_DEFAULT_VOCAB)
34
+
35
+ self._id_to_token = dict(enumerate(self.all_tokens))
36
+ self._token_to_id = {tok: idx for idx, tok in enumerate(self.all_tokens)}
37
+
38
+ super().__init__(
39
+ unk_token=unk_token,
40
+ cls_token=cls_token,
41
+ pad_token=pad_token,
42
+ mask_token=mask_token,
43
+ sep_token=sep_token,
44
+ bos_token=bos_token,
45
+ eos_token=eos_token,
46
+ **kwargs,
47
+ )
48
+
49
+ # Register all vocab tokens as no-split so the trie-based tokenizer matches
50
+ # single characters (A, G, C, T, U, N) and special tokens exactly.
51
+ self.unique_no_split_tokens = self.all_tokens
52
+ self._update_trie(self.unique_no_split_tokens)
53
+
54
+ @property
55
+ def vocab_size(self):
56
+ return len(self.all_tokens)
57
+
58
+ def get_vocab(self):
59
+ vocab = dict(self._token_to_id)
60
+ vocab.update(self.added_tokens_encoder)
61
+ return vocab
62
+
63
+ def _tokenize(self, text, **kwargs):
64
+ return text.split()
65
+
66
+ def _convert_token_to_id(self, token):
67
+ return self._token_to_id.get(token, self._token_to_id.get("[UNK]", 4))
68
+
69
+ def _convert_id_to_token(self, index):
70
+ return self._id_to_token.get(index, "[UNK]")
71
+
72
+ def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
73
+ cls = [self.cls_token_id]
74
+ sep = [self.sep_token_id]
75
+ if token_ids_1 is None:
76
+ return cls + token_ids_0 + sep
77
+ return cls + token_ids_0 + sep + cls + token_ids_1 + sep
78
+
79
+ def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
80
+ if already_has_special_tokens:
81
+ return [1 if t in self.all_special_ids else 0 for t in token_ids_0]
82
+ mask = [1] + [0] * len(token_ids_0) + [1]
83
+ if token_ids_1 is not None:
84
+ mask += [1] + [0] * len(token_ids_1) + [1]
85
+ return mask
86
+
87
+ def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
88
+ sep = [self.sep_token_id]
89
+ cls = [self.cls_token_id]
90
+ if token_ids_1 is None:
91
+ return [0] * (1 + len(token_ids_0) + 1)
92
+ return [0] * (1 + len(token_ids_0) + 1) + [1] * (1 + len(token_ids_1) + 1)
93
+
94
+ def save_vocabulary(self, save_directory, filename_prefix=None):
95
+ os.makedirs(save_directory, exist_ok=True)
96
+ fname = (filename_prefix + "-" if filename_prefix else "") + "vocab.txt"
97
+ path = os.path.join(save_directory, fname)
98
+ with open(path, "w") as f:
99
+ f.write("\n".join(self.all_tokens))
100
+ return (path,)
tokenizer_config.json ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[MASK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[UNK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "11": {
44
+ "content": "[BOS]",
45
+ "lstrip": false,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ },
51
+ "12": {
52
+ "content": "[EOS]",
53
+ "lstrip": false,
54
+ "normalized": false,
55
+ "rstrip": false,
56
+ "single_word": false,
57
+ "special": true
58
+ }
59
+ },
60
+ "bos_token": "[BOS]",
61
+ "clean_up_tokenization_spaces": false,
62
+ "cls_token": "[CLS]",
63
+ "eos_token": "[EOS]",
64
+ "extra_special_tokens": {},
65
+ "mask_token": "[MASK]",
66
+ "model_max_length": 1024,
67
+ "pad_token": "[PAD]",
68
+ "sep_token": "[SEP]",
69
+ "tokenizer_class": "AIDORNATokenizer",
70
+ "unk_token": "[UNK]",
71
+ "auto_map": {
72
+ "AutoTokenizer": [
73
+ "tokenization_aidorna.AIDORNATokenizer",
74
+ null
75
+ ]
76
+ }
77
+ }
vocab.txt ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [PAD]
2
+ [MASK]
3
+ [CLS]
4
+ [SEP]
5
+ [UNK]
6
+ A
7
+ G
8
+ C
9
+ T
10
+ U
11
+ N
12
+ [BOS]
13
+ [EOS]
14
+ [UNUSED1]
15
+ [UNUSED2]
16
+ [UNUSED3]