eamag commited on
Commit
41a9776
·
verified ·
1 Parent(s): c552a2c

End of training

Browse files
README.md ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: mit
4
+ base_model: chandar-lab/NeoBERT
5
+ tags:
6
+ - generated_from_trainer
7
+ metrics:
8
+ - f1
9
+ model-index:
10
+ - name: NeoBERT-multiclass-classifier-ICLR
11
+ results: []
12
+ ---
13
+
14
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
+ should probably proofread and complete it, then remove this comment. -->
16
+
17
+ # NeoBERT-multiclass-classifier-ICLR
18
+
19
+ This model is a fine-tuned version of [chandar-lab/NeoBERT](https://huggingface.co/chandar-lab/NeoBERT) on the None dataset.
20
+ It achieves the following results on the evaluation set:
21
+ - Loss: 1.7258
22
+ - F1: 0.5134
23
+
24
+ ## Model description
25
+
26
+ More information needed
27
+
28
+ ## Intended uses & limitations
29
+
30
+ More information needed
31
+
32
+ ## Training and evaluation data
33
+
34
+ More information needed
35
+
36
+ ## Training procedure
37
+
38
+ ### Training hyperparameters
39
+
40
+ The following hyperparameters were used during training:
41
+ - learning_rate: 0.0005
42
+ - train_batch_size: 16
43
+ - eval_batch_size: 16
44
+ - seed: 42
45
+ - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
46
+ - lr_scheduler_type: linear
47
+ - num_epochs: 20
48
+ - label_smoothing_factor: 0.1
49
+
50
+ ### Training results
51
+
52
+ | Training Loss | Epoch | Step | Validation Loss | F1 |
53
+ |:-------------:|:-----:|:----:|:---------------:|:------:|
54
+ | No log | 1.0 | 45 | 1.2609 | 0.4641 |
55
+ | No log | 2.0 | 90 | 1.3167 | 0.4641 |
56
+ | 1.2441 | 3.0 | 135 | 1.1799 | 0.5405 |
57
+ | 1.2441 | 4.0 | 180 | 1.2810 | 0.5386 |
58
+ | 1.0526 | 5.0 | 225 | 1.2742 | 0.5098 |
59
+ | 1.0526 | 6.0 | 270 | 1.4929 | 0.5030 |
60
+ | 0.7789 | 7.0 | 315 | 1.5076 | 0.5425 |
61
+ | 0.7789 | 8.0 | 360 | 1.6513 | 0.4908 |
62
+ | 0.5299 | 9.0 | 405 | 1.6172 | 0.5476 |
63
+ | 0.5299 | 10.0 | 450 | 1.7358 | 0.5389 |
64
+ | 0.5299 | 11.0 | 495 | 1.8935 | 0.4847 |
65
+ | 0.4185 | 12.0 | 540 | 1.8012 | 0.5152 |
66
+ | 0.4185 | 13.0 | 585 | 1.7241 | 0.5337 |
67
+ | 0.3614 | 14.0 | 630 | 1.7109 | 0.5257 |
68
+ | 0.3614 | 15.0 | 675 | 1.7233 | 0.5024 |
69
+ | 0.3527 | 16.0 | 720 | 1.7104 | 0.5147 |
70
+ | 0.3527 | 17.0 | 765 | 1.7282 | 0.5134 |
71
+ | 0.3513 | 18.0 | 810 | 1.7257 | 0.5134 |
72
+ | 0.3513 | 19.0 | 855 | 1.7263 | 0.5134 |
73
+ | 0.3511 | 20.0 | 900 | 1.7258 | 0.5134 |
74
+
75
+
76
+ ### Framework versions
77
+
78
+ - Transformers 4.53.0
79
+ - Pytorch 2.7.1+cu126
80
+ - Datasets 3.6.0
81
+ - Tokenizers 0.21.2
config.json ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "NeoBERTForSequenceClassification"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "model.NeoBERTConfig",
7
+ "AutoModel": "model.NeoBERT",
8
+ "AutoModelForMaskedLM": "model.NeoBERTLMHead",
9
+ "AutoModelForSequenceClassification": "model.NeoBERTForSequenceClassification"
10
+ },
11
+ "classifier_init_range": 0.02,
12
+ "decoder_init_range": 0.02,
13
+ "dim_head": 64,
14
+ "embedding_init_range": 0.02,
15
+ "hidden_size": 768,
16
+ "id2label": {
17
+ "0": "Conceptual Integration",
18
+ "1": "Cross-Domain Application",
19
+ "2": "Direct Enhancement",
20
+ "3": "Other"
21
+ },
22
+ "intermediate_size": 3072,
23
+ "kwargs": {
24
+ "architectures": [
25
+ "NeoBERTLMHead"
26
+ ],
27
+ "attn_implementation": null,
28
+ "auto_map": {
29
+ "AutoConfig": "model.NeoBERTConfig",
30
+ "AutoModel": "model.NeoBERT",
31
+ "AutoModelForMaskedLM": "model.NeoBERTLMHead",
32
+ "AutoModelForSequenceClassification": "model.NeoBERTForSequenceClassification"
33
+ },
34
+ "classifier_init_range": 0.02,
35
+ "dim_head": 64,
36
+ "kwargs": {
37
+ "classifier_init_range": 0.02,
38
+ "pretrained_model_name_or_path": "google-bert/bert-base-uncased",
39
+ "trust_remote_code": true
40
+ },
41
+ "model_type": "neobert",
42
+ "pretrained_model_name_or_path": "google-bert/bert-base-uncased",
43
+ "torch_dtype": "float32",
44
+ "transformers_version": "4.48.2",
45
+ "trust_remote_code": true
46
+ },
47
+ "label2id": {
48
+ "Conceptual Integration": "0",
49
+ "Cross-Domain Application": "1",
50
+ "Direct Enhancement": "2",
51
+ "Other": "3"
52
+ },
53
+ "max_length": 4096,
54
+ "model_type": "neobert",
55
+ "norm_eps": 1e-05,
56
+ "num_attention_heads": 12,
57
+ "num_hidden_layers": 28,
58
+ "pad_token_id": 0,
59
+ "pretrained_model_name_or_path": "google-bert/bert-base-uncased",
60
+ "torch_dtype": "float32",
61
+ "transformers_version": "4.53.0",
62
+ "trust_remote_code": true,
63
+ "vocab_size": 30522
64
+ }
model.py ADDED
@@ -0,0 +1,434 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # From https://github.com/facebookresearch/llama/blob/main/llama/model.py
2
+
3
+ import torch
4
+ from torch import nn
5
+
6
+ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
7
+ from torch.nn.functional import scaled_dot_product_attention
8
+
9
+ from typing import Optional, Tuple
10
+ import numpy as np
11
+
12
+ from xformers.ops import SwiGLU
13
+
14
+ try:
15
+ from flash_attn.flash_attn_interface import flash_attn_varlen_func
16
+
17
+ FLASH_ATTN_AVAILABLE = True
18
+ except ImportError:
19
+ FLASH_ATTN_AVAILABLE = False
20
+
21
+ from transformers import (
22
+ PreTrainedModel,
23
+ PretrainedConfig,
24
+ DataCollatorForLanguageModeling,
25
+ )
26
+ from transformers.modeling_outputs import (
27
+ BaseModelOutput,
28
+ MaskedLMOutput,
29
+ SequenceClassifierOutput,
30
+ )
31
+
32
+ from .rotary import precompute_freqs_cis, apply_rotary_emb
33
+
34
+
35
+ class DataCollatorWithPacking(DataCollatorForLanguageModeling):
36
+ def __init__(self, pack_sequences=False, **kwargs):
37
+ super().__init__(**kwargs)
38
+ self.pack_sequences = pack_sequences
39
+
40
+ def __call__(self, batch):
41
+ if self.pack_sequences:
42
+ # Add position_ids if not present
43
+ if "position_ids" not in batch[0]:
44
+ for item in batch:
45
+ item["position_ids"] = list(range(len(item["input_ids"])))
46
+
47
+ # Pack the sequences into a single list
48
+ input_ids_list = [item["input_ids"] for item in batch]
49
+ position_ids_list = [item["position_ids"] for item in batch]
50
+ seqlens = np.array([0] + [len(ids) for ids in input_ids_list])
51
+
52
+ packed_batch = {
53
+ "position_ids": np.concatenate(position_ids_list, axis=0),
54
+ "input_ids": np.concatenate(input_ids_list, axis=0),
55
+ "cu_seqlens": np.cumsum(seqlens),
56
+ "max_seqlen": max(seqlens),
57
+ }
58
+
59
+ batch = super().__call__([packed_batch])
60
+ batch["cu_seqlens"] = batch["cu_seqlens"].to(torch.int32).squeeze()
61
+ else:
62
+ batch = super().__call__(batch)
63
+ batch["attention_mask"] = batch["attention_mask"].to(torch.bool)
64
+
65
+ return batch
66
+
67
+
68
+ class NeoBERTConfig(PretrainedConfig):
69
+ model_type = "neobert"
70
+
71
+ # All config parameters must have a default value.
72
+ def __init__(
73
+ self,
74
+ hidden_size: int = 768,
75
+ num_hidden_layers: int = 28,
76
+ num_attention_heads: int = 12,
77
+ intermediate_size: int = 3072,
78
+ embedding_init_range: float = 0.02,
79
+ decoder_init_range: float = 0.02,
80
+ norm_eps: float = 1e-06,
81
+ vocab_size: int = 30522,
82
+ pad_token_id: int = 0,
83
+ max_length: int = 1024,
84
+ **kwargs,
85
+ ):
86
+ super().__init__(**kwargs)
87
+
88
+ self.hidden_size = hidden_size
89
+ self.num_hidden_layers = num_hidden_layers
90
+ self.num_attention_heads = num_attention_heads
91
+ if hidden_size % num_attention_heads != 0:
92
+ raise ValueError("Hidden size must be divisible by the number of heads.")
93
+ self.dim_head = hidden_size // num_attention_heads
94
+ self.intermediate_size = intermediate_size
95
+ self.embedding_init_range = embedding_init_range
96
+ self.decoder_init_range = decoder_init_range
97
+ self.norm_eps = norm_eps
98
+ self.vocab_size = vocab_size
99
+ self.pad_token_id = pad_token_id
100
+ self.max_length = max_length
101
+ self.kwargs = kwargs
102
+
103
+
104
+ class EncoderBlock(nn.Module):
105
+ """Transformer encoder block."""
106
+
107
+ def __init__(self, config: NeoBERTConfig):
108
+ super().__init__()
109
+
110
+ self.config = config
111
+
112
+ # Attention
113
+ self.qkv = nn.Linear(in_features=config.hidden_size, out_features=config.hidden_size * 3, bias=False)
114
+ self.wo = nn.Linear(in_features=config.hidden_size, out_features=config.hidden_size, bias=False)
115
+
116
+ # Feedforward network
117
+ multiple_of = 8
118
+ intermediate_size = int(2 * config.intermediate_size / 3)
119
+ intermediate_size = multiple_of * ((intermediate_size + multiple_of - 1) // multiple_of)
120
+ self.ffn = SwiGLU(config.hidden_size, intermediate_size, config.hidden_size, bias=False)
121
+
122
+ # Layer norms
123
+ self.attention_norm = nn.RMSNorm(config.hidden_size, config.norm_eps)
124
+ self.ffn_norm = nn.RMSNorm(config.hidden_size, config.norm_eps)
125
+
126
+ def forward(
127
+ self,
128
+ x: torch.Tensor,
129
+ attention_mask: torch.Tensor,
130
+ freqs_cis: torch.Tensor,
131
+ output_attentions: bool,
132
+ max_seqlen: int = None,
133
+ cu_seqlens: torch.Tensor = None,
134
+ ):
135
+ # Attention
136
+ attn_output, attn_weights = self._att_block(
137
+ self.attention_norm(x), attention_mask, freqs_cis, output_attentions, max_seqlen, cu_seqlens
138
+ )
139
+
140
+ # Residual
141
+ x = x + attn_output
142
+
143
+ # Feed-forward
144
+ x = x + self.ffn(self.ffn_norm(x))
145
+
146
+ return x, attn_weights
147
+
148
+ def _att_block(
149
+ self,
150
+ x: torch.Tensor,
151
+ attention_mask: torch.Tensor,
152
+ freqs_cis: torch.Tensor,
153
+ output_attentions: bool,
154
+ max_seqlen: int = None,
155
+ cu_seqlens: torch.Tensor = None,
156
+ ):
157
+ batch_size, seq_len, _ = x.shape
158
+
159
+ xq, xk, xv = self.qkv(x).view(batch_size, seq_len, self.config.num_attention_heads, self.config.dim_head * 3).chunk(3, axis=-1)
160
+
161
+ xq, xk = apply_rotary_emb(xq, xk, freqs_cis)
162
+
163
+ # Attn block
164
+ attn_weights = None
165
+
166
+ # Flash attention if the tensors are packed
167
+ if cu_seqlens is not None:
168
+ attn = flash_attn_varlen_func(
169
+ q=xq.squeeze(0),
170
+ k=xk.squeeze(0),
171
+ v=xv.squeeze(0),
172
+ cu_seqlens_q=cu_seqlens,
173
+ cu_seqlens_k=cu_seqlens,
174
+ max_seqlen_q=max_seqlen,
175
+ max_seqlen_k=max_seqlen,
176
+ dropout_p=0.0,
177
+ causal=False,
178
+ )
179
+ # Eager attention if attention weights are needed in the output
180
+ elif output_attentions:
181
+ attn_weights = xq.permute(0, 2, 1, 3) @ xk.permute(0, 2, 3, 1) / (xq.size(-1) ** 0.5)
182
+ if attention_mask is not None:
183
+ attn_weights = attn_weights * attention_mask
184
+ attn_weights = attn_weights.softmax(-1)
185
+ attn = attn_weights @ xv.permute(0, 2, 1, 3)
186
+ attn = attn.transpose(1, 2)
187
+ # Fall back to SDPA otherwise
188
+ else:
189
+ attn = scaled_dot_product_attention(
190
+ query=xq.transpose(1, 2),
191
+ key=xk.transpose(1, 2),
192
+ value=xv.transpose(1, 2),
193
+ attn_mask=attention_mask.bool(),
194
+ dropout_p=0,
195
+ ).transpose(1, 2)
196
+
197
+ return self.wo(attn.reshape(batch_size, seq_len, self.config.num_attention_heads * self.config.dim_head)), attn_weights
198
+
199
+
200
+ class NeoBERTPreTrainedModel(PreTrainedModel):
201
+ config_class = NeoBERTConfig
202
+ base_model_prefix = "model"
203
+ _supports_cache_class = True
204
+
205
+ def _init_weights(self, module):
206
+ if isinstance(module, nn.Linear):
207
+ module.weight.data.uniform_(-self.config.decoder_init_range, self.config.decoder_init_range)
208
+ elif isinstance(module, nn.Embedding):
209
+ module.weight.data.uniform_(-self.config.embedding_init_range, self.config.embedding_init_range)
210
+
211
+
212
+ class NeoBERT(NeoBERTPreTrainedModel):
213
+ config_class = NeoBERTConfig
214
+
215
+ def __init__(self, config: NeoBERTConfig):
216
+ super().__init__(config)
217
+
218
+ self.config = config
219
+
220
+ self.encoder = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
221
+
222
+ # Ensures freqs_cis is moved to the same devices as the model. Non-persistent buffers are not saved in the state_dict.
223
+ freqs_cis = precompute_freqs_cis(config.hidden_size // config.num_attention_heads, config.max_length)
224
+ self.register_buffer("freqs_cis", freqs_cis, persistent=False)
225
+
226
+ self.transformer_encoder = nn.ModuleList()
227
+ for _ in range(config.num_hidden_layers):
228
+ self.transformer_encoder.append(EncoderBlock(config))
229
+
230
+ self.layer_norm = nn.RMSNorm(config.hidden_size, config.norm_eps)
231
+
232
+ # Initialize weights and apply final processing
233
+ self.post_init()
234
+
235
+ def forward(
236
+ self,
237
+ input_ids: Optional[torch.Tensor] = None,
238
+ position_ids: torch.Tensor = None,
239
+ max_seqlen: int = None,
240
+ cu_seqlens: torch.Tensor = None,
241
+ attention_mask: torch.Tensor = None,
242
+ inputs_embeds: Optional[torch.Tensor] = None,
243
+ output_hidden_states: bool = False,
244
+ output_attentions: bool = False,
245
+ **kwargs,
246
+ ):
247
+ # Initialize
248
+ hidden_states, attentions = [], []
249
+
250
+ if (input_ids is None) ^ (inputs_embeds is not None):
251
+ raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
252
+
253
+ # Expand and repeat: (Batch, Length) -> (Batch, Heads, Length, Length)
254
+ if attention_mask is not None:
255
+ attention_mask = attention_mask.unsqueeze(1).unsqueeze(1).repeat(1, self.config.num_attention_heads, attention_mask.size(-1), 1)
256
+
257
+ # Checks to be done if inputs are packed sequences
258
+ if cu_seqlens is not None:
259
+ assert (
260
+ FLASH_ATTN_AVAILABLE
261
+ ), "Flash-attention is not available. Please ''pip install flash_attn'', or provide un-packed sequences."
262
+ assert not output_attentions, "Output attentions is not supported when sequences are packed."
263
+ assert max_seqlen is not None, "Missing max_seqlen. It must be provided when cu_seqlens are not None."
264
+ assert (input_ids if input_ids is not None else inputs_embeds).shape[
265
+ 0
266
+ ] == 1, "Cumulative sequence lengths are provided but inputs are not packed."
267
+ assert (
268
+ input_ids if input_ids is not None else inputs_embeds
269
+ ).is_cuda, "Packing uses an implementation of flash-attention and is only supported on GPU."
270
+
271
+ # RoPE
272
+ freqs_cis = (
273
+ self.freqs_cis[position_ids]
274
+ if position_ids is not None
275
+ else self.freqs_cis[: (input_ids if input_ids is not None else inputs_embeds).shape[1]].unsqueeze(0)
276
+ )
277
+
278
+ # Embedding
279
+ x = self.encoder(input_ids) if input_ids is not None else inputs_embeds
280
+
281
+ # Transformer encoder
282
+ for layer in self.transformer_encoder:
283
+ x, attn = layer(x, attention_mask, freqs_cis, output_attentions, max_seqlen, cu_seqlens)
284
+ if output_hidden_states:
285
+ hidden_states.append(x)
286
+ if output_attentions:
287
+ attentions.append(attn)
288
+
289
+ # Final normalization layer
290
+ x = self.layer_norm(x)
291
+
292
+ # Return the output of the last hidden layer
293
+ return BaseModelOutput(
294
+ last_hidden_state=x,
295
+ hidden_states=hidden_states if output_hidden_states else None,
296
+ attentions=attentions if output_attentions else None,
297
+ )
298
+
299
+
300
+ class NeoBERTLMHead(NeoBERTPreTrainedModel):
301
+ config_class = NeoBERTConfig
302
+
303
+ def __init__(self, config: NeoBERTConfig):
304
+ super().__init__(config)
305
+
306
+ self.config = config
307
+
308
+ self.model = NeoBERT(config)
309
+ self.decoder = nn.Linear(config.hidden_size, config.vocab_size)
310
+
311
+ self.post_init()
312
+
313
+ def forward(
314
+ self,
315
+ input_ids: torch.Tensor,
316
+ position_ids: torch.Tensor = None,
317
+ max_seqlen: int = None,
318
+ cu_seqlens: torch.Tensor = None,
319
+ attention_mask: torch.Tensor = None,
320
+ output_hidden_states: bool = False,
321
+ output_attentions: bool = False,
322
+ **kwargs,
323
+ ):
324
+
325
+ output = self.model.forward(
326
+ input_ids=input_ids,
327
+ position_ids=position_ids,
328
+ max_seqlen=max_seqlen,
329
+ cu_seqlens=cu_seqlens,
330
+ attention_mask=attention_mask,
331
+ output_hidden_states=output_hidden_states,
332
+ output_attentions=output_attentions,
333
+ )
334
+ logits = self.decoder(output.last_hidden_state)
335
+
336
+ return MaskedLMOutput(
337
+ hidden_states=output.hidden_states if output_hidden_states else None,
338
+ attentions=output.attentions if output_attentions else None,
339
+ logits=logits,
340
+ )
341
+
342
+
343
+ class NeoBERTForSequenceClassification(NeoBERTPreTrainedModel):
344
+ config_class = NeoBERTConfig
345
+
346
+ def __init__(self, config: NeoBERTConfig):
347
+ super().__init__(config)
348
+
349
+ self.config = config
350
+
351
+ self.num_labels = getattr(config, "num_labels", 2)
352
+ self.classifier_dropout = getattr(config, "classifier_dropout", 0.1)
353
+ self.classifier_init_range = getattr(config, "classifier_init_range", 0.02)
354
+
355
+ self.model = NeoBERT(config)
356
+
357
+ self.dense = nn.Linear(self.config.hidden_size, self.config.hidden_size)
358
+ self.dropout = nn.Dropout(self.classifier_dropout)
359
+ self.classifier = nn.Linear(self.config.hidden_size, self.num_labels)
360
+
361
+ self.post_init()
362
+
363
+ def _init_weights(self, module):
364
+ if isinstance(module, nn.Linear):
365
+ module.weight.data.normal_(mean=0.0, std=self.classifier_init_range)
366
+ if module.bias is not None:
367
+ module.bias.data.zero_()
368
+
369
+ def forward(
370
+ self,
371
+ input_ids: Optional[torch.Tensor] = None,
372
+ position_ids: torch.Tensor = None,
373
+ max_seqlen: int = None,
374
+ cu_seqlens: torch.Tensor = None,
375
+ attention_mask: torch.Tensor = None,
376
+ output_hidden_states: bool = False,
377
+ output_attentions: bool = False,
378
+ labels: Optional[torch.Tensor] = None,
379
+ return_dict: Optional[bool] = None,
380
+ ):
381
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
382
+
383
+ output = self.model.forward(
384
+ input_ids=input_ids,
385
+ position_ids=position_ids,
386
+ max_seqlen=max_seqlen,
387
+ cu_seqlens=cu_seqlens,
388
+ attention_mask=attention_mask,
389
+ output_hidden_states=output_hidden_states,
390
+ output_attentions=output_attentions,
391
+ )
392
+ hidden_states = output.last_hidden_state
393
+
394
+ x = hidden_states[:, 0, :]
395
+ x = self.dropout(x)
396
+ x = self.dense(x)
397
+ x = torch.tanh(x)
398
+ x = self.dropout(x)
399
+
400
+ logits = self.classifier(x)
401
+
402
+ loss = None
403
+ if labels is not None:
404
+ if self.config.problem_type is None:
405
+ if self.num_labels == 1:
406
+ self.config.problem_type = "regression"
407
+ elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
408
+ self.config.problem_type = "single_label_classification"
409
+ else:
410
+ self.config.problem_type = "multi_label_classification"
411
+
412
+ if self.config.problem_type == "regression":
413
+ loss_fct = MSELoss()
414
+ if self.num_labels == 1:
415
+ loss = loss_fct(logits.squeeze(), labels.squeeze())
416
+ else:
417
+ loss = loss_fct(logits, labels)
418
+ elif self.config.problem_type == "single_label_classification":
419
+ loss_fct = CrossEntropyLoss()
420
+ loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
421
+ elif self.config.problem_type == "multi_label_classification":
422
+ loss_fct = BCEWithLogitsLoss()
423
+ loss = loss_fct(logits, labels)
424
+
425
+ if not return_dict:
426
+ result = (logits,)
427
+ return ((loss,) + result) if loss is not None else result
428
+
429
+ return SequenceClassifierOutput(
430
+ loss=loss,
431
+ logits=logits,
432
+ hidden_states=output.hidden_states if output_hidden_states else None,
433
+ attentions=output.attentions if output_attentions else None,
434
+ )
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ae1e6bdafb1fe7f334874ad2659ff30275c460a3bb9cecf892c443639a1b15da
3
+ size 889056736
rotary.py ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # From https://github.com/facebookresearch/llama/blob/main/llama/model.py
2
+
3
+ import torch
4
+ from typing import Tuple
5
+
6
+
7
+ def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0):
8
+ """
9
+ Precompute the frequency tensor for complex exponentials (cis) with given dimensions.
10
+
11
+ This function calculates a frequency tensor with complex exponentials using the given dimension 'dim'
12
+ and the end index 'end'. The 'theta' parameter scales the frequencies.
13
+ The returned tensor contains complex values in complex64 data type.
14
+
15
+ Args:
16
+ dim (int): Dimension of the frequency tensor.
17
+ end (int): End index for precomputing frequencies.
18
+ theta (float, optional): Scaling factor for frequency computation. Defaults to 10000.0.
19
+
20
+ Returns:
21
+ torch.Tensor: Precomputed frequency tensor with complex exponentials.
22
+ """
23
+
24
+ freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
25
+ t = torch.arange(end, device=freqs.device)
26
+ freqs = torch.outer(t, freqs).float()
27
+ return torch.polar(torch.ones_like(freqs), freqs)
28
+
29
+
30
+ def reshape_for_broadcast(freqs_cis: torch.Tensor, x: torch.Tensor):
31
+ assert freqs_cis.shape[1:] == (x.shape[1], x.shape[-1])
32
+ return freqs_cis.contiguous().unsqueeze(2)
33
+
34
+
35
+ def apply_rotary_emb(
36
+ xq: torch.Tensor,
37
+ xk: torch.Tensor,
38
+ freqs_cis: torch.Tensor,
39
+ ) -> Tuple[torch.Tensor, torch.Tensor]:
40
+ """
41
+ Apply rotary embeddings to input tensors using the given frequency tensor.
42
+
43
+ This function applies rotary embeddings to the given query 'xq' and key 'xk' tensors using the provided
44
+ frequency tensor 'freqs_cis'. The input tensors are reshaped as complex numbers, and the frequency tensor
45
+ is reshaped for broadcasting compatibility. The resulting tensors contain rotary embeddings and are
46
+ returned as real tensors.
47
+
48
+ Args:
49
+ xq (torch.Tensor): Query tensor to apply rotary embeddings.
50
+ xk (torch.Tensor): Key tensor to apply rotary embeddings.
51
+ freqs_cis (torch.Tensor): Precomputed frequency tensor for complex exponentials.
52
+
53
+ Returns:
54
+ Tuple[torch.Tensor, torch.Tensor]: Tuple of modified query tensor and key tensor with rotary embeddings.
55
+ """
56
+ xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
57
+ xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
58
+ freqs_cis = reshape_for_broadcast(freqs_cis, xq_)
59
+ xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
60
+ xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
61
+ return xq_out.type_as(xq), xk_out.type_as(xk)
runs/Jul08_17-29-30_ip-10-192-12-113/events.out.tfevents.1751995772.ip-10-192-12-113.3586.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:08dc7cc8aa176e5df2477496fec924f21a48082c337ccafc66619ded727cdd7e
3
+ size 14825
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c2b0af3650a600c4fb9128195fe5607db67e13ae2aa7a457b00984c63922b7bf
3
+ size 5841