estebancarlin commited on
Commit
60b91bb
·
verified ·
1 Parent(s): 54f1dc3

Initial upload: BitMar Epoch 1 - 99,686,013 tokens processed

Browse files
README.md ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ tags:
5
+ - bitmar
6
+ - multimodal
7
+ - babylm
8
+ - cross-modal
9
+ datasets:
10
+ - babylm_multimodal
11
+ metrics:
12
+ - bleu
13
+ - cross_modal_similarity
14
+ ---
15
+
16
+ # BitMar 100M Token Model
17
+
18
+ This model was trained on exactly 100 million tokens as part of the BabyLM challenge.
19
+
20
+ ## Training Details
21
+ - Total tokens: 100,000,000
22
+ - Epochs completed: 1
23
+ - Tokens processed: 99,686,013
24
+ - Cross-modal similarity: 0.3418
25
+
26
+ ## Model Architecture
27
+ - Text encoder: 4 layers, 128 hidden size
28
+ - Vision encoder: DiNOv2 features compressed to 128
29
+ - Episodic memory: 32 slots
30
+
31
+ ## Usage
32
+ ```python
33
+ from transformers import AutoModel, AutoTokenizer
34
+
35
+ model = AutoModel.from_pretrained("euhidaman/bitmar-attention-multimodal")
36
+ tokenizer = AutoTokenizer.from_pretrained("euhidaman/bitmar-attention-multimodal")
37
+ ```
38
+
39
+
40
+ ## Training Status
41
+ - **Status**: In Progress (Epoch 1)
42
+ - **Tokens Processed**: 99,686,013
43
+ - **Best Cross-modal Similarity**: 0.3418
config.json ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": ["BitMarModel"],
3
+ "auto_map": {
4
+ "AutoConfig": "modeling_bitmar.BitMarConfig",
5
+ "AutoModel": "modeling_bitmar.BitMarModel"
6
+ },
7
+ "model_type": "bitmar",
8
+ "vocab_size": 50257,
9
+ "text_encoder_dim": 128,
10
+ "text_encoder_layers": 4,
11
+ "text_encoder_heads": 4,
12
+ "text_decoder_dim": 128,
13
+ "text_decoder_layers": 4,
14
+ "text_decoder_heads": 4,
15
+ "vision_encoder_dim": 768,
16
+ "vision_latent_size": 128,
17
+ "vision_hidden_size": 64,
18
+ "vision_compression_method": "learned_compression",
19
+ "vision_spatial_pooling": true,
20
+ "vision_pool_size": 2,
21
+ "fusion_hidden_size": 128,
22
+ "fusion_num_heads": 4,
23
+ "fusion_num_layers": 2,
24
+ "memory_size": 32,
25
+ "episode_dim": 128,
26
+ "memory_alpha": 0.2,
27
+ "direct_writing": true,
28
+ "memory_compression": true,
29
+ "max_seq_len": 256,
30
+ "dropout": 0.15,
31
+ "torch_dtype": "float32",
32
+ "transformers_version": "4.36.0",
33
+ "use_cache": true,
34
+ "tie_word_embeddings": true,
35
+ "initializer_range": 0.02,
36
+ "layer_norm_epsilon": 1e-5,
37
+ "pad_token_id": 50256,
38
+ "bos_token_id": 50256,
39
+ "eos_token_id": 50256,
40
+ "sep_token_id": null,
41
+ "decoder_start_token_id": null
42
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
modeling_bitmar.py ADDED
@@ -0,0 +1,829 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ BitMar Model for Hugging Face Transformers
3
+ BitNet-quantized Vision-Language Episodic Memory Transformer
4
+ """
5
+ import torch
6
+ import torch.nn as nn
7
+ import torch.nn.functional as F
8
+ import logging
9
+ import math
10
+ import os
11
+ import pickle
12
+ import gzip
13
+ from typing import Dict, List, Optional, Tuple, Union
14
+ from transformers import PreTrainedModel, PretrainedConfig
15
+ from transformers.modeling_outputs import CausalLMOutput, BaseModelOutput
16
+
17
+ logger = logging.getLogger(__name__)
18
+
19
+
20
+ class BitMarConfig(PretrainedConfig):
21
+ """Configuration class for BitMar model"""
22
+
23
+ model_type = "bitmar"
24
+
25
+ def __init__(
26
+ self,
27
+ vocab_size: int = 50257,
28
+ text_encoder_dim: int = 128,
29
+ text_encoder_layers: int = 4,
30
+ text_encoder_heads: int = 4,
31
+ text_decoder_dim: int = 128,
32
+ text_decoder_layers: int = 4,
33
+ text_decoder_heads: int = 4,
34
+ vision_encoder_dim: int = 768,
35
+ vision_latent_size: int = 128,
36
+ vision_hidden_size: int = 64,
37
+ vision_compression_method: str = "learned_compression",
38
+ vision_spatial_pooling: bool = True,
39
+ vision_pool_size: int = 2,
40
+ fusion_hidden_size: int = 128,
41
+ fusion_num_heads: int = 4,
42
+ fusion_num_layers: int = 2,
43
+ memory_size: int = 32,
44
+ episode_dim: int = 128,
45
+ memory_alpha: float = 0.2,
46
+ direct_writing: bool = True,
47
+ memory_compression: bool = True,
48
+ max_seq_len: int = 256,
49
+ dropout: float = 0.15,
50
+ initializer_range: float = 0.02,
51
+ layer_norm_epsilon: float = 1e-5,
52
+ use_cache: bool = True,
53
+ tie_word_embeddings: bool = True,
54
+ pad_token_id: int = 50256,
55
+ bos_token_id: int = 50256,
56
+ eos_token_id: int = 50256,
57
+ **kwargs
58
+ ):
59
+ super().__init__(
60
+ pad_token_id=pad_token_id,
61
+ bos_token_id=bos_token_id,
62
+ eos_token_id=eos_token_id,
63
+ **kwargs
64
+ )
65
+
66
+ self.vocab_size = vocab_size
67
+ self.text_encoder_dim = text_encoder_dim
68
+ self.text_encoder_layers = text_encoder_layers
69
+ self.text_encoder_heads = text_encoder_heads
70
+ self.text_decoder_dim = text_decoder_dim
71
+ self.text_decoder_layers = text_decoder_layers
72
+ self.text_decoder_heads = text_decoder_heads
73
+ self.vision_encoder_dim = vision_encoder_dim
74
+ self.vision_latent_size = vision_latent_size
75
+ self.vision_hidden_size = vision_hidden_size
76
+ self.vision_compression_method = vision_compression_method
77
+ self.vision_spatial_pooling = vision_spatial_pooling
78
+ self.vision_pool_size = vision_pool_size
79
+ self.fusion_hidden_size = fusion_hidden_size
80
+ self.fusion_num_heads = fusion_num_heads
81
+ self.fusion_num_layers = fusion_num_layers
82
+ self.memory_size = memory_size
83
+ self.episode_dim = episode_dim
84
+ self.memory_alpha = memory_alpha
85
+ self.direct_writing = direct_writing
86
+ self.memory_compression = memory_compression
87
+ self.max_seq_len = max_seq_len
88
+ self.dropout = dropout
89
+ self.initializer_range = initializer_range
90
+ self.layer_norm_epsilon = layer_norm_epsilon
91
+ self.use_cache = use_cache
92
+ self.tie_word_embeddings = tie_word_embeddings
93
+
94
+
95
+ class BitNetLinear(nn.Module):
96
+ """1.58-bit Linear layer following BitNet b1.58 architecture"""
97
+
98
+ def __init__(self, in_features: int, out_features: int, bias: bool = True):
99
+ super().__init__()
100
+ self.in_features = in_features
101
+ self.out_features = out_features
102
+
103
+ self.weight = nn.Parameter(torch.randn(out_features, in_features))
104
+ self.bias = nn.Parameter(torch.zeros(out_features)) if bias else None
105
+
106
+ self.register_buffer('weight_scale', torch.ones(1))
107
+ self.register_buffer('input_scale', torch.ones(1))
108
+
109
+ def quantize_weights_1_58_bit(self, weight: torch.Tensor) -> torch.Tensor:
110
+ scale = weight.abs().mean()
111
+ self.weight_scale.data = scale.clamp(min=1e-5, max=1e3)
112
+
113
+ weight_norm = torch.clamp(weight / self.weight_scale, min=-10.0, max=10.0)
114
+ threshold = 2.0 / 3.0
115
+
116
+ quantized = torch.zeros_like(weight_norm)
117
+ quantized[weight_norm > threshold] = 1.0
118
+ quantized[weight_norm < -threshold] = -1.0
119
+
120
+ return quantized
121
+
122
+ def quantize_activations_8bit(self, x: torch.Tensor) -> torch.Tensor:
123
+ x_clamped = torch.clamp(x, min=-1e6, max=1e6)
124
+ x_min, x_max = x_clamped.min(), x_clamped.max()
125
+
126
+ range_val = x_max - x_min
127
+ if range_val < 1e-8:
128
+ return x_clamped
129
+
130
+ scale = range_val / 255.0
131
+ self.input_scale.data = scale.clamp(min=1e-8, max=1e3)
132
+
133
+ zero_point = (-x_min / scale).round().clamp(0, 255)
134
+ quantized = ((x_clamped / scale) + zero_point).round().clamp(0, 255)
135
+ dequantized = scale * (quantized - zero_point)
136
+
137
+ return dequantized
138
+
139
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
140
+ if self.training:
141
+ weight = self.quantize_weights_1_58_bit(self.weight)
142
+ x = self.quantize_activations_8bit(x)
143
+ else:
144
+ weight = self.weight
145
+
146
+ output = F.linear(x, weight, self.bias)
147
+ return output
148
+
149
+
150
+ class BitNetMLP(nn.Module):
151
+ """BitNet MLP block with 1.58-bit quantization"""
152
+
153
+ def __init__(self, dim: int, hidden_dim: int, dropout: float = 0.1):
154
+ super().__init__()
155
+ self.up_proj = BitNetLinear(dim, hidden_dim)
156
+ self.gate_proj = BitNetLinear(dim, hidden_dim)
157
+ self.down_proj = BitNetLinear(hidden_dim, dim)
158
+ self.dropout = nn.Dropout(dropout)
159
+
160
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
161
+ gate = torch.sigmoid(self.gate_proj(x))
162
+ up = F.silu(self.up_proj(x))
163
+ return self.dropout(self.down_proj(gate * up))
164
+
165
+
166
+ class BitNetAttention(nn.Module):
167
+ """Multi-head attention with BitNet quantization"""
168
+
169
+ def __init__(
170
+ self,
171
+ dim: int,
172
+ num_heads: int,
173
+ dropout: float = 0.1,
174
+ bias: bool = True
175
+ ):
176
+ super().__init__()
177
+ self.dim = dim
178
+ self.num_heads = num_heads
179
+ self.head_dim = dim // num_heads
180
+
181
+ assert self.head_dim * num_heads == dim
182
+
183
+ self.q_proj = BitNetLinear(dim, dim, bias=bias)
184
+ self.k_proj = BitNetLinear(dim, dim, bias=bias)
185
+ self.v_proj = BitNetLinear(dim, dim, bias=bias)
186
+ self.out_proj = BitNetLinear(dim, dim, bias=bias)
187
+
188
+ self.dropout = nn.Dropout(dropout)
189
+ self.scale = self.head_dim ** -0.5
190
+
191
+ def forward(
192
+ self,
193
+ query: torch.Tensor,
194
+ key: torch.Tensor,
195
+ value: torch.Tensor,
196
+ mask: Optional[torch.Tensor] = None
197
+ ) -> Tuple[torch.Tensor, torch.Tensor]:
198
+ B, L, D = query.shape
199
+
200
+ q = self.q_proj(query).view(B, L, self.num_heads, self.head_dim).transpose(1, 2)
201
+ k = self.k_proj(key).view(B, -1, self.num_heads, self.head_dim).transpose(1, 2)
202
+ v = self.v_proj(value).view(B, -1, self.num_heads, self.head_dim).transpose(1, 2)
203
+
204
+ attn_weights = torch.matmul(q, k.transpose(-2, -1)) * self.scale
205
+
206
+ if mask is not None:
207
+ attn_weights = attn_weights.masked_fill(mask.unsqueeze(1).unsqueeze(1) == 0, float('-inf'))
208
+
209
+ attn_weights = F.softmax(attn_weights, dim=-1)
210
+ attn_weights = self.dropout(attn_weights)
211
+
212
+ attn_output = torch.matmul(attn_weights, v)
213
+ attn_output = attn_output.transpose(1, 2).contiguous().view(B, L, D)
214
+ attn_output = self.out_proj(attn_output)
215
+
216
+ return attn_output, attn_weights
217
+
218
+
219
+ class BitNetTransformerBlock(nn.Module):
220
+ """BitNet Transformer block with quantized components"""
221
+
222
+ def __init__(
223
+ self,
224
+ dim: int,
225
+ num_heads: int,
226
+ mlp_ratio: float = 4.0,
227
+ dropout: float = 0.1
228
+ ):
229
+ super().__init__()
230
+ self.norm1 = nn.LayerNorm(dim)
231
+ self.attention = BitNetAttention(dim, num_heads, dropout)
232
+ self.norm2 = nn.LayerNorm(dim)
233
+ self.mlp = BitNetMLP(dim, int(dim * mlp_ratio), dropout)
234
+
235
+ def forward(
236
+ self,
237
+ x: torch.Tensor,
238
+ mask: Optional[torch.Tensor] = None
239
+ ) -> Tuple[torch.Tensor, torch.Tensor]:
240
+ # Self-attention with residual
241
+ norm_x = self.norm1(x)
242
+ attn_out, attn_weights = self.attention(norm_x, norm_x, norm_x, mask)
243
+ x = x + attn_out
244
+
245
+ # MLP with residual
246
+ x = x + self.mlp(self.norm2(x))
247
+
248
+ return x, attn_weights
249
+
250
+
251
+ class BitNetTextEncoder(nn.Module):
252
+ """BitNet-based text encoder"""
253
+
254
+ def __init__(
255
+ self,
256
+ vocab_size: int,
257
+ dim: int,
258
+ num_layers: int,
259
+ num_heads: int,
260
+ max_seq_len: int = 512,
261
+ dropout: float = 0.1
262
+ ):
263
+ super().__init__()
264
+ self.dim = dim
265
+ self.embedding = nn.Embedding(vocab_size, dim)
266
+ self.pos_embedding = nn.Embedding(max_seq_len, dim)
267
+ self.dropout = nn.Dropout(dropout)
268
+
269
+ self.layers = nn.ModuleList([
270
+ BitNetTransformerBlock(dim, num_heads, dropout=dropout)
271
+ for _ in range(num_layers)
272
+ ])
273
+
274
+ self.norm = nn.LayerNorm(dim)
275
+
276
+ def forward(
277
+ self,
278
+ input_ids: torch.Tensor,
279
+ attention_mask: Optional[torch.Tensor] = None
280
+ ) -> Tuple[torch.Tensor, List[torch.Tensor]]:
281
+ B, L = input_ids.shape
282
+
283
+ # Token embeddings + positional embeddings
284
+ positions = torch.arange(L, device=input_ids.device).unsqueeze(0)
285
+ x = self.embedding(input_ids) + self.pos_embedding(positions)
286
+ x = self.dropout(x)
287
+
288
+ # Apply transformer layers
289
+ attention_weights = []
290
+ for layer in self.layers:
291
+ x, attn = layer(x, attention_mask)
292
+ attention_weights.append(attn)
293
+
294
+ x = self.norm(x)
295
+ return x, attention_weights
296
+
297
+
298
+ class BitNetTextDecoder(nn.Module):
299
+ """BitNet-based text decoder with causal masking"""
300
+
301
+ def __init__(
302
+ self,
303
+ vocab_size: int,
304
+ dim: int,
305
+ num_layers: int,
306
+ num_heads: int,
307
+ max_seq_len: int = 512,
308
+ dropout: float = 0.1
309
+ ):
310
+ super().__init__()
311
+ self.dim = dim
312
+ self.max_seq_len = max_seq_len
313
+ self.embedding = nn.Embedding(vocab_size, dim)
314
+ self.pos_embedding = nn.Embedding(max_seq_len, dim)
315
+ self.dropout = nn.Dropout(dropout)
316
+
317
+ self.layers = nn.ModuleList([
318
+ BitNetTransformerBlock(dim, num_heads, dropout=dropout)
319
+ for _ in range(num_layers)
320
+ ])
321
+
322
+ self.norm = nn.LayerNorm(dim)
323
+ self.lm_head = BitNetLinear(dim, vocab_size, bias=False)
324
+
325
+ # Create causal mask
326
+ self.register_buffer(
327
+ "causal_mask",
328
+ torch.tril(torch.ones(max_seq_len, max_seq_len)).unsqueeze(0).unsqueeze(0)
329
+ )
330
+
331
+ def forward(
332
+ self,
333
+ input_ids: Optional[torch.Tensor] = None,
334
+ inputs_embeds: Optional[torch.Tensor] = None,
335
+ attention_mask: Optional[torch.Tensor] = None,
336
+ labels: Optional[torch.Tensor] = None
337
+ ) -> Dict[str, torch.Tensor]:
338
+
339
+ if input_ids is not None:
340
+ B, L = input_ids.shape
341
+ positions = torch.arange(L, device=input_ids.device).unsqueeze(0)
342
+ x = self.embedding(input_ids) + self.pos_embedding(positions)
343
+ else:
344
+ x = inputs_embeds
345
+ B, L, _ = x.shape
346
+
347
+ x = self.dropout(x)
348
+
349
+ # Create causal mask
350
+ causal_mask = self.causal_mask[:, :, :L, :L]
351
+ if attention_mask is not None:
352
+ causal_mask = causal_mask * attention_mask.unsqueeze(1).unsqueeze(2)
353
+
354
+ # Apply transformer layers
355
+ attention_weights = []
356
+ for layer in self.layers:
357
+ x, attn = layer(x, causal_mask)
358
+ attention_weights.append(attn)
359
+
360
+ x = self.norm(x)
361
+ logits = self.lm_head(x)
362
+
363
+ outputs = {"logits": logits, "hidden_states": x, "attentions": attention_weights}
364
+
365
+ if labels is not None:
366
+ shift_logits = logits[..., :-1, :].contiguous()
367
+ shift_labels = labels[..., 1:].contiguous()
368
+ loss_fct = nn.CrossEntropyLoss()
369
+ loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
370
+ outputs["loss"] = loss
371
+
372
+ return outputs
373
+
374
+
375
+ class EpisodicMemory(nn.Module):
376
+ """Episodic Memory mechanism inspired by Larimar"""
377
+
378
+ def __init__(
379
+ self,
380
+ memory_size: int,
381
+ episode_dim: int,
382
+ alpha: float = 0.1,
383
+ direct_writing: bool = True,
384
+ observation_noise_std: float = 1e-6,
385
+ external_storage: bool = False,
386
+ memory_storage_path: str = None,
387
+ compression_enabled: bool = True,
388
+ lazy_loading: bool = False
389
+ ):
390
+ super().__init__()
391
+ self.memory_size = memory_size
392
+ self.episode_dim = episode_dim
393
+ self.alpha = alpha
394
+ self.direct_writing = direct_writing
395
+ self.observation_noise_std = observation_noise_std
396
+ self.external_storage = external_storage
397
+ self.memory_storage_path = memory_storage_path
398
+ self.compression_enabled = compression_enabled
399
+ self.lazy_loading = lazy_loading
400
+
401
+ # Initialize memory
402
+ self.register_buffer('memory', torch.randn(memory_size, episode_dim))
403
+ self.register_buffer('write_head', torch.zeros(1, dtype=torch.long))
404
+ self.register_buffer('memory_age', torch.zeros(memory_size))
405
+
406
+ # Statistics
407
+ self.register_buffer('episode_mean', torch.zeros(episode_dim))
408
+ self.register_buffer('episode_std', torch.ones(episode_dim))
409
+ self.register_buffer('update_count', torch.zeros(1))
410
+
411
+ def write_memory(self, episode: torch.Tensor) -> torch.Tensor:
412
+ batch_size = episode.size(0)
413
+
414
+ if self.direct_writing:
415
+ # Direct writing to memory
416
+ for i in range(batch_size):
417
+ write_pos = self.write_head.item()
418
+ self.memory[write_pos] = episode[i].detach()
419
+ self.memory_age[write_pos] = 0
420
+ self.write_head = (self.write_head + 1) % self.memory_size
421
+
422
+ # Add observation noise
423
+ if self.observation_noise_std > 0:
424
+ noise = torch.randn_like(episode) * self.observation_noise_std
425
+ episode = episode + noise
426
+
427
+ return episode
428
+
429
+ def read_memory(self, query: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
430
+ batch_size, query_dim = query.shape
431
+
432
+ # Compute similarities
433
+ similarities = F.cosine_similarity(
434
+ query.unsqueeze(1),
435
+ self.memory.unsqueeze(0),
436
+ dim=-1
437
+ )
438
+
439
+ # Apply softmax to get attention weights
440
+ attention_weights = F.softmax(similarities / 0.1, dim=-1)
441
+
442
+ # Weighted sum of memory
443
+ retrieved = torch.sum(
444
+ attention_weights.unsqueeze(-1) * self.memory.unsqueeze(0),
445
+ dim=1
446
+ )
447
+
448
+ return retrieved, attention_weights
449
+
450
+ def forward(self, episode: torch.Tensor, mode: str = "read_write") -> Tuple[torch.Tensor, torch.Tensor]:
451
+ if mode == "write":
452
+ return self.write_memory(episode), torch.zeros(episode.size(0), self.memory_size, device=episode.device)
453
+ elif mode == "read":
454
+ return self.read_memory(episode)
455
+ else: # read_write
456
+ # Write to memory
457
+ written_episode = self.write_memory(episode)
458
+ # Read from memory
459
+ retrieved, attention_weights = self.read_memory(episode)
460
+ return retrieved, attention_weights
461
+
462
+
463
+ class CrossModalFusion(nn.Module):
464
+ """Cross-modal fusion module for text and vision features"""
465
+
466
+ def __init__(
467
+ self,
468
+ text_dim: int,
469
+ vision_dim: int,
470
+ hidden_dim: int,
471
+ num_heads: int = 8,
472
+ num_layers: int = 2
473
+ ):
474
+ super().__init__()
475
+ self.text_dim = text_dim
476
+ self.vision_dim = vision_dim
477
+ self.hidden_dim = hidden_dim
478
+
479
+ # Project to same dimension
480
+ self.text_proj = BitNetLinear(text_dim, hidden_dim)
481
+ self.vision_proj = BitNetLinear(vision_dim, hidden_dim)
482
+
483
+ # Cross-attention layers
484
+ self.cross_attention = nn.ModuleList([
485
+ BitNetAttention(hidden_dim, num_heads)
486
+ for _ in range(num_layers)
487
+ ])
488
+
489
+ self.norm = nn.LayerNorm(hidden_dim)
490
+
491
+ def forward(
492
+ self,
493
+ text_features: torch.Tensor,
494
+ vision_features: torch.Tensor
495
+ ) -> Tuple[torch.Tensor, Dict[str, torch.Tensor]]:
496
+
497
+ # Project to same dimension
498
+ text_proj = self.text_proj(text_features)
499
+ vision_proj = self.vision_proj(vision_features)
500
+
501
+ # Cross-modal attention
502
+ fused_features = text_proj
503
+ attention_maps = {}
504
+
505
+ for i, cross_attn in enumerate(self.cross_attention):
506
+ fused_features, attn_weights = cross_attn(
507
+ fused_features, vision_proj, vision_proj
508
+ )
509
+ attention_maps[f'cross_attn_{i}'] = attn_weights
510
+
511
+ fused_features = self.norm(fused_features)
512
+
513
+ return fused_features, attention_maps
514
+
515
+
516
+ class VisionEncoder(nn.Module):
517
+ """Quantized Vision Encoder for DiNOv2 features"""
518
+
519
+ def __init__(
520
+ self,
521
+ input_dim: int = 768,
522
+ hidden_dim: int = 512,
523
+ output_dim: int = 768,
524
+ num_layers: int = 2
525
+ ):
526
+ super().__init__()
527
+
528
+ layers = []
529
+ layers.append(BitNetLinear(input_dim, hidden_dim))
530
+ layers.append(nn.ReLU())
531
+
532
+ for _ in range(num_layers - 1):
533
+ layers.append(BitNetLinear(hidden_dim, hidden_dim))
534
+ layers.append(nn.ReLU())
535
+
536
+ layers.append(BitNetLinear(hidden_dim, output_dim))
537
+
538
+ self.encoder = nn.Sequential(*layers)
539
+
540
+ def forward(self, vision_features: torch.Tensor) -> torch.Tensor:
541
+ return self.encoder(vision_features)
542
+
543
+
544
+ class BitMarModel(PreTrainedModel):
545
+ """
546
+ BitMar: BitNet-quantized Vision-Language Episodic Memory Transformer
547
+ Compatible with Hugging Face Transformers
548
+ """
549
+
550
+ config_class = BitMarConfig
551
+ base_model_prefix = "bitmar"
552
+ supports_gradient_checkpointing = True
553
+ _no_split_modules = ["BitNetTransformerBlock", "EpisodicMemory"]
554
+
555
+ def __init__(self, config: BitMarConfig):
556
+ super().__init__(config)
557
+ self.config = config
558
+
559
+ # Text encoder
560
+ self.text_encoder = BitNetTextEncoder(
561
+ vocab_size=config.vocab_size,
562
+ dim=config.text_encoder_dim,
563
+ num_layers=config.text_encoder_layers,
564
+ num_heads=config.text_encoder_heads,
565
+ max_seq_len=config.max_seq_len,
566
+ dropout=config.dropout
567
+ )
568
+
569
+ # Text decoder
570
+ self.text_decoder = BitNetTextDecoder(
571
+ vocab_size=config.vocab_size,
572
+ dim=config.text_decoder_dim,
573
+ num_layers=config.text_decoder_layers,
574
+ num_heads=config.text_decoder_heads,
575
+ max_seq_len=config.max_seq_len,
576
+ dropout=config.dropout
577
+ )
578
+
579
+ # Vision encoder
580
+ self.vision_encoder = VisionEncoder(
581
+ input_dim=config.vision_encoder_dim,
582
+ hidden_dim=config.vision_hidden_size,
583
+ output_dim=config.vision_latent_size
584
+ )
585
+
586
+ # Cross-modal fusion
587
+ self.cross_modal_fusion = CrossModalFusion(
588
+ text_dim=config.text_encoder_dim,
589
+ vision_dim=config.vision_latent_size,
590
+ hidden_dim=config.fusion_hidden_size,
591
+ num_heads=config.fusion_num_heads,
592
+ num_layers=config.fusion_num_layers
593
+ )
594
+
595
+ # Episodic memory
596
+ self.episodic_memory = EpisodicMemory(
597
+ memory_size=config.memory_size,
598
+ episode_dim=config.episode_dim,
599
+ alpha=config.memory_alpha,
600
+ direct_writing=config.direct_writing,
601
+ compression_enabled=config.memory_compression
602
+ )
603
+
604
+ # Initialize weights
605
+ self.post_init()
606
+
607
+ def _init_weights(self, module):
608
+ """Initialize the weights"""
609
+ if isinstance(module, (nn.Linear, BitNetLinear)):
610
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
611
+ if module.bias is not None:
612
+ module.bias.data.zero_()
613
+ elif isinstance(module, nn.Embedding):
614
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
615
+ if module.padding_idx is not None:
616
+ module.weight.data[module.padding_idx].zero_()
617
+ elif isinstance(module, nn.LayerNorm):
618
+ module.bias.data.zero_()
619
+ module.weight.data.fill_(1.0)
620
+
621
+ def encode_text(self, input_ids: torch.Tensor, attention_mask: torch.Tensor) -> Tuple[torch.Tensor, List[torch.Tensor]]:
622
+ return self.text_encoder(input_ids, attention_mask)
623
+
624
+ def encode_vision(self, vision_features: torch.Tensor) -> torch.Tensor:
625
+ return self.vision_encoder(vision_features)
626
+
627
+ def create_episode(
628
+ self,
629
+ text_features: torch.Tensor,
630
+ vision_latent: torch.Tensor,
631
+ attention_weights: Dict[str, torch.Tensor]
632
+ ) -> torch.Tensor:
633
+ # Simple concatenation for episode creation
634
+ # Average pool text features
635
+ text_pooled = text_features.mean(dim=1) # [B, D]
636
+ vision_pooled = vision_latent.mean(dim=1) # [B, D]
637
+
638
+ # Concatenate and project to episode dimension
639
+ episode = torch.cat([text_pooled, vision_pooled], dim=-1)
640
+
641
+ # Project to episode dimension if needed
642
+ if episode.size(-1) != self.config.episode_dim:
643
+ if not hasattr(self, 'episode_proj'):
644
+ self.episode_proj = nn.Linear(episode.size(-1), self.config.episode_dim).to(episode.device)
645
+ episode = self.episode_proj(episode)
646
+
647
+ return episode
648
+
649
+ def forward(
650
+ self,
651
+ input_ids: Optional[torch.LongTensor] = None,
652
+ attention_mask: Optional[torch.FloatTensor] = None,
653
+ vision_features: Optional[torch.FloatTensor] = None,
654
+ labels: Optional[torch.LongTensor] = None,
655
+ use_cache: Optional[bool] = None,
656
+ output_attentions: Optional[bool] = None,
657
+ output_hidden_states: Optional[bool] = None,
658
+ return_dict: Optional[bool] = None,
659
+ mode: str = "train",
660
+ step: int = 0,
661
+ has_vision: Optional[torch.Tensor] = None,
662
+ **kwargs
663
+ ) -> Union[Tuple, CausalLMOutput]:
664
+
665
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
666
+
667
+ # Handle missing vision features
668
+ if vision_features is None:
669
+ batch_size = input_ids.size(0) if input_ids is not None else 1
670
+ vision_features = torch.zeros(batch_size, 196, self.config.vision_encoder_dim, device=self.device)
671
+
672
+ # Encode text
673
+ text_features, text_attentions = self.encode_text(input_ids, attention_mask)
674
+
675
+ # Encode vision
676
+ vision_latent = self.encode_vision(vision_features)
677
+
678
+ # Cross-modal fusion
679
+ fused_features, fusion_attentions = self.cross_modal_fusion(text_features, vision_latent)
680
+
681
+ # Create episode for memory
682
+ episode = self.create_episode(text_features, vision_latent, fusion_attentions)
683
+
684
+ # Episodic memory interaction
685
+ retrieved_memory, memory_weights = self.episodic_memory(episode, mode="read_write")
686
+
687
+ # Text generation with decoder
688
+ decoder_outputs = self.text_decoder(
689
+ input_ids=input_ids,
690
+ attention_mask=attention_mask,
691
+ labels=labels
692
+ )
693
+
694
+ # Prepare outputs
695
+ loss = decoder_outputs.get("loss", None)
696
+ logits = decoder_outputs["logits"]
697
+ hidden_states = decoder_outputs["hidden_states"] if output_hidden_states else None
698
+ attentions = decoder_outputs["attentions"] if output_attentions else None
699
+
700
+ if return_dict:
701
+ return CausalLMOutput(
702
+ loss=loss,
703
+ logits=logits,
704
+ hidden_states=hidden_states,
705
+ attentions=attentions,
706
+ )
707
+ else:
708
+ outputs = (logits,)
709
+ if loss is not None:
710
+ outputs = (loss,) + outputs
711
+ if hidden_states is not None:
712
+ outputs = outputs + (hidden_states,)
713
+ if attentions is not None:
714
+ outputs = outputs + (attentions,)
715
+ return outputs
716
+
717
+ def generate(
718
+ self,
719
+ input_ids: torch.LongTensor,
720
+ attention_mask: Optional[torch.FloatTensor] = None,
721
+ vision_features: Optional[torch.FloatTensor] = None,
722
+ max_length: int = 100,
723
+ temperature: float = 0.7,
724
+ top_p: float = 0.9,
725
+ do_sample: bool = True,
726
+ **kwargs
727
+ ) -> torch.LongTensor:
728
+ """Simple generation method"""
729
+
730
+ batch_size = input_ids.size(0)
731
+ device = input_ids.device
732
+
733
+ # Handle missing vision features
734
+ if vision_features is None:
735
+ vision_features = torch.zeros(batch_size, 196, self.config.vision_encoder_dim, device=device)
736
+
737
+ generated = input_ids.clone()
738
+
739
+ for _ in range(max_length - input_ids.size(1)):
740
+ # Get model outputs
741
+ with torch.no_grad():
742
+ outputs = self.forward(
743
+ input_ids=generated,
744
+ attention_mask=attention_mask,
745
+ vision_features=vision_features,
746
+ return_dict=True
747
+ )
748
+
749
+ # Get next token logits
750
+ next_token_logits = outputs.logits[:, -1, :] / temperature
751
+
752
+ if do_sample:
753
+ # Apply top-p sampling
754
+ sorted_logits, sorted_indices = torch.sort(next_token_logits, descending=True)
755
+ cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
756
+
757
+ # Remove tokens with cumulative probability above the threshold
758
+ sorted_indices_to_remove = cumulative_probs > top_p
759
+ sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
760
+ sorted_indices_to_remove[..., 0] = 0
761
+
762
+ indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)
763
+ next_token_logits[indices_to_remove] = float('-inf')
764
+
765
+ # Sample from the filtered distribution
766
+ probs = F.softmax(next_token_logits, dim=-1)
767
+ next_token = torch.multinomial(probs, num_samples=1)
768
+ else:
769
+ # Greedy decoding
770
+ next_token = next_token_logits.argmax(dim=-1, keepdim=True)
771
+
772
+ # Append to generated sequence
773
+ generated = torch.cat([generated, next_token], dim=-1)
774
+
775
+ # Update attention mask
776
+ if attention_mask is not None:
777
+ attention_mask = torch.cat([
778
+ attention_mask,
779
+ torch.ones(batch_size, 1, device=device)
780
+ ], dim=-1)
781
+
782
+ # Stop if EOS token is generated
783
+ if (next_token == self.config.eos_token_id).all():
784
+ break
785
+
786
+ return generated
787
+
788
+ def prepare_inputs_for_generation(
789
+ self,
790
+ input_ids,
791
+ past_key_values=None,
792
+ attention_mask=None,
793
+ vision_features=None,
794
+ **kwargs
795
+ ):
796
+ """Prepare inputs for generation"""
797
+ return {
798
+ "input_ids": input_ids,
799
+ "attention_mask": attention_mask,
800
+ "vision_features": vision_features,
801
+ "use_cache": kwargs.get("use_cache", True),
802
+ }
803
+
804
+
805
+ # Register the model with transformers
806
+ from transformers import AutoConfig, AutoModel, AutoModelForCausalLM
807
+
808
+ AutoConfig.register("bitmar", BitMarConfig)
809
+ AutoModel.register(BitMarConfig, BitMarModel)
810
+ AutoModelForCausalLM.register(BitMarConfig, BitMarModel)
811
+
812
+
813
+ def count_parameters(model: nn.Module) -> Dict[str, int]:
814
+ """Count model parameters"""
815
+ total_params = sum(p.numel() for p in model.parameters())
816
+ trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
817
+
818
+ return {
819
+ "total_parameters": total_params,
820
+ "trainable_parameters": trainable_params,
821
+ "non_trainable_parameters": total_params - trainable_params
822
+ }
823
+
824
+
825
+ def create_bitmar_model(config: Dict) -> BitMarModel:
826
+ """Create BitMar model from config dictionary"""
827
+ bitmar_config = BitMarConfig(**config)
828
+ model = BitMarModel(bitmar_config)
829
+ return model
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a90cd9981271cc1f56d76c5ddecec018cc2f28c749cce233eb1cbaf9b35552e0
3
+ size 86128991
tokenizer.json ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "version": "1.0",
3
+ "truncation": null,
4
+ "padding": null,
5
+ "added_tokens": [
6
+ {
7
+ "id": 50256,
8
+ "content": "<|endoftext|>",
9
+ "single_word": false,
10
+ "lstrip": false,
11
+ "rstrip": false,
12
+ "normalized": true,
13
+ "special": true
14
+ }
15
+ ],
16
+ "normalizer": {
17
+ "type": "NFC"
18
+ },
19
+ "pre_tokenizer": {
20
+ "type": "ByteLevel",
21
+ "add_prefix_space": false,
22
+ "trim_offsets": true,
23
+ "use_regex": true
24
+ },
25
+ "post_processor": {
26
+ "type": "ByteLevel",
27
+ "add_prefix_space": false,
28
+ "trim_offsets": true,
29
+ "use_regex": true
30
+ },
31
+ "decoder": {
32
+ "type": "ByteLevel",
33
+ "add_prefix_space": false,
34
+ "trim_offsets": true,
35
+ "use_regex": true
36
+ },
37
+ "model": {
38
+ "type": "BPE",
39
+ "dropout": null,
40
+ "unk_token": null,
41
+ "continuing_subword_prefix": null,
42
+ "end_of_word_suffix": null,
43
+ "fuse_unk": false,
44
+ "byte_fallback": false,
45
+ "vocab": {},
46
+ "merges": []
47
+ }
48
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "tokenizer_class": "GPT2Tokenizer",
3
+ "auto_map": {
4
+ "AutoTokenizer": ["transformers", "GPT2Tokenizer"]
5
+ },
6
+ "bos_token": "<|endoftext|>",
7
+ "eos_token": "<|endoftext|>",
8
+ "pad_token": "<|endoftext|>",
9
+ "unk_token": "<|endoftext|>",
10
+ "add_prefix_space": false,
11
+ "model_max_length": 1024,
12
+ "special_tokens_map_file": null,
13
+ "name_or_path": "gpt2",
14
+ "tokenizer_type": "GPT2Tokenizer"
15
+ }
training_metadata.json ADDED
@@ -0,0 +1,261 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 0,
3
+ "global_step": 99498,
4
+ "tokens_processed": 99686013,
5
+ "target_tokens": 100000000,
6
+ "best_similarity": 0.34183505177497864,
7
+ "training_config": {
8
+ "model": {
9
+ "vocab_size": 50257,
10
+ "text_encoder_dim": 128,
11
+ "text_encoder_layers": 4,
12
+ "text_encoder_heads": 4,
13
+ "text_decoder_dim": 128,
14
+ "text_decoder_layers": 4,
15
+ "text_decoder_heads": 4,
16
+ "vision_encoder_dim": 768,
17
+ "vision_latent_size": 128,
18
+ "vision_hidden_size": 64,
19
+ "vision_compression_method": "learned_compression",
20
+ "vision_spatial_pooling": true,
21
+ "vision_pool_size": 2,
22
+ "fusion_hidden_size": 128,
23
+ "fusion_num_heads": 4,
24
+ "fusion_num_layers": 2,
25
+ "memory_size": 32,
26
+ "episode_dim": 128,
27
+ "memory_alpha": 0.2,
28
+ "direct_writing": true,
29
+ "memory_compression": true,
30
+ "enable_adaptive_training": true,
31
+ "max_seq_len": 256,
32
+ "dropout": 0.15
33
+ },
34
+ "token_constraints": {
35
+ "total_tokens": 100000000,
36
+ "caption_tokens": 50000000,
37
+ "text_tokens": 50000000,
38
+ "enforce_exact_count": true,
39
+ "uniform_sampling": true,
40
+ "alignment_priority": "perfect_alignment",
41
+ "preserve_image_caption_pairs": true,
42
+ "strict_alignment_validation": true
43
+ },
44
+ "vision_feature_reduction": {
45
+ "enabled": true,
46
+ "method": "learned_compression",
47
+ "target_dim": 64,
48
+ "spatial_pooling": true,
49
+ "pool_method": "attention",
50
+ "hidden_dim": 128,
51
+ "learnable": true,
52
+ "preserve_spatial_info": true
53
+ },
54
+ "data": {
55
+ "dataset_dir": "../babylm_dataset",
56
+ "text_encoder_name": "gpt2",
57
+ "max_seq_length": 256,
58
+ "count_tokens": true,
59
+ "target_caption_tokens": 50000000,
60
+ "target_text_tokens": 50000000,
61
+ "token_counting_method": "gpt2",
62
+ "batch_size": 64,
63
+ "num_workers": 6,
64
+ "pin_memory": true,
65
+ "persistent_workers": true,
66
+ "mix_ratio": 0.5,
67
+ "shuffle_datasets": true,
68
+ "ensure_alignment": true,
69
+ "validate_alignment": true,
70
+ "alignment_verification": "strict",
71
+ "never_break_pairs": true,
72
+ "alignment_check_frequency": 1000,
73
+ "use_validation": false,
74
+ "train_only": true
75
+ },
76
+ "attention_analysis": {
77
+ "track_top_k": 5,
78
+ "log_every_n_steps": 200,
79
+ "viz_every_n_epochs": 3,
80
+ "save_head_patterns": true,
81
+ "analyze_memory_attention": true,
82
+ "analyze_cross_modal": true,
83
+ "track_token_alignment": true
84
+ },
85
+ "adaptive_training": {
86
+ "enabled": true,
87
+ "similarity_window_size": 200,
88
+ "drop_threshold": 0.12,
89
+ "min_steps_between_interventions": 800,
90
+ "freeze_duration_steps": 1500,
91
+ "loss_rebalance_factor": 2.0,
92
+ "similarity_smoothing_alpha": 0.15
93
+ },
94
+ "training": {
95
+ "max_epochs": 10,
96
+ "accumulate_grad_batches": 2,
97
+ "gradient_clip_val": 0.3,
98
+ "val_check_interval": 1000,
99
+ "scheduler": "cosine_with_restarts",
100
+ "min_lr": 5e-05,
101
+ "warmup_steps": 1000,
102
+ "learning_rate": 0.0002,
103
+ "weight_decay": 0.02,
104
+ "optimizer": "adamw8bit",
105
+ "scheduler_config": {
106
+ "T_0": 1000,
107
+ "T_mult": 2,
108
+ "eta_min_ratio": 0.1
109
+ },
110
+ "cross_modal_loss_weight": 1.5,
111
+ "text_generation_loss_weight": 1.0,
112
+ "memory_regularization_weight": 0.1,
113
+ "alignment_consistency_weight": 0.5,
114
+ "track_token_usage": true,
115
+ "log_token_progress": true,
116
+ "stop_at_token_limit": false,
117
+ "validate_alignment_every_n_steps": 500,
118
+ "log_alignment_metrics": true,
119
+ "alignment_loss_scaling": "adaptive"
120
+ },
121
+ "wandb": {
122
+ "project": "bitmar-100M-attention-epochs",
123
+ "entity": "babylm-ntust",
124
+ "api_key": null,
125
+ "log_every_n_steps": 100,
126
+ "log_attention": true,
127
+ "log_memory": true,
128
+ "log_gradients": true,
129
+ "log_token_usage": true,
130
+ "log_cross_modal_similarity": true,
131
+ "log_alignment_quality": true,
132
+ "log_caption_image_matching": true,
133
+ "save_code": true,
134
+ "create_plots": true,
135
+ "plot_attention_heatmaps": true,
136
+ "plot_memory_usage": true,
137
+ "plot_token_distribution": true,
138
+ "plot_alignment_metrics": true,
139
+ "log_memory_evolution": true,
140
+ "plot_memory_evolution_heatmap": true,
141
+ "plot_memory_diversity": true,
142
+ "plot_memory_access_patterns": true,
143
+ "memory_visualization_frequency": 5000,
144
+ "memory_snapshot_frequency": 10000,
145
+ "track_memory_metrics": [
146
+ "memory_diversity_score",
147
+ "memory_specialization_score",
148
+ "memory_usage_entropy",
149
+ "cross_modal_memory_ratio",
150
+ "memory_slot_utilization",
151
+ "memory_update_frequency",
152
+ "memory_retrieval_accuracy"
153
+ ]
154
+ },
155
+ "evaluation": {
156
+ "metrics": [
157
+ "bleu",
158
+ "rouge",
159
+ "cross_modal_similarity",
160
+ "memory_efficiency"
161
+ ],
162
+ "generate_samples": true,
163
+ "num_samples": 20,
164
+ "max_generation_length": 32,
165
+ "temperature": 0.8,
166
+ "top_p": 0.9,
167
+ "evaluate_alignment": true,
168
+ "alignment_metrics": [
169
+ "cosine_similarity",
170
+ "retrieval_accuracy",
171
+ "caption_image_matching",
172
+ "cross_modal_retrieval"
173
+ ],
174
+ "alignment_threshold": 0.8,
175
+ "validate_pairs_during_eval": true
176
+ },
177
+ "output": {
178
+ "checkpoint_dir": "checkpoints_100M_dataset",
179
+ "log_dir": "logs_100M_dataset",
180
+ "attention_dir": "attention_100M_dataset",
181
+ "memory_dir": "memory_100M_dataset",
182
+ "results_dir": "results_100M_dataset",
183
+ "token_logs_dir": "token_logs_100M_dataset"
184
+ },
185
+ "memory_optimization": {
186
+ "use_gradient_checkpointing": true,
187
+ "use_fp16": true,
188
+ "use_int8_vision": false,
189
+ "empty_cache_frequency": 10,
190
+ "max_memory_slots_in_ram": 16,
191
+ "compress_episodic_memory": true,
192
+ "vision_feature_caching": false,
193
+ "vision_batch_processing": true,
194
+ "tie_word_embeddings": true,
195
+ "use_shared_attention": false
196
+ },
197
+ "performance_targets": {
198
+ "max_model_size_mb": 50,
199
+ "target_cross_modal_similarity": 0.75,
200
+ "target_text_generation_quality": 0.6,
201
+ "memory_efficiency_threshold": 0.8
202
+ },
203
+ "flops_tracking": {
204
+ "enabled": true,
205
+ "log_frequency": 100,
206
+ "save_statistics": true,
207
+ "estimate_theoretical": true,
208
+ "track_peak_performance": true,
209
+ "log_to_wandb": true,
210
+ "detailed_breakdown": true,
211
+ "memory_bandwidth_tracking": false,
212
+ "efficiency_analysis": true,
213
+ "track_components": [
214
+ "attention",
215
+ "feedforward",
216
+ "layer_norm",
217
+ "embeddings",
218
+ "vision_encoder",
219
+ "cross_modal_fusion"
220
+ ]
221
+ },
222
+ "token_tracking": {
223
+ "log_frequency": 1000,
224
+ "save_token_distribution": true,
225
+ "monitor_caption_text_ratio": true,
226
+ "enforce_token_limits": false,
227
+ "early_stopping_on_limit": false,
228
+ "track_alignment_quality": true,
229
+ "log_misaligned_samples": true,
230
+ "alignment_quality_threshold": 0.7,
231
+ "save_alignment_statistics": true,
232
+ "correlate_flops_with_tokens": true,
233
+ "log_computational_efficiency": true,
234
+ "track_throughput_vs_quality": true
235
+ },
236
+ "huggingface_hub": {
237
+ "enabled": true,
238
+ "repo_id": "euhidaman/bitmar-attention-multimodal",
239
+ "private": true,
240
+ "upload_after_epoch": true,
241
+ "upload_final_model": true,
242
+ "commit_message_template": "BitMar 100M tokens - Epoch {epoch} - {tokens_processed:,} tokens processed",
243
+ "create_model_card": true,
244
+ "model_card_template": "---\nlanguage: en\nlicense: mit\ntags:\n- bitmar\n- multimodal\n- babylm\n- cross-modal\ndatasets:\n- babylm_multimodal\nmetrics:\n- bleu\n- cross_modal_similarity\n---\n\n# BitMar 100M Token Model\n\nThis model was trained on exactly 100 million tokens as part of the BabyLM challenge.\n\n## Training Details\n- Total tokens: 100,000,000\n- Epochs completed: {epoch}\n- Tokens processed: {tokens_processed:,}\n- Cross-modal similarity: {best_similarity:.4f}\n\n## Model Architecture\n- Text encoder: {text_encoder_layers} layers, {text_encoder_dim} hidden size\n- Vision encoder: DiNOv2 features compressed to {vision_latent_size}\n- Episodic memory: {memory_size} slots\n\n## Usage\n```python\nfrom transformers import AutoModel, AutoTokenizer\n\nmodel = AutoModel.from_pretrained(\"{repo_id}\")\ntokenizer = AutoTokenizer.from_pretrained(\"{repo_id}\")\n```\n"
245
+ },
246
+ "attention_sinks": {
247
+ "enabled": true,
248
+ "attention_sink_size": 4,
249
+ "attention_sink_window_size": 1020,
250
+ "inject_to_text_encoder": true,
251
+ "inject_to_text_decoder": true,
252
+ "position_shift_enabled": true,
253
+ "cache_compression": true,
254
+ "adaptive_window_size": false,
255
+ "memory_efficient_attention": true,
256
+ "preserve_episodic_memory": true,
257
+ "preserve_quantization": true,
258
+ "preserve_cross_modal_fusion": true
259
+ }
260
+ }
261
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff