Ma7ee7 commited on
Commit
233d036
·
verified ·
1 Parent(s): 9e9f4fd

Upload Meet25M base model as safetensors

Browse files
README.md ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: other
5
+ library_name: pytorch
6
+ tags:
7
+ - causal-lm
8
+ - from-scratch
9
+ - gpt
10
+ - safetensors
11
+ - small-language-model
12
+ - meet25m
13
+ ---
14
+
15
+ # Meet25M Base
16
+
17
+ A small GPT-style causal language model trained from scratch.
18
+
19
+ ## Model
20
+
21
+ - Architecture: GPT-style decoder-only Transformer
22
+ - Approx size: ~25M parameters
23
+ - Context length: 1024
24
+ - Tokenizer: custom byte-level BPE
25
+ - Positional encoding: RoPE
26
+ - Normalization: RMSNorm
27
+ - MLP: SwiGLU
28
+ - Embeddings: tied input/output embeddings
29
+
30
+ ## Training Data Mix
31
+
32
+ Target pretraining mix:
33
+
34
+ - FineWeb-Edu
35
+ - FineWeb general
36
+ - Wikipedia
37
+ - OpenWebMath
38
+ - Project Gutenberg
39
+ - StackOverflow / Stack Exchange style posts
40
+ - CodeSearchNet
41
+
42
+ Total target: ~250M training tokens.
43
+
44
+ ## Files
45
+
46
+ - `model.safetensors` — safetensors checkpoint
47
+ - `config.json` — model config
48
+ - `tokenizer/` — tokenizer files
49
+ - `safetensors_info.json` — checkpoint metadata
50
+
51
+ ## Loading
52
+
53
+ This is not a standard Transformers `AutoModelForCausalLM` checkpoint.
54
+ Use the custom GPT class from the training script and load `model.safetensors`.
config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "vocab_size": 16384,
3
+ "block_size": 1024,
4
+ "n_layer": 8,
5
+ "n_embd": 384,
6
+ "n_head": 6,
7
+ "dropout": 0.0,
8
+ "pad_id": 0,
9
+ "eos_id": 2,
10
+ "weight_format": "safetensors",
11
+ "tied_embeddings": true
12
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:81f258ca51f4a107ac10439a25f62fbd9e25b96f692464d637a0276211769619
3
+ size 106986416
safetensors_info.json ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_name": "Meet25M-Base",
3
+ "safetensors_file": "model.safetensors",
4
+ "source_checkpoint": "model.pt",
5
+ "num_tensors": 59,
6
+ "size_bytes": 106986416,
7
+ "config": {
8
+ "vocab_size": 16384,
9
+ "block_size": 1024,
10
+ "n_layer": 8,
11
+ "n_embd": 384,
12
+ "n_head": 6,
13
+ "dropout": 0.0,
14
+ "pad_id": 0,
15
+ "eos_id": 2,
16
+ "weight_format": "safetensors",
17
+ "tied_embeddings": true
18
+ }
19
+ }
tokenizer/tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer/tokenizer_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "backend": "tokenizers",
3
+ "bos_token": "<bos>",
4
+ "eos_token": "<eos>",
5
+ "model_max_length": 1024,
6
+ "pad_token": "<pad>",
7
+ "tokenizer_class": "TokenizersBackend",
8
+ "unk_token": "<unk>"
9
+ }