bgraudt commited on
Commit
dead189
Β·
verified Β·
1 Parent(s): 9368ba4

Upload folder using huggingface_hub

Browse files
Files changed (4) hide show
  1. README.md +10 -10
  2. config.json +1 -1
  3. model.safetensors +2 -2
  4. tokenizer.json +0 -0
README.md CHANGED
@@ -15,7 +15,7 @@ tags:
15
  - rope
16
  - rmsnorm
17
  model-index:
18
- - name: Mythos-172M
19
  results: []
20
  widget:
21
  - text: "The history of artificial intelligence begins with"
@@ -31,7 +31,7 @@ inference:
31
 
32
  <div align="center">
33
 
34
- # Mythos-172M
35
 
36
  **A decoder-only language model built from scratch β€” LLaMA-compatible weights.**
37
 
@@ -44,7 +44,7 @@ inference:
44
 
45
  ---
46
 
47
- > ⚠️ **Research preview.** Debug checkpoint β€” trained on ~21 M tokens with vocab 3 252 for 5 000 steps. Intended to verify the architecture, not for downstream use. A production 500 M checkpoint will supersede it.
48
 
49
  ## Model Summary
50
 
@@ -71,7 +71,7 @@ toolchains β€” no custom code or `trust_remote_code` required.
71
 
72
  | Component | Choice | Value |
73
  |---|---|---:|
74
- | Parameters | β€” | **172 M** |
75
  | Hidden layers | Pre-norm decoder blocks | 24 |
76
  | Hidden size | `d_model` | 768 |
77
  | Intermediate size | SwiGLU hidden | 2048 |
@@ -82,7 +82,7 @@ toolchains β€” no custom code or `trust_remote_code` required.
82
  | Normalization | **RMSNorm** (pre-norm) | Ξ΅ = 1e-05 |
83
  | Activation | **SwiGLU** | β€” |
84
  | Tied embeddings | Embedding ↔ LM head | βœ… |
85
- | Vocabulary | ByteLevel BPE | 3,252 |
86
  | Context length | Max sequence | 2,048 |
87
 
88
  ## Quickstart
@@ -118,27 +118,27 @@ python llama.cpp/convert_hf_to_gguf.py mythos
118
 
119
  ### Data
120
 
121
- - **Corpus:** Wikipedia (English 20231101 snapshot) β€” 5 000 articles, ~21 M tokens
122
- - **Tokenizer:** ByteLevel BPE trained from scratch, vocab size **3,252**
123
  - **Training context:** 512 tokens
124
 
125
  ### Hyperparameters
126
 
127
  | | |
128
  |---|---:|
129
- | Steps | 5,000 |
130
  | Optimizer | AdamW (β₁=0.9, Ξ²β‚‚=0.95, wd=0.1) |
131
  | LR schedule | Cosine decay, 2 000-step warmup |
132
  | Peak learning rate | 3 Γ— 10⁻⁴ |
133
  | Precision | bfloat16 mixed |
134
- | Hardware | Apple M2 (MPS) |
135
 
136
  ## Limitations and Intended Use
137
 
138
  - **Base model only** β€” no instruction tuning, no RLHF, no safety alignment.
139
  - English-only; non-English performance is poor.
140
  - May reproduce biases and factual errors from the training distribution.
141
- - Tiny vocabulary (3 252 tokens) severely caps fluency β€” intended as an architecture demo.
142
  - Not suitable for medical, legal, financial, or other high-stakes applications.
143
 
144
  ## Citation
 
15
  - rope
16
  - rmsnorm
17
  model-index:
18
+ - name: Mythos-194M
19
  results: []
20
  widget:
21
  - text: "The history of artificial intelligence begins with"
 
31
 
32
  <div align="center">
33
 
34
+ # Mythos-194M
35
 
36
  **A decoder-only language model built from scratch β€” LLaMA-compatible weights.**
37
 
 
44
 
45
  ---
46
 
47
+ > **Production release.** Full pre-training run.
48
 
49
  ## Model Summary
50
 
 
71
 
72
  | Component | Choice | Value |
73
  |---|---|---:|
74
+ | Parameters | β€” | **194 M** |
75
  | Hidden layers | Pre-norm decoder blocks | 24 |
76
  | Hidden size | `d_model` | 768 |
77
  | Intermediate size | SwiGLU hidden | 2048 |
 
82
  | Normalization | **RMSNorm** (pre-norm) | Ξ΅ = 1e-05 |
83
  | Activation | **SwiGLU** | β€” |
84
  | Tied embeddings | Embedding ↔ LM head | βœ… |
85
+ | Vocabulary | ByteLevel BPE | 31,021 |
86
  | Context length | Max sequence | 2,048 |
87
 
88
  ## Quickstart
 
118
 
119
  ### Data
120
 
121
+ - **Corpus:** mixed web + code (details in the GitHub repo)
122
+ - **Tokenizer:** ByteLevel BPE trained from scratch, vocab size **31,021**
123
  - **Training context:** 512 tokens
124
 
125
  ### Hyperparameters
126
 
127
  | | |
128
  |---|---:|
129
+ | Steps | 16,000 |
130
  | Optimizer | AdamW (β₁=0.9, Ξ²β‚‚=0.95, wd=0.1) |
131
  | LR schedule | Cosine decay, 2 000-step warmup |
132
  | Peak learning rate | 3 Γ— 10⁻⁴ |
133
  | Precision | bfloat16 mixed |
134
+ | Hardware | A100 40 GB |
135
 
136
  ## Limitations and Intended Use
137
 
138
  - **Base model only** β€” no instruction tuning, no RLHF, no safety alignment.
139
  - English-only; non-English performance is poor.
140
  - May reproduce biases and factual errors from the training distribution.
141
+
142
  - Not suitable for medical, legal, financial, or other high-stakes applications.
143
 
144
  ## Citation
config.json CHANGED
@@ -9,7 +9,7 @@
9
  "num_attention_heads": 12,
10
  "num_key_value_heads": 4,
11
  "head_dim": 64,
12
- "vocab_size": 3252,
13
  "max_position_embeddings": 2048,
14
  "rms_norm_eps": 1e-05,
15
  "rope_theta": 10000.0,
 
9
  "num_attention_heads": 12,
10
  "num_key_value_heads": 4,
11
  "head_dim": 64,
12
+ "vocab_size": 31021,
13
  "max_position_embeddings": 2048,
14
  "rms_norm_eps": 1e-05,
15
  "rope_theta": 10000.0,
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:4ff47b104ec6bed36b144db50483986275af0c4bf946d1edc824bff50270a653
3
- size 614144704
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e53a1840fddf1373dac13b2c3745b50a4a3ca5fcba7e668984081f5a7a5c4e0a
3
+ size 699451136
tokenizer.json CHANGED
The diff for this file is too large to render. See raw diff