MediaStreamAI commited on
Commit
523b5e9
Β·
verified Β·
1 Parent(s): 5605f9a

Fix training hyperparameters: seq=2048, effective batch=8 (was incorrectly listed as 512/32)

Browse files
Files changed (1) hide show
  1. README.md +311 -5
README.md CHANGED
@@ -1,5 +1,311 @@
1
- ---
2
- license: other
3
- license_name: mother-ai-beta
4
- license_link: LICENSE
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: msai-sovereign
4
+ license_link: LICENSE
5
+ language:
6
+ - en
7
+ - cy
8
+ - ga
9
+ - gd
10
+ tags:
11
+ - sovereign-ai
12
+ - uk
13
+ - reasoning
14
+ - msai
15
+ - mother-core
16
+ pipeline_tag: text-generation
17
+ library_name: pytorch
18
+ ---
19
+
20
+ # MOTHER CORE V2 β€” chunk 450 (W2.7)
21
+
22
+ **Sovereign UK AI** built from scratch by **MediaStream AI Limited (MSAI)**.
23
+
24
+ This is a development checkpoint released for **MSAI team and partner testing only**. It is **not** a released model and **not** intended for production use. Eval performance is partial; the model is mid-training.
25
+
26
+ ---
27
+
28
+ ## 1. Model Summary
29
+
30
+ | Field | Value |
31
+ |---|---|
32
+ | Model | MOTHER CORE V2 |
33
+ | Checkpoint | chunk 450 (W2.7 stage) |
34
+ | Parameters | 6.877B |
35
+ | Architecture | Custom transformer (RoPE, GQA, RMSNorm, SwiGLU FFN, memory gate) |
36
+ | Layers | 48 |
37
+ | Hidden dimension | 3,072 |
38
+ | Attention heads | 24 (head_dim 128) |
39
+ | KV heads | 6 (GQA ratio 4:1) |
40
+ | FFN multiplier | 4.0 (intermediate 12,288) |
41
+ | Max sequence length | 4,096 |
42
+ | Vocabulary | 50,258 (SentencePiece) |
43
+ | RoPE ΞΈ | 10,000 |
44
+ | RMSNorm Ξ΅ | 1e-5 |
45
+ | Tied embeddings | No (separate `lm_head`) |
46
+ | Weights dtype (this release) | bfloat16 |
47
+ | Training dtype | float32 |
48
+
49
+ This is a **from-scratch sovereign build**. It is not a fine-tune of any external model (Llama, Qwen, Mistral, GPT, etc.). Training, tokenisation, architecture, and corpus are all proprietary to MSAI.
50
+
51
+ ---
52
+
53
+ ## 2. Status
54
+
55
+ | Metric | Value |
56
+ |---|---|
57
+ | Training stage | W2.7 (mid-curriculum) |
58
+ | Most recent chunk eval | 47/105 @ chunk 450 |
59
+ | Scope | math, science, reasoning, chain-of-thought, UK knowledge, Celtic languages, MOTHER identity |
60
+ | Out of scope (separate future models) | code generation, creative writing, vision |
61
+
62
+ This release is for **internal team testing**. It will fail on tasks outside its training scope.
63
+
64
+ The training trajectory has been monotonic since chunk 300:
65
+
66
+ | Chunk | Eval | Loss |
67
+ |---|---|---|
68
+ | 300 | 36/105 | 2.47 |
69
+ | 350 | 37/105 | 2.05 |
70
+ | 400 | 45/105 | 2.01 |
71
+ | **450** | **47/105** | **1.74** |
72
+
73
+ W2.7 will continue to chunk 650, after which the W2.8 corpus addition (~330,000 records spanning agentic orchestration, multi-step reasoning, tool use, memory synthesis) will be merged for the next training phase.
74
+
75
+ ---
76
+
77
+ ## 3. Locked Inference Rules
78
+
79
+ **Deviation from these rules produces incorrect or degenerate output.** They are not suggestions β€” they are the inference recipe the model was trained against.
80
+
81
+ | Setting | Value | Reason |
82
+ |---|---|---|
83
+ | Prompt format | `Question:\n\n{question}\n\nAnswer:` | Exact whitespace. Model is OOD without it. |
84
+ | BOS token | id=1, `<s>` | Always prepended; model was trained with BOS at position 0 |
85
+ | EOS token | id=2, `</s>` | Stop generation on emission |
86
+ | PAD token | id=0, `<pad>` | Training only |
87
+ | Sampling | **Greedy argmax** | No temperature, no top-k, no top-p |
88
+ | Repetition penalty | 1.3 (frequency-scaled, count β‰₯ 2) | Higher values collapse output |
89
+ | n-gram blocking | 4-gram, no repeat | Prevents loop output |
90
+ | Max new tokens | 200 | Hard cap |
91
+ | BOS in output | Banned | Never emit BOS during generation |
92
+ | EOS in output | Allowed after first token | Early stop signal |
93
+
94
+ ### Reference code
95
+
96
+ A working reference is included as `inference.py` in this repo. The canonical implementation lives in `mother_train_7b.py::_generate_greedy()` in the MSAI training repository. **Use `inference.py` from this repo or load `mother_train_7b._generate_greedy` directly.** Re-implementations frequently get the recipe wrong.
97
+
98
+ ---
99
+
100
+ ## 4. Architecture Detail
101
+
102
+ ```
103
+ MotherCoreModel
104
+ β”œβ”€β”€ tok_emb [50258, 3072]
105
+ β”œβ”€β”€ blocks Γ— 48
106
+ β”‚ └── each:
107
+ β”‚ β”œβ”€β”€ attn (GQA)
108
+ β”‚ β”‚ β”œβ”€β”€ wq [3072, 3072] # 24 heads Γ— 128 dim
109
+ β”‚ β”‚ β”œβ”€β”€ wk [768, 3072] # 6 KV heads Γ— 128 dim
110
+ β”‚ β”‚ β”œβ”€β”€ wv [768, 3072]
111
+ β”‚ β”‚ └── wo [3072, 3072]
112
+ β”‚ β”œβ”€β”€ ff (SwiGLU)
113
+ β”‚ β”‚ β”œβ”€β”€ w1 [12288, 3072]
114
+ β”‚ β”‚ β”œβ”€β”€ w2 [12288, 3072]
115
+ β”‚ β”‚ └── w3 [3072, 12288]
116
+ β”‚ β”œβ”€β”€ norm_attn (RMSNorm)
117
+ β”‚ └── norm_ff (RMSNorm)
118
+ β”œβ”€β”€ norm_f [3072]
119
+ β”œβ”€β”€ lm_head [50258, 3072] # NOT tied to tok_emb
120
+ └── memory_gate [1, 3072] + bias[1]
121
+ ```
122
+
123
+ ### Memory gate
124
+
125
+ `memory_gate` is a sigmoid-gated single-dimension projection from the last hidden state. It is **trained but not active in inference output** β€” it is reserved for downstream integration with MOTHER ROBOTICS (an item/object/situational/historical awareness model) and external memory systems. Its activation is exposed in the forward pass return dict but does not affect token logits.
126
+
127
+ Forward return:
128
+ ```
129
+ {
130
+ "logits": [B, T, vocab],
131
+ "loss": scalar or None,
132
+ "aux_loss": scalar (MoE; unused here, fixed=0),
133
+ "past_key_values": List[(K,V)] or None,
134
+ "hidden_states": List[Tensor] or None,
135
+ "last_hidden_state": [B, T, dim],
136
+ "gate": [B, 1] ← detached, FYI only
137
+ }
138
+ ```
139
+
140
+ ---
141
+
142
+ ## 5. Training
143
+
144
+ ### Corpus (W2.7)
145
+
146
+ | Category | Records |
147
+ |---|---|
148
+ | Reasoning + chain-of-thought | ~390,000 |
149
+ | UK general knowledge | ~210,000 |
150
+ | Math & arithmetic (digit-spaced) | ~165,000 |
151
+ | Identity & self-knowledge (MOTHER, MSAI) | ~32,000 |
152
+ | Celtic languages (Welsh, Irish, Scottish Gaelic) | ~28,000 |
153
+ | Science | ~88,000 |
154
+ | Misc (chat, instruct skeleton) | ~135,000 |
155
+ | **Total** | **~1.05M** |
156
+
157
+ ### Hyperparameters
158
+
159
+ | Setting | Value |
160
+ |---|---|
161
+ | Learning rate | 1e-5 |
162
+ | Gradient clip | 10.0 |
163
+ | Effective batch size | 8 (BATCH_PHYSICAL=1 Γ— GRAD_ACCUM_STEPS=8) |
164
+ | Sequence length (training) | 2048 |
165
+ | Optimiser | AdamW (β₁=0.9, Ξ²β‚‚=0.95) |
166
+ | Weight decay | 0.1 |
167
+ | Warmup steps | 100 |
168
+ | Layer-wise LR scaling | from chunk 10 onward |
169
+ | Hardware | NVIDIA GB10 Blackwell (Grace–Blackwell unified memory, 128GB) |
170
+ | Training site | MSAI Wright Avenue, Dundee β€” sovereign UK infrastructure |
171
+
172
+ Training was performed at sequence length **2048** using physical microbatches of 1 with gradient accumulation of 8 (effective batch = 8). The architecture supports 4,096-token inference; 2048 β†’ 4096 is a modest RoPE extrapolation, but long-context behaviour at full 4096 has not been benchmarked at this checkpoint.
173
+
174
+ ---
175
+
176
+ ## 6. Sovereign Build Posture
177
+
178
+ MOTHER CORE is part of MSAI's sovereign AI stack β€” built end-to-end in the UK on UK-resident infrastructure. The training, weights, tokeniser, and corpus are owned by MSAI. The training datacentres are MSAI-operated (Wright Avenue, Dundee; with additional sites in Durham and Manchester). No US cloud provider is in the inference or training path.
179
+
180
+ This positioning matters for UK government, defence, and regulated-enterprise customers where data residency, GDPR, and supply-chain provenance are mandatory.
181
+
182
+ ---
183
+
184
+ ## 7. Intended Use & Out-of-Scope Use
185
+
186
+ **In scope (this checkpoint):**
187
+ - Reasoning and chain-of-thought tasks at modest difficulty
188
+ - UK general knowledge questions
189
+ - Welsh / Irish / Scottish Gaelic short-form questions
190
+ - MOTHER-identity Q&A
191
+ - Arithmetic on small integers (with digit-spaced inputs for β‰₯3-digit numbers)
192
+
193
+ **Out of scope (this checkpoint):**
194
+ - Code generation (separate model β€” MOTHER CODE β€” planned)
195
+ - Creative writing (separate model β€” MOTHER LLM β€” planned)
196
+ - Long-form (>1,000 token) generation
197
+ - Multi-turn dialogue (training is single-turn Q/A)
198
+ - Anything safety-critical, medical, legal, or financial advisory
199
+ - Real-time information (model has no internet access at inference)
200
+
201
+ ---
202
+
203
+ ## 8. Evaluation
204
+
205
+ The internal eval suite at chunk 450 scores **47/105 (44.8%)** across:
206
+
207
+ - Identity: 6/6 (100%)
208
+ - UK knowledge: 9/12
209
+ - Reasoning (multi-step): 14/35
210
+ - Arithmetic: 5/15
211
+ - Science: 7/12
212
+ - Celtic languages: 4/9
213
+ - Chain-of-thought: 2/16
214
+
215
+ Persistent gaps at chunk 450:
216
+ - Arithmetic on multi-digit numbers (training fix in progress β€” see W2.8 plan)
217
+ - Multi-step reasoning beyond 3 hops
218
+ - Welsh and Irish (smaller corpus volume than other categories)
219
+
220
+ Eval suite and methodology are MSAI-internal. Comparable public benchmarks (MMLU, GSM8K) have **not** been run against this checkpoint and would not be directly comparable since the training corpus and tokeniser are sovereign.
221
+
222
+ ---
223
+
224
+ ## 9. Limitations & Known Failure Modes
225
+
226
+ 1. **Single-turn only** β€” no chat-style multi-turn coherence
227
+ 2. **Format-brittle** β€” the `Question:\n\n...\n\nAnswer:` template is required; other formats produce OOD output
228
+ 3. **No tool use / no agent loop** at this checkpoint (W2.8 corpus will add this)
229
+ 4. **No code generation** β€” even simple Python will fail; not in scope
230
+ 5. **No retrieval / no internet** β€” closed-book knowledge only, as of training cutoff
231
+ 6. **Arithmetic at multi-digit numbers** β€” requires digit-spaced input (`1 5 + 2 7`) to perform reliably
232
+ 7. **`weights_only=False` required** if loading from `.pt` β€” this repo ships `.safetensors` instead which is safer
233
+ 8. **High repetition penalty (>1.4) collapses output** β€” stick to 1.3
234
+
235
+ ---
236
+
237
+ ## 10. Usage
238
+
239
+ ### Quick test from a clean Python environment
240
+
241
+ ```bash
242
+ pip install torch safetensors sentencepiece huggingface_hub
243
+ ```
244
+
245
+ You also need the `mother_core` package source available (architecture is custom; no Transformers integration yet). Clone the MSAI training repo or copy `mother_core/` into your `PYTHONPATH`.
246
+
247
+ ```python
248
+ from huggingface_hub import snapshot_download
249
+ repo_dir = snapshot_download(repo_id="MediaStreamAI/MOTHER_CORE_V2")
250
+ # Then import inference.py from the snapshot
251
+ import sys, importlib.util
252
+ spec = importlib.util.spec_from_file_location("inf", f"{repo_dir}/inference.py")
253
+ inf = importlib.util.module_from_spec(spec); spec.loader.exec_module(inf)
254
+
255
+ model, tok = inf.load_model_and_tokenizer(repo_dir)
256
+ print(inf.generate_greedy(model, tok, "What is the capital of Scotland?"))
257
+ ```
258
+
259
+ Or run the inference script directly:
260
+
261
+ ```bash
262
+ python inference.py "What is the capital of Scotland?"
263
+ ```
264
+
265
+ ### File map
266
+
267
+ | File | Purpose |
268
+ |---|---|
269
+ | `model-00001-of-00003.safetensors` | Weights, shard 1/3 |
270
+ | `model-00002-of-00003.safetensors` | Weights, shard 2/3 |
271
+ | `model-00003-of-00003.safetensors` | Weights, shard 3/3 |
272
+ | `model.safetensors.index.json` | Shard index |
273
+ | `config.json` | Architecture spec |
274
+ | `tokenizer.model` | SentencePiece vocab |
275
+ | `tokenizer_config.json` | Tokeniser config (`add_bos_token=true` required) |
276
+ | `special_tokens_map.json` | BOS/EOS/PAD/UNK ids |
277
+ | `inference.py` | Reference inference with locked rules |
278
+ | `README.md` | This file |
279
+
280
+ ---
281
+
282
+ ## 11. License
283
+
284
+ **MSAI Sovereign License β€” Internal & Partner Use Only.**
285
+
286
+ This model is the proprietary work of MediaStream AI Limited. It is released to authorised team members and contracted partners for evaluation and integration purposes. Redistribution, commercial use, or training other models on this model's outputs require written permission from MSAI.
287
+
288
+ For licensing enquiries: contact MediaStream AI Limited via the company website.
289
+
290
+ ---
291
+
292
+ ## 12. Citation
293
+
294
+ ```
295
+ @misc{msai-mother-core-2026,
296
+ title = {MOTHER CORE V2 β€” Sovereign UK AI},
297
+ author = {{MediaStream AI Limited}},
298
+ year = {2026},
299
+ note = {Chunk 450, W2.7 mid-training checkpoint},
300
+ url = {https://huggingface.co/MediaStreamAI/MOTHER_CORE_V2}
301
+ }
302
+ ```
303
+
304
+ ---
305
+
306
+ ## 13. Contact
307
+
308
+ - Organisation: MediaStream AI Limited (MSAI)
309
+ - Founder & CEO: Christopher Kenna
310
+ - Web: https://mediastreamai.com
311
+ - Infrastructure: UK sovereign (Dundee, Durham, Manchester)