Spaces:
Running
Running
Create Weird Training Idea: Embeddings First, Everything Else Later.html
Browse files
Weird Training Idea: Embeddings First, Everything Else Later.html
ADDED
|
@@ -0,0 +1,184 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<!DOCTYPE html>
|
| 2 |
+
<html lang="en">
|
| 3 |
+
<head>
|
| 4 |
+
<meta charset="UTF-8">
|
| 5 |
+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
| 6 |
+
<title>A Weird Training Idea: Embeddings First, Everything Else Later | FMN-GPT - CompactAI</title>
|
| 7 |
+
<link rel="stylesheet" href="bluesheet.css">
|
| 8 |
+
<link rel="preconnect" href="https://fonts.googleapis.com">
|
| 9 |
+
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
|
| 10 |
+
<link href="https://fonts.googleapis.com/css2?family=Geist:wght@400:500:600:700&family=Geist+Mono&display=swap" rel="stylesheet">
|
| 11 |
+
<style>
|
| 12 |
+
:root {
|
| 13 |
+
--blue-900: #000000;
|
| 14 |
+
--blue-800: #0a0a0a;
|
| 15 |
+
--blue-700: #111111;
|
| 16 |
+
--blue-600: #1a1a1a;
|
| 17 |
+
--blue-500: #333333;
|
| 18 |
+
--blue-400: #555555;
|
| 19 |
+
--blue-300: #777777;
|
| 20 |
+
--blue-200: #888888;
|
| 21 |
+
--blue-100: #aaaaaa;
|
| 22 |
+
--white: #ffffff;
|
| 23 |
+
--white-soft: #f5f5f5;
|
| 24 |
+
--white-muted: #e0e0e0;
|
| 25 |
+
--grid-line: rgba(255, 255, 255, 0.03);
|
| 26 |
+
--grid-line-major: rgba(255, 255, 255, 0.06);
|
| 27 |
+
--accent: #ededed;
|
| 28 |
+
--accent-muted: #888888;
|
| 29 |
+
--font-sans: 'Geist', -apple-system, BlinkMacSystemFont, sans-serif;
|
| 30 |
+
--font-mono: 'Geist Mono', 'SF Mono', 'Fira Code', monospace;
|
| 31 |
+
--container-max: 1100px;
|
| 32 |
+
}
|
| 33 |
+
* { box-sizing: border-box; margin: 0; padding: 0; }
|
| 34 |
+
html { font-size: 16px; scroll-behavior: smooth; }
|
| 35 |
+
body { font-family: var(--font-sans); background: var(--blue-900); color: var(--white-muted); line-height: 1.7; -webkit-font-smoothing: antialiased; }
|
| 36 |
+
a { color: var(--white); text-decoration: none; transition: color 0.15s ease; }
|
| 37 |
+
a:hover { color: var(--accent); }
|
| 38 |
+
.container { max-width: var(--container-max); margin: 0 auto; padding: 0 24px; }
|
| 39 |
+
nav { position: fixed; top: 0; left: 0; right: 0; z-index: 100; background: rgba(0, 0, 0, 0.85); backdrop-filter: blur(12px); border-bottom: 1px solid var(--blue-600); padding: 16px 0; }
|
| 40 |
+
nav .container { display: flex; justify-content: space-between; align-items: center; }
|
| 41 |
+
.nav-brand { font-size: 18px; font-weight: 600; color: var(--white); display: flex; align-items: center; gap: 8px; }
|
| 42 |
+
.nav-brand span { color: var(--accent); }
|
| 43 |
+
.nav-links { display: flex; gap: 32px; }
|
| 44 |
+
.nav-links a { font-size: 14px; font-weight: 500; color: var(--blue-200); }
|
| 45 |
+
.nav-links a:hover { color: var(--white); }
|
| 46 |
+
.post { padding: 140px 0 80px; }
|
| 47 |
+
.post-back { display: inline-block; color: var(--blue-200); font-size: 14px; margin-bottom: 32px; }
|
| 48 |
+
.post-back:hover { color: var(--accent); }
|
| 49 |
+
.post-back::before { content: '← '; }
|
| 50 |
+
.post-meta { display: flex; gap: 12px; margin-bottom: 20px; }
|
| 51 |
+
.post-date { font-size: 13px; color: var(--blue-200); font-family: var(--font-mono); }
|
| 52 |
+
.post-tag { font-size: 11px; font-weight: 600; text-transform: uppercase; letter-spacing: 0.05em; color: var(--white); background: rgba(255, 255, 255, 0.08); padding: 4px 10px; border-radius: 4px; }
|
| 53 |
+
.post h1 { font-size: 36px; font-weight: 700; color: var(--white); margin-bottom: 32px; line-height: 1.2; letter-spacing: -0.02em; }
|
| 54 |
+
.post-body p { font-size: 17px; line-height: 1.8; margin-bottom: 24px; color: var(--blue-200); }
|
| 55 |
+
.post-body p:first-of-type { font-size: 20px; color: var(--white-muted); }
|
| 56 |
+
.post-body h2 { font-size: 24px; font-weight: 600; color: var(--white); margin: 48px 0 20px; }
|
| 57 |
+
.post-body blockquote { border-left: 3px solid var(--accent); padding: 20px 24px; margin: 32px 0; background: var(--blue-800); border-radius: 0 8px 8px 0; }
|
| 58 |
+
.post-body blockquote p { font-size: 16px; font-style: italic; color: var(--blue-200); margin: 0; }
|
| 59 |
+
.post-body hr { border: none; height: 1px; background: var(--blue-600); margin: 48px 0; }
|
| 60 |
+
.code-block { background: var(--blue-800); border: 1px solid var(--blue-600); border-radius: 8px; padding: 20px; margin: 24px 0; font-family: var(--font-mono); font-size: 13px; overflow-x: auto; }
|
| 61 |
+
.code-block .comment { color: var(--blue-200); font-style: italic; display: block; margin-top: 4px; }
|
| 62 |
+
.stats-grid { display: grid; grid-template-columns: 1fr 1fr; gap: 16px; margin: 24px 0; }
|
| 63 |
+
.stat-card { background: var(--blue-800); border: 1px solid var(--blue-600); border-radius: 8px; padding: 20px; text-align: center; }
|
| 64 |
+
.stat-card .number { font-size: 32px; font-weight: 700; color: var(--accent); font-family: var(--font-mono); }
|
| 65 |
+
.stat-card .label { font-size: 13px; color: var(--blue-200); margin-top: 8px; }
|
| 66 |
+
.post-footer { margin-top: 48px; padding-top: 32px; border-top: 1px solid var(--blue-600); }
|
| 67 |
+
.post-footer p { font-size: 14px; color: var(--blue-200); font-style: italic; margin: 0; }
|
| 68 |
+
footer { padding: 40px 0; background: var(--blue-800); border-top: 1px solid var(--blue-600); text-align: center; }
|
| 69 |
+
footer p { color: var(--blue-200); font-size: 14px; margin-bottom: 8px; }
|
| 70 |
+
footer a { color: var(--blue-200); }
|
| 71 |
+
footer a:hover { color: var(--accent); }
|
| 72 |
+
@media (max-width: 768px) { .post h1 { font-size: 28px; } .nav-links { display: none; } .stats-grid { grid-template-columns: 1fr; } }
|
| 73 |
+
|
| 74 |
+
</style>
|
| 75 |
+
|
| 76 |
+
</head>
|
| 77 |
+
<body>
|
| 78 |
+
<nav>
|
| 79 |
+
<div class="container">
|
| 80 |
+
<a href="index.html" class="nav-brand"><span>/</span>FMN-GPT</a>
|
| 81 |
+
<div class="nav-links">
|
| 82 |
+
<a href="blog.html">Blog</a>
|
| 83 |
+
<a href="status.html">Model Status</a>
|
| 84 |
+
<a href="https://huggingface.co/CompactAI-O" target="_blank">HuggingFace Org</a>
|
| 85 |
+
</div>
|
| 86 |
+
</div>
|
| 87 |
+
</nav>
|
| 88 |
+
<main>
|
| 89 |
+
<article class="post">
|
| 90 |
+
<div class="container">
|
| 91 |
+
<a href="blog.html" class="post-back">Back to Blog</a>
|
| 92 |
+
<header>
|
| 93 |
+
<div class="post-meta">
|
| 94 |
+
<span class="post-date">2026-05-15</span>
|
| 95 |
+
<span class="post-tag">Architecture Ideas</span>
|
| 96 |
+
</div>
|
| 97 |
+
<h1>A Weird Training Idea: Embeddings First, Everything Else Later</h1>
|
| 98 |
+
</header>
|
| 99 |
+
<div class="post-body">
|
| 100 |
+
<p>I have a weird idea. It might be stupid. It might be brilliant. I do not know yet. The idea is simple. Instead of training the entire model at once, train the embedding layers first. Then train the rest of the model to utilize those embeddings. Two stages. Two objectives. One hope.</p>
|
| 101 |
+
<blockquote>
|
| 102 |
+
<p>Weird ideas are the only ideas worth having. The standard approaches are well documented. The weird approaches are where progress hides.</p>
|
| 103 |
+
</blockquote>
|
| 104 |
+
<h2>The Core Concept</h2>
|
| 105 |
+
<p>Here is how it would work. Stage one takes 100 billion tokens and shoves them into the embedding layers. The embeddings learn to represent tokens in a rich, structured space. The rest of the model sits idle. Stage two freezes those embeddings. The rest of the model trains on another 100 billion tokens, learning to use the fixed embeddings as input.</p>
|
| 106 |
+
<div class="code-block">
|
| 107 |
+
<span class="comment"># Two-stage training in pseudocode</span><br>
|
| 108 |
+
# Stage 1: Train embeddings only<br>
|
| 109 |
+
embeddings = EmbeddingLayer(vocab_size, hidden_dim)<br>
|
| 110 |
+
for token_batch in dataset_100B:<br>
|
| 111 |
+
loss = embedding_objective(token_batch, embeddings)<br>
|
| 112 |
+
embeddings.update(loss)<br>
|
| 113 |
+
<br>
|
| 114 |
+
# Stage 2: Freeze embeddings, train the rest<br>
|
| 115 |
+
embeddings.freeze()<br>
|
| 116 |
+
model = Transformer(rest_of_architecture)<br>
|
| 117 |
+
for token_batch in dataset_100B_round2:<br>
|
| 118 |
+
embedded = embeddings(token_batch)<br>
|
| 119 |
+
loss = language_modeling_objective(embedded, model)<br>
|
| 120 |
+
model.update(loss)<br>
|
| 121 |
+
<span class="comment"># Simple. Weird. Possibly broken.</span>
|
| 122 |
+
</div>
|
| 123 |
+
<p>The intuition is that embeddings carry most of the semantic weight. If embeddings are well trained, the rest of the model has an easier job. It does not need to learn what tokens mean. It only needs to learn how to combine those meanings. That division of labor might accelerate learning. It might also create a bottleneck. Both outcomes are informative.</p>
|
| 124 |
+
<h2>Why This Might Help</h2>
|
| 125 |
+
<div class="stats-grid">
|
| 126 |
+
<div class="stat-card">
|
| 127 |
+
<div class="number">2</div>
|
| 128 |
+
<div class="label">Training Stages</div>
|
| 129 |
+
</div>
|
| 130 |
+
<div class="stat-card">
|
| 131 |
+
<div class="number">200B</div>
|
| 132 |
+
<div class="label">Total Tokens</div>
|
| 133 |
+
</div>
|
| 134 |
+
<div class="stat-card">
|
| 135 |
+
<div class="number">???</div>
|
| 136 |
+
<div class="label">Actual Benefit</div>
|
| 137 |
+
</div>
|
| 138 |
+
<div class="stat-card">
|
| 139 |
+
<div class="number">∞</div>
|
| 140 |
+
<div class="label">My Curiosity</div>
|
| 141 |
+
</div>
|
| 142 |
+
</div>
|
| 143 |
+
<p>Training embeddings separately could reduce interference. When all parameters update simultaneously, gradients compete. Embedding gradients might conflict with attention gradients. Separating the phases isolates the objectives. Each phase optimizes a clear target. That clarity might improve convergence.</p>
|
| 144 |
+
<p>Fixed embeddings could also stabilize training. The rest of the model sees consistent input representations. It does not need to adapt to shifting embeddings. That stability might allow deeper architectures or higher learning rates. It might also limit expressivity. The trade-off is the experiment.</p>
|
| 145 |
+
<h2>Potential Problems</h2>
|
| 146 |
+
<p>The embeddings might overfit to stage one objectives. They might learn representations that are optimal for embedding loss but suboptimal for language modeling. The rest of the model then inherits those limitations. It cannot request better embeddings. It must work with what it is given.</p>
|
| 147 |
+
<p>There is also the risk of capacity mismatch. If embeddings are too expressive, the rest of the model might underutilize them. If embeddings are too constrained, the rest of the model might struggle to compensate. Finding the right balance requires tuning. Tuning requires time. Time is the resource I lack.</p>
|
| 148 |
+
<blockquote>
|
| 149 |
+
<p>Every architectural choice is a bet. This bet is that staged training reduces interference. The house always wins. I am placing the bet anyway.</p>
|
| 150 |
+
</blockquote>
|
| 151 |
+
<h2>How I Might Test It</h2>
|
| 152 |
+
<p>I would start small. A 1M parameter model. A 500 token vocabulary. Two training stages of 10B tokens each. Compare against a baseline trained end-to-end on the same 20B tokens. Measure loss, perplexity, and downstream task performance. The difference, if any, will guide the next iteration.</p>
|
| 153 |
+
<div class="code-block">
|
| 154 |
+
<span class="comment"># Minimal experiment design</span><br>
|
| 155 |
+
Baseline: Train full model on 20B tokens<br>
|
| 156 |
+
Experiment: Stage 1 embeddings on 10B, Stage 2 rest on 10B<br>
|
| 157 |
+
Metrics: Wikitext-2 loss, BLiMP accuracy, ARC-Easy score<br>
|
| 158 |
+
Decision: Keep the approach if metrics improve by 5%+<br>
|
| 159 |
+
<span class="comment"># Measure twice. Cut once. Iterate slowly.</span>
|
| 160 |
+
</div>
|
| 161 |
+
<p>If the staged approach helps at small scale, I would scale up. Larger vocabularies. More tokens. Deeper architectures. The principle remains the same. The implementation grows. The risk grows too. That is the nature of research.</p>
|
| 162 |
+
<h2>Why I Am Thinking About This</h2>
|
| 163 |
+
<p>I train tiny models. I care about efficiency. I care about making every parameter count. Embeddings are a large fraction of tiny model parameters. If embeddings can be trained more effectively, the whole model benefits. That potential motivates the idea.</p>
|
| 164 |
+
<p>I also like weird ideas. I like asking questions that might have simple answers. I like exploring concepts that might just work. This is how I learn. This is how I have fun. This is how I end up with blogs about staged embedding training.</p>
|
| 165 |
+
<h2>Final Thoughts</h2>
|
| 166 |
+
<p>Train embeddings first. Freeze them. Train the rest to use them. The idea is simple. The implementation is uncertain. The outcome is unknown. That uncertainty is the point. That uncertainty is the opportunity.</p>
|
| 167 |
+
<p>I might test this idea. I might fail. I might learn something. That is the cycle. That is the process. That is how progress happens in my tiny corner of AI research.</p>
|
| 168 |
+
<p>If you have thoughts on this idea, let me know. If you have tried something similar, share your results. If you think this is a terrible idea, you are probably right. But terrible ideas sometimes lead to good ones. That is how innovation works. That is how I work.</p>
|
| 169 |
+
<hr>
|
| 170 |
+
</div>
|
| 171 |
+
<footer class="post-footer">
|
| 172 |
+
<p>Current status: Weird idea documented. Experiment design drafted. Risks acknowledged. Optimism maintained. Progress is weird. Curiosity is higher.</p>
|
| 173 |
+
</footer>
|
| 174 |
+
</div>
|
| 175 |
+
</article>
|
| 176 |
+
</main>
|
| 177 |
+
<footer>
|
| 178 |
+
<div class="container">
|
| 179 |
+
<p>Built with curiosity over compute</p>
|
| 180 |
+
<p>FMN-GPT by <a href="https://huggingface.co/CompactAI-O" target="_blank">CompactAI-O</a> | 2026</p>
|
| 181 |
+
</div>
|
| 182 |
+
</footer>
|
| 183 |
+
</body>
|
| 184 |
+
</html>
|