File size: 8,762 Bytes

---
language: en
tags:
- causal-lm
- gqa
- rope
- swiglu
license: apache-2.0
---
# Exp-1 
> A cautionary tale about transformers, sleep deprivation, and the discovery that one missing residual connection can destroy several days/weeks of your life.
---
## A Poem Written During a Mental Breakdown

I built a brain of matrices to learn the human tongue,  
I fed it billions of tokens and watched it while it sung.  
But all it sang were months and days and scattered dictionary parts,  
I cried out to the GPU, “Why break my coder heart?”  

I tore my hair, I cursed the code, I blamed the learning rate,  
I thought the dataset was trash, I thought it was my fate.  
Then by sheer dumb luck I scrolled and saw the dreaded line,  
The FFN was stuffed inside the attention’s pure design.  

My models weren’t just acting dumb or trapped in slow decay,  
I’d built a headless feed-forward and threw the heads away.

---
## Introduction
Welcome to **Exp-1**.

* The **“Exp”** stands for Experiment.
* The **“1”** stands for the first experiment that accidentally became a forensic investigation into my own incompetence.

Originally, this model was never supposed to be important. I am a solo developer and needed a small-scale training run on Kaggle to verify something extremely boring: **checkpoint recovery**.

That’s it. No revolutionary architecture. No grand research ambitions. No plans for world domination. I simply wanted proof that my training pipeline could survive Kaggle’s session timeouts and resume correctly after interruptions.

The plan was straightforward:
* Train a highly compressed transformer.
* Feed it roughly 6 billion tokens.
* Save checkpoints.
* Resume checkpoints.
* Verify nothing exploded.
* Move on with life.

Instead, Exp-1 became one of the most educational disasters I have ever experienced. For 23 consecutive hours, I babysat this run like a nervous parent watching a toddler approach a staircase. And somewhere around hour eighteen, I discovered a horrifying truth: **This model wasn’t merely underperforming. It was fundamentally broken.**

---
## The Dark Ages

Before Exp-1 there were two previous models:
* **Ant-5M**
* **Ant-10M**

Both exhibited the exact same mysterious behavior. Training appeared healthy. Loss decreased. Gradients looked normal. Memory usage looked normal. Nothing crashed. Nothing exploded. Everything looked mathematically correct.

Yet, whenever I generated text, the models behaved as if they had suffered severe neurological damage. Instead of producing coherent language, they generated things like:

> *House School University Library Center Point March December November October July April February Player Number Game II III IV V VI*

The models clearly understood that words belonged together, but they had absolutely no idea how language worked.

* **Verbs?** Gone.
* **Grammar?** Missing.
* **Context?** Never heard of it.
* **Sentence structure?** Optional, apparently.

The models behaved like someone had dumped an entire dictionary onto the floor and asked them to organize it into piles. Surprisingly good at categorization; terrible at speaking.

---
## The Five Stages of Machine Learning Grief

1. **Stage 1: Denial** — *“The model is fine.”*
2. **Stage 2: Anger** — *“The dataset is garbage.”*
3. **Stage 3: Bargaining** — *“Maybe if I lower the learning rate.”*
4. **Stage 4: Depression** — *“I have wasted Days.”*
5. **Stage 5: Reading Your Own Code** — *“Oh.”*

---
## The Discovery

During the Exp-1 run, I began investigating generation outputs more aggressively. Eventually, I found a single line hidden deep inside my Transformer block. A line that would later become known as...

### The Code of Absolute Shame
```python
return x + self.ffn(self.n2(x + self.attn(self.n1(x))))
```
At first glance, it appears harmless. It is not harmless. It is a crime scene. 

A standard transformer block should contain two independent residual pathways:
```text
Input ──> Attention ──> Residual Add ──> FFN ──> Residual Add ──> Output
```
What I accidentally built looked more like this, with the residual structure mangled beyond recognition:
```text
Input ──> Attention ──> FFN ──> Output
```
The attention mechanism never received the architectural support it needed to properly transmit sequence information. 
The FFN essentially hijacked the block. The attention heads still existed, computed things, consumed parameters, and ate up compute—but functionally, they were screaming into the void.

## What Exactly Was This Model Learning?

The funny part is that the model wasn’t completely useless. It learned *something*. Just not language. It learned **statistical neighborhoods**. It learned that:

* *January* belongs near *February*.
* *School* belongs near *University*.
* *Player* belongs near *Game*.
* *Roman numerals* belong near other *Roman numerals*.

Exp-1 became a very expensive semantic clustering machine. It could identify relationships, but it could not explain them. It could recognize concepts, but it could not communicate them. It was basically a giant autocomplete engine for word categories.

---

## The Benchmark Comedy Show

The most absurd part of this entire story is that I still evaluated the model. And somehow, it scored surprisingly well on several benchmarks. 
Remember: **This model effectively had a broken transformer block.** Yet, it still managed results that occasionally look respectable. Apparently, modern benchmarks are willing to tolerate more nonsense than I expected.

### Benchmark Results

| Benchmark | Metric | Score |
| :--- | :--- | :--- |
| **COPA** | acc | 0.6300 |
| **BoolQ** | acc | 0.6085 |
| **BLiMP** | acc | 0.5361 |
| **WinoGrande** | acc | 0.4996 |
| **TruthfulQA MC2** | acc | 0.4981 |
| **PIQA** | acc_norm | 0.4923 |
| **OpenBookQA** | acc_norm | 0.2980 |
| **ARC-Challenge** | acc_norm | 0.2755 |
| **ARC-Easy** | acc_norm | 0.2710 |
| **HellaSwag** | acc_norm | 0.2684 |
| **SWAG** | acc_norm | 0.2589 |
| **MMLU** | acc | 0.2416 |
| **RACE** | acc | 0.2239 |
| **CommonsenseQA** | acc | 0.2080 |
| **SciQ** | acc_norm | 0.2000 |
| **LAMBADA** | acc | 0.0000 |
| **WikiText-2** | word_perplexity | 39,301,278.79 |
| **WikiText-2** | byte_perplexity | 26.31 |

### Observations

* **COPA: 63%** — Somehow capable of causal reasoning. Possibly by accident.
* **BoolQ: 60.85%** — Apparently, word association alone gets you surprisingly far in yes/no questions.
* **WinoGrande: ~50%** — Equivalent to a very confident coin flip.
* **LAMBADA: 0.0%** — Finally, a benchmark that noticed something was wrong.
* **WikiText Word Perplexity: 39 Million** — The benchmark system politely informed me that my language model was not, in fact, a language model.

---

## Scientific Conclusions

After extensive research, I can confidently report the following:

1. Residual connections are important.
2. Reading your own code is important.
3. Benchmarks can occasionally be terrifying.
4. Sleep is probably useful.
5. A decreasing loss curve does not guarantee intelligence.
6. The GPU is innocent.
7. The dataset was innocent.
8. The optimizer was innocent.
9. **The bug was guilty.**

---

## Why Exp-1 Matters

Despite everything, Exp-1 achieved every engineering objective:
* Checkpoint recovery works.
* Training resumes correctly.
* VRAM usage was optimized.
* The training pipeline became substantially cleaner.
* Several architectural weaknesses were identified and fixed.

Most importantly: **The transformer implementation is now actually a transformer.** Exp-1 transformed from a checkpoint validation run into one of the most valuable debugging sessions of my career.

---

## Legacy

Exp-1 will remain on the Hub permanently. Not because it is powerful. Not because it is useful. Not because it advances the state of the art. **But because every researcher deserves a monument to their worst mistake.**

Future models may become larger. Future models may become smarter. Future models may achieve meaningful benchmarks. But none of them will ever teach me as much as the model that spent 6 billion tokens learning how to sort words into piles.

### Final Verdict

* Exp-1 cannot speak.
* Exp-1 cannot reason.
* Exp-1 cannot write coherent English.

But Exp-1 did accomplish something remarkable. It successfully located the single line of code that had been quietly sabotaging an entire family of models. 

The bug is dead. The pipeline is stable. The checkpoints work. The transformer has functioning residual streams. And for the first time in months, **the attention heads are no longer screaming into the void.**

> *"Every researcher eventually trains a terrible model. The lucky ones discover why."*