Archaea-74M-V1.1
The Return to Familiar Ground
Archaea-74M-V1.1 is a continuation of Archaea-74M, a 74 million parameter decoder-only language model built as part of my ongoing journey into training language models from scratch.
Unlike many model releases that begin with a completely new architecture, a larger parameter count, or a dramatic redesign, Archaea-74M-V1.1 started with a much simpler idea: What if I revisited something that already worked?
The original Archaea-74M was one of the earliest models I trained that genuinely felt like a language model rather than a machine learning experiment desperately trying to convince me it understood English. It wasn't perfect. It wasn't state-of-the-art. It certainly wasn't going to challenge frontier models. But it could generate coherent text, complete prompts, and hold together enough language structure to prove that building language models from scratch was actually possible. For a solo developer, that was an important milestone.
Over time, however, my attention shifted elsewhere. New datasets appeared. New architectures seemed interesting. New experiments promised bigger results. And like many developers before me, I became convinced that the next project would surely be the one that solved everything.
Reality, as usual, had other plans.
A Brief Detour Into Chaos
Following the original Archaea release, I spent months exploring increasingly ambitious projects. Some worked. Some taught valuable lessons. Some taught lessons so valuable that I would have preferred not to learn them.
One particular project eventually became known as Exp-1. What started as a routine experiment slowly transformed into a full-scale investigation into my own codebase. The model behaved strangely. The outputs looked suspicious. The benchmarks made just enough sense to be confusing. After far more debugging than I would like to admit, Exp-1 ultimately revealed a Transformer implementation bug that had quietly affected multiple experimental projects.
The experience answered many questions. It also raised several new ones. Most importantly, it left me wanting a temporary break from constantly chasing the next architecture. After enough debugging sessions, enough evaluation runs, and enough moments spent staring at Python files wondering how exactly I had managed to create these problems for myself, I decided to change direction.
Instead of building something entirely new, I returned to something familiar. I returned to Archaea. Not because it was my newest model. Not because it was my largest model. But because it was one of the projects that reminded me why I started building language models in the first place.
Archaea was never perfect. But it was solid. And sometimes the best way to move forward is not to start over. Sometimes the best way forward is to improve something that already works.
That decision became Archaea-74M-V1.1.
What Is Archaea-74M-V1.1?
Archaea-74M-V1.1 is a refined continuation of the original Archaea-74M model. The objective was straightforward:
- Preserve the strengths of the original model.
- Improve overall language understanding.
- Improve knowledge-focused benchmark performance.
- Increase general capability without changing the underlying architecture.
This release is not a new model family. It is not a complete redesign. It is simply a better version of Archaea. Sometimes the most satisfying improvements come from refinement rather than reinvention.
Model Overview
| Attribute | Value |
|---|---|
| Model ID | GODELEV/Archaea-74M-V1.1 |
| Parameters | ~74 Million |
| Architecture | Decoder-only Transformer |
| Attention | Grouped Query Attention (GQA) |
| Context Length | 1024 |
| Tokenizer | GPT-2 |
| Precision | BF16 |
| Framework | PyTorch + Transformers |
| License | MIT |
Notable Improvements
The goal of V1.1 was not to chase a single benchmark. The goal was to improve the model's overall capability profile. Several evaluations showed measurable gains compared to the original Archaea-74M release.
| Benchmark | Archaea-74M | Archaea-74M-V1.1 | Gain |
|---|---|---|---|
| SciQ | 57.70% | 68.20% | +10.50 |
| OpenBookQA | 26.00% | 30.60% | +4.60 |
| ARC-Easy | 39.06% | 41.71% | +2.65 |
| ARC-Challenge | 22.70% | 24.23% | +1.53 |
| BLiMP | 74.91% | 76.45% | +1.54 |
| SWAG | 41.98% | 43.32% | +1.34 |
| PIQA | 58.54% | 59.25% | +0.71 |
The most significant improvement appeared in SciQ, where the model gained more than ten percentage points over the original release. OpenBookQA, ARC-Easy, ARC-Challenge, BLiMP, SWAG, and PIQA also showed consistent gains. This suggests that the additional training helped the model make better use of knowledge already present within its parameters while improving general language understanding.
As with every machine learning project ever created, some benchmarks were considerably more cooperative than others. The benchmark table and I have agreed not to discuss those topics further.
Evaluation Results
Evaluated using the EleutherAI LM Evaluation Harness. All evaluations were performed in a 0-shot setting.
| Benchmark | Metric | Score |
|---|---|---|
| BLiMP | acc | 76.45% |
| SciQ | acc_norm | 68.20% |
| PIQA | acc_norm | 59.25% |
| COPA | acc | 55.00% |
| WinoGrande | acc | 51.78% |
| BoolQ | acc | 47.74% |
| TruthfulQA MC2 | acc | 46.14% |
| SWAG | acc_norm | 43.32% |
| ARC-Easy | acc_norm | 41.71% |
| OpenBookQA | acc_norm | 30.60% |
| HellaSwag | acc_norm | 27.05% |
| RACE | acc | 25.36% |
| ARC-Challenge | acc_norm | 24.23% |
| MMLU | acc | 23.31% |
| CommonsenseQA | acc | 19.49% |
| LAMBADA | acc | 9.94% |
Lessons From Revisiting an Older Model
One of the unexpected advantages of returning to an older project is perspective. When Archaea-74M was originally trained, my primary objective was proving that I could build a language model from scratch.
Now the objective is different. Now I care about understanding why a model improves. Why a benchmark increases. Why some tasks respond immediately to additional training while others stubbornly refuse to move.
Archaea-74M-V1.1 is therefore more than another checkpoint. It represents experience gained through multiple projects, multiple experiments, multiple mistakes, and multiple lessons learned along the way. Fortunately, most of those lessons were learned by the models rather than me. At least that is what I choose to believe.
Future Work
While V1.1 represents a meaningful improvement over the original release, there is still considerable room for growth. Potential future directions include:
- Expanded training datasets
- Longer context lengths
- Instruction tuning
- Improved reasoning performance
- Larger successors within the Archaea family
- Additional benchmark coverage
The current long-term research strategy remains highly sophisticated and is summarized below:
Make model better.
Early results appear promising.
Final Thoughts
Archaea-74M was one of the first projects that convinced me that training language models from scratch was actually possible. Archaea-74M-V1.1 is a reminder that progress does not always come from starting over. Sometimes progress comes from returning to an older project with more experience, better tools, and a clearer understanding of what you are trying to achieve.
This release is the result of that philosophy. It is not a new beginning. It is not an ending. It is simply the next chapter in the story of Archaea.
- Downloads last month
- 19