Archaea-74M-V1.1

The Return to Familiar Ground

Archaea-74M-V1.1 is a continuation of Archaea-74M, a 74 million parameter decoder-only language model built as part of my ongoing journey into training language models from scratch.

Unlike many model releases that begin with a completely new architecture, a larger parameter count, or a dramatic redesign, Archaea-74M-V1.1 started with a much simpler idea: What if I revisited something that already worked?

The original Archaea-74M was one of the earliest models I trained that genuinely felt like a language model rather than a machine learning experiment desperately trying to convince me it understood English. It wasn't perfect. It wasn't state-of-the-art. It certainly wasn't going to challenge frontier models. But it could generate coherent text, complete prompts, and hold together enough language structure to prove that building language models from scratch was actually possible. For a solo developer, that was an important milestone.

Over time, however, my attention shifted elsewhere. New datasets appeared. New architectures seemed interesting. New experiments promised bigger results. And like many developers before me, I became convinced that the next project would surely be the one that solved everything.

Reality, as usual, had other plans.

A Brief Detour Into Chaos

Following the original Archaea release, I spent months exploring increasingly ambitious projects. Some worked. Some taught valuable lessons. Some taught lessons so valuable that I would have preferred not to learn them.

One particular project eventually became known as Exp-1. What started as a routine experiment slowly transformed into a full-scale investigation into my own codebase. The model behaved strangely. The outputs looked suspicious. The benchmarks made just enough sense to be confusing. After far more debugging than I would like to admit, Exp-1 ultimately revealed a Transformer implementation bug that had quietly affected multiple experimental projects.

The experience answered many questions. It also raised several new ones. Most importantly, it left me wanting a temporary break from constantly chasing the next architecture. After enough debugging sessions, enough evaluation runs, and enough moments spent staring at Python files wondering how exactly I had managed to create these problems for myself, I decided to change direction.

Instead of building something entirely new, I returned to something familiar. I returned to Archaea. Not because it was my newest model. Not because it was my largest model. But because it was one of the projects that reminded me why I started building language models in the first place.

Archaea was never perfect. But it was solid. And sometimes the best way to move forward is not to start over. Sometimes the best way forward is to improve something that already works.

That decision became Archaea-74M-V1.1.

What Is Archaea-74M-V1.1?

Archaea-74M-V1.1 is a refined continuation of the original Archaea-74M model. The objective was straightforward:

Preserve the strengths of the original model.
Improve overall language understanding.
Improve knowledge-focused benchmark performance.
Increase general capability without changing the underlying architecture.

This release is not a new model family. It is not a complete redesign. It is simply a better version of Archaea. Sometimes the most satisfying improvements come from refinement rather than reinvention.

Model Overview

Attribute	Value
Model ID	GODELEV/Archaea-74M-V1.1
Parameters	~74 Million
Architecture	Decoder-only Transformer
Attention	Grouped Query Attention (GQA)
Context Length	1024
Tokenizer	GPT-2
Precision	BF16
Framework	PyTorch + Transformers
License	MIT

Notable Improvements

The goal of V1.1 was not to chase a single benchmark. The goal was to improve the model's overall capability profile. Several evaluations showed measurable gains compared to the original Archaea-74M release.

Benchmark	Archaea-74M	Archaea-74M-V1.1	Gain
SciQ	57.70%	68.20%	+10.50
OpenBookQA	26.00%	30.60%	+4.60
ARC-Easy	39.06%	41.71%	+2.65
ARC-Challenge	22.70%	24.23%	+1.53
BLiMP	74.91%	76.45%	+1.54
SWAG	41.98%	43.32%	+1.34
PIQA	58.54%	59.25%	+0.71

The most significant improvement appeared in SciQ, where the model gained more than ten percentage points over the original release. OpenBookQA, ARC-Easy, ARC-Challenge, BLiMP, SWAG, and PIQA also showed consistent gains. This suggests that the additional training helped the model make better use of knowledge already present within its parameters while improving general language understanding.

As with every machine learning project ever created, some benchmarks were considerably more cooperative than others. The benchmark table and I have agreed not to discuss those topics further.

Evaluation Results

Evaluated using the EleutherAI LM Evaluation Harness. All evaluations were performed in a 0-shot setting.

Benchmark	Metric	Score
BLiMP	acc	76.45%
SciQ	acc_norm	68.20%
PIQA	acc_norm	59.25%
COPA	acc	55.00%
WinoGrande	acc	51.78%
BoolQ	acc	47.74%
TruthfulQA MC2	acc	46.14%
SWAG	acc_norm	43.32%
ARC-Easy	acc_norm	41.71%
OpenBookQA	acc_norm	30.60%
HellaSwag	acc_norm	27.05%
RACE	acc	25.36%
ARC-Challenge	acc_norm	24.23%
MMLU	acc	23.31%
CommonsenseQA	acc	19.49%
LAMBADA	acc	9.94%

Lessons From Revisiting an Older Model

One of the unexpected advantages of returning to an older project is perspective. When Archaea-74M was originally trained, my primary objective was proving that I could build a language model from scratch.

Now the objective is different. Now I care about understanding why a model improves. Why a benchmark increases. Why some tasks respond immediately to additional training while others stubbornly refuse to move.

Archaea-74M-V1.1 is therefore more than another checkpoint. It represents experience gained through multiple projects, multiple experiments, multiple mistakes, and multiple lessons learned along the way. Fortunately, most of those lessons were learned by the models rather than me. At least that is what I choose to believe.

Future Work

While V1.1 represents a meaningful improvement over the original release, there is still considerable room for growth. Potential future directions include:

Expanded training datasets
Longer context lengths
Instruction tuning
Improved reasoning performance
Larger successors within the Archaea family
Additional benchmark coverage

The current long-term research strategy remains highly sophisticated and is summarized below:

Make model better.

Early results appear promising.

Final Thoughts

Archaea-74M was one of the first projects that convinced me that training language models from scratch was actually possible. Archaea-74M-V1.1 is a reminder that progress does not always come from starting over. Sometimes progress comes from returning to an older project with more experience, better tools, and a clearer understanding of what you are trying to achieve.

This release is the result of that philosophy. It is not a new beginning. It is not an ending. It is simply the next chapter in the story of Archaea.

Downloads last month: 19

Safetensors

Model size

74M params

Tensor type

F32

Model tree for GODELEV/Archaea-74M-V1.1

Quantizations

1 model

GODELEV
/

Archaea-74M-V1.1