---
language: en
tags:
- causal-lm
- gqa
- rope
- swiglu
license: apache-2.0
datasets:
- GODELEV/Archaea-5M-T
pipeline_tag: text-generation
---

# Ant-10M

Ant-10M is a 9.90-million parameter, decoder-only Llama-style transformer model. It was designed, configured, and trained from scratch as a pure engineering sandbox. The primary objectives of this project were to explore the empirical boundaries of Small Language Model (SLM) scaling laws, evaluate extreme tokenizer constraints, test ultra-compact hidden representation geometries, and validate structural training loop stability on highly constrained hardware footprints.

This model serves as a direct technical continuation of its predecessor, Ant-5M, implementing critical structural changes to prevent the architectural collapse observed in that earlier iteration and pushing the boundaries of what a sub-10M parameter network can stabilize.

---

## Important Disclaimer and Evaluation Frame

**Ant-10M outputs absolute gibberish and possesses no semantic coherency, conversational capacity, structural grammar, or factual reasoning.**

When interacting with this model or interpreting its metrics, keep the following engineering constraints in mind:

1. **The Vocabulary Suffocation:** The model is trained using a highly restricted custom vocabulary size of 4,096 tokens. This forces standard English text to be aggressively shattered into microscopic character fragments and syllables during tokenization.
2. **Perplexity Interpretation Trap:** The low validation perplexity achieved during training (`12.57`) is a **byte/token-level perplexity**, not a standard word-level perplexity. Because the tokenizer space is highly compressed, the model is optimizing over a narrow probability distribution of tiny token shards. Standard word-level evaluations (like WikiText-2) will register massive, exploding perplexity values (`88,520,100.69`) because the evaluation frameworks attempt to calculate probabilities over traditional word boundaries that do not exist within this model's narrow dictionary maps.

This model is not a functional assistant. It is a mathematical log of a successful optimization and convergence experiment.

---

## Technical Architecture Specification

Ant-10M scales the internal hidden representation width of the network while maintaining an efficient attention execution path. It relies on a balanced width-to-depth ratio designed to maximize token processing speed on consumer-tier systems.

* **Total Parameters:** 9.90 Million (`9,902,464`)
* **Layers (`num_hidden_layers`):** 12
* **Hidden Size (`hidden_size`):** 256
* **Intermediate Size (`intermediate_size`):** 704
* **Attention Heads (`num_attention_heads`):** 4
* **Key-Value Heads (`num_key_value_heads`):** 2 (Grouped-Query Attention ratio of 2:1)
* **Head Dimension (`head_dim`):** 64
* **Max Sequence Length (`max_position_embeddings`):** 1,024 tokens
* **Vocabulary Size (`vocab_size`):** 4,096 (Custom trained BPE tokenizer)
* **Activation Function:** SiLU (SwiGLU variant without linear biases)
* **Positional Embeddings:** Rotary Position Embeddings (RoPE) with a native base frequency ($\theta$) of 10,000.0
* **Weight Tying:** `tie_word_embeddings: true` (Input embedding and final output projection share an identical tensor matrix to optimize parameter allocation)

---

## Hardware and Training Infrastructure Metadata

The model was successfully pre-trained in a single continuous session lasting **9.63 hours (approx. 10 hours)**.

* **Hardware Used:** 1x NVIDIA T4 GPU (16GB VRAM) via Kaggle Compute Engine
* **Tokens Seen:** 2,979,215,382 (~3 Billion tokens)
* **Engine Velocity:** Steady operational throughput of **81,520 to 83,000 tokens per second**
* **Precision:** `torch.float16` Automatic Mixed Precision (AMP)
* **Optimization Framework:** AdamW Optimizer with a Cosine Learning Rate Decay Schedule and a linear warmup phase peaking at step 200 ($4.0 \times 10^{-4}$)

---
   <img src="graph.png" alt="Ant-10M Pre-training Metrics Summary" width="1000"/>

---

## Training Dynamics and Convergence Curves

The training loop executed flawlessly without gradient explosions, numerical underflow, or loss divergence. The training loss and validation loss tracked each other with near-zero variance, demonstrating excellent data regularization across the 3 Billion token dataset.

| Metrics | Step 80 (Initialization) | Step 200 (Warmup Peak) | Step 600 (Mid-Run) | Step 1200 (Final Convergence) |
| --- | --- | --- | --- | --- |
| **Training Loss** | 5.0837 | 3.8214 | 2.7231 | **2.5303** |
| **Validation Loss** | — | 3.8174 | 2.7217 | **2.5314** |
| **Token Perplexity** | 161.37 | 45.49 | 15.22 | **12.57** |
| **Learning Rate** | $2.46 \times 10^{-4}$ | $4.00 \times 10^{-4}$ | $2.31 \times 10^{-4}$ | $4.29 \times 10^{-5}$ |
| **Gradient Norm** | 2.0964 | 0.8142 | 0.4431 | 0.3189 |

---

## Downstream Benchmarks: A Comparative Post-Mortem

To understand the developmental step forward taken by Ant-10M, its zero-shot performance is compared below against its older sibling, [Ant-5M](https://huggingface.co/GODELEV/Ant-5M).

Ant-5M suffered a catastrophic structural collapse due to severe architectural imbalances—specifically, a microscopic hidden size (128) forced into an overly deep structure (11 layers) combined with an excessive Grouped-Query Attention bottleneck. This caused Ant-5M to trap itself in endless degenerate loops, repeating singular words like "Sciences" or URL punctuation constantly.

Ant-10M completely eliminates these degenerate loops. However, because its vocabulary is still heavily compressed down to 4,096 tokens, it remains choked during standard language evaluations that rely on whole-word assemblies.

### Standard Language Benchmarks

| Benchmark Dataset | Metric Type | Ant-5M (The Catastrophe) | Ant-10M (One Step Ahead) |
| --- | --- | --- | --- |
| **ARC-Challenge** | `acc_norm` | 0.2442 (Below Random Guess) | **0.2747** (Above Random Guess) |
| **ARC-Easy** | `acc_norm` | 0.2319 | **0.2542** |
| **PIQA** | `acc_norm` | 0.4951 | **0.5032** |
| **WinoGrande** | `acc` | 0.4885 | **0.4964** |
| **MMLU** | `acc` | 0.2412 | **0.2543** |
| **SciQ** | `acc_norm` | 0.1980 | **0.2150** |
| **BoolQ** | `acc` | 0.3621 | **0.3782** |
| **HellaSwag** | `acc_norm` | 0.2514 | **0.2672** |
| **WikiText-2** | `byte_perplexity` | 48.91 | **30.62** |
| **WikiText-2** | `word_perplexity` | Run Crashed / Diverged | **88,520,100.69** (Token Splitting Artifact) |

### Mathematical Reasoning Evaluation: Arithmark-2.0

Arithmark-2.0 evaluates the latent computational capacity of tiny models by asking them to solve basic arithmetic strings containing varying numbers of operators. Because multiple choice contains 4 potential variations, the random baseline floor is 25.0%.

| Arithmark-2.0 Slice | Ant-5M Score | Ant-10M Score |
| --- | --- | --- |
| **Overall Accuracy** | 22.10% (Fails Floor) | **25.44%** (Crosses Floor) |
| **1 Operator (Easy)** | 23.40% | **26.40%** |
| **2 Operators (Medium)** | 21.90% | **26.93%** |
| **3 Operators (Hard)** | 20.10% | **20.80%** |

### Key Takeaways from the Data

* **ARC-Challenge Progression:** Ant-5M scored below the random multiple-choice baseline (25.0%). Ant-10M breaks past the baseline to achieve **27.47%**, proving that widening the hidden dimension to 256 allowed the attention heads to actively map structural positioning signals instead of outputting repetitive tokens.
* **Arithmark Numerical Floor:** While Ant-5M failed to maintain stable positioning math during mathematical syntax, Ant-10M managed to clear the 25% guessing baseline on 1-operator and 2-operator strings. At 3 operators, the context requirements of tracking multi-step parenthesis tokens exceeded the model's 256 hidden dimension capabilities, dropping accuracy back down to 20.80%.
* **Byte-Perplexity Improvement:** The compression performance on raw character patterns improved significantly, dropping from 48.91 down to **30.62**, confirming high computational density inside the 12 transformer layers.

---

## Verification and Weights Inspection

To verify the weights of Ant-10M, explore its layers, or inspect its token-fragment distribution outputs, use the standard Hugging Face Transformers pipeline as written below.

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "GODELEV/Ant-10M"

# Load the custom tokenizer and model architecture
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    device_map="auto"
)

# Set up raw text input
prompt = "The basic principles of small language models require"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate using high repetition penalties to counter the narrow vocabulary space
with torch.no_grad():
    outputs = model.generate(
        **inputs, 
        max_new_tokens=32,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.5
    )

# Decode tokens back into structural text fragments
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated Output Fragments:")
print(generated_text)

```