--- language: en tags: - causal-lm - gqa - rope - swiglu license: apache-2.0 datasets: - GODELEV/Archaea-5M-T pipeline_tag: text-generation --- # Ant-10M Ant-10M is a 9.90-million parameter, decoder-only Llama-style transformer model. It was designed, configured, and trained from scratch as a pure engineering sandbox. The primary objectives of this project were to explore the empirical boundaries of Small Language Model (SLM) scaling laws, evaluate extreme tokenizer constraints, test ultra-compact hidden representation geometries, and validate structural training loop stability on highly constrained hardware footprints. This model serves as a direct technical continuation of its predecessor, Ant-5M, implementing critical structural changes to prevent the architectural collapse observed in that earlier iteration and pushing the boundaries of what a sub-10M parameter network can stabilize. --- ## Important Disclaimer and Evaluation Frame **Ant-10M outputs absolute gibberish and possesses no semantic coherency, conversational capacity, structural grammar, or factual reasoning.** When interacting with this model or interpreting its metrics, keep the following engineering constraints in mind: 1. **The Vocabulary Suffocation:** The model is trained using a highly restricted custom vocabulary size of 4,096 tokens. This forces standard English text to be aggressively shattered into microscopic character fragments and syllables during tokenization. 2. **Perplexity Interpretation Trap:** The low validation perplexity achieved during training (`12.57`) is a **byte/token-level perplexity**, not a standard word-level perplexity. Because the tokenizer space is highly compressed, the model is optimizing over a narrow probability distribution of tiny token shards. Standard word-level evaluations (like WikiText-2) will register massive, exploding perplexity values (`88,520,100.69`) because the evaluation frameworks attempt to calculate probabilities over traditional word boundaries that do not exist within this model's narrow dictionary maps. This model is not a functional assistant. It is a mathematical log of a successful optimization and convergence experiment. --- ## Technical Architecture Specification Ant-10M scales the internal hidden representation width of the network while maintaining an efficient attention execution path. It relies on a balanced width-to-depth ratio designed to maximize token processing speed on consumer-tier systems. * **Total Parameters:** 9.90 Million (`9,902,464`) * **Layers (`num_hidden_layers`):** 12 * **Hidden Size (`hidden_size`):** 256 * **Intermediate Size (`intermediate_size`):** 704 * **Attention Heads (`num_attention_heads`):** 4 * **Key-Value Heads (`num_key_value_heads`):** 2 (Grouped-Query Attention ratio of 2:1) * **Head Dimension (`head_dim`):** 64 * **Max Sequence Length (`max_position_embeddings`):** 1,024 tokens * **Vocabulary Size (`vocab_size`):** 4,096 (Custom trained BPE tokenizer) * **Activation Function:** SiLU (SwiGLU variant without linear biases) * **Positional Embeddings:** Rotary Position Embeddings (RoPE) with a native base frequency ($\theta$) of 10,000.0 * **Weight Tying:** `tie_word_embeddings: true` (Input embedding and final output projection share an identical tensor matrix to optimize parameter allocation) --- ## Hardware and Training Infrastructure Metadata The model was successfully pre-trained in a single continuous session lasting **9.63 hours (approx. 10 hours)**. * **Hardware Used:** 1x NVIDIA T4 GPU (16GB VRAM) via Kaggle Compute Engine * **Tokens Seen:** 2,979,215,382 (~3 Billion tokens) * **Engine Velocity:** Steady operational throughput of **81,520 to 83,000 tokens per second** * **Precision:** `torch.float16` Automatic Mixed Precision (AMP) * **Optimization Framework:** AdamW Optimizer with a Cosine Learning Rate Decay Schedule and a linear warmup phase peaking at step 200 ($4.0 \times 10^{-4}$) --- Ant-10M Pre-training Metrics Summary --- ## Training Dynamics and Convergence Curves The training loop executed flawlessly without gradient explosions, numerical underflow, or loss divergence. The training loss and validation loss tracked each other with near-zero variance, demonstrating excellent data regularization across the 3 Billion token dataset. | Metrics | Step 80 (Initialization) | Step 200 (Warmup Peak) | Step 600 (Mid-Run) | Step 1200 (Final Convergence) | | --- | --- | --- | --- | --- | | **Training Loss** | 5.0837 | 3.8214 | 2.7231 | **2.5303** | | **Validation Loss** | — | 3.8174 | 2.7217 | **2.5314** | | **Token Perplexity** | 161.37 | 45.49 | 15.22 | **12.57** | | **Learning Rate** | $2.46 \times 10^{-4}$ | $4.00 \times 10^{-4}$ | $2.31 \times 10^{-4}$ | $4.29 \times 10^{-5}$ | | **Gradient Norm** | 2.0964 | 0.8142 | 0.4431 | 0.3189 | --- ## Downstream Benchmarks: A Comparative Post-Mortem To understand the developmental step forward taken by Ant-10M, its zero-shot performance is compared below against its older sibling, [Ant-5M](https://huggingface.co/GODELEV/Ant-5M). Ant-5M suffered a catastrophic structural collapse due to severe architectural imbalances—specifically, a microscopic hidden size (128) forced into an overly deep structure (11 layers) combined with an excessive Grouped-Query Attention bottleneck. This caused Ant-5M to trap itself in endless degenerate loops, repeating singular words like "Sciences" or URL punctuation constantly. Ant-10M completely eliminates these degenerate loops. However, because its vocabulary is still heavily compressed down to 4,096 tokens, it remains choked during standard language evaluations that rely on whole-word assemblies. ### Standard Language Benchmarks | Benchmark Dataset | Metric Type | Ant-5M (The Catastrophe) | Ant-10M (One Step Ahead) | | --- | --- | --- | --- | | **ARC-Challenge** | `acc_norm` | 0.2442 (Below Random Guess) | **0.2747** (Above Random Guess) | | **ARC-Easy** | `acc_norm` | 0.2319 | **0.2542** | | **PIQA** | `acc_norm` | 0.4951 | **0.5032** | | **WinoGrande** | `acc` | 0.4885 | **0.4964** | | **MMLU** | `acc` | 0.2412 | **0.2543** | | **SciQ** | `acc_norm` | 0.1980 | **0.2150** | | **BoolQ** | `acc` | 0.3621 | **0.3782** | | **HellaSwag** | `acc_norm` | 0.2514 | **0.2672** | | **WikiText-2** | `byte_perplexity` | 48.91 | **30.62** | | **WikiText-2** | `word_perplexity` | Run Crashed / Diverged | **88,520,100.69** (Token Splitting Artifact) | ### Mathematical Reasoning Evaluation: Arithmark-2.0 Arithmark-2.0 evaluates the latent computational capacity of tiny models by asking them to solve basic arithmetic strings containing varying numbers of operators. Because multiple choice contains 4 potential variations, the random baseline floor is 25.0%. | Arithmark-2.0 Slice | Ant-5M Score | Ant-10M Score | | --- | --- | --- | | **Overall Accuracy** | 22.10% (Fails Floor) | **25.44%** (Crosses Floor) | | **1 Operator (Easy)** | 23.40% | **26.40%** | | **2 Operators (Medium)** | 21.90% | **26.93%** | | **3 Operators (Hard)** | 20.10% | **20.80%** | ### Key Takeaways from the Data * **ARC-Challenge Progression:** Ant-5M scored below the random multiple-choice baseline (25.0%). Ant-10M breaks past the baseline to achieve **27.47%**, proving that widening the hidden dimension to 256 allowed the attention heads to actively map structural positioning signals instead of outputting repetitive tokens. * **Arithmark Numerical Floor:** While Ant-5M failed to maintain stable positioning math during mathematical syntax, Ant-10M managed to clear the 25% guessing baseline on 1-operator and 2-operator strings. At 3 operators, the context requirements of tracking multi-step parenthesis tokens exceeded the model's 256 hidden dimension capabilities, dropping accuracy back down to 20.80%. * **Byte-Perplexity Improvement:** The compression performance on raw character patterns improved significantly, dropping from 48.91 down to **30.62**, confirming high computational density inside the 12 transformer layers. --- ## Verification and Weights Inspection To verify the weights of Ant-10M, explore its layers, or inspect its token-fragment distribution outputs, use the standard Hugging Face Transformers pipeline as written below. ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "GODELEV/Ant-10M" # Load the custom tokenizer and model architecture tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.float16, device_map="auto" ) # Set up raw text input prompt = "The basic principles of small language models require" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) # Generate using high repetition penalties to counter the narrow vocabulary space with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=32, do_sample=True, temperature=0.7, top_p=0.9, repetition_penalty=1.5 ) # Decode tokens back into structural text fragments generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) print("Generated Output Fragments:") print(generated_text) ```