Ant-10M / README.md

Update README.md

778839a verified 24 days ago

9.24 kB

	---
	language: en
	tags:
	- causal-lm
	- gqa
	- rope
	- swiglu
	license: apache-2.0
	datasets:
	- GODELEV/Archaea-5M-T
	pipeline_tag: text-generation
	---

	# Ant-10M

	Ant-10M is a 9.90-million parameter, decoder-only Llama-style transformer model. It was designed, configured, and trained from scratch as a pure engineering sandbox. The primary objectives of this project were to explore the empirical boundaries of Small Language Model (SLM) scaling laws, evaluate extreme tokenizer constraints, test ultra-compact hidden representation geometries, and validate structural training loop stability on highly constrained hardware footprints.

	This model serves as a direct technical continuation of its predecessor, Ant-5M, implementing critical structural changes to prevent the architectural collapse observed in that earlier iteration and pushing the boundaries of what a sub-10M parameter network can stabilize.

	---

	## Important Disclaimer and Evaluation Frame

	Ant-10M outputs absolute gibberish and possesses no semantic coherency, conversational capacity, structural grammar, or factual reasoning.

	When interacting with this model or interpreting its metrics, keep the following engineering constraints in mind:

	1. The Vocabulary Suffocation: The model is trained using a highly restricted custom vocabulary size of 4,096 tokens. This forces standard English text to be aggressively shattered into microscopic character fragments and syllables during tokenization.
	2. Perplexity Interpretation Trap: The low validation perplexity achieved during training (`12.57`) is a byte/token-level perplexity, not a standard word-level perplexity. Because the tokenizer space is highly compressed, the model is optimizing over a narrow probability distribution of tiny token shards. Standard word-level evaluations (like WikiText-2) will register massive, exploding perplexity values (`88,520,100.69`) because the evaluation frameworks attempt to calculate probabilities over traditional word boundaries that do not exist within this model's narrow dictionary maps.

	This model is not a functional assistant. It is a mathematical log of a successful optimization and convergence experiment.

	---

	## Technical Architecture Specification

	Ant-10M scales the internal hidden representation width of the network while maintaining an efficient attention execution path. It relies on a balanced width-to-depth ratio designed to maximize token processing speed on consumer-tier systems.

	* Total Parameters: 9.90 Million (`9,902,464`)
	* Layers (`num_hidden_layers`): 12
	* Hidden Size (`hidden_size`): 256
	* Intermediate Size (`intermediate_size`): 704
	* Attention Heads (`num_attention_heads`): 4
	* Key-Value Heads (`num_key_value_heads`): 2 (Grouped-Query Attention ratio of 2:1)
	* Head Dimension (`head_dim`): 64
	* Max Sequence Length (`max_position_embeddings`): 1,024 tokens
	* Vocabulary Size (`vocab_size`): 4,096 (Custom trained BPE tokenizer)
	* Activation Function: SiLU (SwiGLU variant without linear biases)
	* Positional Embeddings: Rotary Position Embeddings (RoPE) with a native base frequency ($\theta$) of 10,000.0
	* Weight Tying: `tie_word_embeddings: true` (Input embedding and final output projection share an identical tensor matrix to optimize parameter allocation)

	---

	## Hardware and Training Infrastructure Metadata

	The model was successfully pre-trained in a single continuous session lasting 9.63 hours (approx. 10 hours).

	* Hardware Used: 1x NVIDIA T4 GPU (16GB VRAM) via Kaggle Compute Engine
	* Tokens Seen: 2,979,215,382 (~3 Billion tokens)
	* Engine Velocity: Steady operational throughput of 81,520 to 83,000 tokens per second
	* Precision: `torch.float16` Automatic Mixed Precision (AMP)
	* Optimization Framework: AdamW Optimizer with a Cosine Learning Rate Decay Schedule and a linear warmup phase peaking at step 200 ($4.0 \times 10^{-4}$)

	---
	<img src="graph.png" alt="Ant-10M Pre-training Metrics Summary" width="1000"/>

	---

	## Training Dynamics and Convergence Curves

	The training loop executed flawlessly without gradient explosions, numerical underflow, or loss divergence. The training loss and validation loss tracked each other with near-zero variance, demonstrating excellent data regularization across the 3 Billion token dataset.

	\| Metrics \| Step 80 (Initialization) \| Step 200 (Warmup Peak) \| Step 600 (Mid-Run) \| Step 1200 (Final Convergence) \|
	\| --- \| --- \| --- \| --- \| --- \|
	\| Training Loss \| 5.0837 \| 3.8214 \| 2.7231 \| 2.5303 \|
	\| Validation Loss \| — \| 3.8174 \| 2.7217 \| 2.5314 \|
	\| Token Perplexity \| 161.37 \| 45.49 \| 15.22 \| 12.57 \|
	\| Learning Rate \| $2.46 \times 10^{-4}$ \| $4.00 \times 10^{-4}$ \| $2.31 \times 10^{-4}$ \| $4.29 \times 10^{-5}$ \|
	\| Gradient Norm \| 2.0964 \| 0.8142 \| 0.4431 \| 0.3189 \|

	---

	## Downstream Benchmarks: A Comparative Post-Mortem

	To understand the developmental step forward taken by Ant-10M, its zero-shot performance is compared below against its older sibling, [Ant-5M](https://huggingface.co/GODELEV/Ant-5M).

	Ant-5M suffered a catastrophic structural collapse due to severe architectural imbalances—specifically, a microscopic hidden size (128) forced into an overly deep structure (11 layers) combined with an excessive Grouped-Query Attention bottleneck. This caused Ant-5M to trap itself in endless degenerate loops, repeating singular words like "Sciences" or URL punctuation constantly.

	Ant-10M completely eliminates these degenerate loops. However, because its vocabulary is still heavily compressed down to 4,096 tokens, it remains choked during standard language evaluations that rely on whole-word assemblies.

	### Standard Language Benchmarks

	\| Benchmark Dataset \| Metric Type \| Ant-5M (The Catastrophe) \| Ant-10M (One Step Ahead) \|
	\| --- \| --- \| --- \| --- \|
	\| ARC-Challenge \| `acc_norm` \| 0.2442 (Below Random Guess) \| 0.2747 (Above Random Guess) \|
	\| ARC-Easy \| `acc_norm` \| 0.2319 \| 0.2542 \|
	\| PIQA \| `acc_norm` \| 0.4951 \| 0.5032 \|
	\| WinoGrande \| `acc` \| 0.4885 \| 0.4964 \|
	\| MMLU \| `acc` \| 0.2412 \| 0.2543 \|
	\| SciQ \| `acc_norm` \| 0.1980 \| 0.2150 \|
	\| BoolQ \| `acc` \| 0.3621 \| 0.3782 \|
	\| HellaSwag \| `acc_norm` \| 0.2514 \| 0.2672 \|
	\| WikiText-2 \| `byte_perplexity` \| 48.91 \| 30.62 \|
	\| WikiText-2 \| `word_perplexity` \| Run Crashed / Diverged \| 88,520,100.69 (Token Splitting Artifact) \|

	### Mathematical Reasoning Evaluation: Arithmark-2.0

	Arithmark-2.0 evaluates the latent computational capacity of tiny models by asking them to solve basic arithmetic strings containing varying numbers of operators. Because multiple choice contains 4 potential variations, the random baseline floor is 25.0%.

	\| Arithmark-2.0 Slice \| Ant-5M Score \| Ant-10M Score \|
	\| --- \| --- \| --- \|
	\| Overall Accuracy \| 22.10% (Fails Floor) \| 25.44% (Crosses Floor) \|
	\| 1 Operator (Easy) \| 23.40% \| 26.40% \|
	\| 2 Operators (Medium) \| 21.90% \| 26.93% \|
	\| 3 Operators (Hard) \| 20.10% \| 20.80% \|

	### Key Takeaways from the Data

	* ARC-Challenge Progression: Ant-5M scored below the random multiple-choice baseline (25.0%). Ant-10M breaks past the baseline to achieve 27.47%, proving that widening the hidden dimension to 256 allowed the attention heads to actively map structural positioning signals instead of outputting repetitive tokens.
	* Arithmark Numerical Floor: While Ant-5M failed to maintain stable positioning math during mathematical syntax, Ant-10M managed to clear the 25% guessing baseline on 1-operator and 2-operator strings. At 3 operators, the context requirements of tracking multi-step parenthesis tokens exceeded the model's 256 hidden dimension capabilities, dropping accuracy back down to 20.80%.
	* Byte-Perplexity Improvement: The compression performance on raw character patterns improved significantly, dropping from 48.91 down to 30.62, confirming high computational density inside the 12 transformer layers.

	---

	## Verification and Weights Inspection

	To verify the weights of Ant-10M, explore its layers, or inspect its token-fragment distribution outputs, use the standard Hugging Face Transformers pipeline as written below.

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_id = "GODELEV/Ant-10M"

	# Load the custom tokenizer and model architecture
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.float16,
	device_map="auto"
	)

	# Set up raw text input
	prompt = "The basic principles of small language models require"
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

	# Generate using high repetition penalties to counter the narrow vocabulary space
	with torch.no_grad():
	outputs = model.generate(
	**inputs,
	max_new_tokens=32,
	do_sample=True,
	temperature=0.7,
	top_p=0.9,
	repetition_penalty=1.5
	)

	# Decode tokens back into structural text fragments
	generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print("Generated Output Fragments:")
	print(generated_text)

	```