| --- |
| license: apache-2.0 |
| language: |
| - en |
| pipeline_tag: text-generation |
| tags: |
| - causal-lm |
| - gpt |
| - small-language-model |
| - arithmetic |
| - custom-tokenizer |
| - custom-code |
| - safetensors |
| - lm-evaluation-harness |
| datasets: |
| - openbmb/Ultra-FineWeb |
| - HuggingFaceFW/fineweb-edu |
| - HuggingFaceTB/finemath |
| - HuggingFaceTB/smollm-corpus |
| --- |
| |
|  |
|
|
| # Atom2.7m |
|
|
| Atom2.7m is a small decoder-only causal language model trained with a general byte-level BPE tokenizer plus arithmetic-specific digit features. The model has 2,738,880 parameters and uses custom code for both the model and the tokenizer path. |
|
|
| The main result is on [ArithMark 2.0](https://huggingface.co/datasets/AxiomicLabs/ArithMark-2.0), a 2,500-example integer-arithmetic continuation benchmark. Atom2.7m scores 69.24% accuracy. This places it above the nearby published range of SmolLM2-1.7B at 66.12% and Qwen2.5-0.5B at 63.04%, while using only 2.74M parameters. |
|
|
| The result shows the leverage of domain-specific design. With arithmetic-aware tokenization and digit features, Atom2.7m reaches the same ArithMark score band as models hundreds of times larger. |
|
|
| ## Model Details |
|
|
| - Architecture: decoder-only GPT |
| - Parameters: 2,738,880 |
| - Layers: 5 |
| - Hidden size: 192 |
| - Attention heads: 4 |
| - KV heads: 2 |
| - Attention: grouped-query causal self-attention with RoPE and XSA projection |
| - Context length: 512 |
| - Vocabulary size: 4,096 |
| - Token embeddings: tied input/output embeddings |
| - Arithmetic feature embeddings: |
| - `place_vocab_size`: 66 |
| - `role_vocab_size`: 12 |
|
|
| ## Tokenizer |
|
|
| Use this model with `trust_remote_code=True`. The submission includes an `AtomTokenizer` remote-code wrapper in `tokenization_atom.py` so standard Hugging Face callers can use `AutoTokenizer.from_pretrained(...)`. |
|
|
| The tokenizer keeps byte-level BPE for ordinary text, but treats arithmetic sensitive spans specially: |
|
|
| - digits `0`-`9` are atomic and never BPE-merged |
| - digit spans are emitted least-significant-digit first |
| - `+ - * / = ( )` are isolated atomic tokens |
| - whitespace is isolated from text |
| - arithmetic feature IDs are derived by the model from token IDs at inference time |
|
|
| Training and custom tooling may still pass aligned `place_ids` and `role_ids`, but generic inference and evaluation only need `input_ids` and `attention_mask`. |
|
|
| ## Usage |
|
|
| ```python |
| import torch |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| model_dir = "." |
| |
| model = AutoModelForCausalLM.from_pretrained( |
| model_dir, |
| trust_remote_code=True, |
| ).eval() |
| tokenizer = AutoTokenizer.from_pretrained( |
| model_dir, |
| trust_remote_code=True, |
| ) |
| |
| text = "12 + 34 =" |
| inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False) |
| |
| with torch.no_grad(): |
| outputs = model(**inputs) |
| ``` |
|
|
| ## Evaluation |
|
|
| ### ArithMark 2.0 |
|
|
| Use the included benchmark script: |
|
|
| ```bash |
| python benchmark_fusion_arithmark.py \ |
| --checkpoint . \ |
| --data-path arithmark_2.0.jsonl \ |
| --batch-size 64 \ |
| --device cuda \ |
| --output benchmark_results/fusion_arithmark_2.0_results.json |
| ``` |
|
|
| ### lm-evaluation-harness |
|
|
| For lm-evaluation-harness tasks, use the standard `hf` model with remote code enabled: |
|
|
| ```bash |
| lm_eval \ |
| --model hf \ |
| --model_args pretrained=.,trust_remote_code=True,dtype=bfloat16,max_length=548 \ |
| --tasks hellaswag,arc_easy,arc_challenge,piqa \ |
| --device cuda:0 \ |
| --batch_size auto:1 \ |
| --output_path benchmark_results/lm_eval |
| ``` |
|
|
| `max_length=548` is passed to the lm-evaluation-harness wrapper so long |
| multiple-choice continuations do not trip the harness assertion that a |
| continuation must fit inside the model window. The tokenizer also advertises |
| `model_max_length=548`, matching the longest sequence observed in this eval run. |
| The checkpoint was trained with a 512-token context, but the RoPE |
| implementation can score this slightly longer harness window; reduce batch size |
| or set `max_length` to the longest sequence found if a task variant contains |
| longer continuations. |
|
|
| ## Results |
|
|
| | Benchmark | Metric | Value | |
| | --- | --- | ---: | |
| | ArithMark 2.0 | acc | 0.6924 | |
| | arc_challenge | acc_norm | 0.2099 | |
| | arc_easy | acc_norm | 0.3161 | |
| | hellaswag | acc_norm | 0.2701 | |
| | piqa | acc_norm | 0.5299 | |
|
|
| ## Training Data |
|
|
| The pretraining mixture targeted about 3.5B tokens: |
|
|
| - Ultra-FineWeb: 900M |
| - FineWeb-Edu: 900M |
| - FineMath: 450M |
| - Cosmopedia-v2: 337.5M |
| - UltraData-Math-L2-preview: 337.5M |
| - Ultra-FineWeb-L3-en-QA-Synthetic: 225M |
| - Synthetic-Arithmetic: 350M |
|
|
| Synthetic-Arithmetic is canonical integer equation data. The training curriculum is included as `pretraining_curriculum.json`. |
|
|
| ## Limitations |
|
|
| - This is a very small model and should be treated as an experimental research artifact. |
| - Use `trust_remote_code=True` so `AutoTokenizer` applies the digit-span transform. |
| - Numeric text is represented least-significant-digit first internally. |
| - Role annotations intentionally target strict integer equations, not broad math prose, decimals, rationals, or QA formats. |
|
|
| ## Files |
|
|
| - `model.safetensors`: model weights |
| - `config.json`, `config.py`, `configuration_gpt.py`, `model.py`: custom model code |
| - `tokenizer.json`, `tokenization_atom.py`: tokenizer files and remote-code wrapper |
| - `benchmark_fusion_arithmark.py`: ArithMark evaluation |
| - `arithmark_2.0.jsonl`: local ArithMark 2.0 data for the standalone benchmark script |
| - `pretraining_curriculum.json`: training curriculum |
|
|
| ## References / Design Influences |
|
|
| - [Attention Is All You Need](https://arxiv.org/abs/1706.03762) - additive positional information in Transformer inputs |
| - [Exclusive Self Attention](https://arxiv.org/abs/2603.09078) - related attention work on reducing self-position dominance in sequence modeling |
| - [Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure](https://arxiv.org/abs/2405.20671) - coupling digit positions by arithmetic significance |
| - [Transformers Can Do Arithmetic with the Right Embeddings](https://arxiv.org/abs/2405.17399) - digit-position embeddings for arithmetic |
|
|