Atom2.7m / README.md

Update Atom2.7m submission

2fd4f23 verified about 15 hours ago

5.97 kB

	---
	license: apache-2.0
	language:
	- en
	pipeline_tag: text-generation
	tags:
	- causal-lm
	- gpt
	- small-language-model
	- arithmetic
	- custom-tokenizer
	- custom-code
	- safetensors
	- lm-evaluation-harness
	datasets:
	- openbmb/Ultra-FineWeb
	- HuggingFaceFW/fineweb-edu
	- HuggingFaceTB/finemath
	- HuggingFaceTB/smollm-corpus
	---

	![bg](bg.png)

	# Atom2.7m

	Atom2.7m is a small decoder-only causal language model trained with a general byte-level BPE tokenizer plus arithmetic-specific digit features. The model has 2,738,880 parameters and uses custom code for both the model and the tokenizer path.

	The main result is on [ArithMark 2.0](https://huggingface.co/datasets/AxiomicLabs/ArithMark-2.0), a 2,500-example integer-arithmetic continuation benchmark. Atom2.7m scores 69.24% accuracy. This places it above the nearby published range of SmolLM2-1.7B at 66.12% and Qwen2.5-0.5B at 63.04%, while using only 2.74M parameters.

	The result shows the leverage of domain-specific design. With arithmetic-aware tokenization and digit features, Atom2.7m reaches the same ArithMark score band as models hundreds of times larger.

	## Model Details

	- Architecture: decoder-only GPT
	- Parameters: 2,738,880
	- Layers: 5
	- Hidden size: 192
	- Attention heads: 4
	- KV heads: 2
	- Attention: grouped-query causal self-attention with RoPE and XSA projection
	- Context length: 512
	- Vocabulary size: 4,096
	- Token embeddings: tied input/output embeddings
	- Arithmetic feature embeddings:
	- `place_vocab_size`: 66
	- `role_vocab_size`: 12

	## Tokenizer

	Use this model with `trust_remote_code=True`. The submission includes an `AtomTokenizer` remote-code wrapper in `tokenization_atom.py` so standard Hugging Face callers can use `AutoTokenizer.from_pretrained(...)`.

	The tokenizer keeps byte-level BPE for ordinary text, but treats arithmetic sensitive spans specially:

	- digits `0`-`9` are atomic and never BPE-merged
	- digit spans are emitted least-significant-digit first
	- `+ - * / = ( )` are isolated atomic tokens
	- whitespace is isolated from text
	- arithmetic feature IDs are derived by the model from token IDs at inference time

	Training and custom tooling may still pass aligned `place_ids` and `role_ids`, but generic inference and evaluation only need `input_ids` and `attention_mask`.

	## Usage

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_dir = "."

	model = AutoModelForCausalLM.from_pretrained(
	model_dir,
	trust_remote_code=True,
	).eval()
	tokenizer = AutoTokenizer.from_pretrained(
	model_dir,
	trust_remote_code=True,
	)

	text = "12 + 34 ="
	inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)

	with torch.no_grad():
	outputs = model(**inputs)
	```

	## Evaluation

	### ArithMark 2.0

	Use the included benchmark script:

	```bash
	python benchmark_fusion_arithmark.py \
	--checkpoint . \
	--data-path arithmark_2.0.jsonl \
	--batch-size 64 \
	--device cuda \
	--output benchmark_results/fusion_arithmark_2.0_results.json
	```

	### lm-evaluation-harness

	For lm-evaluation-harness tasks, use the standard `hf` model with remote code enabled:

	```bash
	lm_eval \
	--model hf \
	--model_args pretrained=.,trust_remote_code=True,dtype=bfloat16,max_length=548 \
	--tasks hellaswag,arc_easy,arc_challenge,piqa \
	--device cuda:0 \
	--batch_size auto:1 \
	--output_path benchmark_results/lm_eval
	```

	`max_length=548` is passed to the lm-evaluation-harness wrapper so long
	multiple-choice continuations do not trip the harness assertion that a
	continuation must fit inside the model window. The tokenizer also advertises
	`model_max_length=548`, matching the longest sequence observed in this eval run.
	The checkpoint was trained with a 512-token context, but the RoPE
	implementation can score this slightly longer harness window; reduce batch size
	or set `max_length` to the longest sequence found if a task variant contains
	longer continuations.

	## Results

	\| Benchmark \| Metric \| Value \|
	\| --- \| --- \| ---: \|
	\| ArithMark 2.0 \| acc \| 0.6924 \|
	\| arc_challenge \| acc_norm \| 0.2099 \|
	\| arc_easy \| acc_norm \| 0.3161 \|
	\| hellaswag \| acc_norm \| 0.2701 \|
	\| piqa \| acc_norm \| 0.5299 \|

	## Training Data

	The pretraining mixture targeted about 3.5B tokens:

	- Ultra-FineWeb: 900M
	- FineWeb-Edu: 900M
	- FineMath: 450M
	- Cosmopedia-v2: 337.5M
	- UltraData-Math-L2-preview: 337.5M
	- Ultra-FineWeb-L3-en-QA-Synthetic: 225M
	- Synthetic-Arithmetic: 350M

	Synthetic-Arithmetic is canonical integer equation data. The training curriculum is included as `pretraining_curriculum.json`.

	## Limitations

	- This is a very small model and should be treated as an experimental research artifact.
	- Use `trust_remote_code=True` so `AutoTokenizer` applies the digit-span transform.
	- Numeric text is represented least-significant-digit first internally.
	- Role annotations intentionally target strict integer equations, not broad math prose, decimals, rationals, or QA formats.

	## Files

	- `model.safetensors`: model weights
	- `config.json`, `config.py`, `configuration_gpt.py`, `model.py`: custom model code
	- `tokenizer.json`, `tokenization_atom.py`: tokenizer files and remote-code wrapper
	- `benchmark_fusion_arithmark.py`: ArithMark evaluation
	- `arithmark_2.0.jsonl`: local ArithMark 2.0 data for the standalone benchmark script
	- `pretraining_curriculum.json`: training curriculum

	## References / Design Influences

	- [Attention Is All You Need](https://arxiv.org/abs/1706.03762) - additive positional information in Transformer inputs
	- [Exclusive Self Attention](https://arxiv.org/abs/2603.09078) - related attention work on reducing self-position dominance in sequence modeling
	- [Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure](https://arxiv.org/abs/2405.20671) - coupling digit positions by arithmetic significance
	- [Transformers Can Do Arithmetic with the Right Embeddings](https://arxiv.org/abs/2405.17399) - digit-position embeddings for arithmetic