README: report 50k training steps (matches truncated log + SWA window)

0cdeb73 verified 1 day ago

2.49 kB

	# AnCoder-1.0B-Base

	Anchored bidirectional diffusion language model built on Qwen3-0.6B.

	- Architecture: 28 anchor layers + 28 denoiser layers, hid connection, all weights tied
	- Parameters: 1.04B unique
	- Base model: Qwen/Qwen3-0.6B
	- Training: 50k steps continued pretraining, token-packed streams (block_size=2048),
	uniform noise schedule, anchor_weight=0.1, all-position anchor supervision,
	shifted AR alignment (BOS-prepend trick on Qwen3 lm_head)
	- Endpoint: SWA over the last 5 saved checkpoints (steps 46k–50k, 1k stride)

	## Training Progress (1000-entry SMA over final 5k steps, ending at 50k)

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Loss \| 1.1645 \|
	\| DLM loss \| 1.0037 \|
	\| Anchor loss \| 1.1015 \|
	\| DLM accuracy \| 60.84% \|
	\| Anchor accuracy \| 58.12% \|

	## Architecture

	AnCoder uses an anchor-denoiser architecture for absorbing-state diffusion language modeling:
	- Anchor: Full bidirectional Qwen3 (28 layers) processes masked input
	- Denoiser: Full bidirectional Qwen3 (28 layers) refines anchor's hidden states
	- Connection: Anchor hidden states passed directly to denoiser (hid mode)
	- Weight tying: All embeddings and lm_heads share the same weight matrix
	- Shifted AR alignment: BOS is prepended at forward time and the trailing
	position is dropped before lm_head, so the AR-pretrained Qwen3 head operates
	on its native "predict-next" alignment under the bidirectional diffusion loss.

	## SWA Endpoint

	Stage-1 training reaches a plateau around step ~25k. Training continues to 50k
	to give the trajectory time to settle in the basin, then tail-averages the last
	5 saved checkpoints (steps 46000, 47000, 48000, 49000, 50000) to reduce noise
	in the final weights. The averaged model's tied weights are deduplicated at
	save time, yielding a single ~1.87 GB safetensors file.

	## Usage

	```python
	from transformers import AutoModel, AutoTokenizer

	model = AutoModel.from_pretrained("EER6/AnCoder-1.0B-Base", trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained("EER6/AnCoder-1.0B-Base")

	# For diffusion inference, pad short prompts with mask tokens (<\|fim_middle\|>, 151660)
	# rather than the actual pad token (<\|endoftext\|>, 151643).
	tokenizer.pad_token_id = 151660
	inputs = tokenizer("def fibonacci(n):", return_tensors="pt", padding="max_length", max_length=2048)

	outputs = model(**inputs)
	outputs.logits # (B, L, V) denoiser predictions
	outputs.anchor_logits # (B, L, V) anchor predictions
	```