| # AnCoder-1.0B-Base |
|
|
| Anchored bidirectional diffusion language model built on Qwen3-0.6B. |
|
|
| - **Architecture**: 28 anchor layers + 28 denoiser layers, hid connection, all weights tied |
| - **Parameters**: 1.04B unique |
| - **Base model**: Qwen/Qwen3-0.6B |
| - **Training**: 50k steps continued pretraining, token-packed streams (block_size=2048), |
| uniform noise schedule, anchor_weight=0.1, all-position anchor supervision, |
| shifted AR alignment (BOS-prepend trick on Qwen3 lm_head) |
| - **Endpoint**: SWA over the last 5 saved checkpoints (steps 46k–50k, 1k stride) |
| |
| ## Training Progress (1000-entry SMA over final 5k steps, ending at 50k) |
| |
| | Metric | Value | |
| |--------|-------| |
| | Loss | 1.1645 | |
| | DLM loss | 1.0037 | |
| | Anchor loss | 1.1015 | |
| | DLM accuracy | 60.84% | |
| | Anchor accuracy | 58.12% | |
| |
| ## Architecture |
| |
| AnCoder uses an anchor-denoiser architecture for absorbing-state diffusion language modeling: |
| - **Anchor**: Full bidirectional Qwen3 (28 layers) processes masked input |
| - **Denoiser**: Full bidirectional Qwen3 (28 layers) refines anchor's hidden states |
| - **Connection**: Anchor hidden states passed directly to denoiser (hid mode) |
| - **Weight tying**: All embeddings and lm_heads share the same weight matrix |
| - **Shifted AR alignment**: BOS is prepended at forward time and the trailing |
| position is dropped before lm_head, so the AR-pretrained Qwen3 head operates |
| on its native "predict-next" alignment under the bidirectional diffusion loss. |
| |
| ## SWA Endpoint |
| |
| Stage-1 training reaches a plateau around step ~25k. Training continues to 50k |
| to give the trajectory time to settle in the basin, then tail-averages the last |
| 5 saved checkpoints (steps 46000, 47000, 48000, 49000, 50000) to reduce noise |
| in the final weights. The averaged model's tied weights are deduplicated at |
| save time, yielding a single ~1.87 GB safetensors file. |
| |
| ## Usage |
| |
| ```python |
| from transformers import AutoModel, AutoTokenizer |
| |
| model = AutoModel.from_pretrained("EER6/AnCoder-1.0B-Base", trust_remote_code=True) |
| tokenizer = AutoTokenizer.from_pretrained("EER6/AnCoder-1.0B-Base") |
| |
| # For diffusion inference, pad short prompts with mask tokens (<|fim_middle|>, 151660) |
| # rather than the actual pad token (<|endoftext|>, 151643). |
| tokenizer.pad_token_id = 151660 |
| inputs = tokenizer("def fibonacci(n):", return_tensors="pt", padding="max_length", max_length=2048) |
| |
| outputs = model(**inputs) |
| outputs.logits # (B, L, V) denoiser predictions |
| outputs.anchor_logits # (B, L, V) anchor predictions |
| ``` |
| |