AnCoder-1.0B-Base / README.md
AntonXue's picture
README: report 50k training steps (matches truncated log + SWA window)
0cdeb73 verified
# AnCoder-1.0B-Base
Anchored bidirectional diffusion language model built on Qwen3-0.6B.
- **Architecture**: 28 anchor layers + 28 denoiser layers, hid connection, all weights tied
- **Parameters**: 1.04B unique
- **Base model**: Qwen/Qwen3-0.6B
- **Training**: 50k steps continued pretraining, token-packed streams (block_size=2048),
uniform noise schedule, anchor_weight=0.1, all-position anchor supervision,
shifted AR alignment (BOS-prepend trick on Qwen3 lm_head)
- **Endpoint**: SWA over the last 5 saved checkpoints (steps 46k–50k, 1k stride)
## Training Progress (1000-entry SMA over final 5k steps, ending at 50k)
| Metric | Value |
|--------|-------|
| Loss | 1.1645 |
| DLM loss | 1.0037 |
| Anchor loss | 1.1015 |
| DLM accuracy | 60.84% |
| Anchor accuracy | 58.12% |
## Architecture
AnCoder uses an anchor-denoiser architecture for absorbing-state diffusion language modeling:
- **Anchor**: Full bidirectional Qwen3 (28 layers) processes masked input
- **Denoiser**: Full bidirectional Qwen3 (28 layers) refines anchor's hidden states
- **Connection**: Anchor hidden states passed directly to denoiser (hid mode)
- **Weight tying**: All embeddings and lm_heads share the same weight matrix
- **Shifted AR alignment**: BOS is prepended at forward time and the trailing
position is dropped before lm_head, so the AR-pretrained Qwen3 head operates
on its native "predict-next" alignment under the bidirectional diffusion loss.
## SWA Endpoint
Stage-1 training reaches a plateau around step ~25k. Training continues to 50k
to give the trajectory time to settle in the basin, then tail-averages the last
5 saved checkpoints (steps 46000, 47000, 48000, 49000, 50000) to reduce noise
in the final weights. The averaged model's tied weights are deduplicated at
save time, yielding a single ~1.87 GB safetensors file.
## Usage
```python
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("EER6/AnCoder-1.0B-Base", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("EER6/AnCoder-1.0B-Base")
# For diffusion inference, pad short prompts with mask tokens (<|fim_middle|>, 151660)
# rather than the actual pad token (<|endoftext|>, 151643).
tokenizer.pad_token_id = 151660
inputs = tokenizer("def fibonacci(n):", return_tensors="pt", padding="max_length", max_length=2048)
outputs = model(**inputs)
outputs.logits # (B, L, V) denoiser predictions
outputs.anchor_logits # (B, L, V) anchor predictions
```