README: report 50k training steps (matches truncated log + SWA window)
Browse files
README.md
CHANGED
|
@@ -5,7 +5,7 @@ Anchored bidirectional diffusion language model built on Qwen3-0.6B.
|
|
| 5 |
- **Architecture**: 28 anchor layers + 28 denoiser layers, hid connection, all weights tied
|
| 6 |
- **Parameters**: 1.04B unique
|
| 7 |
- **Base model**: Qwen/Qwen3-0.6B
|
| 8 |
-
- **Training**:
|
| 9 |
uniform noise schedule, anchor_weight=0.1, all-position anchor supervision,
|
| 10 |
shifted AR alignment (BOS-prepend trick on Qwen3 lm_head)
|
| 11 |
- **Endpoint**: SWA over the last 5 saved checkpoints (steps 46k–50k, 1k stride)
|
|
|
|
| 5 |
- **Architecture**: 28 anchor layers + 28 denoiser layers, hid connection, all weights tied
|
| 6 |
- **Parameters**: 1.04B unique
|
| 7 |
- **Base model**: Qwen/Qwen3-0.6B
|
| 8 |
+
- **Training**: 50k steps continued pretraining, token-packed streams (block_size=2048),
|
| 9 |
uniform noise schedule, anchor_weight=0.1, all-position anchor supervision,
|
| 10 |
shifted AR alignment (BOS-prepend trick on Qwen3 lm_head)
|
| 11 |
- **Endpoint**: SWA over the last 5 saved checkpoints (steps 46k–50k, 1k stride)
|