Step 5000: Initial Foundational Pre-training Weight Drop
Browse files
README.md
ADDED
|
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# DiffReaper-Talk
|
| 2 |
+
|
| 3 |
+
A 1.5B parameter Discrete Diffusion Language Model (dLLM) optimized for parallel token prediction. Trained during foundational pre-training phase on general text corpora.
|
| 4 |
+
|
| 5 |
+
## Summary
|
| 6 |
+
DiffReaper-Talk uses a Transformer-based discrete diffusion architecture to predict multiple tokens in parallel. This approach avoids the sequential bottleneck of standard autoregressive generation.
|
| 7 |
+
|
| 8 |
+
## Technical Details
|
| 9 |
+
- **Architecture:** 24-Layer Transformer Encoder
|
| 10 |
+
- **Embedding Dim:** 2048
|
| 11 |
+
- **Heads:** 16
|
| 12 |
+
- **Parameters:** ~1.5 Billion
|
| 13 |
+
- **Hardware:** 1x NVIDIA A100 (80GB VRAM)
|
| 14 |
+
- **Objective:** Markovian Discrete Denoising (Continuous Embedding Space)
|
| 15 |
+
- **Precision:** Mixed BF16
|
| 16 |
+
- **Context Window:** 1024 Tokens
|
| 17 |
+
|
| 18 |
+
## Current Status
|
| 19 |
+
Foundational pre-training active. Logic and domain-specific training (Code) to be applied post-convergence.
|