Update README.md
Browse files
README.md
CHANGED
|
@@ -63,6 +63,8 @@ This is an **experimental research model** designed to explore hybrid architectu
|
|
| 63 |
## Training Details
|
| 64 |
|
| 65 |
- **Training Context Window:** 4096 tokens
|
|
|
|
|
|
|
| 66 |
- **Base Model Initialization:** Weights initialized from Reka-flash3 21B
|
| 67 |
- **Architecture Conversion:** Transformer attention blocks systematically replaced with RWKV blocks, except for 6 strategically placed GQA layers
|
| 68 |
|
|
|
|
| 63 |
## Training Details
|
| 64 |
|
| 65 |
- **Training Context Window:** 4096 tokens
|
| 66 |
+
- **Training GPU** AMD MI300X x 1(takes 68hrs)
|
| 67 |
+
- **Training Strategy** 8bit MLP Quant, frozen emb,mlp,head, Deepspeed Stage1
|
| 68 |
- **Base Model Initialization:** Weights initialized from Reka-flash3 21B
|
| 69 |
- **Architecture Conversion:** Transformer attention blocks systematically replaced with RWKV blocks, except for 6 strategically placed GQA layers
|
| 70 |
|