OpenMOSE
/

HRWKV7-Reka-Flash3-Preview

Text Generation

linear-attention

knowledge-distillation

Model card Files Files and versions

OpenMOSE commited on Jul 5, 2025

Commit

fb074ec

·

verified ·

1 Parent(s): 4e7ffe6

Update README.md

Files changed (1) hide show

README.md +2 -0

README.md CHANGED Viewed

@@ -63,6 +63,8 @@ This is an **experimental research model** designed to explore hybrid architectu
 ## Training Details
 - **Training Context Window:** 4096 tokens
 - **Base Model Initialization:** Weights initialized from Reka-flash3 21B
 - **Architecture Conversion:** Transformer attention blocks systematically replaced with RWKV blocks, except for 6 strategically placed GQA layers

 ## Training Details
 - **Training Context Window:** 4096 tokens
+- **Training GPU** AMD MI300X x 1(takes 68hrs)
+- **Training Strategy** 8bit MLP Quant, frozen emb,mlp,head, Deepspeed Stage1
 - **Base Model Initialization:** Weights initialized from Reka-flash3 21B
 - **Architecture Conversion:** Transformer attention blocks systematically replaced with RWKV blocks, except for 6 strategically placed GQA layers