Configuration Parsing Warning: Invalid JSON for config file config.json
DAARTH-7B-A1B
A 7-billion parameter (1B active) hybrid Mamba + Mixture-of-Experts language model pretrained from scratch on 73.4B tokens of NVIDIA ClimbMix data.
Model Overview
DAARTH-7B-A1B adapts the Nemotron-Nano-V3 architecture into a 7B-total / 1B-active-parameter configuration. The model uses a hybrid layer design combining Mamba (state-space) layers, Mixture-of-Experts (MoE) layers, and standard attention layers. Training was completed in under 6 days on a single 8×H100 node.
Developed by: David Yang, Alex Luu, Aryan Bansal, Rishi Athavale, Timothy Gao, Harsha Polavaram
Acknowledgements: The authors gratefully acknowledge NVIDIA for providing computational resources and technical support that have contributed to the results of this technical report.
Architecture
| Parameter | Value |
|---|---|
| Total Parameters | ~7B |
| Active Parameters | ~1B |
| Hidden Size | 2048 |
| Num Layers | 38 |
| Layer Order | MEM*EMEM*EMEM*EMEM*EMEM*EMEMEM*EMEMEME |
| Attention Heads | 16 |
| Head Dimension | 128 |
| Routed Experts | 64 |
| Top-K Routing | 4 |
| Shared Experts | 1 |
| Expert FFN Size | 1408 |
| Shared Expert FFN Size | 2816 |
| Vocabulary Size | 131,072 |
| Tokenizer | Nemotron Nano V3 tokenizer |
In the layer order notation: M = Mamba layer, E = Mixture-of-Experts layer, * = Attention layer.
Training Details
Training Data
The model was trained on 73.4B tokens sampled from ClimbMix, a 400B-token dataset from the Nemotron-CLIMB project. ClimbMix was constructed by combining Nemotron-CC and SmolLM-Corpus, embedding and clustering the data into 20 semantic clusters, and using an iterative proxy-model-based search (building on the RegMix framework) to find an optimal cluster mixture. ClimbMix was chosen over a hand-designed data mixture after proxy evaluations showed it outperformed the manual curation on multiple benchmarks.
Training Configuration
| Detail | Value |
|---|---|
| Hardware | 8× NVIDIA H100 80GB |
| Wall-Clock Time | 4 days, 16 hours, 19 minutes, 48 seconds |
| Tokens Consumed | 73.4B |
| Optimizer | Muon (with gradient resizing) |
| LR Schedule | Warmup-Stable-Decay (MiniCPM) |
| Framework | Megatron-LM |
Training was stable throughout, with no NaN events or loss spikes. Training and validation loss decreased smoothly over the course of the run.
Final Loss & Perplexity
| Metric | Validation Set | Test Set |
|---|---|---|
| Loss | 2.0819 | 2.0875 |
| Perplexity | 8.0194 | 8.0648 |
Evaluation
All benchmarks were run using NeMo Evaluator (which leverages lm-evaluation-harness and vLLM under the hood). MMLU and MMLU Pro use 5-shot prompting; all others use 0-shot. We report acc_norm (length-normalized log-likelihood) where applicable.
| Benchmark | Random Baseline | DAARTH-7B-A1B |
|---|---|---|
| MMLU (5-shot) | 0.250 | 0.377 |
| MMLU Pro (5-shot) | 0.100 | 0.091 |
| PIQA (0-shot) | 0.500 | 0.796 |
| Winogrande (0-shot) | 0.500 | 0.609 |
| HellaSwag (0-shot) | 0.250 | 0.684 |
| ARC-Challenge (0-shot) | 0.250 | 0.480 |
| ARC-Easy (0-shot) | 0.250 | 0.735 |
| MMLU Redux (5-shot) | 0.250 | 0.342 |
The model significantly outperforms random guessing on all benchmarks except MMLU Pro, which is an especially challenging benchmark with tricky distractor answer choices.
Scaling Rationale
The 6-day compute budget on 8×H100 at 150k tokens/sec yields approximately 70B tokens. While Chinchilla scaling laws were derived for dense models and the optimal ratio for hybrid MoE architectures remains an open research question, 73.4B tokens exceeds Chinchilla-optimal for the active parameter count (1B) and approaches Chinchilla-optimal for total parameters (7B). Evidence from OLMoE, OLMO Hybrid, and Qwen suggests that small models continue to benefit well beyond the Chinchilla-optimal ratio.
Data Curation Study
Before selecting ClimbMix, the team conducted an extensive study of data mixture strategies for MMLU-optimized pretraining under constrained token budgets. Key findings from this study include:
- Knowledge-dense sources disproportionately help MMLU. FineWeb-Edu showed ~12% relative MMLU gain over unfiltered web data.
- Format alignment matters. FLAN's multi-task Q&A data, which includes multiple-choice formats, directly aligns with MMLU's evaluation structure.
- Multi-epoch training on high-quality data is acceptable at constrained budgets (~8 epochs optimal per Feng et al., 2024).
- Code has limited MMLU relevance. Only 2 of 57 MMLU subjects touch computer science.
A hand-designed 9-source, two-phase mixture was developed based on these findings but was ultimately superseded by ClimbMix after proxy evaluation comparisons.
Proxy Evaluation: Curated Mix vs. ClimbMix
Evaluated at 2,000 steps with global batch size 128:
| Benchmark | Curated @2k | ClimbMix @2k |
|---|---|---|
| PIQA | 0.656 | 0.720 |
| WinoGrande | 0.502 | 0.506 |
| HellaSwag | 0.335 | 0.332 |
| ARC-Easy | 0.503 | 0.587 |
| ARC-Challenge | 0.196 | 0.273 |
| LAMBADA | 0.163 | 0.155 |
| MMLU | 0.230 | 0.239 |
Limitations
- This is a base pretrained model — it has not been instruction-tuned or aligned via RLHF/DPO. It is not suitable for direct use in conversational applications.
- Performance on MMLU Pro is near random, indicating limited capacity for complex multi-step reasoning at this scale and token budget.
- The model was trained on English data only and is not intended for multilingual use.
- At 73.4B tokens, the model has seen a relatively small amount of data compared to production-scale models.
Training Logs
Full training logs are available on Weights & Biases: DAARTH-7B Technical Report — W&B
Citation
@misc{daarth7b2025,
title={DAARTH-7B-A1B: Hybrid-MoE Pretrained from Scratch},
author={Yang, David and Luu, Alex and Bansal, Aryan and Athavale, Rishi and Gao, Timothy and Polavaram, Harsha},
year={2026}
}
- Downloads last month
- 61
Dataset used to train shiftyblock/DAARTH-7B-A1B
Evaluation results
- Accuracy on MMLUself-reported0.377
- Accuracy on MMLU Proself-reported0.091
- Accuracy (norm) on PIQAself-reported0.796
- Accuracy (norm) on Winograndeself-reported0.609
- Accuracy (norm) on HellaSwagself-reported0.684
- Accuracy (norm) on ARC-Challengeself-reported0.480
- Accuracy (norm) on ARC-Easyself-reported0.735
- Accuracy on MMLU Reduxself-reported0.342