Configuration Parsing Warning: Invalid JSON for config file config.json

DAARTH-7B-A1B

A 7-billion parameter (1B active) hybrid Mamba + Mixture-of-Experts language model pretrained from scratch on 73.4B tokens of NVIDIA ClimbMix data.

Model Overview

DAARTH-7B-A1B adapts the Nemotron-Nano-V3 architecture into a 7B-total / 1B-active-parameter configuration. The model uses a hybrid layer design combining Mamba (state-space) layers, Mixture-of-Experts (MoE) layers, and standard attention layers. Training was completed in under 6 days on a single 8×H100 node.

Developed by: David Yang, Alex Luu, Aryan Bansal, Rishi Athavale, Timothy Gao, Harsha Polavaram

Acknowledgements: The authors gratefully acknowledge NVIDIA for providing computational resources and technical support that have contributed to the results of this technical report.

Architecture

Parameter Value
Total Parameters ~7B
Active Parameters ~1B
Hidden Size 2048
Num Layers 38
Layer Order MEM*EMEM*EMEM*EMEM*EMEM*EMEMEM*EMEMEME
Attention Heads 16
Head Dimension 128
Routed Experts 64
Top-K Routing 4
Shared Experts 1
Expert FFN Size 1408
Shared Expert FFN Size 2816
Vocabulary Size 131,072
Tokenizer Nemotron Nano V3 tokenizer

In the layer order notation: M = Mamba layer, E = Mixture-of-Experts layer, * = Attention layer.

Training Details

Training Data

The model was trained on 73.4B tokens sampled from ClimbMix, a 400B-token dataset from the Nemotron-CLIMB project. ClimbMix was constructed by combining Nemotron-CC and SmolLM-Corpus, embedding and clustering the data into 20 semantic clusters, and using an iterative proxy-model-based search (building on the RegMix framework) to find an optimal cluster mixture. ClimbMix was chosen over a hand-designed data mixture after proxy evaluations showed it outperformed the manual curation on multiple benchmarks.

Training Configuration

Detail Value
Hardware 8× NVIDIA H100 80GB
Wall-Clock Time 4 days, 16 hours, 19 minutes, 48 seconds
Tokens Consumed 73.4B
Optimizer Muon (with gradient resizing)
LR Schedule Warmup-Stable-Decay (MiniCPM)
Framework Megatron-LM

Training was stable throughout, with no NaN events or loss spikes. Training and validation loss decreased smoothly over the course of the run.

Final Loss & Perplexity

Metric Validation Set Test Set
Loss 2.0819 2.0875
Perplexity 8.0194 8.0648

Evaluation

All benchmarks were run using NeMo Evaluator (which leverages lm-evaluation-harness and vLLM under the hood). MMLU and MMLU Pro use 5-shot prompting; all others use 0-shot. We report acc_norm (length-normalized log-likelihood) where applicable.

Benchmark Random Baseline DAARTH-7B-A1B
MMLU (5-shot) 0.250 0.377
MMLU Pro (5-shot) 0.100 0.091
PIQA (0-shot) 0.500 0.796
Winogrande (0-shot) 0.500 0.609
HellaSwag (0-shot) 0.250 0.684
ARC-Challenge (0-shot) 0.250 0.480
ARC-Easy (0-shot) 0.250 0.735
MMLU Redux (5-shot) 0.250 0.342

The model significantly outperforms random guessing on all benchmarks except MMLU Pro, which is an especially challenging benchmark with tricky distractor answer choices.

Scaling Rationale

The 6-day compute budget on 8×H100 at 150k tokens/sec yields approximately 70B tokens. While Chinchilla scaling laws were derived for dense models and the optimal ratio for hybrid MoE architectures remains an open research question, 73.4B tokens exceeds Chinchilla-optimal for the active parameter count (1B) and approaches Chinchilla-optimal for total parameters (7B). Evidence from OLMoE, OLMO Hybrid, and Qwen suggests that small models continue to benefit well beyond the Chinchilla-optimal ratio.

Data Curation Study

Before selecting ClimbMix, the team conducted an extensive study of data mixture strategies for MMLU-optimized pretraining under constrained token budgets. Key findings from this study include:

  • Knowledge-dense sources disproportionately help MMLU. FineWeb-Edu showed ~12% relative MMLU gain over unfiltered web data.
  • Format alignment matters. FLAN's multi-task Q&A data, which includes multiple-choice formats, directly aligns with MMLU's evaluation structure.
  • Multi-epoch training on high-quality data is acceptable at constrained budgets (~8 epochs optimal per Feng et al., 2024).
  • Code has limited MMLU relevance. Only 2 of 57 MMLU subjects touch computer science.

A hand-designed 9-source, two-phase mixture was developed based on these findings but was ultimately superseded by ClimbMix after proxy evaluation comparisons.

Proxy Evaluation: Curated Mix vs. ClimbMix

Evaluated at 2,000 steps with global batch size 128:

Benchmark Curated @2k ClimbMix @2k
PIQA 0.656 0.720
WinoGrande 0.502 0.506
HellaSwag 0.335 0.332
ARC-Easy 0.503 0.587
ARC-Challenge 0.196 0.273
LAMBADA 0.163 0.155
MMLU 0.230 0.239

Limitations

  • This is a base pretrained model — it has not been instruction-tuned or aligned via RLHF/DPO. It is not suitable for direct use in conversational applications.
  • Performance on MMLU Pro is near random, indicating limited capacity for complex multi-step reasoning at this scale and token budget.
  • The model was trained on English data only and is not intended for multilingual use.
  • At 73.4B tokens, the model has seen a relatively small amount of data compared to production-scale models.

Training Logs

Full training logs are available on Weights & Biases: DAARTH-7B Technical Report — W&B

Citation

@misc{daarth7b2025,
  title={DAARTH-7B-A1B: Hybrid-MoE Pretrained from Scratch},
  author={Yang, David and Luu, Alex and Bansal, Aryan and Athavale, Rishi and Gao, Timothy and Polavaram, Harsha},
  year={2026}
}
Downloads last month
61
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train shiftyblock/DAARTH-7B-A1B

Evaluation results