YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Overview
daVinci-LLM-3B is a 3B-parameter base language model aimed at making pretraining a transparent and reproducible scientific process. We release not only the final weights but also training trajectories, intermediate checkpoints, data processing decisions, and 200+ ablation studies covering data quality, mixture design, training dynamics, and evaluation validity. The model reaches an overall score of 51.72 across 19 benchmarks, approaching or matching larger 7B-scale models such as OLMo-3 7B.
The model follows a two-stage curriculum over ~8T tokens:
- Stage 1 (6T tokens): broad pretraining over diverse web-scale corpora.
- Stage 2 (2T tokens): structured QA and reasoning-heavy data to amplify math and code reasoning.
Key Features
- Fully transparent pretraining pipeline: data processing logic, mixtures, logs, and checkpoints are publicly documented.
- Data Darwinism framework: a systematic L0–L9 taxonomy for data processing depth.
- Large-scale ablations: 200+ controlled experiments with both positive and negative results.
Intended Use
- Research: pretraining science, data quality studies, training dynamics, evaluation stability.
- General capabilities: broad language understanding, math/science reasoning, and code generation.
This is a base model and is not instruction- or safety-aligned. Additional safety evaluation and alignment are required for production deployment.
Architecture
- Type: Decoder-only Transformer (Qwen2 family)
- Parameters: ~3.09B
- Layers: 36
- Hidden size: 2048
- Attention heads: 16 (GQA, KV heads = 2)
- MLP: SwiGLU, intermediate size 11008
- Position encoding: RoPE (base = 10000)
- Context length: 4096
- Tokenizer: Qwen2 tokenizer (151,936 vocab)
Data and Processing
The training corpus spans general web text, code, science, and QA sources. Each dataset is annotated with a Data Darwinism level (L0–L9), and multiple sources receive L4/L5 generative refinement and cognitive completion.
Major categories:
- General: Common Crawl–based corpora (L3).
- Code: GitHub crawls + Nemotron code datasets (L3/L5).
- Science/Math: MegaMath, Nemotron-CC-Math, and Darwin-Science series (L3–L5).
- QA: multi-source QA data with rejection sampling (L5).
Training Recipe (Summary)
- Stage 1: 6T tokens with progressively adjusted mixtures (shifting weight from web text to code/science).
- Stage 2: 2T tokens with structured QA (30% → 70%) for stronger reasoning and problem-solving.
Evaluation (Summary)
- Benchmarks: 19 tasks spanning General, Code, and Science/Math.
- Tooling: lm-eval-harness.
- Result: Overall average 51.72, comparable to OLMo-3 7B.
License
Apache-2.0
Citation
- Downloads last month
- 846