YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Overview

daVinci-LLM-3B is a 3B-parameter base language model aimed at making pretraining a transparent and reproducible scientific process. We release not only the final weights but also training trajectories, intermediate checkpoints, data processing decisions, and 200+ ablation studies covering data quality, mixture design, training dynamics, and evaluation validity. The model reaches an overall score of 51.72 across 19 benchmarks, approaching or matching larger 7B-scale models such as OLMo-3 7B. score-daVinci-llm The model follows a two-stage curriculum over ~8T tokens:

  • Stage 1 (6T tokens): broad pretraining over diverse web-scale corpora.
  • Stage 2 (2T tokens): structured QA and reasoning-heavy data to amplify math and code reasoning.

Key Features

  • Fully transparent pretraining pipeline: data processing logic, mixtures, logs, and checkpoints are publicly documented.
  • Data Darwinism framework: a systematic L0–L9 taxonomy for data processing depth.
  • Large-scale ablations: 200+ controlled experiments with both positive and negative results.

Intended Use

  • Research: pretraining science, data quality studies, training dynamics, evaluation stability.
  • General capabilities: broad language understanding, math/science reasoning, and code generation.

This is a base model and is not instruction- or safety-aligned. Additional safety evaluation and alignment are required for production deployment.

Architecture

  • Type: Decoder-only Transformer (Qwen2 family)
  • Parameters: ~3.09B
  • Layers: 36
  • Hidden size: 2048
  • Attention heads: 16 (GQA, KV heads = 2)
  • MLP: SwiGLU, intermediate size 11008
  • Position encoding: RoPE (base = 10000)
  • Context length: 4096
  • Tokenizer: Qwen2 tokenizer (151,936 vocab)

Data and Processing

The training corpus spans general web text, code, science, and QA sources. Each dataset is annotated with a Data Darwinism level (L0–L9), and multiple sources receive L4/L5 generative refinement and cognitive completion.

Major categories:

  • General: Common Crawl–based corpora (L3).
  • Code: GitHub crawls + Nemotron code datasets (L3/L5).
  • Science/Math: MegaMath, Nemotron-CC-Math, and Darwin-Science series (L3–L5).
  • QA: multi-source QA data with rejection sampling (L5).

Training Recipe (Summary)

  • Stage 1: 6T tokens with progressively adjusted mixtures (shifting weight from web text to code/science).
  • Stage 2: 2T tokens with structured QA (30% → 70%) for stronger reasoning and problem-solving.

Evaluation (Summary)

  • Benchmarks: 19 tasks spanning General, Code, and Science/Math.
  • Tooling: lm-eval-harness.
  • Result: Overall average 51.72, comparable to OLMo-3 7B.

License

Apache-2.0

Citation


Downloads last month
846
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support