sage / docs /Datasets.md
sage002's picture
SAGE model repository : Updating some model checkpoints
3d2114e verified

SAGE β€” 5 Billion Token Dataset Downloader

Automatically downloads ~5B tokens from free, public Hugging Face datasets and saves them as JSONL files in your data/raw/ directory, fully compatible with the SAGE training pipeline.


Token Budget

File Source Tokens
general_web.jsonl FineWeb 2.5B
code.jsonl The Stack v2 (Python, JS, Rust, Go, C++ and more) 1.0B
math_science.jsonl OpenWebMath 0.5B
multilingual.jsonl Wikipedia (20+ languages) 0.5B
synthetic.jsonl OpenHermes 2.5 (instruction data) 0.5B
Total ~5.0B tokens

Estimated disk space: ~20–25 GB
Estimated download time: 2–8 hours depending on your connection
Cost: 100% free, no account required


Requirements

System

  • Python 3.9+
  • 25 GB free disk space
  • Stable internet connection

Python packages

pip install datasets huggingface_hub tqdm

Usage

Basic β€” download everything

python debug/download_5b_tokens.py --output-dir data/raw

Test run β€” 1% of data to verify everything works

python debug/download_5b_tokens.py --output-dir data/raw --scale 0.01

Resume β€” continue after an internet cutout

python debug/download_5b_tokens.py --output-dir data/raw --resume

Download only one specific file

python debug/download_5b_tokens.py --output-dir data/raw --only code.jsonl

Download multiple specific files

python debug/download_5b_tokens.py --output-dir data/raw --only code.jsonl math_science.jsonl

All Flags

Flag Default Description
--output-dir data/raw Directory where JSONL files are saved
--resume off Skip files that already hit their token target
--only all files Download only the specified file(s)
--scale 1.0 Scale all token targets (e.g. 0.1 = 10% of 5B = 500M tokens)

Output Format

Every record written to disk follows this structure with at minimum a text field, making it directly compatible with the SAGE pipeline:

{ "text": "your training sample here", "source": "fineweb", "language": "en" }

Data Sources

1. FineWeb β€” general_web.jsonl

  • Dataset: HuggingFaceFW/fineweb (sample-10BT subset)
  • What it is: A pre-shuffled, deduplicated 10B-token slice of web-crawl text, one of the cleanest freely available web datasets
  • Why it's used: Broad general language coverage, essential for fluent text generation

2. The Stack v2 β€” code.jsonl

  • Dataset: bigcode/the-stack-v2-train-smol-ids
  • What it is: Source code across 10 programming languages: Python, JavaScript, TypeScript, Rust, Go, C++, Java, Bash, SQL, and HTML
  • Why it's used: Teaches the model programming syntax, logic, and structure

3. OpenWebMath β€” math_science.jsonl

  • Dataset: open-web-math/open-web-math
  • What it is: 14.7B tokens of mathematical content extracted from the web, including LaTeX, proofs, and problem sets
  • Why it's used: Improves numerical reasoning and scientific language understanding

4. Wikipedia β€” multilingual.jsonl

  • Dataset: wikimedia/wikipedia (20231101 dumps)
  • Languages: English, Spanish, French, German, Chinese, Japanese, Portuguese, Arabic, Russian, Hindi, Italian, Korean, Dutch, Polish, Swedish, Turkish, Vietnamese, Indonesian, Ukrainian, Persian
  • Why it's used: Clean, factual, encyclopedic text across 20 languages

5. OpenHermes 2.5 β€” synthetic.jsonl

  • Dataset: teknium/OpenHermes-2.5
  • What it is: ~1M high-quality instruction-following pairs formatted as ### Instruction / ### Response conversations
  • Why it's used: Teaches the model to follow instructions and produce structured, helpful responses

What Happens After Download

Once all files are ready, continue with the standard SAGE pipeline:

Train the tokenizer

python -m tokenizer.train_tokenizer \
  --input data/raw/general_web.jsonl \
          data/raw/code.jsonl \
          data/raw/math_science.jsonl \
          data/raw/multilingual.jsonl \
          data/raw/synthetic.jsonl \
  --model-prefix tokenizer/tokenizer \
  --vocab-size 32000

Build parquet shards

python -m data.pipeline \
  --tokenizer-model tokenizer/tokenizer.model \
  --output-dir data/processed \
  --shard-size 128

Start training

python -m train.trainer \
  --model-config configs/model/1b.yaml \
  --schedule-config configs/train/schedule.yaml \
  --train-shards data/processed/shard-00000.parquet \
  --validation-shards data/processed/shard-00001.parquet \
  --output-dir runs/sage-1b

Troubleshooting

Download stalls or disconnects
Run with --resume to pick up exactly where you left off. The writer appends to existing files and counts already-written tokens before continuing.

A specific language or dataset fails
The downloader catches errors per-source and logs a warning, then moves on. The other files are unaffected. Re-run with --only <filename> --resume to retry just that file.

Running out of disk space mid-download
Use --scale 0.5 to target 2.5B tokens total (~10–12 GB) instead of the full 5B. The model will be slightly less capable but the pipeline will still work end to end.

Slow download speed
All datasets are streamed β€” data is downloaded and written record by record, so you never need to load the entire dataset at once. If speed is consistently low, try running overnight or on a cloud VM closer to Hugging Face's CDN.


Full Script

"""
SAGE β€” 5 Billion Token Dataset Downloader
==========================================
Downloads ~5B tokens from free Hugging Face datasets and saves them
as JSONL files in your data/raw/ directory, ready for the SAGE pipeline.

Token budget breakdown:
  general_web.jsonl    β†’  2.5B tokens  (FineWeb)
  code.jsonl           β†’  1.0B tokens  (The Stack v2 - Python, JS, Rust, Go, C++)
  math_science.jsonl   β†’  0.5B tokens  (OpenWebMath)
  multilingual.jsonl   β†’  0.5B tokens  (Wikipedia 20+ languages)
  synthetic.jsonl      β†’  0.5B tokens  (OpenHermes instruction data)
  ─────────────────────────────────────
  TOTAL                β†’  ~5.0B tokens

Usage:
  pip install datasets huggingface_hub tqdm
  python debug/download_5b_tokens.py --output-dir data/raw
  python debug/download_5b_tokens.py --output-dir data/raw --resume
"""

import argparse
import json
import sys
import time
from pathlib import Path

missing = []
try:
    from datasets import load_dataset
except ImportError:
    missing.append("datasets")
try:
    from tqdm import tqdm
except ImportError:
    missing.append("tqdm")

if missing:
    print(f"[ERROR] Missing packages: {', '.join(missing)}")
    print(f"  Run:  pip install {' '.join(missing)}")
    sys.exit(1)


def estimate_tokens(text: str) -> int:
    return max(1, len(text) // 4)

def human_tokens(n: int) -> str:
    if n >= 1_000_000_000:
        return f"{n/1_000_000_000:.2f}B"
    if n >= 1_000_000:
        return f"{n/1_000_000:.1f}M"
    return f"{n:,}"

def human_bytes(n: int) -> str:
    for unit in ["B", "KB", "MB", "GB"]:
        if n < 1024:
            return f"{n:.1f} {unit}"
        n /= 1024
    return f"{n:.1f} TB"


class JSONLWriter:
    def __init__(self, path: Path, target_tokens: int, resume: bool = False):
        self.path = path
        self.target_tokens = target_tokens
        self.tokens_written = 0
        self.records_written = 0

        if resume and path.exists():
            print(f"  [resume] Counting existing tokens in {path.name}...")
            with open(path, "r", encoding="utf-8") as f:
                for line in f:
                    try:
                        rec = json.loads(line)
                        self.tokens_written += estimate_tokens(rec.get("text", ""))
                        self.records_written += 1
                    except json.JSONDecodeError:
                        pass
            print(f"  [resume] Already have {human_tokens(self.tokens_written)} / {human_tokens(target_tokens)}")
            self._file = open(path, "a", encoding="utf-8", buffering=1024 * 1024)
        else:
            path.parent.mkdir(parents=True, exist_ok=True)
            self._file = open(path, "w", encoding="utf-8", buffering=1024 * 1024)

    @property
    def done(self) -> bool:
        return self.tokens_written >= self.target_tokens

    def write(self, record: dict) -> int:
        text = record.get("text", "")
        if not text or len(text.strip()) < 50:
            return 0
        toks = estimate_tokens(text)
        self._file.write(json.dumps(record, ensure_ascii=False) + "\n")
        self.tokens_written += toks
        self.records_written += 1
        return toks

    def close(self):
        self._file.flush()
        self._file.close()

    def __enter__(self): return self
    def __exit__(self, *_): self.close()


def download_general_web(writer):
    print("\n[1/5] general_web.jsonl β€” FineWeb")
    bar = tqdm(total=writer.target_tokens, initial=writer.tokens_written,
               unit="tok", unit_scale=True, desc="  web tokens")
    ds = load_dataset("HuggingFaceFW/fineweb", name="sample-10BT",
                      split="train", streaming=True)
    for sample in ds:
        if writer.done: break
        bar.update(writer.write({"text": sample["text"], "source": "fineweb",
                                  "url": sample.get("url", ""), "language": "en"}))
    bar.close()
    print(f"  βœ“ {human_tokens(writer.tokens_written)} tokens | {writer.records_written:,} records")


def download_code(writer):
    print("\n[2/5] code.jsonl β€” The Stack v2")
    LANGUAGES = [("python","Python"),("javascript","JavaScript"),("typescript","TypeScript"),
                 ("rust","Rust"),("go","Go"),("cpp","C++"),("java","Java"),
                 ("bash","Bash"),("sql","SQL"),("html","HTML")]
    bar = tqdm(total=writer.target_tokens, initial=writer.tokens_written,
               unit="tok", unit_scale=True, desc="  code tokens")
    tokens_per_lang = writer.target_tokens // len(LANGUAGES)
    for lang_id, lang_name in LANGUAGES:
        if writer.done: break
        lang_tokens = 0
        print(f"    β†’ {lang_name}...")
        try:
            ds = load_dataset("bigcode/the-stack-v2-train-smol-ids",
                              data_dir=f"data/{lang_id}", split="train",
                              streaming=True, trust_remote_code=True)
            for sample in ds:
                if writer.done or lang_tokens >= tokens_per_lang: break
                content = sample.get("content", "") or sample.get("text", "")
                if not content: continue
                t = writer.write({"text": content, "source": "the_stack_v2",
                                   "language": lang_id})
                bar.update(t); lang_tokens += t
        except Exception as e:
            print(f"    [warn] {lang_name} failed ({e}), skipping.")
    bar.close()
    print(f"  βœ“ {human_tokens(writer.tokens_written)} tokens | {writer.records_written:,} records")


def download_math(writer):
    print("\n[3/5] math_science.jsonl β€” OpenWebMath")
    bar = tqdm(total=writer.target_tokens, initial=writer.tokens_written,
               unit="tok", unit_scale=True, desc="  math tokens")
    ds = load_dataset("open-web-math/open-web-math", split="train", streaming=True)
    for sample in ds:
        if writer.done: break
        bar.update(writer.write({"text": sample["text"], "source": "open_web_math",
                                  "url": sample.get("url", "")}))
    bar.close()
    print(f"  βœ“ {human_tokens(writer.tokens_written)} tokens | {writer.records_written:,} records")


def download_multilingual(writer):
    print("\n[4/5] multilingual.jsonl β€” Wikipedia (20 languages)")
    LANGUAGES = [("en","English"),("es","Spanish"),("fr","French"),("de","German"),
                 ("zh","Chinese"),("ja","Japanese"),("pt","Portuguese"),("ar","Arabic"),
                 ("ru","Russian"),("hi","Hindi"),("it","Italian"),("ko","Korean"),
                 ("nl","Dutch"),("pl","Polish"),("sv","Swedish"),("tr","Turkish"),
                 ("vi","Vietnamese"),("id","Indonesian"),("uk","Ukrainian"),("fa","Persian")]
    bar = tqdm(total=writer.target_tokens, initial=writer.tokens_written,
               unit="tok", unit_scale=True, desc="  multilingual tokens")
    tokens_per_lang = writer.target_tokens // len(LANGUAGES)
    for lang_code, lang_name in LANGUAGES:
        if writer.done: break
        lang_tokens = 0
        try:
            ds = load_dataset("wikimedia/wikipedia", f"20231101.{lang_code}",
                              split="train", streaming=True, trust_remote_code=True)
            for sample in ds:
                if writer.done or lang_tokens >= tokens_per_lang: break
                text = sample.get("text", "")
                if not text: continue
                t = writer.write({"text": text, "source": "wikipedia",
                                   "language": lang_code, "title": sample.get("title","")})
                bar.update(t); lang_tokens += t
        except Exception as e:
            print(f"\n    [warn] Wikipedia {lang_name} failed: {e}")
    bar.close()
    print(f"  βœ“ {human_tokens(writer.tokens_written)} tokens | {writer.records_written:,} records")


def download_synthetic(writer):
    print("\n[5/5] synthetic.jsonl β€” OpenHermes 2.5")
    bar = tqdm(total=writer.target_tokens, initial=writer.tokens_written,
               unit="tok", unit_scale=True, desc="  synthetic tokens")
    ds = load_dataset("teknium/OpenHermes-2.5", split="train", streaming=True)
    rounds = 0
    while not writer.done and rounds < 10:
        for sample in ds:
            if writer.done: break
            convs = sample.get("conversations", [])
            parts = []
            for turn in convs:
                role, value = turn.get("from",""), turn.get("value","")
                if role == "human":   parts.append(f"### Instruction\n{value}")
                elif role == "gpt":   parts.append(f"### Response\n{value}")
            text = "\n\n".join(parts) or sample.get("text","")
            if not text: continue
            bar.update(writer.write({"text": text, "source": "openhermes_2.5",
                                      "task": "instruction_following"}))
        rounds += 1
    bar.close()
    print(f"  βœ“ {human_tokens(writer.tokens_written)} tokens | {writer.records_written:,} records")


TARGETS = {
    "general_web.jsonl":  2_500_000_000,
    "code.jsonl":         1_000_000_000,
    "math_science.jsonl":   500_000_000,
    "multilingual.jsonl":   500_000_000,
    "synthetic.jsonl":      500_000_000,
}
DOWNLOADERS = {
    "general_web.jsonl":  download_general_web,
    "code.jsonl":         download_code,
    "math_science.jsonl": download_math,
    "multilingual.jsonl": download_multilingual,
    "synthetic.jsonl":    download_synthetic,
}


def main():
    parser = argparse.ArgumentParser(description="Download ~5B tokens for SAGE training.")
    parser.add_argument("--output-dir", default="data/raw")
    parser.add_argument("--resume", action="store_true")
    parser.add_argument("--only", nargs="+", choices=list(TARGETS.keys()))
    parser.add_argument("--scale", type=float, default=1.0)
    args = parser.parse_args()

    out_dir = Path(args.output_dir)
    out_dir.mkdir(parents=True, exist_ok=True)
    files_to_run = args.only or list(TARGETS.keys())
    total_target = sum(int(TARGETS[f] * args.scale) for f in files_to_run)

    print("=" * 60)
    print("  SAGE β€” 5 Billion Token Downloader")
    print("=" * 60)
    print(f"  Output dir : {out_dir.resolve()}")
    print(f"  Resume     : {args.resume}")
    print(f"  Scale      : {args.scale}x")
    print(f"  Target     : {human_tokens(total_target)} tokens")
    print(f"  Est. disk  : ~{total_target // 40_000_000} GB")
    print("=" * 60)

    grand_start = time.time()
    grand_tokens = 0

    for filename in files_to_run:
        target = int(TARGETS[filename] * args.scale)
        with JSONLWriter(out_dir / filename, target, resume=args.resume) as writer:
            if writer.done:
                print(f"\n[skip] {filename} already complete ({human_tokens(writer.tokens_written)} tokens)")
                grand_tokens += writer.tokens_written
                continue
            t0 = time.time()
            DOWNLOADERS[filename](writer)
            elapsed = time.time() - t0
            grand_tokens += writer.tokens_written
            size = (out_dir / filename).stat().st_size
            print(f"  Time: {elapsed/60:.1f} min  |  Size: {human_bytes(size)}")

    elapsed_total = time.time() - grand_start
    print("\n" + "=" * 60)
    print(f"  DONE β€” {human_tokens(grand_tokens)} tokens downloaded")
    print(f"  Total time: {elapsed_total/3600:.2f} hours")
    print(f"  Files: {out_dir.resolve()}/")
    print("=" * 60)


if __name__ == "__main__":
    main()