# SAGE — 5 Billion Token Dataset Downloader Automatically downloads ~5B tokens from free, public Hugging Face datasets and saves them as JSONL files in your `data/raw/` directory, fully compatible with the SAGE training pipeline. --- ## Token Budget | File | Source | Tokens | |---|---|---| | `general_web.jsonl` | FineWeb | 2.5B | | `code.jsonl` | The Stack v2 (Python, JS, Rust, Go, C++ and more) | 1.0B | | `math_science.jsonl` | OpenWebMath | 0.5B | | `multilingual.jsonl` | Wikipedia (20+ languages) | 0.5B | | `synthetic.jsonl` | OpenHermes 2.5 (instruction data) | 0.5B | | **Total** | | **~5.0B tokens** | **Estimated disk space:** ~20–25 GB **Estimated download time:** 2–8 hours depending on your connection **Cost:** 100% free, no account required --- ## Requirements ### System - Python 3.9+ - 25 GB free disk space - Stable internet connection ### Python packages ```bash pip install datasets huggingface_hub tqdm ``` --- ## Usage ### Basic — download everything ```bash python debug/download_5b_tokens.py --output-dir data/raw ``` ### Test run — 1% of data to verify everything works ```bash python debug/download_5b_tokens.py --output-dir data/raw --scale 0.01 ``` ### Resume — continue after an internet cutout ```bash python debug/download_5b_tokens.py --output-dir data/raw --resume ``` ### Download only one specific file ```bash python debug/download_5b_tokens.py --output-dir data/raw --only code.jsonl ``` ### Download multiple specific files ```bash python debug/download_5b_tokens.py --output-dir data/raw --only code.jsonl math_science.jsonl ``` --- ## All Flags | Flag | Default | Description | |---|---|---| | `--output-dir` | `data/raw` | Directory where JSONL files are saved | | `--resume` | off | Skip files that already hit their token target | | `--only` | all files | Download only the specified file(s) | | `--scale` | `1.0` | Scale all token targets (e.g. `0.1` = 10% of 5B = 500M tokens) | --- ## Output Format Every record written to disk follows this structure with at minimum a `text` field, making it directly compatible with the SAGE pipeline: ```json { "text": "your training sample here", "source": "fineweb", "language": "en" } ``` --- ## Data Sources ### 1. FineWeb — `general_web.jsonl` - **Dataset:** `HuggingFaceFW/fineweb` (sample-10BT subset) - **What it is:** A pre-shuffled, deduplicated 10B-token slice of web-crawl text, one of the cleanest freely available web datasets - **Why it's used:** Broad general language coverage, essential for fluent text generation ### 2. The Stack v2 — `code.jsonl` - **Dataset:** `bigcode/the-stack-v2-train-smol-ids` - **What it is:** Source code across 10 programming languages: Python, JavaScript, TypeScript, Rust, Go, C++, Java, Bash, SQL, and HTML - **Why it's used:** Teaches the model programming syntax, logic, and structure ### 3. OpenWebMath — `math_science.jsonl` - **Dataset:** `open-web-math/open-web-math` - **What it is:** 14.7B tokens of mathematical content extracted from the web, including LaTeX, proofs, and problem sets - **Why it's used:** Improves numerical reasoning and scientific language understanding ### 4. Wikipedia — `multilingual.jsonl` - **Dataset:** `wikimedia/wikipedia` (20231101 dumps) - **Languages:** English, Spanish, French, German, Chinese, Japanese, Portuguese, Arabic, Russian, Hindi, Italian, Korean, Dutch, Polish, Swedish, Turkish, Vietnamese, Indonesian, Ukrainian, Persian - **Why it's used:** Clean, factual, encyclopedic text across 20 languages ### 5. OpenHermes 2.5 — `synthetic.jsonl` - **Dataset:** `teknium/OpenHermes-2.5` - **What it is:** ~1M high-quality instruction-following pairs formatted as `### Instruction` / `### Response` conversations - **Why it's used:** Teaches the model to follow instructions and produce structured, helpful responses --- ## What Happens After Download Once all files are ready, continue with the standard SAGE pipeline: ### Train the tokenizer ```bash python -m tokenizer.train_tokenizer \ --input data/raw/general_web.jsonl \ data/raw/code.jsonl \ data/raw/math_science.jsonl \ data/raw/multilingual.jsonl \ data/raw/synthetic.jsonl \ --model-prefix tokenizer/tokenizer \ --vocab-size 32000 ``` ### Build parquet shards ```bash python -m data.pipeline \ --tokenizer-model tokenizer/tokenizer.model \ --output-dir data/processed \ --shard-size 128 ``` ### Start training ```bash python -m train.trainer \ --model-config configs/model/1b.yaml \ --schedule-config configs/train/schedule.yaml \ --train-shards data/processed/shard-00000.parquet \ --validation-shards data/processed/shard-00001.parquet \ --output-dir runs/sage-1b ``` --- ## Troubleshooting **Download stalls or disconnects** Run with `--resume` to pick up exactly where you left off. The writer appends to existing files and counts already-written tokens before continuing. **A specific language or dataset fails** The downloader catches errors per-source and logs a warning, then moves on. The other files are unaffected. Re-run with `--only --resume` to retry just that file. **Running out of disk space mid-download** Use `--scale 0.5` to target 2.5B tokens total (~10–12 GB) instead of the full 5B. The model will be slightly less capable but the pipeline will still work end to end. **Slow download speed** All datasets are streamed — data is downloaded and written record by record, so you never need to load the entire dataset at once. If speed is consistently low, try running overnight or on a cloud VM closer to Hugging Face's CDN. --- ## Full Script ```python """ SAGE — 5 Billion Token Dataset Downloader ========================================== Downloads ~5B tokens from free Hugging Face datasets and saves them as JSONL files in your data/raw/ directory, ready for the SAGE pipeline. Token budget breakdown: general_web.jsonl → 2.5B tokens (FineWeb) code.jsonl → 1.0B tokens (The Stack v2 - Python, JS, Rust, Go, C++) math_science.jsonl → 0.5B tokens (OpenWebMath) multilingual.jsonl → 0.5B tokens (Wikipedia 20+ languages) synthetic.jsonl → 0.5B tokens (OpenHermes instruction data) ───────────────────────────────────── TOTAL → ~5.0B tokens Usage: pip install datasets huggingface_hub tqdm python debug/download_5b_tokens.py --output-dir data/raw python debug/download_5b_tokens.py --output-dir data/raw --resume """ import argparse import json import sys import time from pathlib import Path missing = [] try: from datasets import load_dataset except ImportError: missing.append("datasets") try: from tqdm import tqdm except ImportError: missing.append("tqdm") if missing: print(f"[ERROR] Missing packages: {', '.join(missing)}") print(f" Run: pip install {' '.join(missing)}") sys.exit(1) def estimate_tokens(text: str) -> int: return max(1, len(text) // 4) def human_tokens(n: int) -> str: if n >= 1_000_000_000: return f"{n/1_000_000_000:.2f}B" if n >= 1_000_000: return f"{n/1_000_000:.1f}M" return f"{n:,}" def human_bytes(n: int) -> str: for unit in ["B", "KB", "MB", "GB"]: if n < 1024: return f"{n:.1f} {unit}" n /= 1024 return f"{n:.1f} TB" class JSONLWriter: def __init__(self, path: Path, target_tokens: int, resume: bool = False): self.path = path self.target_tokens = target_tokens self.tokens_written = 0 self.records_written = 0 if resume and path.exists(): print(f" [resume] Counting existing tokens in {path.name}...") with open(path, "r", encoding="utf-8") as f: for line in f: try: rec = json.loads(line) self.tokens_written += estimate_tokens(rec.get("text", "")) self.records_written += 1 except json.JSONDecodeError: pass print(f" [resume] Already have {human_tokens(self.tokens_written)} / {human_tokens(target_tokens)}") self._file = open(path, "a", encoding="utf-8", buffering=1024 * 1024) else: path.parent.mkdir(parents=True, exist_ok=True) self._file = open(path, "w", encoding="utf-8", buffering=1024 * 1024) @property def done(self) -> bool: return self.tokens_written >= self.target_tokens def write(self, record: dict) -> int: text = record.get("text", "") if not text or len(text.strip()) < 50: return 0 toks = estimate_tokens(text) self._file.write(json.dumps(record, ensure_ascii=False) + "\n") self.tokens_written += toks self.records_written += 1 return toks def close(self): self._file.flush() self._file.close() def __enter__(self): return self def __exit__(self, *_): self.close() def download_general_web(writer): print("\n[1/5] general_web.jsonl — FineWeb") bar = tqdm(total=writer.target_tokens, initial=writer.tokens_written, unit="tok", unit_scale=True, desc=" web tokens") ds = load_dataset("HuggingFaceFW/fineweb", name="sample-10BT", split="train", streaming=True) for sample in ds: if writer.done: break bar.update(writer.write({"text": sample["text"], "source": "fineweb", "url": sample.get("url", ""), "language": "en"})) bar.close() print(f" ✓ {human_tokens(writer.tokens_written)} tokens | {writer.records_written:,} records") def download_code(writer): print("\n[2/5] code.jsonl — The Stack v2") LANGUAGES = [("python","Python"),("javascript","JavaScript"),("typescript","TypeScript"), ("rust","Rust"),("go","Go"),("cpp","C++"),("java","Java"), ("bash","Bash"),("sql","SQL"),("html","HTML")] bar = tqdm(total=writer.target_tokens, initial=writer.tokens_written, unit="tok", unit_scale=True, desc=" code tokens") tokens_per_lang = writer.target_tokens // len(LANGUAGES) for lang_id, lang_name in LANGUAGES: if writer.done: break lang_tokens = 0 print(f" → {lang_name}...") try: ds = load_dataset("bigcode/the-stack-v2-train-smol-ids", data_dir=f"data/{lang_id}", split="train", streaming=True, trust_remote_code=True) for sample in ds: if writer.done or lang_tokens >= tokens_per_lang: break content = sample.get("content", "") or sample.get("text", "") if not content: continue t = writer.write({"text": content, "source": "the_stack_v2", "language": lang_id}) bar.update(t); lang_tokens += t except Exception as e: print(f" [warn] {lang_name} failed ({e}), skipping.") bar.close() print(f" ✓ {human_tokens(writer.tokens_written)} tokens | {writer.records_written:,} records") def download_math(writer): print("\n[3/5] math_science.jsonl — OpenWebMath") bar = tqdm(total=writer.target_tokens, initial=writer.tokens_written, unit="tok", unit_scale=True, desc=" math tokens") ds = load_dataset("open-web-math/open-web-math", split="train", streaming=True) for sample in ds: if writer.done: break bar.update(writer.write({"text": sample["text"], "source": "open_web_math", "url": sample.get("url", "")})) bar.close() print(f" ✓ {human_tokens(writer.tokens_written)} tokens | {writer.records_written:,} records") def download_multilingual(writer): print("\n[4/5] multilingual.jsonl — Wikipedia (20 languages)") LANGUAGES = [("en","English"),("es","Spanish"),("fr","French"),("de","German"), ("zh","Chinese"),("ja","Japanese"),("pt","Portuguese"),("ar","Arabic"), ("ru","Russian"),("hi","Hindi"),("it","Italian"),("ko","Korean"), ("nl","Dutch"),("pl","Polish"),("sv","Swedish"),("tr","Turkish"), ("vi","Vietnamese"),("id","Indonesian"),("uk","Ukrainian"),("fa","Persian")] bar = tqdm(total=writer.target_tokens, initial=writer.tokens_written, unit="tok", unit_scale=True, desc=" multilingual tokens") tokens_per_lang = writer.target_tokens // len(LANGUAGES) for lang_code, lang_name in LANGUAGES: if writer.done: break lang_tokens = 0 try: ds = load_dataset("wikimedia/wikipedia", f"20231101.{lang_code}", split="train", streaming=True, trust_remote_code=True) for sample in ds: if writer.done or lang_tokens >= tokens_per_lang: break text = sample.get("text", "") if not text: continue t = writer.write({"text": text, "source": "wikipedia", "language": lang_code, "title": sample.get("title","")}) bar.update(t); lang_tokens += t except Exception as e: print(f"\n [warn] Wikipedia {lang_name} failed: {e}") bar.close() print(f" ✓ {human_tokens(writer.tokens_written)} tokens | {writer.records_written:,} records") def download_synthetic(writer): print("\n[5/5] synthetic.jsonl — OpenHermes 2.5") bar = tqdm(total=writer.target_tokens, initial=writer.tokens_written, unit="tok", unit_scale=True, desc=" synthetic tokens") ds = load_dataset("teknium/OpenHermes-2.5", split="train", streaming=True) rounds = 0 while not writer.done and rounds < 10: for sample in ds: if writer.done: break convs = sample.get("conversations", []) parts = [] for turn in convs: role, value = turn.get("from",""), turn.get("value","") if role == "human": parts.append(f"### Instruction\n{value}") elif role == "gpt": parts.append(f"### Response\n{value}") text = "\n\n".join(parts) or sample.get("text","") if not text: continue bar.update(writer.write({"text": text, "source": "openhermes_2.5", "task": "instruction_following"})) rounds += 1 bar.close() print(f" ✓ {human_tokens(writer.tokens_written)} tokens | {writer.records_written:,} records") TARGETS = { "general_web.jsonl": 2_500_000_000, "code.jsonl": 1_000_000_000, "math_science.jsonl": 500_000_000, "multilingual.jsonl": 500_000_000, "synthetic.jsonl": 500_000_000, } DOWNLOADERS = { "general_web.jsonl": download_general_web, "code.jsonl": download_code, "math_science.jsonl": download_math, "multilingual.jsonl": download_multilingual, "synthetic.jsonl": download_synthetic, } def main(): parser = argparse.ArgumentParser(description="Download ~5B tokens for SAGE training.") parser.add_argument("--output-dir", default="data/raw") parser.add_argument("--resume", action="store_true") parser.add_argument("--only", nargs="+", choices=list(TARGETS.keys())) parser.add_argument("--scale", type=float, default=1.0) args = parser.parse_args() out_dir = Path(args.output_dir) out_dir.mkdir(parents=True, exist_ok=True) files_to_run = args.only or list(TARGETS.keys()) total_target = sum(int(TARGETS[f] * args.scale) for f in files_to_run) print("=" * 60) print(" SAGE — 5 Billion Token Downloader") print("=" * 60) print(f" Output dir : {out_dir.resolve()}") print(f" Resume : {args.resume}") print(f" Scale : {args.scale}x") print(f" Target : {human_tokens(total_target)} tokens") print(f" Est. disk : ~{total_target // 40_000_000} GB") print("=" * 60) grand_start = time.time() grand_tokens = 0 for filename in files_to_run: target = int(TARGETS[filename] * args.scale) with JSONLWriter(out_dir / filename, target, resume=args.resume) as writer: if writer.done: print(f"\n[skip] {filename} already complete ({human_tokens(writer.tokens_written)} tokens)") grand_tokens += writer.tokens_written continue t0 = time.time() DOWNLOADERS[filename](writer) elapsed = time.time() - t0 grand_tokens += writer.tokens_written size = (out_dir / filename).stat().st_size print(f" Time: {elapsed/60:.1f} min | Size: {human_bytes(size)}") elapsed_total = time.time() - grand_start print("\n" + "=" * 60) print(f" DONE — {human_tokens(grand_tokens)} tokens downloaded") print(f" Total time: {elapsed_total/3600:.2f} hours") print(f" Files: {out_dir.resolve()}/") print("=" * 60) if __name__ == "__main__": main() ```