SAGE β 5 Billion Token Dataset Downloader
Automatically downloads ~5B tokens from free, public Hugging Face datasets and saves them as JSONL files in your data/raw/ directory, fully compatible with the SAGE training pipeline.
Token Budget
| File | Source | Tokens |
|---|---|---|
general_web.jsonl |
FineWeb | 2.5B |
code.jsonl |
The Stack v2 (Python, JS, Rust, Go, C++ and more) | 1.0B |
math_science.jsonl |
OpenWebMath | 0.5B |
multilingual.jsonl |
Wikipedia (20+ languages) | 0.5B |
synthetic.jsonl |
OpenHermes 2.5 (instruction data) | 0.5B |
| Total | ~5.0B tokens |
Estimated disk space: ~20β25 GB
Estimated download time: 2β8 hours depending on your connection
Cost: 100% free, no account required
Requirements
System
- Python 3.9+
- 25 GB free disk space
- Stable internet connection
Python packages
pip install datasets huggingface_hub tqdm
Usage
Basic β download everything
python debug/download_5b_tokens.py --output-dir data/raw
Test run β 1% of data to verify everything works
python debug/download_5b_tokens.py --output-dir data/raw --scale 0.01
Resume β continue after an internet cutout
python debug/download_5b_tokens.py --output-dir data/raw --resume
Download only one specific file
python debug/download_5b_tokens.py --output-dir data/raw --only code.jsonl
Download multiple specific files
python debug/download_5b_tokens.py --output-dir data/raw --only code.jsonl math_science.jsonl
All Flags
| Flag | Default | Description |
|---|---|---|
--output-dir |
data/raw |
Directory where JSONL files are saved |
--resume |
off | Skip files that already hit their token target |
--only |
all files | Download only the specified file(s) |
--scale |
1.0 |
Scale all token targets (e.g. 0.1 = 10% of 5B = 500M tokens) |
Output Format
Every record written to disk follows this structure with at minimum a text field, making it directly compatible with the SAGE pipeline:
{ "text": "your training sample here", "source": "fineweb", "language": "en" }
Data Sources
1. FineWeb β general_web.jsonl
- Dataset:
HuggingFaceFW/fineweb(sample-10BT subset) - What it is: A pre-shuffled, deduplicated 10B-token slice of web-crawl text, one of the cleanest freely available web datasets
- Why it's used: Broad general language coverage, essential for fluent text generation
2. The Stack v2 β code.jsonl
- Dataset:
bigcode/the-stack-v2-train-smol-ids - What it is: Source code across 10 programming languages: Python, JavaScript, TypeScript, Rust, Go, C++, Java, Bash, SQL, and HTML
- Why it's used: Teaches the model programming syntax, logic, and structure
3. OpenWebMath β math_science.jsonl
- Dataset:
open-web-math/open-web-math - What it is: 14.7B tokens of mathematical content extracted from the web, including LaTeX, proofs, and problem sets
- Why it's used: Improves numerical reasoning and scientific language understanding
4. Wikipedia β multilingual.jsonl
- Dataset:
wikimedia/wikipedia(20231101 dumps) - Languages: English, Spanish, French, German, Chinese, Japanese, Portuguese, Arabic, Russian, Hindi, Italian, Korean, Dutch, Polish, Swedish, Turkish, Vietnamese, Indonesian, Ukrainian, Persian
- Why it's used: Clean, factual, encyclopedic text across 20 languages
5. OpenHermes 2.5 β synthetic.jsonl
- Dataset:
teknium/OpenHermes-2.5 - What it is: ~1M high-quality instruction-following pairs formatted as
### Instruction/### Responseconversations - Why it's used: Teaches the model to follow instructions and produce structured, helpful responses
What Happens After Download
Once all files are ready, continue with the standard SAGE pipeline:
Train the tokenizer
python -m tokenizer.train_tokenizer \
--input data/raw/general_web.jsonl \
data/raw/code.jsonl \
data/raw/math_science.jsonl \
data/raw/multilingual.jsonl \
data/raw/synthetic.jsonl \
--model-prefix tokenizer/tokenizer \
--vocab-size 32000
Build parquet shards
python -m data.pipeline \
--tokenizer-model tokenizer/tokenizer.model \
--output-dir data/processed \
--shard-size 128
Start training
python -m train.trainer \
--model-config configs/model/1b.yaml \
--schedule-config configs/train/schedule.yaml \
--train-shards data/processed/shard-00000.parquet \
--validation-shards data/processed/shard-00001.parquet \
--output-dir runs/sage-1b
Troubleshooting
Download stalls or disconnects
Run with --resume to pick up exactly where you left off. The writer appends to existing files and counts already-written tokens before continuing.
A specific language or dataset fails
The downloader catches errors per-source and logs a warning, then moves on. The other files are unaffected. Re-run with --only <filename> --resume to retry just that file.
Running out of disk space mid-download
Use --scale 0.5 to target 2.5B tokens total (~10β12 GB) instead of the full 5B. The model will be slightly less capable but the pipeline will still work end to end.
Slow download speed
All datasets are streamed β data is downloaded and written record by record, so you never need to load the entire dataset at once. If speed is consistently low, try running overnight or on a cloud VM closer to Hugging Face's CDN.
Full Script
"""
SAGE β 5 Billion Token Dataset Downloader
==========================================
Downloads ~5B tokens from free Hugging Face datasets and saves them
as JSONL files in your data/raw/ directory, ready for the SAGE pipeline.
Token budget breakdown:
general_web.jsonl β 2.5B tokens (FineWeb)
code.jsonl β 1.0B tokens (The Stack v2 - Python, JS, Rust, Go, C++)
math_science.jsonl β 0.5B tokens (OpenWebMath)
multilingual.jsonl β 0.5B tokens (Wikipedia 20+ languages)
synthetic.jsonl β 0.5B tokens (OpenHermes instruction data)
βββββββββββββββββββββββββββββββββββββ
TOTAL β ~5.0B tokens
Usage:
pip install datasets huggingface_hub tqdm
python debug/download_5b_tokens.py --output-dir data/raw
python debug/download_5b_tokens.py --output-dir data/raw --resume
"""
import argparse
import json
import sys
import time
from pathlib import Path
missing = []
try:
from datasets import load_dataset
except ImportError:
missing.append("datasets")
try:
from tqdm import tqdm
except ImportError:
missing.append("tqdm")
if missing:
print(f"[ERROR] Missing packages: {', '.join(missing)}")
print(f" Run: pip install {' '.join(missing)}")
sys.exit(1)
def estimate_tokens(text: str) -> int:
return max(1, len(text) // 4)
def human_tokens(n: int) -> str:
if n >= 1_000_000_000:
return f"{n/1_000_000_000:.2f}B"
if n >= 1_000_000:
return f"{n/1_000_000:.1f}M"
return f"{n:,}"
def human_bytes(n: int) -> str:
for unit in ["B", "KB", "MB", "GB"]:
if n < 1024:
return f"{n:.1f} {unit}"
n /= 1024
return f"{n:.1f} TB"
class JSONLWriter:
def __init__(self, path: Path, target_tokens: int, resume: bool = False):
self.path = path
self.target_tokens = target_tokens
self.tokens_written = 0
self.records_written = 0
if resume and path.exists():
print(f" [resume] Counting existing tokens in {path.name}...")
with open(path, "r", encoding="utf-8") as f:
for line in f:
try:
rec = json.loads(line)
self.tokens_written += estimate_tokens(rec.get("text", ""))
self.records_written += 1
except json.JSONDecodeError:
pass
print(f" [resume] Already have {human_tokens(self.tokens_written)} / {human_tokens(target_tokens)}")
self._file = open(path, "a", encoding="utf-8", buffering=1024 * 1024)
else:
path.parent.mkdir(parents=True, exist_ok=True)
self._file = open(path, "w", encoding="utf-8", buffering=1024 * 1024)
@property
def done(self) -> bool:
return self.tokens_written >= self.target_tokens
def write(self, record: dict) -> int:
text = record.get("text", "")
if not text or len(text.strip()) < 50:
return 0
toks = estimate_tokens(text)
self._file.write(json.dumps(record, ensure_ascii=False) + "\n")
self.tokens_written += toks
self.records_written += 1
return toks
def close(self):
self._file.flush()
self._file.close()
def __enter__(self): return self
def __exit__(self, *_): self.close()
def download_general_web(writer):
print("\n[1/5] general_web.jsonl β FineWeb")
bar = tqdm(total=writer.target_tokens, initial=writer.tokens_written,
unit="tok", unit_scale=True, desc=" web tokens")
ds = load_dataset("HuggingFaceFW/fineweb", name="sample-10BT",
split="train", streaming=True)
for sample in ds:
if writer.done: break
bar.update(writer.write({"text": sample["text"], "source": "fineweb",
"url": sample.get("url", ""), "language": "en"}))
bar.close()
print(f" β {human_tokens(writer.tokens_written)} tokens | {writer.records_written:,} records")
def download_code(writer):
print("\n[2/5] code.jsonl β The Stack v2")
LANGUAGES = [("python","Python"),("javascript","JavaScript"),("typescript","TypeScript"),
("rust","Rust"),("go","Go"),("cpp","C++"),("java","Java"),
("bash","Bash"),("sql","SQL"),("html","HTML")]
bar = tqdm(total=writer.target_tokens, initial=writer.tokens_written,
unit="tok", unit_scale=True, desc=" code tokens")
tokens_per_lang = writer.target_tokens // len(LANGUAGES)
for lang_id, lang_name in LANGUAGES:
if writer.done: break
lang_tokens = 0
print(f" β {lang_name}...")
try:
ds = load_dataset("bigcode/the-stack-v2-train-smol-ids",
data_dir=f"data/{lang_id}", split="train",
streaming=True, trust_remote_code=True)
for sample in ds:
if writer.done or lang_tokens >= tokens_per_lang: break
content = sample.get("content", "") or sample.get("text", "")
if not content: continue
t = writer.write({"text": content, "source": "the_stack_v2",
"language": lang_id})
bar.update(t); lang_tokens += t
except Exception as e:
print(f" [warn] {lang_name} failed ({e}), skipping.")
bar.close()
print(f" β {human_tokens(writer.tokens_written)} tokens | {writer.records_written:,} records")
def download_math(writer):
print("\n[3/5] math_science.jsonl β OpenWebMath")
bar = tqdm(total=writer.target_tokens, initial=writer.tokens_written,
unit="tok", unit_scale=True, desc=" math tokens")
ds = load_dataset("open-web-math/open-web-math", split="train", streaming=True)
for sample in ds:
if writer.done: break
bar.update(writer.write({"text": sample["text"], "source": "open_web_math",
"url": sample.get("url", "")}))
bar.close()
print(f" β {human_tokens(writer.tokens_written)} tokens | {writer.records_written:,} records")
def download_multilingual(writer):
print("\n[4/5] multilingual.jsonl β Wikipedia (20 languages)")
LANGUAGES = [("en","English"),("es","Spanish"),("fr","French"),("de","German"),
("zh","Chinese"),("ja","Japanese"),("pt","Portuguese"),("ar","Arabic"),
("ru","Russian"),("hi","Hindi"),("it","Italian"),("ko","Korean"),
("nl","Dutch"),("pl","Polish"),("sv","Swedish"),("tr","Turkish"),
("vi","Vietnamese"),("id","Indonesian"),("uk","Ukrainian"),("fa","Persian")]
bar = tqdm(total=writer.target_tokens, initial=writer.tokens_written,
unit="tok", unit_scale=True, desc=" multilingual tokens")
tokens_per_lang = writer.target_tokens // len(LANGUAGES)
for lang_code, lang_name in LANGUAGES:
if writer.done: break
lang_tokens = 0
try:
ds = load_dataset("wikimedia/wikipedia", f"20231101.{lang_code}",
split="train", streaming=True, trust_remote_code=True)
for sample in ds:
if writer.done or lang_tokens >= tokens_per_lang: break
text = sample.get("text", "")
if not text: continue
t = writer.write({"text": text, "source": "wikipedia",
"language": lang_code, "title": sample.get("title","")})
bar.update(t); lang_tokens += t
except Exception as e:
print(f"\n [warn] Wikipedia {lang_name} failed: {e}")
bar.close()
print(f" β {human_tokens(writer.tokens_written)} tokens | {writer.records_written:,} records")
def download_synthetic(writer):
print("\n[5/5] synthetic.jsonl β OpenHermes 2.5")
bar = tqdm(total=writer.target_tokens, initial=writer.tokens_written,
unit="tok", unit_scale=True, desc=" synthetic tokens")
ds = load_dataset("teknium/OpenHermes-2.5", split="train", streaming=True)
rounds = 0
while not writer.done and rounds < 10:
for sample in ds:
if writer.done: break
convs = sample.get("conversations", [])
parts = []
for turn in convs:
role, value = turn.get("from",""), turn.get("value","")
if role == "human": parts.append(f"### Instruction\n{value}")
elif role == "gpt": parts.append(f"### Response\n{value}")
text = "\n\n".join(parts) or sample.get("text","")
if not text: continue
bar.update(writer.write({"text": text, "source": "openhermes_2.5",
"task": "instruction_following"}))
rounds += 1
bar.close()
print(f" β {human_tokens(writer.tokens_written)} tokens | {writer.records_written:,} records")
TARGETS = {
"general_web.jsonl": 2_500_000_000,
"code.jsonl": 1_000_000_000,
"math_science.jsonl": 500_000_000,
"multilingual.jsonl": 500_000_000,
"synthetic.jsonl": 500_000_000,
}
DOWNLOADERS = {
"general_web.jsonl": download_general_web,
"code.jsonl": download_code,
"math_science.jsonl": download_math,
"multilingual.jsonl": download_multilingual,
"synthetic.jsonl": download_synthetic,
}
def main():
parser = argparse.ArgumentParser(description="Download ~5B tokens for SAGE training.")
parser.add_argument("--output-dir", default="data/raw")
parser.add_argument("--resume", action="store_true")
parser.add_argument("--only", nargs="+", choices=list(TARGETS.keys()))
parser.add_argument("--scale", type=float, default=1.0)
args = parser.parse_args()
out_dir = Path(args.output_dir)
out_dir.mkdir(parents=True, exist_ok=True)
files_to_run = args.only or list(TARGETS.keys())
total_target = sum(int(TARGETS[f] * args.scale) for f in files_to_run)
print("=" * 60)
print(" SAGE β 5 Billion Token Downloader")
print("=" * 60)
print(f" Output dir : {out_dir.resolve()}")
print(f" Resume : {args.resume}")
print(f" Scale : {args.scale}x")
print(f" Target : {human_tokens(total_target)} tokens")
print(f" Est. disk : ~{total_target // 40_000_000} GB")
print("=" * 60)
grand_start = time.time()
grand_tokens = 0
for filename in files_to_run:
target = int(TARGETS[filename] * args.scale)
with JSONLWriter(out_dir / filename, target, resume=args.resume) as writer:
if writer.done:
print(f"\n[skip] {filename} already complete ({human_tokens(writer.tokens_written)} tokens)")
grand_tokens += writer.tokens_written
continue
t0 = time.time()
DOWNLOADERS[filename](writer)
elapsed = time.time() - t0
grand_tokens += writer.tokens_written
size = (out_dir / filename).stat().st_size
print(f" Time: {elapsed/60:.1f} min | Size: {human_bytes(size)}")
elapsed_total = time.time() - grand_start
print("\n" + "=" * 60)
print(f" DONE β {human_tokens(grand_tokens)} tokens downloaded")
print(f" Total time: {elapsed_total/3600:.2f} hours")
print(f" Files: {out_dir.resolve()}/")
print("=" * 60)
if __name__ == "__main__":
main()