| # SAGE β 5 Billion Token Dataset Downloader | |
| Automatically downloads ~5B tokens from free, public Hugging Face datasets and saves them as JSONL files in your `data/raw/` directory, fully compatible with the SAGE training pipeline. | |
| --- | |
| ## Token Budget | |
| | File | Source | Tokens | | |
| |---|---|---| | |
| | `general_web.jsonl` | FineWeb | 2.5B | | |
| | `code.jsonl` | The Stack v2 (Python, JS, Rust, Go, C++ and more) | 1.0B | | |
| | `math_science.jsonl` | OpenWebMath | 0.5B | | |
| | `multilingual.jsonl` | Wikipedia (20+ languages) | 0.5B | | |
| | `synthetic.jsonl` | OpenHermes 2.5 (instruction data) | 0.5B | | |
| | **Total** | | **~5.0B tokens** | | |
| **Estimated disk space:** ~20β25 GB | |
| **Estimated download time:** 2β8 hours depending on your connection | |
| **Cost:** 100% free, no account required | |
| --- | |
| ## Requirements | |
| ### System | |
| - Python 3.9+ | |
| - 25 GB free disk space | |
| - Stable internet connection | |
| ### Python packages | |
| ```bash | |
| pip install datasets huggingface_hub tqdm | |
| ``` | |
| --- | |
| ## Usage | |
| ### Basic β download everything | |
| ```bash | |
| python debug/download_5b_tokens.py --output-dir data/raw | |
| ``` | |
| ### Test run β 1% of data to verify everything works | |
| ```bash | |
| python debug/download_5b_tokens.py --output-dir data/raw --scale 0.01 | |
| ``` | |
| ### Resume β continue after an internet cutout | |
| ```bash | |
| python debug/download_5b_tokens.py --output-dir data/raw --resume | |
| ``` | |
| ### Download only one specific file | |
| ```bash | |
| python debug/download_5b_tokens.py --output-dir data/raw --only code.jsonl | |
| ``` | |
| ### Download multiple specific files | |
| ```bash | |
| python debug/download_5b_tokens.py --output-dir data/raw --only code.jsonl math_science.jsonl | |
| ``` | |
| --- | |
| ## All Flags | |
| | Flag | Default | Description | | |
| |---|---|---| | |
| | `--output-dir` | `data/raw` | Directory where JSONL files are saved | | |
| | `--resume` | off | Skip files that already hit their token target | | |
| | `--only` | all files | Download only the specified file(s) | | |
| | `--scale` | `1.0` | Scale all token targets (e.g. `0.1` = 10% of 5B = 500M tokens) | | |
| --- | |
| ## Output Format | |
| Every record written to disk follows this structure with at minimum a `text` field, making it directly compatible with the SAGE pipeline: | |
| ```json | |
| { "text": "your training sample here", "source": "fineweb", "language": "en" } | |
| ``` | |
| --- | |
| ## Data Sources | |
| ### 1. FineWeb β `general_web.jsonl` | |
| - **Dataset:** `HuggingFaceFW/fineweb` (sample-10BT subset) | |
| - **What it is:** A pre-shuffled, deduplicated 10B-token slice of web-crawl text, one of the cleanest freely available web datasets | |
| - **Why it's used:** Broad general language coverage, essential for fluent text generation | |
| ### 2. The Stack v2 β `code.jsonl` | |
| - **Dataset:** `bigcode/the-stack-v2-train-smol-ids` | |
| - **What it is:** Source code across 10 programming languages: Python, JavaScript, TypeScript, Rust, Go, C++, Java, Bash, SQL, and HTML | |
| - **Why it's used:** Teaches the model programming syntax, logic, and structure | |
| ### 3. OpenWebMath β `math_science.jsonl` | |
| - **Dataset:** `open-web-math/open-web-math` | |
| - **What it is:** 14.7B tokens of mathematical content extracted from the web, including LaTeX, proofs, and problem sets | |
| - **Why it's used:** Improves numerical reasoning and scientific language understanding | |
| ### 4. Wikipedia β `multilingual.jsonl` | |
| - **Dataset:** `wikimedia/wikipedia` (20231101 dumps) | |
| - **Languages:** English, Spanish, French, German, Chinese, Japanese, Portuguese, Arabic, Russian, Hindi, Italian, Korean, Dutch, Polish, Swedish, Turkish, Vietnamese, Indonesian, Ukrainian, Persian | |
| - **Why it's used:** Clean, factual, encyclopedic text across 20 languages | |
| ### 5. OpenHermes 2.5 β `synthetic.jsonl` | |
| - **Dataset:** `teknium/OpenHermes-2.5` | |
| - **What it is:** ~1M high-quality instruction-following pairs formatted as `### Instruction` / `### Response` conversations | |
| - **Why it's used:** Teaches the model to follow instructions and produce structured, helpful responses | |
| --- | |
| ## What Happens After Download | |
| Once all files are ready, continue with the standard SAGE pipeline: | |
| ### Train the tokenizer | |
| ```bash | |
| python -m tokenizer.train_tokenizer \ | |
| --input data/raw/general_web.jsonl \ | |
| data/raw/code.jsonl \ | |
| data/raw/math_science.jsonl \ | |
| data/raw/multilingual.jsonl \ | |
| data/raw/synthetic.jsonl \ | |
| --model-prefix tokenizer/tokenizer \ | |
| --vocab-size 32000 | |
| ``` | |
| ### Build parquet shards | |
| ```bash | |
| python -m data.pipeline \ | |
| --tokenizer-model tokenizer/tokenizer.model \ | |
| --output-dir data/processed \ | |
| --shard-size 128 | |
| ``` | |
| ### Start training | |
| ```bash | |
| python -m train.trainer \ | |
| --model-config configs/model/1b.yaml \ | |
| --schedule-config configs/train/schedule.yaml \ | |
| --train-shards data/processed/shard-00000.parquet \ | |
| --validation-shards data/processed/shard-00001.parquet \ | |
| --output-dir runs/sage-1b | |
| ``` | |
| --- | |
| ## Troubleshooting | |
| **Download stalls or disconnects** | |
| Run with `--resume` to pick up exactly where you left off. The writer appends to existing files and counts already-written tokens before continuing. | |
| **A specific language or dataset fails** | |
| The downloader catches errors per-source and logs a warning, then moves on. The other files are unaffected. Re-run with `--only <filename> --resume` to retry just that file. | |
| **Running out of disk space mid-download** | |
| Use `--scale 0.5` to target 2.5B tokens total (~10β12 GB) instead of the full 5B. The model will be slightly less capable but the pipeline will still work end to end. | |
| **Slow download speed** | |
| All datasets are streamed β data is downloaded and written record by record, so you never need to load the entire dataset at once. If speed is consistently low, try running overnight or on a cloud VM closer to Hugging Face's CDN. | |
| --- | |
| ## Full Script | |
| ```python | |
| """ | |
| SAGE β 5 Billion Token Dataset Downloader | |
| ========================================== | |
| Downloads ~5B tokens from free Hugging Face datasets and saves them | |
| as JSONL files in your data/raw/ directory, ready for the SAGE pipeline. | |
| Token budget breakdown: | |
| general_web.jsonl β 2.5B tokens (FineWeb) | |
| code.jsonl β 1.0B tokens (The Stack v2 - Python, JS, Rust, Go, C++) | |
| math_science.jsonl β 0.5B tokens (OpenWebMath) | |
| multilingual.jsonl β 0.5B tokens (Wikipedia 20+ languages) | |
| synthetic.jsonl β 0.5B tokens (OpenHermes instruction data) | |
| βββββββββββββββββββββββββββββββββββββ | |
| TOTAL β ~5.0B tokens | |
| Usage: | |
| pip install datasets huggingface_hub tqdm | |
| python debug/download_5b_tokens.py --output-dir data/raw | |
| python debug/download_5b_tokens.py --output-dir data/raw --resume | |
| """ | |
| import argparse | |
| import json | |
| import sys | |
| import time | |
| from pathlib import Path | |
| missing = [] | |
| try: | |
| from datasets import load_dataset | |
| except ImportError: | |
| missing.append("datasets") | |
| try: | |
| from tqdm import tqdm | |
| except ImportError: | |
| missing.append("tqdm") | |
| if missing: | |
| print(f"[ERROR] Missing packages: {', '.join(missing)}") | |
| print(f" Run: pip install {' '.join(missing)}") | |
| sys.exit(1) | |
| def estimate_tokens(text: str) -> int: | |
| return max(1, len(text) // 4) | |
| def human_tokens(n: int) -> str: | |
| if n >= 1_000_000_000: | |
| return f"{n/1_000_000_000:.2f}B" | |
| if n >= 1_000_000: | |
| return f"{n/1_000_000:.1f}M" | |
| return f"{n:,}" | |
| def human_bytes(n: int) -> str: | |
| for unit in ["B", "KB", "MB", "GB"]: | |
| if n < 1024: | |
| return f"{n:.1f} {unit}" | |
| n /= 1024 | |
| return f"{n:.1f} TB" | |
| class JSONLWriter: | |
| def __init__(self, path: Path, target_tokens: int, resume: bool = False): | |
| self.path = path | |
| self.target_tokens = target_tokens | |
| self.tokens_written = 0 | |
| self.records_written = 0 | |
| if resume and path.exists(): | |
| print(f" [resume] Counting existing tokens in {path.name}...") | |
| with open(path, "r", encoding="utf-8") as f: | |
| for line in f: | |
| try: | |
| rec = json.loads(line) | |
| self.tokens_written += estimate_tokens(rec.get("text", "")) | |
| self.records_written += 1 | |
| except json.JSONDecodeError: | |
| pass | |
| print(f" [resume] Already have {human_tokens(self.tokens_written)} / {human_tokens(target_tokens)}") | |
| self._file = open(path, "a", encoding="utf-8", buffering=1024 * 1024) | |
| else: | |
| path.parent.mkdir(parents=True, exist_ok=True) | |
| self._file = open(path, "w", encoding="utf-8", buffering=1024 * 1024) | |
| @property | |
| def done(self) -> bool: | |
| return self.tokens_written >= self.target_tokens | |
| def write(self, record: dict) -> int: | |
| text = record.get("text", "") | |
| if not text or len(text.strip()) < 50: | |
| return 0 | |
| toks = estimate_tokens(text) | |
| self._file.write(json.dumps(record, ensure_ascii=False) + "\n") | |
| self.tokens_written += toks | |
| self.records_written += 1 | |
| return toks | |
| def close(self): | |
| self._file.flush() | |
| self._file.close() | |
| def __enter__(self): return self | |
| def __exit__(self, *_): self.close() | |
| def download_general_web(writer): | |
| print("\n[1/5] general_web.jsonl β FineWeb") | |
| bar = tqdm(total=writer.target_tokens, initial=writer.tokens_written, | |
| unit="tok", unit_scale=True, desc=" web tokens") | |
| ds = load_dataset("HuggingFaceFW/fineweb", name="sample-10BT", | |
| split="train", streaming=True) | |
| for sample in ds: | |
| if writer.done: break | |
| bar.update(writer.write({"text": sample["text"], "source": "fineweb", | |
| "url": sample.get("url", ""), "language": "en"})) | |
| bar.close() | |
| print(f" β {human_tokens(writer.tokens_written)} tokens | {writer.records_written:,} records") | |
| def download_code(writer): | |
| print("\n[2/5] code.jsonl β The Stack v2") | |
| LANGUAGES = [("python","Python"),("javascript","JavaScript"),("typescript","TypeScript"), | |
| ("rust","Rust"),("go","Go"),("cpp","C++"),("java","Java"), | |
| ("bash","Bash"),("sql","SQL"),("html","HTML")] | |
| bar = tqdm(total=writer.target_tokens, initial=writer.tokens_written, | |
| unit="tok", unit_scale=True, desc=" code tokens") | |
| tokens_per_lang = writer.target_tokens // len(LANGUAGES) | |
| for lang_id, lang_name in LANGUAGES: | |
| if writer.done: break | |
| lang_tokens = 0 | |
| print(f" β {lang_name}...") | |
| try: | |
| ds = load_dataset("bigcode/the-stack-v2-train-smol-ids", | |
| data_dir=f"data/{lang_id}", split="train", | |
| streaming=True, trust_remote_code=True) | |
| for sample in ds: | |
| if writer.done or lang_tokens >= tokens_per_lang: break | |
| content = sample.get("content", "") or sample.get("text", "") | |
| if not content: continue | |
| t = writer.write({"text": content, "source": "the_stack_v2", | |
| "language": lang_id}) | |
| bar.update(t); lang_tokens += t | |
| except Exception as e: | |
| print(f" [warn] {lang_name} failed ({e}), skipping.") | |
| bar.close() | |
| print(f" β {human_tokens(writer.tokens_written)} tokens | {writer.records_written:,} records") | |
| def download_math(writer): | |
| print("\n[3/5] math_science.jsonl β OpenWebMath") | |
| bar = tqdm(total=writer.target_tokens, initial=writer.tokens_written, | |
| unit="tok", unit_scale=True, desc=" math tokens") | |
| ds = load_dataset("open-web-math/open-web-math", split="train", streaming=True) | |
| for sample in ds: | |
| if writer.done: break | |
| bar.update(writer.write({"text": sample["text"], "source": "open_web_math", | |
| "url": sample.get("url", "")})) | |
| bar.close() | |
| print(f" β {human_tokens(writer.tokens_written)} tokens | {writer.records_written:,} records") | |
| def download_multilingual(writer): | |
| print("\n[4/5] multilingual.jsonl β Wikipedia (20 languages)") | |
| LANGUAGES = [("en","English"),("es","Spanish"),("fr","French"),("de","German"), | |
| ("zh","Chinese"),("ja","Japanese"),("pt","Portuguese"),("ar","Arabic"), | |
| ("ru","Russian"),("hi","Hindi"),("it","Italian"),("ko","Korean"), | |
| ("nl","Dutch"),("pl","Polish"),("sv","Swedish"),("tr","Turkish"), | |
| ("vi","Vietnamese"),("id","Indonesian"),("uk","Ukrainian"),("fa","Persian")] | |
| bar = tqdm(total=writer.target_tokens, initial=writer.tokens_written, | |
| unit="tok", unit_scale=True, desc=" multilingual tokens") | |
| tokens_per_lang = writer.target_tokens // len(LANGUAGES) | |
| for lang_code, lang_name in LANGUAGES: | |
| if writer.done: break | |
| lang_tokens = 0 | |
| try: | |
| ds = load_dataset("wikimedia/wikipedia", f"20231101.{lang_code}", | |
| split="train", streaming=True, trust_remote_code=True) | |
| for sample in ds: | |
| if writer.done or lang_tokens >= tokens_per_lang: break | |
| text = sample.get("text", "") | |
| if not text: continue | |
| t = writer.write({"text": text, "source": "wikipedia", | |
| "language": lang_code, "title": sample.get("title","")}) | |
| bar.update(t); lang_tokens += t | |
| except Exception as e: | |
| print(f"\n [warn] Wikipedia {lang_name} failed: {e}") | |
| bar.close() | |
| print(f" β {human_tokens(writer.tokens_written)} tokens | {writer.records_written:,} records") | |
| def download_synthetic(writer): | |
| print("\n[5/5] synthetic.jsonl β OpenHermes 2.5") | |
| bar = tqdm(total=writer.target_tokens, initial=writer.tokens_written, | |
| unit="tok", unit_scale=True, desc=" synthetic tokens") | |
| ds = load_dataset("teknium/OpenHermes-2.5", split="train", streaming=True) | |
| rounds = 0 | |
| while not writer.done and rounds < 10: | |
| for sample in ds: | |
| if writer.done: break | |
| convs = sample.get("conversations", []) | |
| parts = [] | |
| for turn in convs: | |
| role, value = turn.get("from",""), turn.get("value","") | |
| if role == "human": parts.append(f"### Instruction\n{value}") | |
| elif role == "gpt": parts.append(f"### Response\n{value}") | |
| text = "\n\n".join(parts) or sample.get("text","") | |
| if not text: continue | |
| bar.update(writer.write({"text": text, "source": "openhermes_2.5", | |
| "task": "instruction_following"})) | |
| rounds += 1 | |
| bar.close() | |
| print(f" β {human_tokens(writer.tokens_written)} tokens | {writer.records_written:,} records") | |
| TARGETS = { | |
| "general_web.jsonl": 2_500_000_000, | |
| "code.jsonl": 1_000_000_000, | |
| "math_science.jsonl": 500_000_000, | |
| "multilingual.jsonl": 500_000_000, | |
| "synthetic.jsonl": 500_000_000, | |
| } | |
| DOWNLOADERS = { | |
| "general_web.jsonl": download_general_web, | |
| "code.jsonl": download_code, | |
| "math_science.jsonl": download_math, | |
| "multilingual.jsonl": download_multilingual, | |
| "synthetic.jsonl": download_synthetic, | |
| } | |
| def main(): | |
| parser = argparse.ArgumentParser(description="Download ~5B tokens for SAGE training.") | |
| parser.add_argument("--output-dir", default="data/raw") | |
| parser.add_argument("--resume", action="store_true") | |
| parser.add_argument("--only", nargs="+", choices=list(TARGETS.keys())) | |
| parser.add_argument("--scale", type=float, default=1.0) | |
| args = parser.parse_args() | |
| out_dir = Path(args.output_dir) | |
| out_dir.mkdir(parents=True, exist_ok=True) | |
| files_to_run = args.only or list(TARGETS.keys()) | |
| total_target = sum(int(TARGETS[f] * args.scale) for f in files_to_run) | |
| print("=" * 60) | |
| print(" SAGE β 5 Billion Token Downloader") | |
| print("=" * 60) | |
| print(f" Output dir : {out_dir.resolve()}") | |
| print(f" Resume : {args.resume}") | |
| print(f" Scale : {args.scale}x") | |
| print(f" Target : {human_tokens(total_target)} tokens") | |
| print(f" Est. disk : ~{total_target // 40_000_000} GB") | |
| print("=" * 60) | |
| grand_start = time.time() | |
| grand_tokens = 0 | |
| for filename in files_to_run: | |
| target = int(TARGETS[filename] * args.scale) | |
| with JSONLWriter(out_dir / filename, target, resume=args.resume) as writer: | |
| if writer.done: | |
| print(f"\n[skip] {filename} already complete ({human_tokens(writer.tokens_written)} tokens)") | |
| grand_tokens += writer.tokens_written | |
| continue | |
| t0 = time.time() | |
| DOWNLOADERS[filename](writer) | |
| elapsed = time.time() - t0 | |
| grand_tokens += writer.tokens_written | |
| size = (out_dir / filename).stat().st_size | |
| print(f" Time: {elapsed/60:.1f} min | Size: {human_bytes(size)}") | |
| elapsed_total = time.time() - grand_start | |
| print("\n" + "=" * 60) | |
| print(f" DONE β {human_tokens(grand_tokens)} tokens downloaded") | |
| print(f" Total time: {elapsed_total/3600:.2f} hours") | |
| print(f" Files: {out_dir.resolve()}/") | |
| print("=" * 60) | |
| if __name__ == "__main__": | |
| main() | |
| ``` |