sage / docs /Datasets.md
sage002's picture
SAGE model repository : Updating some model checkpoints
3d2114e verified
# SAGE β€” 5 Billion Token Dataset Downloader
Automatically downloads ~5B tokens from free, public Hugging Face datasets and saves them as JSONL files in your `data/raw/` directory, fully compatible with the SAGE training pipeline.
---
## Token Budget
| File | Source | Tokens |
|---|---|---|
| `general_web.jsonl` | FineWeb | 2.5B |
| `code.jsonl` | The Stack v2 (Python, JS, Rust, Go, C++ and more) | 1.0B |
| `math_science.jsonl` | OpenWebMath | 0.5B |
| `multilingual.jsonl` | Wikipedia (20+ languages) | 0.5B |
| `synthetic.jsonl` | OpenHermes 2.5 (instruction data) | 0.5B |
| **Total** | | **~5.0B tokens** |
**Estimated disk space:** ~20–25 GB
**Estimated download time:** 2–8 hours depending on your connection
**Cost:** 100% free, no account required
---
## Requirements
### System
- Python 3.9+
- 25 GB free disk space
- Stable internet connection
### Python packages
```bash
pip install datasets huggingface_hub tqdm
```
---
## Usage
### Basic β€” download everything
```bash
python debug/download_5b_tokens.py --output-dir data/raw
```
### Test run β€” 1% of data to verify everything works
```bash
python debug/download_5b_tokens.py --output-dir data/raw --scale 0.01
```
### Resume β€” continue after an internet cutout
```bash
python debug/download_5b_tokens.py --output-dir data/raw --resume
```
### Download only one specific file
```bash
python debug/download_5b_tokens.py --output-dir data/raw --only code.jsonl
```
### Download multiple specific files
```bash
python debug/download_5b_tokens.py --output-dir data/raw --only code.jsonl math_science.jsonl
```
---
## All Flags
| Flag | Default | Description |
|---|---|---|
| `--output-dir` | `data/raw` | Directory where JSONL files are saved |
| `--resume` | off | Skip files that already hit their token target |
| `--only` | all files | Download only the specified file(s) |
| `--scale` | `1.0` | Scale all token targets (e.g. `0.1` = 10% of 5B = 500M tokens) |
---
## Output Format
Every record written to disk follows this structure with at minimum a `text` field, making it directly compatible with the SAGE pipeline:
```json
{ "text": "your training sample here", "source": "fineweb", "language": "en" }
```
---
## Data Sources
### 1. FineWeb β€” `general_web.jsonl`
- **Dataset:** `HuggingFaceFW/fineweb` (sample-10BT subset)
- **What it is:** A pre-shuffled, deduplicated 10B-token slice of web-crawl text, one of the cleanest freely available web datasets
- **Why it's used:** Broad general language coverage, essential for fluent text generation
### 2. The Stack v2 β€” `code.jsonl`
- **Dataset:** `bigcode/the-stack-v2-train-smol-ids`
- **What it is:** Source code across 10 programming languages: Python, JavaScript, TypeScript, Rust, Go, C++, Java, Bash, SQL, and HTML
- **Why it's used:** Teaches the model programming syntax, logic, and structure
### 3. OpenWebMath β€” `math_science.jsonl`
- **Dataset:** `open-web-math/open-web-math`
- **What it is:** 14.7B tokens of mathematical content extracted from the web, including LaTeX, proofs, and problem sets
- **Why it's used:** Improves numerical reasoning and scientific language understanding
### 4. Wikipedia β€” `multilingual.jsonl`
- **Dataset:** `wikimedia/wikipedia` (20231101 dumps)
- **Languages:** English, Spanish, French, German, Chinese, Japanese, Portuguese, Arabic, Russian, Hindi, Italian, Korean, Dutch, Polish, Swedish, Turkish, Vietnamese, Indonesian, Ukrainian, Persian
- **Why it's used:** Clean, factual, encyclopedic text across 20 languages
### 5. OpenHermes 2.5 β€” `synthetic.jsonl`
- **Dataset:** `teknium/OpenHermes-2.5`
- **What it is:** ~1M high-quality instruction-following pairs formatted as `### Instruction` / `### Response` conversations
- **Why it's used:** Teaches the model to follow instructions and produce structured, helpful responses
---
## What Happens After Download
Once all files are ready, continue with the standard SAGE pipeline:
### Train the tokenizer
```bash
python -m tokenizer.train_tokenizer \
--input data/raw/general_web.jsonl \
data/raw/code.jsonl \
data/raw/math_science.jsonl \
data/raw/multilingual.jsonl \
data/raw/synthetic.jsonl \
--model-prefix tokenizer/tokenizer \
--vocab-size 32000
```
### Build parquet shards
```bash
python -m data.pipeline \
--tokenizer-model tokenizer/tokenizer.model \
--output-dir data/processed \
--shard-size 128
```
### Start training
```bash
python -m train.trainer \
--model-config configs/model/1b.yaml \
--schedule-config configs/train/schedule.yaml \
--train-shards data/processed/shard-00000.parquet \
--validation-shards data/processed/shard-00001.parquet \
--output-dir runs/sage-1b
```
---
## Troubleshooting
**Download stalls or disconnects**
Run with `--resume` to pick up exactly where you left off. The writer appends to existing files and counts already-written tokens before continuing.
**A specific language or dataset fails**
The downloader catches errors per-source and logs a warning, then moves on. The other files are unaffected. Re-run with `--only <filename> --resume` to retry just that file.
**Running out of disk space mid-download**
Use `--scale 0.5` to target 2.5B tokens total (~10–12 GB) instead of the full 5B. The model will be slightly less capable but the pipeline will still work end to end.
**Slow download speed**
All datasets are streamed β€” data is downloaded and written record by record, so you never need to load the entire dataset at once. If speed is consistently low, try running overnight or on a cloud VM closer to Hugging Face's CDN.
---
## Full Script
```python
"""
SAGE β€” 5 Billion Token Dataset Downloader
==========================================
Downloads ~5B tokens from free Hugging Face datasets and saves them
as JSONL files in your data/raw/ directory, ready for the SAGE pipeline.
Token budget breakdown:
general_web.jsonl β†’ 2.5B tokens (FineWeb)
code.jsonl β†’ 1.0B tokens (The Stack v2 - Python, JS, Rust, Go, C++)
math_science.jsonl β†’ 0.5B tokens (OpenWebMath)
multilingual.jsonl β†’ 0.5B tokens (Wikipedia 20+ languages)
synthetic.jsonl β†’ 0.5B tokens (OpenHermes instruction data)
─────────────────────────────────────
TOTAL β†’ ~5.0B tokens
Usage:
pip install datasets huggingface_hub tqdm
python debug/download_5b_tokens.py --output-dir data/raw
python debug/download_5b_tokens.py --output-dir data/raw --resume
"""
import argparse
import json
import sys
import time
from pathlib import Path
missing = []
try:
from datasets import load_dataset
except ImportError:
missing.append("datasets")
try:
from tqdm import tqdm
except ImportError:
missing.append("tqdm")
if missing:
print(f"[ERROR] Missing packages: {', '.join(missing)}")
print(f" Run: pip install {' '.join(missing)}")
sys.exit(1)
def estimate_tokens(text: str) -> int:
return max(1, len(text) // 4)
def human_tokens(n: int) -> str:
if n >= 1_000_000_000:
return f"{n/1_000_000_000:.2f}B"
if n >= 1_000_000:
return f"{n/1_000_000:.1f}M"
return f"{n:,}"
def human_bytes(n: int) -> str:
for unit in ["B", "KB", "MB", "GB"]:
if n < 1024:
return f"{n:.1f} {unit}"
n /= 1024
return f"{n:.1f} TB"
class JSONLWriter:
def __init__(self, path: Path, target_tokens: int, resume: bool = False):
self.path = path
self.target_tokens = target_tokens
self.tokens_written = 0
self.records_written = 0
if resume and path.exists():
print(f" [resume] Counting existing tokens in {path.name}...")
with open(path, "r", encoding="utf-8") as f:
for line in f:
try:
rec = json.loads(line)
self.tokens_written += estimate_tokens(rec.get("text", ""))
self.records_written += 1
except json.JSONDecodeError:
pass
print(f" [resume] Already have {human_tokens(self.tokens_written)} / {human_tokens(target_tokens)}")
self._file = open(path, "a", encoding="utf-8", buffering=1024 * 1024)
else:
path.parent.mkdir(parents=True, exist_ok=True)
self._file = open(path, "w", encoding="utf-8", buffering=1024 * 1024)
@property
def done(self) -> bool:
return self.tokens_written >= self.target_tokens
def write(self, record: dict) -> int:
text = record.get("text", "")
if not text or len(text.strip()) < 50:
return 0
toks = estimate_tokens(text)
self._file.write(json.dumps(record, ensure_ascii=False) + "\n")
self.tokens_written += toks
self.records_written += 1
return toks
def close(self):
self._file.flush()
self._file.close()
def __enter__(self): return self
def __exit__(self, *_): self.close()
def download_general_web(writer):
print("\n[1/5] general_web.jsonl β€” FineWeb")
bar = tqdm(total=writer.target_tokens, initial=writer.tokens_written,
unit="tok", unit_scale=True, desc=" web tokens")
ds = load_dataset("HuggingFaceFW/fineweb", name="sample-10BT",
split="train", streaming=True)
for sample in ds:
if writer.done: break
bar.update(writer.write({"text": sample["text"], "source": "fineweb",
"url": sample.get("url", ""), "language": "en"}))
bar.close()
print(f" βœ“ {human_tokens(writer.tokens_written)} tokens | {writer.records_written:,} records")
def download_code(writer):
print("\n[2/5] code.jsonl β€” The Stack v2")
LANGUAGES = [("python","Python"),("javascript","JavaScript"),("typescript","TypeScript"),
("rust","Rust"),("go","Go"),("cpp","C++"),("java","Java"),
("bash","Bash"),("sql","SQL"),("html","HTML")]
bar = tqdm(total=writer.target_tokens, initial=writer.tokens_written,
unit="tok", unit_scale=True, desc=" code tokens")
tokens_per_lang = writer.target_tokens // len(LANGUAGES)
for lang_id, lang_name in LANGUAGES:
if writer.done: break
lang_tokens = 0
print(f" β†’ {lang_name}...")
try:
ds = load_dataset("bigcode/the-stack-v2-train-smol-ids",
data_dir=f"data/{lang_id}", split="train",
streaming=True, trust_remote_code=True)
for sample in ds:
if writer.done or lang_tokens >= tokens_per_lang: break
content = sample.get("content", "") or sample.get("text", "")
if not content: continue
t = writer.write({"text": content, "source": "the_stack_v2",
"language": lang_id})
bar.update(t); lang_tokens += t
except Exception as e:
print(f" [warn] {lang_name} failed ({e}), skipping.")
bar.close()
print(f" βœ“ {human_tokens(writer.tokens_written)} tokens | {writer.records_written:,} records")
def download_math(writer):
print("\n[3/5] math_science.jsonl β€” OpenWebMath")
bar = tqdm(total=writer.target_tokens, initial=writer.tokens_written,
unit="tok", unit_scale=True, desc=" math tokens")
ds = load_dataset("open-web-math/open-web-math", split="train", streaming=True)
for sample in ds:
if writer.done: break
bar.update(writer.write({"text": sample["text"], "source": "open_web_math",
"url": sample.get("url", "")}))
bar.close()
print(f" βœ“ {human_tokens(writer.tokens_written)} tokens | {writer.records_written:,} records")
def download_multilingual(writer):
print("\n[4/5] multilingual.jsonl β€” Wikipedia (20 languages)")
LANGUAGES = [("en","English"),("es","Spanish"),("fr","French"),("de","German"),
("zh","Chinese"),("ja","Japanese"),("pt","Portuguese"),("ar","Arabic"),
("ru","Russian"),("hi","Hindi"),("it","Italian"),("ko","Korean"),
("nl","Dutch"),("pl","Polish"),("sv","Swedish"),("tr","Turkish"),
("vi","Vietnamese"),("id","Indonesian"),("uk","Ukrainian"),("fa","Persian")]
bar = tqdm(total=writer.target_tokens, initial=writer.tokens_written,
unit="tok", unit_scale=True, desc=" multilingual tokens")
tokens_per_lang = writer.target_tokens // len(LANGUAGES)
for lang_code, lang_name in LANGUAGES:
if writer.done: break
lang_tokens = 0
try:
ds = load_dataset("wikimedia/wikipedia", f"20231101.{lang_code}",
split="train", streaming=True, trust_remote_code=True)
for sample in ds:
if writer.done or lang_tokens >= tokens_per_lang: break
text = sample.get("text", "")
if not text: continue
t = writer.write({"text": text, "source": "wikipedia",
"language": lang_code, "title": sample.get("title","")})
bar.update(t); lang_tokens += t
except Exception as e:
print(f"\n [warn] Wikipedia {lang_name} failed: {e}")
bar.close()
print(f" βœ“ {human_tokens(writer.tokens_written)} tokens | {writer.records_written:,} records")
def download_synthetic(writer):
print("\n[5/5] synthetic.jsonl β€” OpenHermes 2.5")
bar = tqdm(total=writer.target_tokens, initial=writer.tokens_written,
unit="tok", unit_scale=True, desc=" synthetic tokens")
ds = load_dataset("teknium/OpenHermes-2.5", split="train", streaming=True)
rounds = 0
while not writer.done and rounds < 10:
for sample in ds:
if writer.done: break
convs = sample.get("conversations", [])
parts = []
for turn in convs:
role, value = turn.get("from",""), turn.get("value","")
if role == "human": parts.append(f"### Instruction\n{value}")
elif role == "gpt": parts.append(f"### Response\n{value}")
text = "\n\n".join(parts) or sample.get("text","")
if not text: continue
bar.update(writer.write({"text": text, "source": "openhermes_2.5",
"task": "instruction_following"}))
rounds += 1
bar.close()
print(f" βœ“ {human_tokens(writer.tokens_written)} tokens | {writer.records_written:,} records")
TARGETS = {
"general_web.jsonl": 2_500_000_000,
"code.jsonl": 1_000_000_000,
"math_science.jsonl": 500_000_000,
"multilingual.jsonl": 500_000_000,
"synthetic.jsonl": 500_000_000,
}
DOWNLOADERS = {
"general_web.jsonl": download_general_web,
"code.jsonl": download_code,
"math_science.jsonl": download_math,
"multilingual.jsonl": download_multilingual,
"synthetic.jsonl": download_synthetic,
}
def main():
parser = argparse.ArgumentParser(description="Download ~5B tokens for SAGE training.")
parser.add_argument("--output-dir", default="data/raw")
parser.add_argument("--resume", action="store_true")
parser.add_argument("--only", nargs="+", choices=list(TARGETS.keys()))
parser.add_argument("--scale", type=float, default=1.0)
args = parser.parse_args()
out_dir = Path(args.output_dir)
out_dir.mkdir(parents=True, exist_ok=True)
files_to_run = args.only or list(TARGETS.keys())
total_target = sum(int(TARGETS[f] * args.scale) for f in files_to_run)
print("=" * 60)
print(" SAGE β€” 5 Billion Token Downloader")
print("=" * 60)
print(f" Output dir : {out_dir.resolve()}")
print(f" Resume : {args.resume}")
print(f" Scale : {args.scale}x")
print(f" Target : {human_tokens(total_target)} tokens")
print(f" Est. disk : ~{total_target // 40_000_000} GB")
print("=" * 60)
grand_start = time.time()
grand_tokens = 0
for filename in files_to_run:
target = int(TARGETS[filename] * args.scale)
with JSONLWriter(out_dir / filename, target, resume=args.resume) as writer:
if writer.done:
print(f"\n[skip] {filename} already complete ({human_tokens(writer.tokens_written)} tokens)")
grand_tokens += writer.tokens_written
continue
t0 = time.time()
DOWNLOADERS[filename](writer)
elapsed = time.time() - t0
grand_tokens += writer.tokens_written
size = (out_dir / filename).stat().st_size
print(f" Time: {elapsed/60:.1f} min | Size: {human_bytes(size)}")
elapsed_total = time.time() - grand_start
print("\n" + "=" * 60)
print(f" DONE β€” {human_tokens(grand_tokens)} tokens downloaded")
print(f" Total time: {elapsed_total/3600:.2f} hours")
print(f" Files: {out_dir.resolve()}/")
print("=" * 60)
if __name__ == "__main__":
main()
```