sage / docs /Datasets.md

SAGE model repository : Updating some model checkpoints

3d2114e verified about 1 month ago

17.7 kB

	# SAGE — 5 Billion Token Dataset Downloader

	Automatically downloads ~5B tokens from free, public Hugging Face datasets and saves them as JSONL files in your `data/raw/` directory, fully compatible with the SAGE training pipeline.

	---

	## Token Budget

	\| File \| Source \| Tokens \|
	\|---\|---\|---\|
	\| `general_web.jsonl` \| FineWeb \| 2.5B \|
	\| `code.jsonl` \| The Stack v2 (Python, JS, Rust, Go, C++ and more) \| 1.0B \|
	\| `math_science.jsonl` \| OpenWebMath \| 0.5B \|
	\| `multilingual.jsonl` \| Wikipedia (20+ languages) \| 0.5B \|
	\| `synthetic.jsonl` \| OpenHermes 2.5 (instruction data) \| 0.5B \|
	\| Total \| \| ~5.0B tokens \|

	Estimated disk space: ~20–25 GB
	Estimated download time: 2–8 hours depending on your connection
	Cost: 100% free, no account required

	---

	## Requirements

	### System
	- Python 3.9+
	- 25 GB free disk space
	- Stable internet connection

	### Python packages

	```bash
	pip install datasets huggingface_hub tqdm
	```

	---

	## Usage

	### Basic — download everything

	```bash
	python debug/download_5b_tokens.py --output-dir data/raw
	```

	### Test run — 1% of data to verify everything works

	```bash
	python debug/download_5b_tokens.py --output-dir data/raw --scale 0.01
	```

	### Resume — continue after an internet cutout

	```bash
	python debug/download_5b_tokens.py --output-dir data/raw --resume
	```

	### Download only one specific file

	```bash
	python debug/download_5b_tokens.py --output-dir data/raw --only code.jsonl
	```

	### Download multiple specific files

	```bash
	python debug/download_5b_tokens.py --output-dir data/raw --only code.jsonl math_science.jsonl
	```

	---

	## All Flags

	\| Flag \| Default \| Description \|
	\|---\|---\|---\|
	\| `--output-dir` \| `data/raw` \| Directory where JSONL files are saved \|
	\| `--resume` \| off \| Skip files that already hit their token target \|
	\| `--only` \| all files \| Download only the specified file(s) \|
	\| `--scale` \| `1.0` \| Scale all token targets (e.g. `0.1` = 10% of 5B = 500M tokens) \|

	---

	## Output Format

	Every record written to disk follows this structure with at minimum a `text` field, making it directly compatible with the SAGE pipeline:

	```json
	{ "text": "your training sample here", "source": "fineweb", "language": "en" }
	```

	---

	## Data Sources

	### 1. FineWeb — `general_web.jsonl`
	- Dataset: `HuggingFaceFW/fineweb` (sample-10BT subset)
	- What it is: A pre-shuffled, deduplicated 10B-token slice of web-crawl text, one of the cleanest freely available web datasets
	- Why it's used: Broad general language coverage, essential for fluent text generation

	### 2. The Stack v2 — `code.jsonl`
	- Dataset: `bigcode/the-stack-v2-train-smol-ids`
	- What it is: Source code across 10 programming languages: Python, JavaScript, TypeScript, Rust, Go, C++, Java, Bash, SQL, and HTML
	- Why it's used: Teaches the model programming syntax, logic, and structure

	### 3. OpenWebMath — `math_science.jsonl`
	- Dataset: `open-web-math/open-web-math`
	- What it is: 14.7B tokens of mathematical content extracted from the web, including LaTeX, proofs, and problem sets
	- Why it's used: Improves numerical reasoning and scientific language understanding

	### 4. Wikipedia — `multilingual.jsonl`
	- Dataset: `wikimedia/wikipedia` (20231101 dumps)
	- Languages: English, Spanish, French, German, Chinese, Japanese, Portuguese, Arabic, Russian, Hindi, Italian, Korean, Dutch, Polish, Swedish, Turkish, Vietnamese, Indonesian, Ukrainian, Persian
	- Why it's used: Clean, factual, encyclopedic text across 20 languages

	### 5. OpenHermes 2.5 — `synthetic.jsonl`
	- Dataset: `teknium/OpenHermes-2.5`
	- What it is: ~1M high-quality instruction-following pairs formatted as `### Instruction` / `### Response` conversations
	- Why it's used: Teaches the model to follow instructions and produce structured, helpful responses

	---

	## What Happens After Download

	Once all files are ready, continue with the standard SAGE pipeline:

	### Train the tokenizer

	```bash
	python -m tokenizer.train_tokenizer \
	--input data/raw/general_web.jsonl \
	data/raw/code.jsonl \
	data/raw/math_science.jsonl \
	data/raw/multilingual.jsonl \
	data/raw/synthetic.jsonl \
	--model-prefix tokenizer/tokenizer \
	--vocab-size 32000
	```

	### Build parquet shards

	```bash
	python -m data.pipeline \
	--tokenizer-model tokenizer/tokenizer.model \
	--output-dir data/processed \
	--shard-size 128
	```

	### Start training

	```bash
	python -m train.trainer \
	--model-config configs/model/1b.yaml \
	--schedule-config configs/train/schedule.yaml \
	--train-shards data/processed/shard-00000.parquet \
	--validation-shards data/processed/shard-00001.parquet \
	--output-dir runs/sage-1b
	```

	---

	## Troubleshooting

	Download stalls or disconnects
	Run with `--resume` to pick up exactly where you left off. The writer appends to existing files and counts already-written tokens before continuing.

	A specific language or dataset fails
	The downloader catches errors per-source and logs a warning, then moves on. The other files are unaffected. Re-run with `--only <filename> --resume` to retry just that file.

	Running out of disk space mid-download
	Use `--scale 0.5` to target 2.5B tokens total (~10–12 GB) instead of the full 5B. The model will be slightly less capable but the pipeline will still work end to end.

	Slow download speed
	All datasets are streamed — data is downloaded and written record by record, so you never need to load the entire dataset at once. If speed is consistently low, try running overnight or on a cloud VM closer to Hugging Face's CDN.

	---

	## Full Script

	```python
	"""
	SAGE — 5 Billion Token Dataset Downloader
	==========================================
	Downloads ~5B tokens from free Hugging Face datasets and saves them
	as JSONL files in your data/raw/ directory, ready for the SAGE pipeline.

	Token budget breakdown:
	general_web.jsonl → 2.5B tokens (FineWeb)
	code.jsonl → 1.0B tokens (The Stack v2 - Python, JS, Rust, Go, C++)
	math_science.jsonl → 0.5B tokens (OpenWebMath)
	multilingual.jsonl → 0.5B tokens (Wikipedia 20+ languages)
	synthetic.jsonl → 0.5B tokens (OpenHermes instruction data)
	─────────────────────────────────────
	TOTAL → ~5.0B tokens

	Usage:
	pip install datasets huggingface_hub tqdm
	python debug/download_5b_tokens.py --output-dir data/raw
	python debug/download_5b_tokens.py --output-dir data/raw --resume
	"""

	import argparse
	import json
	import sys
	import time
	from pathlib import Path

	missing = []
	try:
	from datasets import load_dataset
	except ImportError:
	missing.append("datasets")
	try:
	from tqdm import tqdm
	except ImportError:
	missing.append("tqdm")

	if missing:
	print(f"[ERROR] Missing packages: {', '.join(missing)}")
	print(f" Run: pip install {' '.join(missing)}")
	sys.exit(1)


	def estimate_tokens(text: str) -> int:
	return max(1, len(text) // 4)

	def human_tokens(n: int) -> str:
	if n >= 1_000_000_000:
	return f"{n/1_000_000_000:.2f}B"
	if n >= 1_000_000:
	return f"{n/1_000_000:.1f}M"
	return f"{n:,}"

	def human_bytes(n: int) -> str:
	for unit in ["B", "KB", "MB", "GB"]:
	if n < 1024:
	return f"{n:.1f} {unit}"
	n /= 1024
	return f"{n:.1f} TB"


	class JSONLWriter:
	def __init__(self, path: Path, target_tokens: int, resume: bool = False):
	self.path = path
	self.target_tokens = target_tokens
	self.tokens_written = 0
	self.records_written = 0

	if resume and path.exists():
	print(f" [resume] Counting existing tokens in {path.name}...")
	with open(path, "r", encoding="utf-8") as f:
	for line in f:
	try:
	rec = json.loads(line)
	self.tokens_written += estimate_tokens(rec.get("text", ""))
	self.records_written += 1
	except json.JSONDecodeError:
	pass
	print(f" [resume] Already have {human_tokens(self.tokens_written)} / {human_tokens(target_tokens)}")
	self._file = open(path, "a", encoding="utf-8", buffering=1024 * 1024)
	else:
	path.parent.mkdir(parents=True, exist_ok=True)
	self._file = open(path, "w", encoding="utf-8", buffering=1024 * 1024)

	@property
	def done(self) -> bool:
	return self.tokens_written >= self.target_tokens

	def write(self, record: dict) -> int:
	text = record.get("text", "")
	if not text or len(text.strip()) < 50:
	return 0
	toks = estimate_tokens(text)
	self._file.write(json.dumps(record, ensure_ascii=False) + "\n")
	self.tokens_written += toks
	self.records_written += 1
	return toks

	def close(self):
	self._file.flush()
	self._file.close()

	def __enter__(self): return self
	def __exit__(self, *_): self.close()


	def download_general_web(writer):
	print("\n[1/5] general_web.jsonl — FineWeb")
	bar = tqdm(total=writer.target_tokens, initial=writer.tokens_written,
	unit="tok", unit_scale=True, desc=" web tokens")
	ds = load_dataset("HuggingFaceFW/fineweb", name="sample-10BT",
	split="train", streaming=True)
	for sample in ds:
	if writer.done: break
	bar.update(writer.write({"text": sample["text"], "source": "fineweb",
	"url": sample.get("url", ""), "language": "en"}))
	bar.close()
	print(f" ✓ {human_tokens(writer.tokens_written)} tokens \| {writer.records_written:,} records")


	def download_code(writer):
	print("\n[2/5] code.jsonl — The Stack v2")
	LANGUAGES = [("python","Python"),("javascript","JavaScript"),("typescript","TypeScript"),
	("rust","Rust"),("go","Go"),("cpp","C++"),("java","Java"),
	("bash","Bash"),("sql","SQL"),("html","HTML")]
	bar = tqdm(total=writer.target_tokens, initial=writer.tokens_written,
	unit="tok", unit_scale=True, desc=" code tokens")
	tokens_per_lang = writer.target_tokens // len(LANGUAGES)
	for lang_id, lang_name in LANGUAGES:
	if writer.done: break
	lang_tokens = 0
	print(f" → {lang_name}...")
	try:
	ds = load_dataset("bigcode/the-stack-v2-train-smol-ids",
	data_dir=f"data/{lang_id}", split="train",
	streaming=True, trust_remote_code=True)
	for sample in ds:
	if writer.done or lang_tokens >= tokens_per_lang: break
	content = sample.get("content", "") or sample.get("text", "")
	if not content: continue
	t = writer.write({"text": content, "source": "the_stack_v2",
	"language": lang_id})
	bar.update(t); lang_tokens += t
	except Exception as e:
	print(f" [warn] {lang_name} failed ({e}), skipping.")
	bar.close()
	print(f" ✓ {human_tokens(writer.tokens_written)} tokens \| {writer.records_written:,} records")


	def download_math(writer):
	print("\n[3/5] math_science.jsonl — OpenWebMath")
	bar = tqdm(total=writer.target_tokens, initial=writer.tokens_written,
	unit="tok", unit_scale=True, desc=" math tokens")
	ds = load_dataset("open-web-math/open-web-math", split="train", streaming=True)
	for sample in ds:
	if writer.done: break
	bar.update(writer.write({"text": sample["text"], "source": "open_web_math",
	"url": sample.get("url", "")}))
	bar.close()
	print(f" ✓ {human_tokens(writer.tokens_written)} tokens \| {writer.records_written:,} records")


	def download_multilingual(writer):
	print("\n[4/5] multilingual.jsonl — Wikipedia (20 languages)")
	LANGUAGES = [("en","English"),("es","Spanish"),("fr","French"),("de","German"),
	("zh","Chinese"),("ja","Japanese"),("pt","Portuguese"),("ar","Arabic"),
	("ru","Russian"),("hi","Hindi"),("it","Italian"),("ko","Korean"),
	("nl","Dutch"),("pl","Polish"),("sv","Swedish"),("tr","Turkish"),
	("vi","Vietnamese"),("id","Indonesian"),("uk","Ukrainian"),("fa","Persian")]
	bar = tqdm(total=writer.target_tokens, initial=writer.tokens_written,
	unit="tok", unit_scale=True, desc=" multilingual tokens")
	tokens_per_lang = writer.target_tokens // len(LANGUAGES)
	for lang_code, lang_name in LANGUAGES:
	if writer.done: break
	lang_tokens = 0
	try:
	ds = load_dataset("wikimedia/wikipedia", f"20231101.{lang_code}",
	split="train", streaming=True, trust_remote_code=True)
	for sample in ds:
	if writer.done or lang_tokens >= tokens_per_lang: break
	text = sample.get("text", "")
	if not text: continue
	t = writer.write({"text": text, "source": "wikipedia",
	"language": lang_code, "title": sample.get("title","")})
	bar.update(t); lang_tokens += t
	except Exception as e:
	print(f"\n [warn] Wikipedia {lang_name} failed: {e}")
	bar.close()
	print(f" ✓ {human_tokens(writer.tokens_written)} tokens \| {writer.records_written:,} records")


	def download_synthetic(writer):
	print("\n[5/5] synthetic.jsonl — OpenHermes 2.5")
	bar = tqdm(total=writer.target_tokens, initial=writer.tokens_written,
	unit="tok", unit_scale=True, desc=" synthetic tokens")
	ds = load_dataset("teknium/OpenHermes-2.5", split="train", streaming=True)
	rounds = 0
	while not writer.done and rounds < 10:
	for sample in ds:
	if writer.done: break
	convs = sample.get("conversations", [])
	parts = []
	for turn in convs:
	role, value = turn.get("from",""), turn.get("value","")
	if role == "human": parts.append(f"### Instruction\n{value}")
	elif role == "gpt": parts.append(f"### Response\n{value}")
	text = "\n\n".join(parts) or sample.get("text","")
	if not text: continue
	bar.update(writer.write({"text": text, "source": "openhermes_2.5",
	"task": "instruction_following"}))
	rounds += 1
	bar.close()
	print(f" ✓ {human_tokens(writer.tokens_written)} tokens \| {writer.records_written:,} records")


	TARGETS = {
	"general_web.jsonl": 2_500_000_000,
	"code.jsonl": 1_000_000_000,
	"math_science.jsonl": 500_000_000,
	"multilingual.jsonl": 500_000_000,
	"synthetic.jsonl": 500_000_000,
	}
	DOWNLOADERS = {
	"general_web.jsonl": download_general_web,
	"code.jsonl": download_code,
	"math_science.jsonl": download_math,
	"multilingual.jsonl": download_multilingual,
	"synthetic.jsonl": download_synthetic,
	}


	def main():
	parser = argparse.ArgumentParser(description="Download ~5B tokens for SAGE training.")
	parser.add_argument("--output-dir", default="data/raw")
	parser.add_argument("--resume", action="store_true")
	parser.add_argument("--only", nargs="+", choices=list(TARGETS.keys()))
	parser.add_argument("--scale", type=float, default=1.0)
	args = parser.parse_args()

	out_dir = Path(args.output_dir)
	out_dir.mkdir(parents=True, exist_ok=True)
	files_to_run = args.only or list(TARGETS.keys())
	total_target = sum(int(TARGETS[f] * args.scale) for f in files_to_run)

	print("=" * 60)
	print(" SAGE — 5 Billion Token Downloader")
	print("=" * 60)
	print(f" Output dir : {out_dir.resolve()}")
	print(f" Resume : {args.resume}")
	print(f" Scale : {args.scale}x")
	print(f" Target : {human_tokens(total_target)} tokens")
	print(f" Est. disk : ~{total_target // 40_000_000} GB")
	print("=" * 60)

	grand_start = time.time()
	grand_tokens = 0

	for filename in files_to_run:
	target = int(TARGETS[filename] * args.scale)
	with JSONLWriter(out_dir / filename, target, resume=args.resume) as writer:
	if writer.done:
	print(f"\n[skip] {filename} already complete ({human_tokens(writer.tokens_written)} tokens)")
	grand_tokens += writer.tokens_written
	continue
	t0 = time.time()
	DOWNLOADERS[filename](writer)
	elapsed = time.time() - t0
	grand_tokens += writer.tokens_written
	size = (out_dir / filename).stat().st_size
	print(f" Time: {elapsed/60:.1f} min \| Size: {human_bytes(size)}")

	elapsed_total = time.time() - grand_start
	print("\n" + "=" * 60)
	print(f" DONE — {human_tokens(grand_tokens)} tokens downloaded")
	print(f" Total time: {elapsed_total/3600:.2f} hours")
	print(f" Files: {out_dir.resolve()}/")
	print("=" * 60)


	if __name__ == "__main__":
	main()
	```