Update ML Intern artifact metadata

f39f181 verified 13 days ago

9.85 kB

	---
	license: apache-2.0
	tags:
	- dataset-preprocessing
	- llm-pretraining
	- tokenization
	- deduplication
	- data-pipeline
	- ml-intern
	---

	# GrandLine

	Deterministic shard-first dataset preprocessing for LLM pretraining.

	GrandLine turns heterogeneous, partially curated datasets into reproducible, tokenized, packed training artifacts with minimal wasted compute.

	---

	## Key Properties

	\| Property \| Guarantee \|
	\|----------\|-----------\|
	\| Deterministic \| Same input + same config + same code version = same output \|
	\| Shard-first \| Every shard processed independently, no global state \|
	\| Resumable \| Skip already-processed shards on restart \|
	\| Scalable \| 4 CPUs → 224 CPUs without architectural changes \|
	\| Dataset-aware \| Pipeline compiled per dataset trust level \|
	\| Fast \| BLAKE3 hashing, batch tokenization, zstd parquet output \|

	---

	## Installation

	```bash
	# Development install
	pip install -e ".[dev]"

	# Or install dependencies directly
	pip install blake3 duckdb pyarrow transformers tokenizers pyyaml click datasets tqdm
	```

	Requirements: Python ≥ 3.11

	---

	## Quick Start

	### 1. Inspect a dataset

	```bash
	python scripts/inspect_dataset.py --source /path/to/parquet/dir/
	python scripts/inspect_dataset.py --hf-repo "HuggingFaceFW/fineweb-edu-score-2" --split train
	```

	### 2. Process a dataset

	```bash
	python scripts/run_dataset.py --config configs/datasets/fineweb_edu.yaml
	```

	### 3. With executor tuning

	```bash
	# On Kaggle (conservative memory)
	python scripts/run_dataset.py \
	--config configs/datasets/fineweb_edu.yaml \
	--executor configs/executors/kaggle.yaml

	# On a large CPU machine
	python scripts/run_dataset.py \
	--config configs/datasets/dclm.yaml \
	--executor configs/executors/local_224cpu.yaml
	```

	### 4. With CLI overrides

	```bash
	python scripts/run_dataset.py \
	--config configs/datasets/fineweb_edu.yaml \
	packing.max_seq_len=4096 \
	tokenizer.name=Qwen/Qwen3-4B
	```

	### 5. Merge manifests

	```bash
	python scripts/merge_manifests.py --state-dir state/ --dataset fineweb_edu
	```

	---

	## Architecture

	### Pipeline Levels

	GrandLine compiles dataset-specific pipelines based on trust level:

	\| Level \| Target \| Blocks \|
	\|-------\|--------\|--------\|
	\| 0 \| Highly curated (FineWeb-Edu, DCLM) \| normalize → exact dedup → tokenize → pack \|
	\| 1 \| Curated with scores (FineWeb2, FineMath) \| normalize → length filter → score filter → exact dedup → tokenize → pack \|
	\| 2 \| Synthetic/mixed (Cosmopedia) \| normalize → length filter → alpha ratio → exact dedup → tokenize → pack \|
	\| 3 \| Raw web (uncurated) \| normalize → lang ID → heuristics → exact dedup → near-dup → tokenize → pack \|

	### Core Principle

	> Do not waste compute redoing work already done upstream.

	If a dataset already has language labels, quality scores, deduplication, or curation metadata — GrandLine reuses those signals instead of recomputing them.

	---

	## Project Structure

	```
	grandline/
	├── pyproject.toml # Package definition
	├── README.md # This file
	├── configs/
	│ ├── global.yaml # Default settings
	│ ├── datasets/ # Per-dataset pipeline configs
	│ │ ├── fineweb_edu.yaml
	│ │ ├── fineweb2.yaml
	│ │ ├── dclm.yaml
	│ │ ├── cosmopedia.yaml
	│ │ ├── the_stack_v2.yaml
	│ │ ├── finemath.yaml
	│ │ └── pes2o.yaml
	│ ├── executors/ # Machine-specific tuning
	│ │ ├── kaggle.yaml
	│ │ ├── colab.yaml
	│ │ └── local_224cpu.yaml
	│ └── tokenizers/
	│ └── qwen3.yaml
	├── scripts/
	│ ├── run_dataset.py # Main processing entry point
	│ ├── inspect_dataset.py # Dataset inspection utility
	│ └── merge_manifests.py # Merge per-shard manifests
	├── state/ # Runtime state (gitignored in practice)
	│ ├── progress/ # Shard completion markers
	│ ├── manifests/ # Per-shard processing manifests
	│ ├── cache/ # Pipeline output cache
	│ └── dedup.duckdb # Persistent dedup database
	├── tests/
	│ ├── test_determinism.py
	│ ├── test_dedup.py
	│ ├── test_filters.py
	│ ├── test_tokenize.py
	│ └── test_pack.py
	└── src/grandline/
	├── __init__.py
	├── types.py # Document, TokenizedDocument, PackedSequence
	├── pipeline.py # Pipeline compilation and execution
	├── registry.py # Block name → class registry
	├── runtime.py # Shard-level execution engine
	├── manifest.py # Shard/dataset manifest I/O
	├── cache.py # Pipeline output caching
	├── io.py # Shard reading (parquet, JSONL, HF)
	├── hashing.py # BLAKE3 hashing utilities
	├── dedup_store.py # DuckDB-backed dedup state
	├── writer.py # Parquet output writer
	├── blocks/
	│ ├── base.py # Block, FilterBlock, TransformBlock
	│ ├── normalize.py # Unicode/whitespace normalization
	│ ├── filters.py # Length, score, language, alpha filters
	│ ├── dedup.py # Exact deduplication block
	│ ├── tokenize.py # Batch tokenization block
	│ └── pack.py # Deterministic greedy packing
	├── pipelines/
	│ ├── curated_web.py # FineWeb-Edu, DCLM, FineWeb2
	│ ├── code.py # The Stack v2
	│ ├── math.py # FineMath, OpenWebMath
	│ ├── papers.py # PeS2o, arXiv
	│ └── synthetic.py # Cosmopedia
	└── util/
	├── logging.py # Structured logging
	├── paths.py # Project path management
	└── config.py # YAML config loading + validation
	```

	---

	## Dataset Configuration

	Each dataset gets a YAML config specifying its pipeline:

	```yaml
	name: fineweb_edu
	pipeline_type: curated_web
	trust_level: 0

	source:
	repo: "HuggingFaceFW/fineweb-edu-score-2"
	paths: ["/data/fineweb-edu/"]
	text_column: "text"

	metadata_columns:
	- "score"
	- "url"

	score_filter:
	field: "score"
	threshold: 3.0

	tokenizer:
	name: "Qwen/Qwen3-0.6B"
	batch_size: 512

	packing:
	max_seq_len: 2048
	eos_id: 151645
	```

	---

	## Output Format

	GrandLine produces compressed parquet files with this schema:

	\| Column \| Type \| Description \|
	\|--------\|------\|-------------\|
	\| `input_ids` \| `list<int32>` \| Packed token IDs (padded to max_seq_len) \|
	\| `seq_lens` \| `list<int32>` \| Per-document lengths within the packed sequence \|
	\| `total_tokens` \| `int32` \| Non-padding token count \|
	\| `shard_id` \| `string` \| Provenance: which input shard produced this \|

	The `seq_lens` field enables document-level attention masking with FlashAttention `varlen` APIs, preventing cross-document attention contamination.

	---

	## Performance Choices

	\| Choice \| Why \|
	\|--------\|-----\|
	\| BLAKE3 \| 3-4x faster than SHA-256, deterministic, parallelizable \|
	\| DuckDB \| Persistent dedup state, fast hash lookups, no external service \|
	\| Parquet + zstd \| Columnar, compressed, random-access, ecosystem-compatible \|
	\| Rust-backed tokenizers \| Batch tokenization releases GIL, ~10x faster than Python \|
	\| Streaming first-fit packing \| Memory-bounded, deterministic, no sorting needed \|
	\| Single-pass processing \| Read shard once, apply all transforms, write once \|

	---

	## Running Tests

	```bash
	# Install dev dependencies
	pip install -e ".[dev]"

	# Run all tests
	pytest tests/ -v

	# Run specific test
	pytest tests/test_pack.py -v
	```

	---

	## Determinism Guarantees

	GrandLine ensures:

	- Stable sort order: shards processed in lexicographic order
	- Stable hashes: BLAKE3 with fixed encoding (UTF-8)
	- Stable pipeline fingerprints: hash of (config + block signatures + versions)
	- Stable shard outputs: same shard file → same output file, always
	- No randomness: no random seeds, no non-deterministic parallelism
	- No hidden state: dedup store is the only persistent state, fully deterministic

	Pipeline fingerprint includes:
	- Dataset config values
	- Block names and versions
	- Block parameters
	- Tokenizer identity

	---

	## Supported Datasets

	\| Dataset \| Pipeline \| Trust Level \| Notes \|
	\|---------\|----------\|-------------\|-------\|
	\| FineWeb-Edu \| curated_web \| 0 \| Educational quality scored \|
	\| FineWeb2 \| curated_web \| 1 \| Multilingual web \|
	\| DCLM \| curated_web \| 0 \| Model-based quality selection \|
	\| The Stack v2 \| code \| 1 \| License-filtered source code \|
	\| FineMath \| math \| 1 \| Mathematical web content \|
	\| PeS2o \| papers \| 0 \| Open access academic papers \|
	\| Cosmopedia \| synthetic \| 2 \| LLM-generated textbooks \|

	---

	## License

	Apache-2.0

	<!-- ml-intern-provenance -->
	## Generated by ML Intern

	This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.

	- Try ML Intern: https://smolagents-ml-intern.hf.space
	- Source code: https://github.com/huggingface/ml-intern

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_id = 'dignity045/grandline'
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(model_id)
	```

	For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.