GrandLine
Deterministic shard-first dataset preprocessing for LLM pretraining.
GrandLine turns heterogeneous, partially curated datasets into reproducible, tokenized, packed training artifacts with minimal wasted compute.
Key Properties
| Property | Guarantee |
|---|---|
| Deterministic | Same input + same config + same code version = same output |
| Shard-first | Every shard processed independently, no global state |
| Resumable | Skip already-processed shards on restart |
| Scalable | 4 CPUs β 224 CPUs without architectural changes |
| Dataset-aware | Pipeline compiled per dataset trust level |
| Fast | BLAKE3 hashing, batch tokenization, zstd parquet output |
Installation
# Development install
pip install -e ".[dev]"
# Or install dependencies directly
pip install blake3 duckdb pyarrow transformers tokenizers pyyaml click datasets tqdm
Requirements: Python β₯ 3.11
Quick Start
1. Inspect a dataset
python scripts/inspect_dataset.py --source /path/to/parquet/dir/
python scripts/inspect_dataset.py --hf-repo "HuggingFaceFW/fineweb-edu-score-2" --split train
2. Process a dataset
python scripts/run_dataset.py --config configs/datasets/fineweb_edu.yaml
3. With executor tuning
# On Kaggle (conservative memory)
python scripts/run_dataset.py \
--config configs/datasets/fineweb_edu.yaml \
--executor configs/executors/kaggle.yaml
# On a large CPU machine
python scripts/run_dataset.py \
--config configs/datasets/dclm.yaml \
--executor configs/executors/local_224cpu.yaml
4. With CLI overrides
python scripts/run_dataset.py \
--config configs/datasets/fineweb_edu.yaml \
packing.max_seq_len=4096 \
tokenizer.name=Qwen/Qwen3-4B
5. Merge manifests
python scripts/merge_manifests.py --state-dir state/ --dataset fineweb_edu
Architecture
Pipeline Levels
GrandLine compiles dataset-specific pipelines based on trust level:
| Level | Target | Blocks |
|---|---|---|
| 0 | Highly curated (FineWeb-Edu, DCLM) | normalize β exact dedup β tokenize β pack |
| 1 | Curated with scores (FineWeb2, FineMath) | normalize β length filter β score filter β exact dedup β tokenize β pack |
| 2 | Synthetic/mixed (Cosmopedia) | normalize β length filter β alpha ratio β exact dedup β tokenize β pack |
| 3 | Raw web (uncurated) | normalize β lang ID β heuristics β exact dedup β near-dup β tokenize β pack |
Core Principle
Do not waste compute redoing work already done upstream.
If a dataset already has language labels, quality scores, deduplication, or curation metadata β GrandLine reuses those signals instead of recomputing them.
Project Structure
grandline/
βββ pyproject.toml # Package definition
βββ README.md # This file
βββ configs/
β βββ global.yaml # Default settings
β βββ datasets/ # Per-dataset pipeline configs
β β βββ fineweb_edu.yaml
β β βββ fineweb2.yaml
β β βββ dclm.yaml
β β βββ cosmopedia.yaml
β β βββ the_stack_v2.yaml
β β βββ finemath.yaml
β β βββ pes2o.yaml
β βββ executors/ # Machine-specific tuning
β β βββ kaggle.yaml
β β βββ colab.yaml
β β βββ local_224cpu.yaml
β βββ tokenizers/
β βββ qwen3.yaml
βββ scripts/
β βββ run_dataset.py # Main processing entry point
β βββ inspect_dataset.py # Dataset inspection utility
β βββ merge_manifests.py # Merge per-shard manifests
βββ state/ # Runtime state (gitignored in practice)
β βββ progress/ # Shard completion markers
β βββ manifests/ # Per-shard processing manifests
β βββ cache/ # Pipeline output cache
β βββ dedup.duckdb # Persistent dedup database
βββ tests/
β βββ test_determinism.py
β βββ test_dedup.py
β βββ test_filters.py
β βββ test_tokenize.py
β βββ test_pack.py
βββ src/grandline/
βββ __init__.py
βββ types.py # Document, TokenizedDocument, PackedSequence
βββ pipeline.py # Pipeline compilation and execution
βββ registry.py # Block name β class registry
βββ runtime.py # Shard-level execution engine
βββ manifest.py # Shard/dataset manifest I/O
βββ cache.py # Pipeline output caching
βββ io.py # Shard reading (parquet, JSONL, HF)
βββ hashing.py # BLAKE3 hashing utilities
βββ dedup_store.py # DuckDB-backed dedup state
βββ writer.py # Parquet output writer
βββ blocks/
β βββ base.py # Block, FilterBlock, TransformBlock
β βββ normalize.py # Unicode/whitespace normalization
β βββ filters.py # Length, score, language, alpha filters
β βββ dedup.py # Exact deduplication block
β βββ tokenize.py # Batch tokenization block
β βββ pack.py # Deterministic greedy packing
βββ pipelines/
β βββ curated_web.py # FineWeb-Edu, DCLM, FineWeb2
β βββ code.py # The Stack v2
β βββ math.py # FineMath, OpenWebMath
β βββ papers.py # PeS2o, arXiv
β βββ synthetic.py # Cosmopedia
βββ util/
βββ logging.py # Structured logging
βββ paths.py # Project path management
βββ config.py # YAML config loading + validation
Dataset Configuration
Each dataset gets a YAML config specifying its pipeline:
name: fineweb_edu
pipeline_type: curated_web
trust_level: 0
source:
repo: "HuggingFaceFW/fineweb-edu-score-2"
paths: ["/data/fineweb-edu/"]
text_column: "text"
metadata_columns:
- "score"
- "url"
score_filter:
field: "score"
threshold: 3.0
tokenizer:
name: "Qwen/Qwen3-0.6B"
batch_size: 512
packing:
max_seq_len: 2048
eos_id: 151645
Output Format
GrandLine produces compressed parquet files with this schema:
| Column | Type | Description |
|---|---|---|
input_ids |
list<int32> |
Packed token IDs (padded to max_seq_len) |
seq_lens |
list<int32> |
Per-document lengths within the packed sequence |
total_tokens |
int32 |
Non-padding token count |
shard_id |
string |
Provenance: which input shard produced this |
The seq_lens field enables document-level attention masking with FlashAttention varlen APIs, preventing cross-document attention contamination.
Performance Choices
| Choice | Why |
|---|---|
| BLAKE3 | 3-4x faster than SHA-256, deterministic, parallelizable |
| DuckDB | Persistent dedup state, fast hash lookups, no external service |
| Parquet + zstd | Columnar, compressed, random-access, ecosystem-compatible |
| Rust-backed tokenizers | Batch tokenization releases GIL, ~10x faster than Python |
| Streaming first-fit packing | Memory-bounded, deterministic, no sorting needed |
| Single-pass processing | Read shard once, apply all transforms, write once |
Running Tests
# Install dev dependencies
pip install -e ".[dev]"
# Run all tests
pytest tests/ -v
# Run specific test
pytest tests/test_pack.py -v
Determinism Guarantees
GrandLine ensures:
- Stable sort order: shards processed in lexicographic order
- Stable hashes: BLAKE3 with fixed encoding (UTF-8)
- Stable pipeline fingerprints: hash of (config + block signatures + versions)
- Stable shard outputs: same shard file β same output file, always
- No randomness: no random seeds, no non-deterministic parallelism
- No hidden state: dedup store is the only persistent state, fully deterministic
Pipeline fingerprint includes:
- Dataset config values
- Block names and versions
- Block parameters
- Tokenizer identity
Supported Datasets
| Dataset | Pipeline | Trust Level | Notes |
|---|---|---|---|
| FineWeb-Edu | curated_web | 0 | Educational quality scored |
| FineWeb2 | curated_web | 1 | Multilingual web |
| DCLM | curated_web | 0 | Model-based quality selection |
| The Stack v2 | code | 1 | License-filtered source code |
| FineMath | math | 1 | Mathematical web content |
| PeS2o | papers | 0 | Open access academic papers |
| Cosmopedia | synthetic | 2 | LLM-generated textbooks |
License
Apache-2.0
Generated by ML Intern
This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = 'dignity045/grandline'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.