GrandLine

Deterministic shard-first dataset preprocessing for LLM pretraining.

GrandLine turns heterogeneous, partially curated datasets into reproducible, tokenized, packed training artifacts with minimal wasted compute.


Key Properties

Property Guarantee
Deterministic Same input + same config + same code version = same output
Shard-first Every shard processed independently, no global state
Resumable Skip already-processed shards on restart
Scalable 4 CPUs β†’ 224 CPUs without architectural changes
Dataset-aware Pipeline compiled per dataset trust level
Fast BLAKE3 hashing, batch tokenization, zstd parquet output

Installation

# Development install
pip install -e ".[dev]"

# Or install dependencies directly
pip install blake3 duckdb pyarrow transformers tokenizers pyyaml click datasets tqdm

Requirements: Python β‰₯ 3.11


Quick Start

1. Inspect a dataset

python scripts/inspect_dataset.py --source /path/to/parquet/dir/
python scripts/inspect_dataset.py --hf-repo "HuggingFaceFW/fineweb-edu-score-2" --split train

2. Process a dataset

python scripts/run_dataset.py --config configs/datasets/fineweb_edu.yaml

3. With executor tuning

# On Kaggle (conservative memory)
python scripts/run_dataset.py \
  --config configs/datasets/fineweb_edu.yaml \
  --executor configs/executors/kaggle.yaml

# On a large CPU machine
python scripts/run_dataset.py \
  --config configs/datasets/dclm.yaml \
  --executor configs/executors/local_224cpu.yaml

4. With CLI overrides

python scripts/run_dataset.py \
  --config configs/datasets/fineweb_edu.yaml \
  packing.max_seq_len=4096 \
  tokenizer.name=Qwen/Qwen3-4B

5. Merge manifests

python scripts/merge_manifests.py --state-dir state/ --dataset fineweb_edu

Architecture

Pipeline Levels

GrandLine compiles dataset-specific pipelines based on trust level:

Level Target Blocks
0 Highly curated (FineWeb-Edu, DCLM) normalize β†’ exact dedup β†’ tokenize β†’ pack
1 Curated with scores (FineWeb2, FineMath) normalize β†’ length filter β†’ score filter β†’ exact dedup β†’ tokenize β†’ pack
2 Synthetic/mixed (Cosmopedia) normalize β†’ length filter β†’ alpha ratio β†’ exact dedup β†’ tokenize β†’ pack
3 Raw web (uncurated) normalize β†’ lang ID β†’ heuristics β†’ exact dedup β†’ near-dup β†’ tokenize β†’ pack

Core Principle

Do not waste compute redoing work already done upstream.

If a dataset already has language labels, quality scores, deduplication, or curation metadata β€” GrandLine reuses those signals instead of recomputing them.


Project Structure

grandline/
β”œβ”€β”€ pyproject.toml              # Package definition
β”œβ”€β”€ README.md                   # This file
β”œβ”€β”€ configs/
β”‚   β”œβ”€β”€ global.yaml             # Default settings
β”‚   β”œβ”€β”€ datasets/               # Per-dataset pipeline configs
β”‚   β”‚   β”œβ”€β”€ fineweb_edu.yaml
β”‚   β”‚   β”œβ”€β”€ fineweb2.yaml
β”‚   β”‚   β”œβ”€β”€ dclm.yaml
β”‚   β”‚   β”œβ”€β”€ cosmopedia.yaml
β”‚   β”‚   β”œβ”€β”€ the_stack_v2.yaml
β”‚   β”‚   β”œβ”€β”€ finemath.yaml
β”‚   β”‚   └── pes2o.yaml
β”‚   β”œβ”€β”€ executors/              # Machine-specific tuning
β”‚   β”‚   β”œβ”€β”€ kaggle.yaml
β”‚   β”‚   β”œβ”€β”€ colab.yaml
β”‚   β”‚   └── local_224cpu.yaml
β”‚   └── tokenizers/
β”‚       └── qwen3.yaml
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ run_dataset.py          # Main processing entry point
β”‚   β”œβ”€β”€ inspect_dataset.py      # Dataset inspection utility
β”‚   └── merge_manifests.py      # Merge per-shard manifests
β”œβ”€β”€ state/                      # Runtime state (gitignored in practice)
β”‚   β”œβ”€β”€ progress/               # Shard completion markers
β”‚   β”œβ”€β”€ manifests/              # Per-shard processing manifests
β”‚   β”œβ”€β”€ cache/                  # Pipeline output cache
β”‚   └── dedup.duckdb            # Persistent dedup database
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ test_determinism.py
β”‚   β”œβ”€β”€ test_dedup.py
β”‚   β”œβ”€β”€ test_filters.py
β”‚   β”œβ”€β”€ test_tokenize.py
β”‚   └── test_pack.py
└── src/grandline/
    β”œβ”€β”€ __init__.py
    β”œβ”€β”€ types.py                # Document, TokenizedDocument, PackedSequence
    β”œβ”€β”€ pipeline.py             # Pipeline compilation and execution
    β”œβ”€β”€ registry.py             # Block name β†’ class registry
    β”œβ”€β”€ runtime.py              # Shard-level execution engine
    β”œβ”€β”€ manifest.py             # Shard/dataset manifest I/O
    β”œβ”€β”€ cache.py                # Pipeline output caching
    β”œβ”€β”€ io.py                   # Shard reading (parquet, JSONL, HF)
    β”œβ”€β”€ hashing.py              # BLAKE3 hashing utilities
    β”œβ”€β”€ dedup_store.py          # DuckDB-backed dedup state
    β”œβ”€β”€ writer.py               # Parquet output writer
    β”œβ”€β”€ blocks/
    β”‚   β”œβ”€β”€ base.py             # Block, FilterBlock, TransformBlock
    β”‚   β”œβ”€β”€ normalize.py        # Unicode/whitespace normalization
    β”‚   β”œβ”€β”€ filters.py          # Length, score, language, alpha filters
    β”‚   β”œβ”€β”€ dedup.py            # Exact deduplication block
    β”‚   β”œβ”€β”€ tokenize.py         # Batch tokenization block
    β”‚   └── pack.py             # Deterministic greedy packing
    β”œβ”€β”€ pipelines/
    β”‚   β”œβ”€β”€ curated_web.py      # FineWeb-Edu, DCLM, FineWeb2
    β”‚   β”œβ”€β”€ code.py             # The Stack v2
    β”‚   β”œβ”€β”€ math.py             # FineMath, OpenWebMath
    β”‚   β”œβ”€β”€ papers.py           # PeS2o, arXiv
    β”‚   └── synthetic.py        # Cosmopedia
    └── util/
        β”œβ”€β”€ logging.py          # Structured logging
        β”œβ”€β”€ paths.py            # Project path management
        └── config.py           # YAML config loading + validation

Dataset Configuration

Each dataset gets a YAML config specifying its pipeline:

name: fineweb_edu
pipeline_type: curated_web
trust_level: 0

source:
  repo: "HuggingFaceFW/fineweb-edu-score-2"
  paths: ["/data/fineweb-edu/"]
  text_column: "text"

metadata_columns:
  - "score"
  - "url"

score_filter:
  field: "score"
  threshold: 3.0

tokenizer:
  name: "Qwen/Qwen3-0.6B"
  batch_size: 512

packing:
  max_seq_len: 2048
  eos_id: 151645

Output Format

GrandLine produces compressed parquet files with this schema:

Column Type Description
input_ids list<int32> Packed token IDs (padded to max_seq_len)
seq_lens list<int32> Per-document lengths within the packed sequence
total_tokens int32 Non-padding token count
shard_id string Provenance: which input shard produced this

The seq_lens field enables document-level attention masking with FlashAttention varlen APIs, preventing cross-document attention contamination.


Performance Choices

Choice Why
BLAKE3 3-4x faster than SHA-256, deterministic, parallelizable
DuckDB Persistent dedup state, fast hash lookups, no external service
Parquet + zstd Columnar, compressed, random-access, ecosystem-compatible
Rust-backed tokenizers Batch tokenization releases GIL, ~10x faster than Python
Streaming first-fit packing Memory-bounded, deterministic, no sorting needed
Single-pass processing Read shard once, apply all transforms, write once

Running Tests

# Install dev dependencies
pip install -e ".[dev]"

# Run all tests
pytest tests/ -v

# Run specific test
pytest tests/test_pack.py -v

Determinism Guarantees

GrandLine ensures:

  • Stable sort order: shards processed in lexicographic order
  • Stable hashes: BLAKE3 with fixed encoding (UTF-8)
  • Stable pipeline fingerprints: hash of (config + block signatures + versions)
  • Stable shard outputs: same shard file β†’ same output file, always
  • No randomness: no random seeds, no non-deterministic parallelism
  • No hidden state: dedup store is the only persistent state, fully deterministic

Pipeline fingerprint includes:

  • Dataset config values
  • Block names and versions
  • Block parameters
  • Tokenizer identity

Supported Datasets

Dataset Pipeline Trust Level Notes
FineWeb-Edu curated_web 0 Educational quality scored
FineWeb2 curated_web 1 Multilingual web
DCLM curated_web 0 Model-based quality selection
The Stack v2 code 1 License-filtered source code
FineMath math 1 Mathematical web content
PeS2o papers 0 Open access academic papers
Cosmopedia synthetic 2 LLM-generated textbooks

License

Apache-2.0

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = 'dignity045/grandline'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support