GrandLine

Deterministic shard-first dataset preprocessing for LLM pretraining.

GrandLine turns heterogeneous, partially curated datasets into reproducible, tokenized, packed training artifacts with minimal wasted compute.

Key Properties

Property	Guarantee
Deterministic	Same input + same config + same code version = same output
Shard-first	Every shard processed independently, no global state
Resumable	Skip already-processed shards on restart
Scalable	4 CPUs → 224 CPUs without architectural changes
Dataset-aware	Pipeline compiled per dataset trust level
Fast	BLAKE3 hashing, batch tokenization, zstd parquet output

Installation

# Development install
pip install -e ".[dev]"

# Or install dependencies directly
pip install blake3 duckdb pyarrow transformers tokenizers pyyaml click datasets tqdm

Requirements: Python ≥ 3.11

Quick Start

1. Inspect a dataset

python scripts/inspect_dataset.py --source /path/to/parquet/dir/
python scripts/inspect_dataset.py --hf-repo "HuggingFaceFW/fineweb-edu-score-2" --split train

2. Process a dataset

python scripts/run_dataset.py --config configs/datasets/fineweb_edu.yaml

3. With executor tuning

# On Kaggle (conservative memory)
python scripts/run_dataset.py \
  --config configs/datasets/fineweb_edu.yaml \
  --executor configs/executors/kaggle.yaml

# On a large CPU machine
python scripts/run_dataset.py \
  --config configs/datasets/dclm.yaml \
  --executor configs/executors/local_224cpu.yaml

4. With CLI overrides

python scripts/run_dataset.py \
  --config configs/datasets/fineweb_edu.yaml \
  packing.max_seq_len=4096 \
  tokenizer.name=Qwen/Qwen3-4B

5. Merge manifests

python scripts/merge_manifests.py --state-dir state/ --dataset fineweb_edu

Architecture

Pipeline Levels

GrandLine compiles dataset-specific pipelines based on trust level:

Level	Target	Blocks
0	Highly curated (FineWeb-Edu, DCLM)	normalize → exact dedup → tokenize → pack
1	Curated with scores (FineWeb2, FineMath)	normalize → length filter → score filter → exact dedup → tokenize → pack
2	Synthetic/mixed (Cosmopedia)	normalize → length filter → alpha ratio → exact dedup → tokenize → pack
3	Raw web (uncurated)	normalize → lang ID → heuristics → exact dedup → near-dup → tokenize → pack

Core Principle

Do not waste compute redoing work already done upstream.

If a dataset already has language labels, quality scores, deduplication, or curation metadata — GrandLine reuses those signals instead of recomputing them.

Project Structure

grandline/
├── pyproject.toml              # Package definition
├── README.md                   # This file
├── configs/
│   ├── global.yaml             # Default settings
│   ├── datasets/               # Per-dataset pipeline configs
│   │   ├── fineweb_edu.yaml
│   │   ├── fineweb2.yaml
│   │   ├── dclm.yaml
│   │   ├── cosmopedia.yaml
│   │   ├── the_stack_v2.yaml
│   │   ├── finemath.yaml
│   │   └── pes2o.yaml
│   ├── executors/              # Machine-specific tuning
│   │   ├── kaggle.yaml
│   │   ├── colab.yaml
│   │   └── local_224cpu.yaml
│   └── tokenizers/
│       └── qwen3.yaml
├── scripts/
│   ├── run_dataset.py          # Main processing entry point
│   ├── inspect_dataset.py      # Dataset inspection utility
│   └── merge_manifests.py      # Merge per-shard manifests
├── state/                      # Runtime state (gitignored in practice)
│   ├── progress/               # Shard completion markers
│   ├── manifests/              # Per-shard processing manifests
│   ├── cache/                  # Pipeline output cache
│   └── dedup.duckdb            # Persistent dedup database
├── tests/
│   ├── test_determinism.py
│   ├── test_dedup.py
│   ├── test_filters.py
│   ├── test_tokenize.py
│   └── test_pack.py
└── src/grandline/
    ├── __init__.py
    ├── types.py                # Document, TokenizedDocument, PackedSequence
    ├── pipeline.py             # Pipeline compilation and execution
    ├── registry.py             # Block name → class registry
    ├── runtime.py              # Shard-level execution engine
    ├── manifest.py             # Shard/dataset manifest I/O
    ├── cache.py                # Pipeline output caching
    ├── io.py                   # Shard reading (parquet, JSONL, HF)
    ├── hashing.py              # BLAKE3 hashing utilities
    ├── dedup_store.py          # DuckDB-backed dedup state
    ├── writer.py               # Parquet output writer
    ├── blocks/
    │   ├── base.py             # Block, FilterBlock, TransformBlock
    │   ├── normalize.py        # Unicode/whitespace normalization
    │   ├── filters.py          # Length, score, language, alpha filters
    │   ├── dedup.py            # Exact deduplication block
    │   ├── tokenize.py         # Batch tokenization block
    │   └── pack.py             # Deterministic greedy packing
    ├── pipelines/
    │   ├── curated_web.py      # FineWeb-Edu, DCLM, FineWeb2
    │   ├── code.py             # The Stack v2
    │   ├── math.py             # FineMath, OpenWebMath
    │   ├── papers.py           # PeS2o, arXiv
    │   └── synthetic.py        # Cosmopedia
    └── util/
        ├── logging.py          # Structured logging
        ├── paths.py            # Project path management
        └── config.py           # YAML config loading + validation

Dataset Configuration

Each dataset gets a YAML config specifying its pipeline:

name: fineweb_edu
pipeline_type: curated_web
trust_level: 0

source:
  repo: "HuggingFaceFW/fineweb-edu-score-2"
  paths: ["/data/fineweb-edu/"]
  text_column: "text"

metadata_columns:
  - "score"
  - "url"

score_filter:
  field: "score"
  threshold: 3.0

tokenizer:
  name: "Qwen/Qwen3-0.6B"
  batch_size: 512

packing:
  max_seq_len: 2048
  eos_id: 151645

Output Format

GrandLine produces compressed parquet files with this schema:

Column	Type	Description
`input_ids`	`list<int32>`	Packed token IDs (padded to max_seq_len)
`seq_lens`	`list<int32>`	Per-document lengths within the packed sequence
`total_tokens`	`int32`	Non-padding token count
`shard_id`	`string`	Provenance: which input shard produced this

The seq_lens field enables document-level attention masking with FlashAttention varlen APIs, preventing cross-document attention contamination.

Performance Choices

Choice	Why
BLAKE3	3-4x faster than SHA-256, deterministic, parallelizable
DuckDB	Persistent dedup state, fast hash lookups, no external service
Parquet + zstd	Columnar, compressed, random-access, ecosystem-compatible
Rust-backed tokenizers	Batch tokenization releases GIL, ~10x faster than Python
Streaming first-fit packing	Memory-bounded, deterministic, no sorting needed
Single-pass processing	Read shard once, apply all transforms, write once

Running Tests

# Install dev dependencies
pip install -e ".[dev]"

# Run all tests
pytest tests/ -v

# Run specific test
pytest tests/test_pack.py -v

Determinism Guarantees

GrandLine ensures:

Stable sort order: shards processed in lexicographic order
Stable hashes: BLAKE3 with fixed encoding (UTF-8)
Stable pipeline fingerprints: hash of (config + block signatures + versions)
Stable shard outputs: same shard file → same output file, always
No randomness: no random seeds, no non-deterministic parallelism
No hidden state: dedup store is the only persistent state, fully deterministic

Pipeline fingerprint includes:

Dataset config values
Block names and versions
Block parameters
Tokenizer identity

Supported Datasets

Dataset	Pipeline	Trust Level	Notes
FineWeb-Edu	curated_web	0	Educational quality scored
FineWeb2	curated_web	1	Multilingual web
DCLM	curated_web	0	Model-based quality selection
The Stack v2	code	1	License-filtered source code
FineMath	math	1	Mathematical web content
PeS2o	papers	0	Open access academic papers
Cosmopedia	synthetic	2	LLM-generated textbooks

License

Apache-2.0

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Try ML Intern: https://smolagents-ml-intern.hf.space
Source code: https://github.com/huggingface/ml-intern

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = 'dignity045/grandline'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support