---
license: apache-2.0
tags:
- dataset-preprocessing
- llm-pretraining
- tokenization
- deduplication
- data-pipeline
- ml-intern
---

# GrandLine

**Deterministic shard-first dataset preprocessing for LLM pretraining.**

GrandLine turns heterogeneous, partially curated datasets into reproducible, tokenized, packed training artifacts with minimal wasted compute.

---

## Key Properties

| Property | Guarantee |
|----------|-----------|
| **Deterministic** | Same input + same config + same code version = same output |
| **Shard-first** | Every shard processed independently, no global state |
| **Resumable** | Skip already-processed shards on restart |
| **Scalable** | 4 CPUs → 224 CPUs without architectural changes |
| **Dataset-aware** | Pipeline compiled per dataset trust level |
| **Fast** | BLAKE3 hashing, batch tokenization, zstd parquet output |

---

## Installation

```bash
# Development install
pip install -e ".[dev]"

# Or install dependencies directly
pip install blake3 duckdb pyarrow transformers tokenizers pyyaml click datasets tqdm
```

**Requirements:** Python ≥ 3.11

---

## Quick Start

### 1. Inspect a dataset

```bash
python scripts/inspect_dataset.py --source /path/to/parquet/dir/
python scripts/inspect_dataset.py --hf-repo "HuggingFaceFW/fineweb-edu-score-2" --split train
```

### 2. Process a dataset

```bash
python scripts/run_dataset.py --config configs/datasets/fineweb_edu.yaml
```

### 3. With executor tuning

```bash
# On Kaggle (conservative memory)
python scripts/run_dataset.py \
  --config configs/datasets/fineweb_edu.yaml \
  --executor configs/executors/kaggle.yaml

# On a large CPU machine
python scripts/run_dataset.py \
  --config configs/datasets/dclm.yaml \
  --executor configs/executors/local_224cpu.yaml
```

### 4. With CLI overrides

```bash
python scripts/run_dataset.py \
  --config configs/datasets/fineweb_edu.yaml \
  packing.max_seq_len=4096 \
  tokenizer.name=Qwen/Qwen3-4B
```

### 5. Merge manifests

```bash
python scripts/merge_manifests.py --state-dir state/ --dataset fineweb_edu
```

---

## Architecture

### Pipeline Levels

GrandLine compiles dataset-specific pipelines based on trust level:

| Level | Target | Blocks |
|-------|--------|--------|
| **0** | Highly curated (FineWeb-Edu, DCLM) | normalize → exact dedup → tokenize → pack |
| **1** | Curated with scores (FineWeb2, FineMath) | normalize → length filter → score filter → exact dedup → tokenize → pack |
| **2** | Synthetic/mixed (Cosmopedia) | normalize → length filter → alpha ratio → exact dedup → tokenize → pack |
| **3** | Raw web (uncurated) | normalize → lang ID → heuristics → exact dedup → near-dup → tokenize → pack |

### Core Principle

> Do not waste compute redoing work already done upstream.

If a dataset already has language labels, quality scores, deduplication, or curation metadata — GrandLine reuses those signals instead of recomputing them.

---

## Project Structure

```
grandline/
├── pyproject.toml              # Package definition
├── README.md                   # This file
├── configs/
│   ├── global.yaml             # Default settings
│   ├── datasets/               # Per-dataset pipeline configs
│   │   ├── fineweb_edu.yaml
│   │   ├── fineweb2.yaml
│   │   ├── dclm.yaml
│   │   ├── cosmopedia.yaml
│   │   ├── the_stack_v2.yaml
│   │   ├── finemath.yaml
│   │   └── pes2o.yaml
│   ├── executors/              # Machine-specific tuning
│   │   ├── kaggle.yaml
│   │   ├── colab.yaml
│   │   └── local_224cpu.yaml
│   └── tokenizers/
│       └── qwen3.yaml
├── scripts/
│   ├── run_dataset.py          # Main processing entry point
│   ├── inspect_dataset.py      # Dataset inspection utility
│   └── merge_manifests.py      # Merge per-shard manifests
├── state/                      # Runtime state (gitignored in practice)
│   ├── progress/               # Shard completion markers
│   ├── manifests/              # Per-shard processing manifests
│   ├── cache/                  # Pipeline output cache
│   └── dedup.duckdb            # Persistent dedup database
├── tests/
│   ├── test_determinism.py
│   ├── test_dedup.py
│   ├── test_filters.py
│   ├── test_tokenize.py
│   └── test_pack.py
└── src/grandline/
    ├── __init__.py
    ├── types.py                # Document, TokenizedDocument, PackedSequence
    ├── pipeline.py             # Pipeline compilation and execution
    ├── registry.py             # Block name → class registry
    ├── runtime.py              # Shard-level execution engine
    ├── manifest.py             # Shard/dataset manifest I/O
    ├── cache.py                # Pipeline output caching
    ├── io.py                   # Shard reading (parquet, JSONL, HF)
    ├── hashing.py              # BLAKE3 hashing utilities
    ├── dedup_store.py          # DuckDB-backed dedup state
    ├── writer.py               # Parquet output writer
    ├── blocks/
    │   ├── base.py             # Block, FilterBlock, TransformBlock
    │   ├── normalize.py        # Unicode/whitespace normalization
    │   ├── filters.py          # Length, score, language, alpha filters
    │   ├── dedup.py            # Exact deduplication block
    │   ├── tokenize.py         # Batch tokenization block
    │   └── pack.py             # Deterministic greedy packing
    ├── pipelines/
    │   ├── curated_web.py      # FineWeb-Edu, DCLM, FineWeb2
    │   ├── code.py             # The Stack v2
    │   ├── math.py             # FineMath, OpenWebMath
    │   ├── papers.py           # PeS2o, arXiv
    │   └── synthetic.py        # Cosmopedia
    └── util/
        ├── logging.py          # Structured logging
        ├── paths.py            # Project path management
        └── config.py           # YAML config loading + validation
```

---

## Dataset Configuration

Each dataset gets a YAML config specifying its pipeline:

```yaml
name: fineweb_edu
pipeline_type: curated_web
trust_level: 0

source:
  repo: "HuggingFaceFW/fineweb-edu-score-2"
  paths: ["/data/fineweb-edu/"]
  text_column: "text"

metadata_columns:
  - "score"
  - "url"

score_filter:
  field: "score"
  threshold: 3.0

tokenizer:
  name: "Qwen/Qwen3-0.6B"
  batch_size: 512

packing:
  max_seq_len: 2048
  eos_id: 151645
```

---

## Output Format

GrandLine produces compressed parquet files with this schema:

| Column | Type | Description |
|--------|------|-------------|
| `input_ids` | `list<int32>` | Packed token IDs (padded to max_seq_len) |
| `seq_lens` | `list<int32>` | Per-document lengths within the packed sequence |
| `total_tokens` | `int32` | Non-padding token count |
| `shard_id` | `string` | Provenance: which input shard produced this |

The `seq_lens` field enables document-level attention masking with FlashAttention `varlen` APIs, preventing cross-document attention contamination.

---

## Performance Choices

| Choice | Why |
|--------|-----|
| **BLAKE3** | 3-4x faster than SHA-256, deterministic, parallelizable |
| **DuckDB** | Persistent dedup state, fast hash lookups, no external service |
| **Parquet + zstd** | Columnar, compressed, random-access, ecosystem-compatible |
| **Rust-backed tokenizers** | Batch tokenization releases GIL, ~10x faster than Python |
| **Streaming first-fit packing** | Memory-bounded, deterministic, no sorting needed |
| **Single-pass processing** | Read shard once, apply all transforms, write once |

---

## Running Tests

```bash
# Install dev dependencies
pip install -e ".[dev]"

# Run all tests
pytest tests/ -v

# Run specific test
pytest tests/test_pack.py -v
```

---

## Determinism Guarantees

GrandLine ensures:

- **Stable sort order**: shards processed in lexicographic order
- **Stable hashes**: BLAKE3 with fixed encoding (UTF-8)
- **Stable pipeline fingerprints**: hash of (config + block signatures + versions)
- **Stable shard outputs**: same shard file → same output file, always
- **No randomness**: no random seeds, no non-deterministic parallelism
- **No hidden state**: dedup store is the only persistent state, fully deterministic

Pipeline fingerprint includes:
- Dataset config values
- Block names and versions
- Block parameters
- Tokenizer identity

---

## Supported Datasets

| Dataset | Pipeline | Trust Level | Notes |
|---------|----------|-------------|-------|
| FineWeb-Edu | curated_web | 0 | Educational quality scored |
| FineWeb2 | curated_web | 1 | Multilingual web |
| DCLM | curated_web | 0 | Model-based quality selection |
| The Stack v2 | code | 1 | License-filtered source code |
| FineMath | math | 1 | Mathematical web content |
| PeS2o | papers | 0 | Open access academic papers |
| Cosmopedia | synthetic | 2 | LLM-generated textbooks |

---

## License

Apache-2.0

<!-- ml-intern-provenance -->
## Generated by ML Intern

This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.

- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = 'dignity045/grandline'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
```

For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.