grandline / README.md
dignity045's picture
Update ML Intern artifact metadata
f39f181 verified
---
license: apache-2.0
tags:
- dataset-preprocessing
- llm-pretraining
- tokenization
- deduplication
- data-pipeline
- ml-intern
---
# GrandLine
**Deterministic shard-first dataset preprocessing for LLM pretraining.**
GrandLine turns heterogeneous, partially curated datasets into reproducible, tokenized, packed training artifacts with minimal wasted compute.
---
## Key Properties
| Property | Guarantee |
|----------|-----------|
| **Deterministic** | Same input + same config + same code version = same output |
| **Shard-first** | Every shard processed independently, no global state |
| **Resumable** | Skip already-processed shards on restart |
| **Scalable** | 4 CPUs β†’ 224 CPUs without architectural changes |
| **Dataset-aware** | Pipeline compiled per dataset trust level |
| **Fast** | BLAKE3 hashing, batch tokenization, zstd parquet output |
---
## Installation
```bash
# Development install
pip install -e ".[dev]"
# Or install dependencies directly
pip install blake3 duckdb pyarrow transformers tokenizers pyyaml click datasets tqdm
```
**Requirements:** Python β‰₯ 3.11
---
## Quick Start
### 1. Inspect a dataset
```bash
python scripts/inspect_dataset.py --source /path/to/parquet/dir/
python scripts/inspect_dataset.py --hf-repo "HuggingFaceFW/fineweb-edu-score-2" --split train
```
### 2. Process a dataset
```bash
python scripts/run_dataset.py --config configs/datasets/fineweb_edu.yaml
```
### 3. With executor tuning
```bash
# On Kaggle (conservative memory)
python scripts/run_dataset.py \
--config configs/datasets/fineweb_edu.yaml \
--executor configs/executors/kaggle.yaml
# On a large CPU machine
python scripts/run_dataset.py \
--config configs/datasets/dclm.yaml \
--executor configs/executors/local_224cpu.yaml
```
### 4. With CLI overrides
```bash
python scripts/run_dataset.py \
--config configs/datasets/fineweb_edu.yaml \
packing.max_seq_len=4096 \
tokenizer.name=Qwen/Qwen3-4B
```
### 5. Merge manifests
```bash
python scripts/merge_manifests.py --state-dir state/ --dataset fineweb_edu
```
---
## Architecture
### Pipeline Levels
GrandLine compiles dataset-specific pipelines based on trust level:
| Level | Target | Blocks |
|-------|--------|--------|
| **0** | Highly curated (FineWeb-Edu, DCLM) | normalize β†’ exact dedup β†’ tokenize β†’ pack |
| **1** | Curated with scores (FineWeb2, FineMath) | normalize β†’ length filter β†’ score filter β†’ exact dedup β†’ tokenize β†’ pack |
| **2** | Synthetic/mixed (Cosmopedia) | normalize β†’ length filter β†’ alpha ratio β†’ exact dedup β†’ tokenize β†’ pack |
| **3** | Raw web (uncurated) | normalize β†’ lang ID β†’ heuristics β†’ exact dedup β†’ near-dup β†’ tokenize β†’ pack |
### Core Principle
> Do not waste compute redoing work already done upstream.
If a dataset already has language labels, quality scores, deduplication, or curation metadata β€” GrandLine reuses those signals instead of recomputing them.
---
## Project Structure
```
grandline/
β”œβ”€β”€ pyproject.toml # Package definition
β”œβ”€β”€ README.md # This file
β”œβ”€β”€ configs/
β”‚ β”œβ”€β”€ global.yaml # Default settings
β”‚ β”œβ”€β”€ datasets/ # Per-dataset pipeline configs
β”‚ β”‚ β”œβ”€β”€ fineweb_edu.yaml
β”‚ β”‚ β”œβ”€β”€ fineweb2.yaml
β”‚ β”‚ β”œβ”€β”€ dclm.yaml
β”‚ β”‚ β”œβ”€β”€ cosmopedia.yaml
β”‚ β”‚ β”œβ”€β”€ the_stack_v2.yaml
β”‚ β”‚ β”œβ”€β”€ finemath.yaml
β”‚ β”‚ └── pes2o.yaml
β”‚ β”œβ”€β”€ executors/ # Machine-specific tuning
β”‚ β”‚ β”œβ”€β”€ kaggle.yaml
β”‚ β”‚ β”œβ”€β”€ colab.yaml
β”‚ β”‚ └── local_224cpu.yaml
β”‚ └── tokenizers/
β”‚ └── qwen3.yaml
β”œβ”€β”€ scripts/
β”‚ β”œβ”€β”€ run_dataset.py # Main processing entry point
β”‚ β”œβ”€β”€ inspect_dataset.py # Dataset inspection utility
β”‚ └── merge_manifests.py # Merge per-shard manifests
β”œβ”€β”€ state/ # Runtime state (gitignored in practice)
β”‚ β”œβ”€β”€ progress/ # Shard completion markers
β”‚ β”œβ”€β”€ manifests/ # Per-shard processing manifests
β”‚ β”œβ”€β”€ cache/ # Pipeline output cache
β”‚ └── dedup.duckdb # Persistent dedup database
β”œβ”€β”€ tests/
β”‚ β”œβ”€β”€ test_determinism.py
β”‚ β”œβ”€β”€ test_dedup.py
β”‚ β”œβ”€β”€ test_filters.py
β”‚ β”œβ”€β”€ test_tokenize.py
β”‚ └── test_pack.py
└── src/grandline/
β”œβ”€β”€ __init__.py
β”œβ”€β”€ types.py # Document, TokenizedDocument, PackedSequence
β”œβ”€β”€ pipeline.py # Pipeline compilation and execution
β”œβ”€β”€ registry.py # Block name β†’ class registry
β”œβ”€β”€ runtime.py # Shard-level execution engine
β”œβ”€β”€ manifest.py # Shard/dataset manifest I/O
β”œβ”€β”€ cache.py # Pipeline output caching
β”œβ”€β”€ io.py # Shard reading (parquet, JSONL, HF)
β”œβ”€β”€ hashing.py # BLAKE3 hashing utilities
β”œβ”€β”€ dedup_store.py # DuckDB-backed dedup state
β”œβ”€β”€ writer.py # Parquet output writer
β”œβ”€β”€ blocks/
β”‚ β”œβ”€β”€ base.py # Block, FilterBlock, TransformBlock
β”‚ β”œβ”€β”€ normalize.py # Unicode/whitespace normalization
β”‚ β”œβ”€β”€ filters.py # Length, score, language, alpha filters
β”‚ β”œβ”€β”€ dedup.py # Exact deduplication block
β”‚ β”œβ”€β”€ tokenize.py # Batch tokenization block
β”‚ └── pack.py # Deterministic greedy packing
β”œβ”€β”€ pipelines/
β”‚ β”œβ”€β”€ curated_web.py # FineWeb-Edu, DCLM, FineWeb2
β”‚ β”œβ”€β”€ code.py # The Stack v2
β”‚ β”œβ”€β”€ math.py # FineMath, OpenWebMath
β”‚ β”œβ”€β”€ papers.py # PeS2o, arXiv
β”‚ └── synthetic.py # Cosmopedia
└── util/
β”œβ”€β”€ logging.py # Structured logging
β”œβ”€β”€ paths.py # Project path management
└── config.py # YAML config loading + validation
```
---
## Dataset Configuration
Each dataset gets a YAML config specifying its pipeline:
```yaml
name: fineweb_edu
pipeline_type: curated_web
trust_level: 0
source:
repo: "HuggingFaceFW/fineweb-edu-score-2"
paths: ["/data/fineweb-edu/"]
text_column: "text"
metadata_columns:
- "score"
- "url"
score_filter:
field: "score"
threshold: 3.0
tokenizer:
name: "Qwen/Qwen3-0.6B"
batch_size: 512
packing:
max_seq_len: 2048
eos_id: 151645
```
---
## Output Format
GrandLine produces compressed parquet files with this schema:
| Column | Type | Description |
|--------|------|-------------|
| `input_ids` | `list<int32>` | Packed token IDs (padded to max_seq_len) |
| `seq_lens` | `list<int32>` | Per-document lengths within the packed sequence |
| `total_tokens` | `int32` | Non-padding token count |
| `shard_id` | `string` | Provenance: which input shard produced this |
The `seq_lens` field enables document-level attention masking with FlashAttention `varlen` APIs, preventing cross-document attention contamination.
---
## Performance Choices
| Choice | Why |
|--------|-----|
| **BLAKE3** | 3-4x faster than SHA-256, deterministic, parallelizable |
| **DuckDB** | Persistent dedup state, fast hash lookups, no external service |
| **Parquet + zstd** | Columnar, compressed, random-access, ecosystem-compatible |
| **Rust-backed tokenizers** | Batch tokenization releases GIL, ~10x faster than Python |
| **Streaming first-fit packing** | Memory-bounded, deterministic, no sorting needed |
| **Single-pass processing** | Read shard once, apply all transforms, write once |
---
## Running Tests
```bash
# Install dev dependencies
pip install -e ".[dev]"
# Run all tests
pytest tests/ -v
# Run specific test
pytest tests/test_pack.py -v
```
---
## Determinism Guarantees
GrandLine ensures:
- **Stable sort order**: shards processed in lexicographic order
- **Stable hashes**: BLAKE3 with fixed encoding (UTF-8)
- **Stable pipeline fingerprints**: hash of (config + block signatures + versions)
- **Stable shard outputs**: same shard file β†’ same output file, always
- **No randomness**: no random seeds, no non-deterministic parallelism
- **No hidden state**: dedup store is the only persistent state, fully deterministic
Pipeline fingerprint includes:
- Dataset config values
- Block names and versions
- Block parameters
- Tokenizer identity
---
## Supported Datasets
| Dataset | Pipeline | Trust Level | Notes |
|---------|----------|-------------|-------|
| FineWeb-Edu | curated_web | 0 | Educational quality scored |
| FineWeb2 | curated_web | 1 | Multilingual web |
| DCLM | curated_web | 0 | Model-based quality selection |
| The Stack v2 | code | 1 | License-filtered source code |
| FineMath | math | 1 | Mathematical web content |
| PeS2o | papers | 0 | Open access academic papers |
| Cosmopedia | synthetic | 2 | LLM-generated textbooks |
---
## License
Apache-2.0
<!-- ml-intern-provenance -->
## Generated by ML Intern
This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = 'dignity045/grandline'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
```
For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.