| --- |
| license: apache-2.0 |
| tags: |
| - dataset-preprocessing |
| - llm-pretraining |
| - tokenization |
| - deduplication |
| - data-pipeline |
| - ml-intern |
| --- |
| |
| # GrandLine |
|
|
| **Deterministic shard-first dataset preprocessing for LLM pretraining.** |
|
|
| GrandLine turns heterogeneous, partially curated datasets into reproducible, tokenized, packed training artifacts with minimal wasted compute. |
|
|
| --- |
|
|
| ## Key Properties |
|
|
| | Property | Guarantee | |
| |----------|-----------| |
| | **Deterministic** | Same input + same config + same code version = same output | |
| | **Shard-first** | Every shard processed independently, no global state | |
| | **Resumable** | Skip already-processed shards on restart | |
| | **Scalable** | 4 CPUs β 224 CPUs without architectural changes | |
| | **Dataset-aware** | Pipeline compiled per dataset trust level | |
| | **Fast** | BLAKE3 hashing, batch tokenization, zstd parquet output | |
|
|
| --- |
|
|
| ## Installation |
|
|
| ```bash |
| # Development install |
| pip install -e ".[dev]" |
| |
| # Or install dependencies directly |
| pip install blake3 duckdb pyarrow transformers tokenizers pyyaml click datasets tqdm |
| ``` |
|
|
| **Requirements:** Python β₯ 3.11 |
|
|
| --- |
|
|
| ## Quick Start |
|
|
| ### 1. Inspect a dataset |
|
|
| ```bash |
| python scripts/inspect_dataset.py --source /path/to/parquet/dir/ |
| python scripts/inspect_dataset.py --hf-repo "HuggingFaceFW/fineweb-edu-score-2" --split train |
| ``` |
|
|
| ### 2. Process a dataset |
|
|
| ```bash |
| python scripts/run_dataset.py --config configs/datasets/fineweb_edu.yaml |
| ``` |
|
|
| ### 3. With executor tuning |
|
|
| ```bash |
| # On Kaggle (conservative memory) |
| python scripts/run_dataset.py \ |
| --config configs/datasets/fineweb_edu.yaml \ |
| --executor configs/executors/kaggle.yaml |
| |
| # On a large CPU machine |
| python scripts/run_dataset.py \ |
| --config configs/datasets/dclm.yaml \ |
| --executor configs/executors/local_224cpu.yaml |
| ``` |
|
|
| ### 4. With CLI overrides |
|
|
| ```bash |
| python scripts/run_dataset.py \ |
| --config configs/datasets/fineweb_edu.yaml \ |
| packing.max_seq_len=4096 \ |
| tokenizer.name=Qwen/Qwen3-4B |
| ``` |
|
|
| ### 5. Merge manifests |
|
|
| ```bash |
| python scripts/merge_manifests.py --state-dir state/ --dataset fineweb_edu |
| ``` |
|
|
| --- |
|
|
| ## Architecture |
|
|
| ### Pipeline Levels |
|
|
| GrandLine compiles dataset-specific pipelines based on trust level: |
|
|
| | Level | Target | Blocks | |
| |-------|--------|--------| |
| | **0** | Highly curated (FineWeb-Edu, DCLM) | normalize β exact dedup β tokenize β pack | |
| | **1** | Curated with scores (FineWeb2, FineMath) | normalize β length filter β score filter β exact dedup β tokenize β pack | |
| | **2** | Synthetic/mixed (Cosmopedia) | normalize β length filter β alpha ratio β exact dedup β tokenize β pack | |
| | **3** | Raw web (uncurated) | normalize β lang ID β heuristics β exact dedup β near-dup β tokenize β pack | |
|
|
| ### Core Principle |
|
|
| > Do not waste compute redoing work already done upstream. |
|
|
| If a dataset already has language labels, quality scores, deduplication, or curation metadata β GrandLine reuses those signals instead of recomputing them. |
|
|
| --- |
|
|
| ## Project Structure |
|
|
| ``` |
| grandline/ |
| βββ pyproject.toml # Package definition |
| βββ README.md # This file |
| βββ configs/ |
| β βββ global.yaml # Default settings |
| β βββ datasets/ # Per-dataset pipeline configs |
| β β βββ fineweb_edu.yaml |
| β β βββ fineweb2.yaml |
| β β βββ dclm.yaml |
| β β βββ cosmopedia.yaml |
| β β βββ the_stack_v2.yaml |
| β β βββ finemath.yaml |
| β β βββ pes2o.yaml |
| β βββ executors/ # Machine-specific tuning |
| β β βββ kaggle.yaml |
| β β βββ colab.yaml |
| β β βββ local_224cpu.yaml |
| β βββ tokenizers/ |
| β βββ qwen3.yaml |
| βββ scripts/ |
| β βββ run_dataset.py # Main processing entry point |
| β βββ inspect_dataset.py # Dataset inspection utility |
| β βββ merge_manifests.py # Merge per-shard manifests |
| βββ state/ # Runtime state (gitignored in practice) |
| β βββ progress/ # Shard completion markers |
| β βββ manifests/ # Per-shard processing manifests |
| β βββ cache/ # Pipeline output cache |
| β βββ dedup.duckdb # Persistent dedup database |
| βββ tests/ |
| β βββ test_determinism.py |
| β βββ test_dedup.py |
| β βββ test_filters.py |
| β βββ test_tokenize.py |
| β βββ test_pack.py |
| βββ src/grandline/ |
| βββ __init__.py |
| βββ types.py # Document, TokenizedDocument, PackedSequence |
| βββ pipeline.py # Pipeline compilation and execution |
| βββ registry.py # Block name β class registry |
| βββ runtime.py # Shard-level execution engine |
| βββ manifest.py # Shard/dataset manifest I/O |
| βββ cache.py # Pipeline output caching |
| βββ io.py # Shard reading (parquet, JSONL, HF) |
| βββ hashing.py # BLAKE3 hashing utilities |
| βββ dedup_store.py # DuckDB-backed dedup state |
| βββ writer.py # Parquet output writer |
| βββ blocks/ |
| β βββ base.py # Block, FilterBlock, TransformBlock |
| β βββ normalize.py # Unicode/whitespace normalization |
| β βββ filters.py # Length, score, language, alpha filters |
| β βββ dedup.py # Exact deduplication block |
| β βββ tokenize.py # Batch tokenization block |
| β βββ pack.py # Deterministic greedy packing |
| βββ pipelines/ |
| β βββ curated_web.py # FineWeb-Edu, DCLM, FineWeb2 |
| β βββ code.py # The Stack v2 |
| β βββ math.py # FineMath, OpenWebMath |
| β βββ papers.py # PeS2o, arXiv |
| β βββ synthetic.py # Cosmopedia |
| βββ util/ |
| βββ logging.py # Structured logging |
| βββ paths.py # Project path management |
| βββ config.py # YAML config loading + validation |
| ``` |
|
|
| --- |
|
|
| ## Dataset Configuration |
|
|
| Each dataset gets a YAML config specifying its pipeline: |
|
|
| ```yaml |
| name: fineweb_edu |
| pipeline_type: curated_web |
| trust_level: 0 |
| |
| source: |
| repo: "HuggingFaceFW/fineweb-edu-score-2" |
| paths: ["/data/fineweb-edu/"] |
| text_column: "text" |
| |
| metadata_columns: |
| - "score" |
| - "url" |
| |
| score_filter: |
| field: "score" |
| threshold: 3.0 |
| |
| tokenizer: |
| name: "Qwen/Qwen3-0.6B" |
| batch_size: 512 |
| |
| packing: |
| max_seq_len: 2048 |
| eos_id: 151645 |
| ``` |
|
|
| --- |
|
|
| ## Output Format |
|
|
| GrandLine produces compressed parquet files with this schema: |
|
|
| | Column | Type | Description | |
| |--------|------|-------------| |
| | `input_ids` | `list<int32>` | Packed token IDs (padded to max_seq_len) | |
| | `seq_lens` | `list<int32>` | Per-document lengths within the packed sequence | |
| | `total_tokens` | `int32` | Non-padding token count | |
| | `shard_id` | `string` | Provenance: which input shard produced this | |
|
|
| The `seq_lens` field enables document-level attention masking with FlashAttention `varlen` APIs, preventing cross-document attention contamination. |
|
|
| --- |
|
|
| ## Performance Choices |
|
|
| | Choice | Why | |
| |--------|-----| |
| | **BLAKE3** | 3-4x faster than SHA-256, deterministic, parallelizable | |
| | **DuckDB** | Persistent dedup state, fast hash lookups, no external service | |
| | **Parquet + zstd** | Columnar, compressed, random-access, ecosystem-compatible | |
| | **Rust-backed tokenizers** | Batch tokenization releases GIL, ~10x faster than Python | |
| | **Streaming first-fit packing** | Memory-bounded, deterministic, no sorting needed | |
| | **Single-pass processing** | Read shard once, apply all transforms, write once | |
|
|
| --- |
|
|
| ## Running Tests |
|
|
| ```bash |
| # Install dev dependencies |
| pip install -e ".[dev]" |
| |
| # Run all tests |
| pytest tests/ -v |
| |
| # Run specific test |
| pytest tests/test_pack.py -v |
| ``` |
|
|
| --- |
|
|
| ## Determinism Guarantees |
|
|
| GrandLine ensures: |
|
|
| - **Stable sort order**: shards processed in lexicographic order |
| - **Stable hashes**: BLAKE3 with fixed encoding (UTF-8) |
| - **Stable pipeline fingerprints**: hash of (config + block signatures + versions) |
| - **Stable shard outputs**: same shard file β same output file, always |
| - **No randomness**: no random seeds, no non-deterministic parallelism |
| - **No hidden state**: dedup store is the only persistent state, fully deterministic |
|
|
| Pipeline fingerprint includes: |
| - Dataset config values |
| - Block names and versions |
| - Block parameters |
| - Tokenizer identity |
|
|
| --- |
|
|
| ## Supported Datasets |
|
|
| | Dataset | Pipeline | Trust Level | Notes | |
| |---------|----------|-------------|-------| |
| | FineWeb-Edu | curated_web | 0 | Educational quality scored | |
| | FineWeb2 | curated_web | 1 | Multilingual web | |
| | DCLM | curated_web | 0 | Model-based quality selection | |
| | The Stack v2 | code | 1 | License-filtered source code | |
| | FineMath | math | 1 | Mathematical web content | |
| | PeS2o | papers | 0 | Open access academic papers | |
| | Cosmopedia | synthetic | 2 | LLM-generated textbooks | |
| |
| --- |
| |
| ## License |
| |
| Apache-2.0 |
| |
| <!-- ml-intern-provenance --> |
| ## Generated by ML Intern |
| |
| This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub. |
| |
| - Try ML Intern: https://smolagents-ml-intern.hf.space |
| - Source code: https://github.com/huggingface/ml-intern |
| |
| ## Usage |
| |
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| model_id = 'dignity045/grandline' |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModelForCausalLM.from_pretrained(model_id) |
| ``` |
| |
| For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class. |
| |