--- license: apache-2.0 tags: - dataset-preprocessing - llm-pretraining - tokenization - deduplication - data-pipeline - ml-intern --- # GrandLine **Deterministic shard-first dataset preprocessing for LLM pretraining.** GrandLine turns heterogeneous, partially curated datasets into reproducible, tokenized, packed training artifacts with minimal wasted compute. --- ## Key Properties | Property | Guarantee | |----------|-----------| | **Deterministic** | Same input + same config + same code version = same output | | **Shard-first** | Every shard processed independently, no global state | | **Resumable** | Skip already-processed shards on restart | | **Scalable** | 4 CPUs → 224 CPUs without architectural changes | | **Dataset-aware** | Pipeline compiled per dataset trust level | | **Fast** | BLAKE3 hashing, batch tokenization, zstd parquet output | --- ## Installation ```bash # Development install pip install -e ".[dev]" # Or install dependencies directly pip install blake3 duckdb pyarrow transformers tokenizers pyyaml click datasets tqdm ``` **Requirements:** Python ≥ 3.11 --- ## Quick Start ### 1. Inspect a dataset ```bash python scripts/inspect_dataset.py --source /path/to/parquet/dir/ python scripts/inspect_dataset.py --hf-repo "HuggingFaceFW/fineweb-edu-score-2" --split train ``` ### 2. Process a dataset ```bash python scripts/run_dataset.py --config configs/datasets/fineweb_edu.yaml ``` ### 3. With executor tuning ```bash # On Kaggle (conservative memory) python scripts/run_dataset.py \ --config configs/datasets/fineweb_edu.yaml \ --executor configs/executors/kaggle.yaml # On a large CPU machine python scripts/run_dataset.py \ --config configs/datasets/dclm.yaml \ --executor configs/executors/local_224cpu.yaml ``` ### 4. With CLI overrides ```bash python scripts/run_dataset.py \ --config configs/datasets/fineweb_edu.yaml \ packing.max_seq_len=4096 \ tokenizer.name=Qwen/Qwen3-4B ``` ### 5. Merge manifests ```bash python scripts/merge_manifests.py --state-dir state/ --dataset fineweb_edu ``` --- ## Architecture ### Pipeline Levels GrandLine compiles dataset-specific pipelines based on trust level: | Level | Target | Blocks | |-------|--------|--------| | **0** | Highly curated (FineWeb-Edu, DCLM) | normalize → exact dedup → tokenize → pack | | **1** | Curated with scores (FineWeb2, FineMath) | normalize → length filter → score filter → exact dedup → tokenize → pack | | **2** | Synthetic/mixed (Cosmopedia) | normalize → length filter → alpha ratio → exact dedup → tokenize → pack | | **3** | Raw web (uncurated) | normalize → lang ID → heuristics → exact dedup → near-dup → tokenize → pack | ### Core Principle > Do not waste compute redoing work already done upstream. If a dataset already has language labels, quality scores, deduplication, or curation metadata — GrandLine reuses those signals instead of recomputing them. --- ## Project Structure ``` grandline/ ├── pyproject.toml # Package definition ├── README.md # This file ├── configs/ │ ├── global.yaml # Default settings │ ├── datasets/ # Per-dataset pipeline configs │ │ ├── fineweb_edu.yaml │ │ ├── fineweb2.yaml │ │ ├── dclm.yaml │ │ ├── cosmopedia.yaml │ │ ├── the_stack_v2.yaml │ │ ├── finemath.yaml │ │ └── pes2o.yaml │ ├── executors/ # Machine-specific tuning │ │ ├── kaggle.yaml │ │ ├── colab.yaml │ │ └── local_224cpu.yaml │ └── tokenizers/ │ └── qwen3.yaml ├── scripts/ │ ├── run_dataset.py # Main processing entry point │ ├── inspect_dataset.py # Dataset inspection utility │ └── merge_manifests.py # Merge per-shard manifests ├── state/ # Runtime state (gitignored in practice) │ ├── progress/ # Shard completion markers │ ├── manifests/ # Per-shard processing manifests │ ├── cache/ # Pipeline output cache │ └── dedup.duckdb # Persistent dedup database ├── tests/ │ ├── test_determinism.py │ ├── test_dedup.py │ ├── test_filters.py │ ├── test_tokenize.py │ └── test_pack.py └── src/grandline/ ├── __init__.py ├── types.py # Document, TokenizedDocument, PackedSequence ├── pipeline.py # Pipeline compilation and execution ├── registry.py # Block name → class registry ├── runtime.py # Shard-level execution engine ├── manifest.py # Shard/dataset manifest I/O ├── cache.py # Pipeline output caching ├── io.py # Shard reading (parquet, JSONL, HF) ├── hashing.py # BLAKE3 hashing utilities ├── dedup_store.py # DuckDB-backed dedup state ├── writer.py # Parquet output writer ├── blocks/ │ ├── base.py # Block, FilterBlock, TransformBlock │ ├── normalize.py # Unicode/whitespace normalization │ ├── filters.py # Length, score, language, alpha filters │ ├── dedup.py # Exact deduplication block │ ├── tokenize.py # Batch tokenization block │ └── pack.py # Deterministic greedy packing ├── pipelines/ │ ├── curated_web.py # FineWeb-Edu, DCLM, FineWeb2 │ ├── code.py # The Stack v2 │ ├── math.py # FineMath, OpenWebMath │ ├── papers.py # PeS2o, arXiv │ └── synthetic.py # Cosmopedia └── util/ ├── logging.py # Structured logging ├── paths.py # Project path management └── config.py # YAML config loading + validation ``` --- ## Dataset Configuration Each dataset gets a YAML config specifying its pipeline: ```yaml name: fineweb_edu pipeline_type: curated_web trust_level: 0 source: repo: "HuggingFaceFW/fineweb-edu-score-2" paths: ["/data/fineweb-edu/"] text_column: "text" metadata_columns: - "score" - "url" score_filter: field: "score" threshold: 3.0 tokenizer: name: "Qwen/Qwen3-0.6B" batch_size: 512 packing: max_seq_len: 2048 eos_id: 151645 ``` --- ## Output Format GrandLine produces compressed parquet files with this schema: | Column | Type | Description | |--------|------|-------------| | `input_ids` | `list` | Packed token IDs (padded to max_seq_len) | | `seq_lens` | `list` | Per-document lengths within the packed sequence | | `total_tokens` | `int32` | Non-padding token count | | `shard_id` | `string` | Provenance: which input shard produced this | The `seq_lens` field enables document-level attention masking with FlashAttention `varlen` APIs, preventing cross-document attention contamination. --- ## Performance Choices | Choice | Why | |--------|-----| | **BLAKE3** | 3-4x faster than SHA-256, deterministic, parallelizable | | **DuckDB** | Persistent dedup state, fast hash lookups, no external service | | **Parquet + zstd** | Columnar, compressed, random-access, ecosystem-compatible | | **Rust-backed tokenizers** | Batch tokenization releases GIL, ~10x faster than Python | | **Streaming first-fit packing** | Memory-bounded, deterministic, no sorting needed | | **Single-pass processing** | Read shard once, apply all transforms, write once | --- ## Running Tests ```bash # Install dev dependencies pip install -e ".[dev]" # Run all tests pytest tests/ -v # Run specific test pytest tests/test_pack.py -v ``` --- ## Determinism Guarantees GrandLine ensures: - **Stable sort order**: shards processed in lexicographic order - **Stable hashes**: BLAKE3 with fixed encoding (UTF-8) - **Stable pipeline fingerprints**: hash of (config + block signatures + versions) - **Stable shard outputs**: same shard file → same output file, always - **No randomness**: no random seeds, no non-deterministic parallelism - **No hidden state**: dedup store is the only persistent state, fully deterministic Pipeline fingerprint includes: - Dataset config values - Block names and versions - Block parameters - Tokenizer identity --- ## Supported Datasets | Dataset | Pipeline | Trust Level | Notes | |---------|----------|-------------|-------| | FineWeb-Edu | curated_web | 0 | Educational quality scored | | FineWeb2 | curated_web | 1 | Multilingual web | | DCLM | curated_web | 0 | Model-based quality selection | | The Stack v2 | code | 1 | License-filtered source code | | FineMath | math | 1 | Mathematical web content | | PeS2o | papers | 0 | Open access academic papers | | Cosmopedia | synthetic | 2 | LLM-generated textbooks | --- ## License Apache-2.0 ## Generated by ML Intern This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub. - Try ML Intern: https://smolagents-ml-intern.hf.space - Source code: https://github.com/huggingface/ml-intern ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_id = 'dignity045/grandline' tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id) ``` For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.