mindi-backup / README_COMPONENT_3_DATASET_PIPELINE.md
Mindigenous
Initial full project backup with Git LFS
53f0cc2

Component 3: Dataset Pipeline

What This Component Does (Simple English)

  • Downloads the 3 datasets directly from Hugging Face (no manual download files).
  • Reads them in streaming mode so your RAM usage stays low.
  • Cleans prompt/code text.
  • Removes low-quality and likely auto-generated data.
  • Removes duplicate prompt+code pairs using a disk-backed SQLite index.
  • Detects language (Python or JavaScript) when unclear.
  • Tokenizes all cleaned records using the Component 2 tokenizer.
  • Saves training-ready tokenized JSONL output.

Files Created By This Component

  • configs/component3_dataset_pipeline.yaml
  • src/dataset_pipeline/hf_dataset_pipeline.py
  • scripts/run_component3_dataset_pipeline.py
  • scripts/verify_component3_dataset_pipeline.py

Required Before Running

  • Component 2 tokenizer must exist at:
    • artifacts/tokenizer/code_tokenizer_v1/tokenizer.json
    • artifacts/tokenizer/code_tokenizer_v1/tokenizer_config.json

Quick Verification Run (small test)

Run from project root:

.\.venv\Scripts\Activate.ps1
python .\scripts\verify_component3_dataset_pipeline.py

This uses 200 records per dataset for a smoke test.

Full Pipeline Run

.\.venv\Scripts\Activate.ps1
python .\scripts\run_component3_dataset_pipeline.py --config .\configs\component3_dataset_pipeline.yaml

Output Files

  • Clean merged dataset:
    • data/interim/combined_clean.jsonl
  • Tokenized training dataset:
    • data/processed/train_tokenized.jsonl
  • Stats summary:
    • data/processed/pipeline_stats.json