Sanchari / training /README.md
Mike369williams's picture
Create training/README.md
944a5de verified

Training skeleton (Sanchari-S)

How-to (quick):

  1. Prepare data:

    • Create data/all_texts.txt as newline-separated documents or a single large text.
    • Ensure tokenizer/ has your trained tokenizer files: sanchari_spm.model, sanchari_spm.vocab (or use the placeholders until you generate them).
  2. Install deps (Colab / local): pip install transformers datasets sentencepiece accelerate

  3. Run training (example local): python training/train.py --config training/config_s.json --tokenizer_dir tokenizer --data_file data/all_texts.txt --output_dir outputs/sanchari-s

  4. Inspect outputs:

    • Model + tokenizer saved in outputs/sanchari-s
    • Logs printed to console

Notes:

  • This script is intentionally minimal for seed-stage proof of concept.
  • For larger runs use DeepSpeed/Accelerate and multi-GPU clusters.
  • Before any investor checkpoint release, run safety & PII checks.