Training skeleton (Sanchari-S)
How-to (quick):
Prepare data:
- Create
data/all_texts.txtas newline-separated documents or a single large text. - Ensure
tokenizer/has your trained tokenizer files:sanchari_spm.model,sanchari_spm.vocab(or use the placeholders until you generate them).
- Create
Install deps (Colab / local): pip install transformers datasets sentencepiece accelerate
Run training (example local): python training/train.py --config training/config_s.json --tokenizer_dir tokenizer --data_file data/all_texts.txt --output_dir outputs/sanchari-s
Inspect outputs:
- Model + tokenizer saved in
outputs/sanchari-s - Logs printed to console
- Model + tokenizer saved in
Notes:
- This script is intentionally minimal for seed-stage proof of concept.
- For larger runs use DeepSpeed/Accelerate and multi-GPU clusters.
- Before any investor checkpoint release, run safety & PII checks.