| # Final Project README - MINDI 1.0 420M (Windows, RTX 4060 8GB) |
|
|
| ## What This Project Is |
| This is a fully local coding-assistant model system built step-by-step from scratch. |
| It supports: |
| - custom tokenizer for code |
| - dataset cleaning + tokenization pipeline |
| - 420M transformer model |
| - memory-optimized training |
| - evaluation + inference improvements |
| - local chat UI |
| - LoRA fine-tuning |
| - INT8 export + portable package |
|
|
| Everything runs locally on your machine without internet after setup. |
|
|
| --- |
|
|
| ## What You Built (High Level) |
| 1. **Project setup** with reproducible environment and verification scripts. |
| 2. **Custom code tokenizer** (Python + JavaScript aware). |
| 3. **Dataset pipeline** with cleaning, dedupe, and tokenization. |
| 4. **420M transformer architecture** (modular config). |
| 5. **Training pipeline** (FP16, checkpointing, accumulation, resume, early stopping). |
| 6. **Evaluation system** (val metrics + generation checks). |
| 7. **Inference engine** (greedy mode, stop rules, syntax-aware retry). |
| 8. **Local chat interface** with history, copy button, timing, and mode selector. |
| 9. **LoRA fine-tuning pipeline** for your own examples. |
| 10. **Export/quantization/packaging** with benchmark report and portable launcher. |
|
|
| --- |
|
|
| ## Most Important File Locations |
|
|
| ### Core model and data |
| - Base checkpoint: `checkpoints/component5_420m/step_3200.pt` |
| - Tokenized training data: `data/processed/train_tokenized.jsonl` |
| - Tokenizer: `artifacts/tokenizer/code_tokenizer_v1/` |
|
|
| ### LoRA |
| - Best LoRA adapter: `models/lora/custom_lora_v1/best.pt` |
| - LoRA metadata: `models/lora/custom_lora_v1/adapter_meta.json` |
|
|
| ### Quantized model |
| - INT8 model: `models/quantized/model_step3200_int8_state.pt` |
| - Benchmark report: `artifacts/export/component10_benchmark_report.json` |
|
|
| ### Chat interface |
| - Launcher: `scripts/launch_component8_chat.py` |
| - Chat config: `configs/component8_chat_config.yaml` |
|
|
| ### Portable package |
| - Folder: `release/MINDI_1.0_420M` |
| - Double-click launcher: `release/MINDI_1.0_420M/Start_MINDI.bat` |
|
|
| --- |
|
|
| ## Launch the Main Chat UI |
| From project root (`C:\AI 2`): |
|
|
| ```powershell |
| .\.venv\Scripts\Activate.ps1 |
| python .\scripts\launch_component8_chat.py --config .\configs\component8_chat_config.yaml |
| ``` |
|
|
| Open in browser: |
| - `http://127.0.0.1:7860` |
|
|
| ### Live model selector in UI |
| You can switch without restart: |
| - `base` |
| - `lora` |
| - `int8` |
|
|
| Status box shows: |
| - active mode |
| - mode load time |
| - live VRAM usage |
|
|
| --- |
|
|
| ## How to Add More Training Data (Future Improvement) |
|
|
| ### A) Add more base-training pairs (full training path) |
| 1. Put new JSONL/JSON files in `data/raw/`. |
| 2. Run dataset processing scripts (Component 3 path). |
| 3. Continue/refresh base training with Component 5. |
|
|
| ### B) Add targeted improvements quickly (LoRA recommended) |
| 1. Edit `data/raw/custom_finetune_pairs.jsonl` with your new prompt/code pairs. |
| - Required fields per row: `prompt`, `code` |
| - Optional: `language` (`python` or `javascript`) |
| 2. Run LoRA fine-tuning: |
|
|
| ```powershell |
| python .\scripts\run_component9_lora_finetune.py --config .\configs\component9_lora_config.yaml |
| ``` |
|
|
| 3. Use updated adapter in chat by selecting `lora` mode. |
|
|
| --- |
|
|
| ## Recommended Next Habit |
| When quality is weak on specific tasks: |
| 1. Add 20-200 clean examples of exactly that task style to `custom_finetune_pairs.jsonl`. |
| 2. Re-run LoRA fine-tuning. |
| 3. Test in chat `lora` mode. |
| 4. Repeat in small cycles. |
|
|
| This gives faster improvement than retraining the full base model each time. |
|
|
| --- |
|
|
| ## One-File Health Check Commands |
|
|
| ```powershell |
| python .\scripts\verify_component1_setup.py |
| python .\scripts\verify_component4_model.py --config .\configs\component4_model_config.yaml --batch_size 1 --seq_len 256 |
| python .\scripts\verify_component9_lora.py |
| ``` |
|
|
| --- |
|
|
| ## Current Status |
| Project is complete across Components 1-10 and verified on your hardware. |
|
|
|
|