mindi-backup / README_FINAL_PROJECT.md
Mindigenous
Initial full project backup with Git LFS
53f0cc2
# Final Project README - MINDI 1.0 420M (Windows, RTX 4060 8GB)
## What This Project Is
This is a fully local coding-assistant model system built step-by-step from scratch.
It supports:
- custom tokenizer for code
- dataset cleaning + tokenization pipeline
- 420M transformer model
- memory-optimized training
- evaluation + inference improvements
- local chat UI
- LoRA fine-tuning
- INT8 export + portable package
Everything runs locally on your machine without internet after setup.
---
## What You Built (High Level)
1. **Project setup** with reproducible environment and verification scripts.
2. **Custom code tokenizer** (Python + JavaScript aware).
3. **Dataset pipeline** with cleaning, dedupe, and tokenization.
4. **420M transformer architecture** (modular config).
5. **Training pipeline** (FP16, checkpointing, accumulation, resume, early stopping).
6. **Evaluation system** (val metrics + generation checks).
7. **Inference engine** (greedy mode, stop rules, syntax-aware retry).
8. **Local chat interface** with history, copy button, timing, and mode selector.
9. **LoRA fine-tuning pipeline** for your own examples.
10. **Export/quantization/packaging** with benchmark report and portable launcher.
---
## Most Important File Locations
### Core model and data
- Base checkpoint: `checkpoints/component5_420m/step_3200.pt`
- Tokenized training data: `data/processed/train_tokenized.jsonl`
- Tokenizer: `artifacts/tokenizer/code_tokenizer_v1/`
### LoRA
- Best LoRA adapter: `models/lora/custom_lora_v1/best.pt`
- LoRA metadata: `models/lora/custom_lora_v1/adapter_meta.json`
### Quantized model
- INT8 model: `models/quantized/model_step3200_int8_state.pt`
- Benchmark report: `artifacts/export/component10_benchmark_report.json`
### Chat interface
- Launcher: `scripts/launch_component8_chat.py`
- Chat config: `configs/component8_chat_config.yaml`
### Portable package
- Folder: `release/MINDI_1.0_420M`
- Double-click launcher: `release/MINDI_1.0_420M/Start_MINDI.bat`
---
## Launch the Main Chat UI
From project root (`C:\AI 2`):
```powershell
.\.venv\Scripts\Activate.ps1
python .\scripts\launch_component8_chat.py --config .\configs\component8_chat_config.yaml
```
Open in browser:
- `http://127.0.0.1:7860`
### Live model selector in UI
You can switch without restart:
- `base`
- `lora`
- `int8`
Status box shows:
- active mode
- mode load time
- live VRAM usage
---
## How to Add More Training Data (Future Improvement)
### A) Add more base-training pairs (full training path)
1. Put new JSONL/JSON files in `data/raw/`.
2. Run dataset processing scripts (Component 3 path).
3. Continue/refresh base training with Component 5.
### B) Add targeted improvements quickly (LoRA recommended)
1. Edit `data/raw/custom_finetune_pairs.jsonl` with your new prompt/code pairs.
- Required fields per row: `prompt`, `code`
- Optional: `language` (`python` or `javascript`)
2. Run LoRA fine-tuning:
```powershell
python .\scripts\run_component9_lora_finetune.py --config .\configs\component9_lora_config.yaml
```
3. Use updated adapter in chat by selecting `lora` mode.
---
## Recommended Next Habit
When quality is weak on specific tasks:
1. Add 20-200 clean examples of exactly that task style to `custom_finetune_pairs.jsonl`.
2. Re-run LoRA fine-tuning.
3. Test in chat `lora` mode.
4. Repeat in small cycles.
This gives faster improvement than retraining the full base model each time.
---
## One-File Health Check Commands
```powershell
python .\scripts\verify_component1_setup.py
python .\scripts\verify_component4_model.py --config .\configs\component4_model_config.yaml --batch_size 1 --seq_len 256
python .\scripts\verify_component9_lora.py
```
---
## Current Status
Project is complete across Components 1-10 and verified on your hardware.