mindi-backup / README_FINAL_PROJECT.md

Mindigenous

Initial full project backup with Git LFS

53f0cc2 9 days ago

3.8 kB

	# Final Project README - MINDI 1.0 420M (Windows, RTX 4060 8GB)

	## What This Project Is
	This is a fully local coding-assistant model system built step-by-step from scratch.
	It supports:
	- custom tokenizer for code
	- dataset cleaning + tokenization pipeline
	- 420M transformer model
	- memory-optimized training
	- evaluation + inference improvements
	- local chat UI
	- LoRA fine-tuning
	- INT8 export + portable package

	Everything runs locally on your machine without internet after setup.

	---

	## What You Built (High Level)
	1. Project setup with reproducible environment and verification scripts.
	2. Custom code tokenizer (Python + JavaScript aware).
	3. Dataset pipeline with cleaning, dedupe, and tokenization.
	4. 420M transformer architecture (modular config).
	5. Training pipeline (FP16, checkpointing, accumulation, resume, early stopping).
	6. Evaluation system (val metrics + generation checks).
	7. Inference engine (greedy mode, stop rules, syntax-aware retry).
	8. Local chat interface with history, copy button, timing, and mode selector.
	9. LoRA fine-tuning pipeline for your own examples.
	10. Export/quantization/packaging with benchmark report and portable launcher.

	---

	## Most Important File Locations

	### Core model and data
	- Base checkpoint: `checkpoints/component5_420m/step_3200.pt`
	- Tokenized training data: `data/processed/train_tokenized.jsonl`
	- Tokenizer: `artifacts/tokenizer/code_tokenizer_v1/`

	### LoRA
	- Best LoRA adapter: `models/lora/custom_lora_v1/best.pt`
	- LoRA metadata: `models/lora/custom_lora_v1/adapter_meta.json`

	### Quantized model
	- INT8 model: `models/quantized/model_step3200_int8_state.pt`
	- Benchmark report: `artifacts/export/component10_benchmark_report.json`

	### Chat interface
	- Launcher: `scripts/launch_component8_chat.py`
	- Chat config: `configs/component8_chat_config.yaml`

	### Portable package
	- Folder: `release/MINDI_1.0_420M`
	- Double-click launcher: `release/MINDI_1.0_420M/Start_MINDI.bat`

	---

	## Launch the Main Chat UI
	From project root (`C:\AI 2`):

	```powershell
	.\.venv\Scripts\Activate.ps1
	python .\scripts\launch_component8_chat.py --config .\configs\component8_chat_config.yaml
	```

	Open in browser:
	- `http://127.0.0.1:7860`

	### Live model selector in UI
	You can switch without restart:
	- `base`
	- `lora`
	- `int8`

	Status box shows:
	- active mode
	- mode load time
	- live VRAM usage

	---

	## How to Add More Training Data (Future Improvement)

	### A) Add more base-training pairs (full training path)
	1. Put new JSONL/JSON files in `data/raw/`.
	2. Run dataset processing scripts (Component 3 path).
	3. Continue/refresh base training with Component 5.

	### B) Add targeted improvements quickly (LoRA recommended)
	1. Edit `data/raw/custom_finetune_pairs.jsonl` with your new prompt/code pairs.
	- Required fields per row: `prompt`, `code`
	- Optional: `language` (`python` or `javascript`)
	2. Run LoRA fine-tuning:

	```powershell
	python .\scripts\run_component9_lora_finetune.py --config .\configs\component9_lora_config.yaml
	```

	3. Use updated adapter in chat by selecting `lora` mode.

	---

	## Recommended Next Habit
	When quality is weak on specific tasks:
	1. Add 20-200 clean examples of exactly that task style to `custom_finetune_pairs.jsonl`.
	2. Re-run LoRA fine-tuning.
	3. Test in chat `lora` mode.
	4. Repeat in small cycles.

	This gives faster improvement than retraining the full base model each time.

	---

	## One-File Health Check Commands

	```powershell
	python .\scripts\verify_component1_setup.py
	python .\scripts\verify_component4_model.py --config .\configs\component4_model_config.yaml --batch_size 1 --seq_len 256
	python .\scripts\verify_component9_lora.py
	```

	---

	## Current Status
	Project is complete across Components 1-10 and verified on your hardware.