| # Project Specification | |
| ## 1. Project Name | |
| Local Advanced Fine-Tuning Pipeline for Coding LLM | |
| ## 2. Purpose | |
| Provide a fully local, modular workflow to fine-tune a compact coding LLM for: | |
| - code fixing | |
| - debugging | |
| - code explanation | |
| - response confidence and relevancy signals | |
| ## 3. Functional Requirements | |
| ### FR-1 Dataset Generation | |
| - System must generate a JSON dataset with fields: | |
| - `instruction` | |
| - `input` | |
| - `output` | |
| - `explanation` | |
| - `confidence` | |
| - `relevancy` | |
| - Dataset size must be constrained to 5000-10000 samples. | |
| ### FR-2 Model Fine-Tuning | |
| - System must support LoRA fine-tuning on: | |
| - `Qwen/Qwen2.5-Coder-0.5B-Instruct` (default) | |
| - Training inputs must be tokenized and formatted from dataset records. | |
| - Training output must be stored in a configurable output directory. | |
| ### FR-3 Pipeline Orchestration | |
| - System must provide a one-command execution script for: | |
| - dataset generation | |
| - training | |
| - optional uploading | |
| - Pipeline must support skipping individual stages. | |
| ### FR-4 Local Inference | |
| - System must generate outputs from local model folder. | |
| - Inference module must support: | |
| - LoRA adapter outputs | |
| - full model outputs | |
| - Inference output must be valid JSON containing: | |
| - `code` | |
| - `explanation` | |
| - `confidence` | |
| - `important_tokens` | |
| - `relevancy_score` | |
| - `hallucination` | |
| - `hallucination_check_reason` | |
| - `latency_ms` | |
| ### FR-5 HF Upload | |
| - System must upload trained model artifacts to a user-specified HF repo. | |
| - Upload should be optional and independently executable. | |
| - System must support updating an existing HF model repo by uploading to the same `repo_id`. | |
| ## 4. Non-Functional Requirements | |
| ### NFR-1 Reliability | |
| - Scripts must fail with clear error messages for missing files/directories. | |
| ### NFR-2 Configurability | |
| - Hyperparameters and paths must be configurable via CLI. | |
| - Pipeline defaults should be read from `training_config.json`. | |
| ### NFR-3 Performance | |
| - Must support limited-sample smoke run for CPU environments. | |
| - Tokenization must use deterministic fixed-length padding for stable LoRA training labels. | |
| - Inference should support deterministic mode by default for stable outputs. | |
| ### NFR-4 Maintainability | |
| - Modules must remain decoupled and single-purpose where possible. | |
| - Documentation must include setup and run commands. | |
| ## 5. Input/Output Contracts | |
| ### Dataset Generator | |
| - Input: | |
| - `--size` (int, 5000-10000) | |
| - `--out` (path) | |
| - Output: | |
| - JSON training file at `--out` | |
| ### Trainer | |
| - Input: | |
| - dataset file path | |
| - model name | |
| - hyperparameters | |
| - Output: | |
| - trained model artifacts in `output_dir` | |
| ### Inference | |
| - Input: | |
| - local model path | |
| - prompt | |
| - max new tokens | |
| - Output: | |
| - structured JSON to stdout | |
| - Contract: | |
| - required keys: `code`, `explanation`, `confidence`, `important_tokens`, `relevancy_score`, `hallucination`, `hallucination_check_reason`, `latency_ms` | |
| ### Upload | |
| - Input: | |
| - model directory path | |
| - HF repo id | |
| - Output: | |
| - model artifacts uploaded to HF repo | |
| ## 6. Default Configuration | |
| - Model: `Qwen/Qwen2.5-Coder-0.5B-Instruct` | |
| - Dataset size: `8000` | |
| - Epochs: `3` | |
| - Batch size: `2` | |
| - Learning rate: `1e-4` | |
| - Max length: `512` | |
| ## 7. Validation Criteria | |
| Project is considered runnable when: | |
| - all scripts compile | |
| - dataset generation succeeds | |
| - a smoke training run completes | |
| - inference returns valid JSON payload with required keys | |
| - upload script accepts valid model dir and repo id | |
| ## 8. Known Constraints | |
| - CPU training is slow for full dataset runs. | |
| - HF login/token is required for upload. | |
| - Output quality depends heavily on dataset diversity and quality. | |