# Project Specification ## 1. Project Name Local Advanced Fine-Tuning Pipeline for Coding LLM ## 2. Purpose Provide a fully local, modular workflow to fine-tune a compact coding LLM for: - code fixing - debugging - code explanation - response confidence and relevancy signals ## 3. Functional Requirements ### FR-1 Dataset Generation - System must generate a JSON dataset with fields: - `instruction` - `input` - `output` - `explanation` - `confidence` - `relevancy` - Dataset size must be constrained to 5000-10000 samples. ### FR-2 Model Fine-Tuning - System must support LoRA fine-tuning on: - `Qwen/Qwen2.5-Coder-0.5B-Instruct` (default) - Training inputs must be tokenized and formatted from dataset records. - Training output must be stored in a configurable output directory. ### FR-3 Pipeline Orchestration - System must provide a one-command execution script for: - dataset generation - training - optional uploading - Pipeline must support skipping individual stages. ### FR-4 Local Inference - System must generate outputs from local model folder. - Inference module must support: - LoRA adapter outputs - full model outputs - Inference output must be valid JSON containing: - `code` - `explanation` - `confidence` - `important_tokens` - `relevancy_score` - `hallucination` - `hallucination_check_reason` - `latency_ms` ### FR-5 HF Upload - System must upload trained model artifacts to a user-specified HF repo. - Upload should be optional and independently executable. - System must support updating an existing HF model repo by uploading to the same `repo_id`. ## 4. Non-Functional Requirements ### NFR-1 Reliability - Scripts must fail with clear error messages for missing files/directories. ### NFR-2 Configurability - Hyperparameters and paths must be configurable via CLI. - Pipeline defaults should be read from `training_config.json`. ### NFR-3 Performance - Must support limited-sample smoke run for CPU environments. - Tokenization must use deterministic fixed-length padding for stable LoRA training labels. - Inference should support deterministic mode by default for stable outputs. ### NFR-4 Maintainability - Modules must remain decoupled and single-purpose where possible. - Documentation must include setup and run commands. ## 5. Input/Output Contracts ### Dataset Generator - Input: - `--size` (int, 5000-10000) - `--out` (path) - Output: - JSON training file at `--out` ### Trainer - Input: - dataset file path - model name - hyperparameters - Output: - trained model artifacts in `output_dir` ### Inference - Input: - local model path - prompt - max new tokens - Output: - structured JSON to stdout - Contract: - required keys: `code`, `explanation`, `confidence`, `important_tokens`, `relevancy_score`, `hallucination`, `hallucination_check_reason`, `latency_ms` ### Upload - Input: - model directory path - HF repo id - Output: - model artifacts uploaded to HF repo ## 6. Default Configuration - Model: `Qwen/Qwen2.5-Coder-0.5B-Instruct` - Dataset size: `8000` - Epochs: `3` - Batch size: `2` - Learning rate: `1e-4` - Max length: `512` ## 7. Validation Criteria Project is considered runnable when: - all scripts compile - dataset generation succeeds - a smoke training run completes - inference returns valid JSON payload with required keys - upload script accepts valid model dir and repo id ## 8. Known Constraints - CPU training is slow for full dataset runs. - HF login/token is required for upload. - Output quality depends heavily on dataset diversity and quality.