File size: 3,739 Bytes
3e6e808 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 | # Project Specification
## 1. Project Name
Local Advanced Fine-Tuning Pipeline for Coding LLM
## 2. Purpose
Provide a fully local, modular workflow to fine-tune a compact coding LLM for:
- code fixing
- debugging
- code explanation
- response confidence and relevancy signals
## 3. Functional Requirements
### FR-1 Dataset Generation
- System must generate a JSON dataset with fields:
- `instruction`
- `input`
- `output`
- `explanation`
- `confidence`
- `relevancy`
- Dataset size must be constrained to 5000-10000 samples.
### FR-2 Model Fine-Tuning
- System must support LoRA fine-tuning on:
- `Qwen/Qwen2.5-Coder-0.5B-Instruct` (default)
- Training inputs must be tokenized and formatted from dataset records.
- Training output must be stored in a configurable output directory.
### FR-3 Pipeline Orchestration
- System must provide a one-command execution script for:
- dataset generation
- training
- optional uploading
- Pipeline must support skipping individual stages.
### FR-4 Local Inference
- System must generate outputs from local model folder.
- Inference module must support:
- LoRA adapter outputs
- full model outputs
- Inference output must be valid JSON containing:
- `code`
- `explanation`
- `confidence`
- `important_tokens`
- `relevancy_score`
- `hallucination`
- `hallucination_check_reason`
- `latency_ms`
### FR-5 HF Upload
- System must upload trained model artifacts to a user-specified HF repo.
- Upload should be optional and independently executable.
- System must support updating an existing HF model repo by uploading to the same `repo_id`.
## 4. Non-Functional Requirements
### NFR-1 Reliability
- Scripts must fail with clear error messages for missing files/directories.
### NFR-2 Configurability
- Hyperparameters and paths must be configurable via CLI.
- Pipeline defaults should be read from `training_config.json`.
### NFR-3 Performance
- Must support limited-sample smoke run for CPU environments.
- Tokenization must use deterministic fixed-length padding for stable LoRA training labels.
- Inference should support deterministic mode by default for stable outputs.
### NFR-4 Maintainability
- Modules must remain decoupled and single-purpose where possible.
- Documentation must include setup and run commands.
## 5. Input/Output Contracts
### Dataset Generator
- Input:
- `--size` (int, 5000-10000)
- `--out` (path)
- Output:
- JSON training file at `--out`
### Trainer
- Input:
- dataset file path
- model name
- hyperparameters
- Output:
- trained model artifacts in `output_dir`
### Inference
- Input:
- local model path
- prompt
- max new tokens
- Output:
- structured JSON to stdout
- Contract:
- required keys: `code`, `explanation`, `confidence`, `important_tokens`, `relevancy_score`, `hallucination`, `hallucination_check_reason`, `latency_ms`
### Upload
- Input:
- model directory path
- HF repo id
- Output:
- model artifacts uploaded to HF repo
## 6. Default Configuration
- Model: `Qwen/Qwen2.5-Coder-0.5B-Instruct`
- Dataset size: `8000`
- Epochs: `3`
- Batch size: `2`
- Learning rate: `1e-4`
- Max length: `512`
## 7. Validation Criteria
Project is considered runnable when:
- all scripts compile
- dataset generation succeeds
- a smoke training run completes
- inference returns valid JSON payload with required keys
- upload script accepts valid model dir and repo id
## 8. Known Constraints
- CPU training is slow for full dataset runs.
- HF login/token is required for upload.
- Output quality depends heavily on dataset diversity and quality.
|