Project Specification
1. Project Name
Local Advanced Fine-Tuning Pipeline for Coding LLM
2. Purpose
Provide a fully local, modular workflow to fine-tune a compact coding LLM for:
- code fixing
- debugging
- code explanation
- response confidence and relevancy signals
3. Functional Requirements
FR-1 Dataset Generation
- System must generate a JSON dataset with fields:
instructioninputoutputexplanationconfidencerelevancy
- Dataset size must be constrained to 5000-10000 samples.
FR-2 Model Fine-Tuning
- System must support LoRA fine-tuning on:
Qwen/Qwen2.5-Coder-0.5B-Instruct(default)
- Training inputs must be tokenized and formatted from dataset records.
- Training output must be stored in a configurable output directory.
FR-3 Pipeline Orchestration
- System must provide a one-command execution script for:
- dataset generation
- training
- optional uploading
- Pipeline must support skipping individual stages.
FR-4 Local Inference
- System must generate outputs from local model folder.
- Inference module must support:
- LoRA adapter outputs
- full model outputs
- Inference output must be valid JSON containing:
codeexplanationconfidenceimportant_tokensrelevancy_scorehallucinationhallucination_check_reasonlatency_ms
FR-5 HF Upload
- System must upload trained model artifacts to a user-specified HF repo.
- Upload should be optional and independently executable.
- System must support updating an existing HF model repo by uploading to the same
repo_id.
4. Non-Functional Requirements
NFR-1 Reliability
- Scripts must fail with clear error messages for missing files/directories.
NFR-2 Configurability
- Hyperparameters and paths must be configurable via CLI.
- Pipeline defaults should be read from
training_config.json.
NFR-3 Performance
- Must support limited-sample smoke run for CPU environments.
- Tokenization must use deterministic fixed-length padding for stable LoRA training labels.
- Inference should support deterministic mode by default for stable outputs.
NFR-4 Maintainability
- Modules must remain decoupled and single-purpose where possible.
- Documentation must include setup and run commands.
5. Input/Output Contracts
Dataset Generator
- Input:
--size(int, 5000-10000)--out(path)
- Output:
- JSON training file at
--out
- JSON training file at
Trainer
- Input:
- dataset file path
- model name
- hyperparameters
- Output:
- trained model artifacts in
output_dir
- trained model artifacts in
Inference
- Input:
- local model path
- prompt
- max new tokens
- Output:
- structured JSON to stdout
- Contract:
- required keys:
code,explanation,confidence,important_tokens,relevancy_score,hallucination,hallucination_check_reason,latency_ms
- required keys:
Upload
- Input:
- model directory path
- HF repo id
- Output:
- model artifacts uploaded to HF repo
6. Default Configuration
- Model:
Qwen/Qwen2.5-Coder-0.5B-Instruct - Dataset size:
8000 - Epochs:
3 - Batch size:
2 - Learning rate:
1e-4 - Max length:
512
7. Validation Criteria
Project is considered runnable when:
- all scripts compile
- dataset generation succeeds
- a smoke training run completes
- inference returns valid JSON payload with required keys
- upload script accepts valid model dir and repo id
8. Known Constraints
- CPU training is slow for full dataset runs.
- HF login/token is required for upload.
- Output quality depends heavily on dataset diversity and quality.