File size: 3,739 Bytes
3e6e808
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
# Project Specification

## 1. Project Name

Local Advanced Fine-Tuning Pipeline for Coding LLM

## 2. Purpose

Provide a fully local, modular workflow to fine-tune a compact coding LLM for:
- code fixing
- debugging
- code explanation
- response confidence and relevancy signals

## 3. Functional Requirements

### FR-1 Dataset Generation
- System must generate a JSON dataset with fields:
  - `instruction`
  - `input`
  - `output`
  - `explanation`
  - `confidence`
  - `relevancy`
- Dataset size must be constrained to 5000-10000 samples.

### FR-2 Model Fine-Tuning
- System must support LoRA fine-tuning on:
  - `Qwen/Qwen2.5-Coder-0.5B-Instruct` (default)
- Training inputs must be tokenized and formatted from dataset records.
- Training output must be stored in a configurable output directory.

### FR-3 Pipeline Orchestration
- System must provide a one-command execution script for:
  - dataset generation
  - training
  - optional uploading
- Pipeline must support skipping individual stages.

### FR-4 Local Inference
- System must generate outputs from local model folder.
- Inference module must support:
  - LoRA adapter outputs
  - full model outputs
- Inference output must be valid JSON containing:
  - `code`
  - `explanation`
  - `confidence`
  - `important_tokens`
  - `relevancy_score`
  - `hallucination`
  - `hallucination_check_reason`
  - `latency_ms`

### FR-5 HF Upload
- System must upload trained model artifacts to a user-specified HF repo.
- Upload should be optional and independently executable.
- System must support updating an existing HF model repo by uploading to the same `repo_id`.

## 4. Non-Functional Requirements

### NFR-1 Reliability
- Scripts must fail with clear error messages for missing files/directories.

### NFR-2 Configurability
- Hyperparameters and paths must be configurable via CLI.
- Pipeline defaults should be read from `training_config.json`.

### NFR-3 Performance
- Must support limited-sample smoke run for CPU environments.
- Tokenization must use deterministic fixed-length padding for stable LoRA training labels.
- Inference should support deterministic mode by default for stable outputs.

### NFR-4 Maintainability
- Modules must remain decoupled and single-purpose where possible.
- Documentation must include setup and run commands.

## 5. Input/Output Contracts

### Dataset Generator
- Input:
  - `--size` (int, 5000-10000)
  - `--out` (path)
- Output:
  - JSON training file at `--out`

### Trainer
- Input:
  - dataset file path
  - model name
  - hyperparameters
- Output:
  - trained model artifacts in `output_dir`

### Inference
- Input:
  - local model path
  - prompt
  - max new tokens
- Output:
  - structured JSON to stdout
- Contract:
  - required keys: `code`, `explanation`, `confidence`, `important_tokens`, `relevancy_score`, `hallucination`, `hallucination_check_reason`, `latency_ms`

### Upload
- Input:
  - model directory path
  - HF repo id
- Output:
  - model artifacts uploaded to HF repo

## 6. Default Configuration

- Model: `Qwen/Qwen2.5-Coder-0.5B-Instruct`
- Dataset size: `8000`
- Epochs: `3`
- Batch size: `2`
- Learning rate: `1e-4`
- Max length: `512`

## 7. Validation Criteria

Project is considered runnable when:
- all scripts compile
- dataset generation succeeds
- a smoke training run completes
- inference returns valid JSON payload with required keys
- upload script accepts valid model dir and repo id

## 8. Known Constraints

- CPU training is slow for full dataset runs.
- HF login/token is required for upload.
- Output quality depends heavily on dataset diversity and quality.