girish00 commited on
Commit
3e6e808
·
verified ·
1 Parent(s): a201d36

update endpoint helper files

Browse files
Files changed (1) hide show
  1. SPECIFICATION.md +134 -0
SPECIFICATION.md ADDED
@@ -0,0 +1,134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Project Specification
2
+
3
+ ## 1. Project Name
4
+
5
+ Local Advanced Fine-Tuning Pipeline for Coding LLM
6
+
7
+ ## 2. Purpose
8
+
9
+ Provide a fully local, modular workflow to fine-tune a compact coding LLM for:
10
+ - code fixing
11
+ - debugging
12
+ - code explanation
13
+ - response confidence and relevancy signals
14
+
15
+ ## 3. Functional Requirements
16
+
17
+ ### FR-1 Dataset Generation
18
+ - System must generate a JSON dataset with fields:
19
+ - `instruction`
20
+ - `input`
21
+ - `output`
22
+ - `explanation`
23
+ - `confidence`
24
+ - `relevancy`
25
+ - Dataset size must be constrained to 5000-10000 samples.
26
+
27
+ ### FR-2 Model Fine-Tuning
28
+ - System must support LoRA fine-tuning on:
29
+ - `Qwen/Qwen2.5-Coder-0.5B-Instruct` (default)
30
+ - Training inputs must be tokenized and formatted from dataset records.
31
+ - Training output must be stored in a configurable output directory.
32
+
33
+ ### FR-3 Pipeline Orchestration
34
+ - System must provide a one-command execution script for:
35
+ - dataset generation
36
+ - training
37
+ - optional uploading
38
+ - Pipeline must support skipping individual stages.
39
+
40
+ ### FR-4 Local Inference
41
+ - System must generate outputs from local model folder.
42
+ - Inference module must support:
43
+ - LoRA adapter outputs
44
+ - full model outputs
45
+ - Inference output must be valid JSON containing:
46
+ - `code`
47
+ - `explanation`
48
+ - `confidence`
49
+ - `important_tokens`
50
+ - `relevancy_score`
51
+ - `hallucination`
52
+ - `hallucination_check_reason`
53
+ - `latency_ms`
54
+
55
+ ### FR-5 HF Upload
56
+ - System must upload trained model artifacts to a user-specified HF repo.
57
+ - Upload should be optional and independently executable.
58
+ - System must support updating an existing HF model repo by uploading to the same `repo_id`.
59
+
60
+ ## 4. Non-Functional Requirements
61
+
62
+ ### NFR-1 Reliability
63
+ - Scripts must fail with clear error messages for missing files/directories.
64
+
65
+ ### NFR-2 Configurability
66
+ - Hyperparameters and paths must be configurable via CLI.
67
+ - Pipeline defaults should be read from `training_config.json`.
68
+
69
+ ### NFR-3 Performance
70
+ - Must support limited-sample smoke run for CPU environments.
71
+ - Tokenization must use deterministic fixed-length padding for stable LoRA training labels.
72
+ - Inference should support deterministic mode by default for stable outputs.
73
+
74
+ ### NFR-4 Maintainability
75
+ - Modules must remain decoupled and single-purpose where possible.
76
+ - Documentation must include setup and run commands.
77
+
78
+ ## 5. Input/Output Contracts
79
+
80
+ ### Dataset Generator
81
+ - Input:
82
+ - `--size` (int, 5000-10000)
83
+ - `--out` (path)
84
+ - Output:
85
+ - JSON training file at `--out`
86
+
87
+ ### Trainer
88
+ - Input:
89
+ - dataset file path
90
+ - model name
91
+ - hyperparameters
92
+ - Output:
93
+ - trained model artifacts in `output_dir`
94
+
95
+ ### Inference
96
+ - Input:
97
+ - local model path
98
+ - prompt
99
+ - max new tokens
100
+ - Output:
101
+ - structured JSON to stdout
102
+ - Contract:
103
+ - required keys: `code`, `explanation`, `confidence`, `important_tokens`, `relevancy_score`, `hallucination`, `hallucination_check_reason`, `latency_ms`
104
+
105
+ ### Upload
106
+ - Input:
107
+ - model directory path
108
+ - HF repo id
109
+ - Output:
110
+ - model artifacts uploaded to HF repo
111
+
112
+ ## 6. Default Configuration
113
+
114
+ - Model: `Qwen/Qwen2.5-Coder-0.5B-Instruct`
115
+ - Dataset size: `8000`
116
+ - Epochs: `3`
117
+ - Batch size: `2`
118
+ - Learning rate: `1e-4`
119
+ - Max length: `512`
120
+
121
+ ## 7. Validation Criteria
122
+
123
+ Project is considered runnable when:
124
+ - all scripts compile
125
+ - dataset generation succeeds
126
+ - a smoke training run completes
127
+ - inference returns valid JSON payload with required keys
128
+ - upload script accepts valid model dir and repo id
129
+
130
+ ## 8. Known Constraints
131
+
132
+ - CPU training is slow for full dataset runs.
133
+ - HF login/token is required for upload.
134
+ - Output quality depends heavily on dataset diversity and quality.