ConicAI_LLM_model / SPECIFICATION.md
girish00's picture
update endpoint helper files
3e6e808 verified
# Project Specification
## 1. Project Name
Local Advanced Fine-Tuning Pipeline for Coding LLM
## 2. Purpose
Provide a fully local, modular workflow to fine-tune a compact coding LLM for:
- code fixing
- debugging
- code explanation
- response confidence and relevancy signals
## 3. Functional Requirements
### FR-1 Dataset Generation
- System must generate a JSON dataset with fields:
- `instruction`
- `input`
- `output`
- `explanation`
- `confidence`
- `relevancy`
- Dataset size must be constrained to 5000-10000 samples.
### FR-2 Model Fine-Tuning
- System must support LoRA fine-tuning on:
- `Qwen/Qwen2.5-Coder-0.5B-Instruct` (default)
- Training inputs must be tokenized and formatted from dataset records.
- Training output must be stored in a configurable output directory.
### FR-3 Pipeline Orchestration
- System must provide a one-command execution script for:
- dataset generation
- training
- optional uploading
- Pipeline must support skipping individual stages.
### FR-4 Local Inference
- System must generate outputs from local model folder.
- Inference module must support:
- LoRA adapter outputs
- full model outputs
- Inference output must be valid JSON containing:
- `code`
- `explanation`
- `confidence`
- `important_tokens`
- `relevancy_score`
- `hallucination`
- `hallucination_check_reason`
- `latency_ms`
### FR-5 HF Upload
- System must upload trained model artifacts to a user-specified HF repo.
- Upload should be optional and independently executable.
- System must support updating an existing HF model repo by uploading to the same `repo_id`.
## 4. Non-Functional Requirements
### NFR-1 Reliability
- Scripts must fail with clear error messages for missing files/directories.
### NFR-2 Configurability
- Hyperparameters and paths must be configurable via CLI.
- Pipeline defaults should be read from `training_config.json`.
### NFR-3 Performance
- Must support limited-sample smoke run for CPU environments.
- Tokenization must use deterministic fixed-length padding for stable LoRA training labels.
- Inference should support deterministic mode by default for stable outputs.
### NFR-4 Maintainability
- Modules must remain decoupled and single-purpose where possible.
- Documentation must include setup and run commands.
## 5. Input/Output Contracts
### Dataset Generator
- Input:
- `--size` (int, 5000-10000)
- `--out` (path)
- Output:
- JSON training file at `--out`
### Trainer
- Input:
- dataset file path
- model name
- hyperparameters
- Output:
- trained model artifacts in `output_dir`
### Inference
- Input:
- local model path
- prompt
- max new tokens
- Output:
- structured JSON to stdout
- Contract:
- required keys: `code`, `explanation`, `confidence`, `important_tokens`, `relevancy_score`, `hallucination`, `hallucination_check_reason`, `latency_ms`
### Upload
- Input:
- model directory path
- HF repo id
- Output:
- model artifacts uploaded to HF repo
## 6. Default Configuration
- Model: `Qwen/Qwen2.5-Coder-0.5B-Instruct`
- Dataset size: `8000`
- Epochs: `3`
- Batch size: `2`
- Learning rate: `1e-4`
- Max length: `512`
## 7. Validation Criteria
Project is considered runnable when:
- all scripts compile
- dataset generation succeeds
- a smoke training run completes
- inference returns valid JSON payload with required keys
- upload script accepts valid model dir and repo id
## 8. Known Constraints
- CPU training is slow for full dataset runs.
- HF login/token is required for upload.
- Output quality depends heavily on dataset diversity and quality.