ConicAI_LLM_model / SPECIFICATION.md
girish00's picture
update endpoint helper files
3e6e808 verified

Project Specification

1. Project Name

Local Advanced Fine-Tuning Pipeline for Coding LLM

2. Purpose

Provide a fully local, modular workflow to fine-tune a compact coding LLM for:

  • code fixing
  • debugging
  • code explanation
  • response confidence and relevancy signals

3. Functional Requirements

FR-1 Dataset Generation

  • System must generate a JSON dataset with fields:
    • instruction
    • input
    • output
    • explanation
    • confidence
    • relevancy
  • Dataset size must be constrained to 5000-10000 samples.

FR-2 Model Fine-Tuning

  • System must support LoRA fine-tuning on:
    • Qwen/Qwen2.5-Coder-0.5B-Instruct (default)
  • Training inputs must be tokenized and formatted from dataset records.
  • Training output must be stored in a configurable output directory.

FR-3 Pipeline Orchestration

  • System must provide a one-command execution script for:
    • dataset generation
    • training
    • optional uploading
  • Pipeline must support skipping individual stages.

FR-4 Local Inference

  • System must generate outputs from local model folder.
  • Inference module must support:
    • LoRA adapter outputs
    • full model outputs
  • Inference output must be valid JSON containing:
    • code
    • explanation
    • confidence
    • important_tokens
    • relevancy_score
    • hallucination
    • hallucination_check_reason
    • latency_ms

FR-5 HF Upload

  • System must upload trained model artifacts to a user-specified HF repo.
  • Upload should be optional and independently executable.
  • System must support updating an existing HF model repo by uploading to the same repo_id.

4. Non-Functional Requirements

NFR-1 Reliability

  • Scripts must fail with clear error messages for missing files/directories.

NFR-2 Configurability

  • Hyperparameters and paths must be configurable via CLI.
  • Pipeline defaults should be read from training_config.json.

NFR-3 Performance

  • Must support limited-sample smoke run for CPU environments.
  • Tokenization must use deterministic fixed-length padding for stable LoRA training labels.
  • Inference should support deterministic mode by default for stable outputs.

NFR-4 Maintainability

  • Modules must remain decoupled and single-purpose where possible.
  • Documentation must include setup and run commands.

5. Input/Output Contracts

Dataset Generator

  • Input:
    • --size (int, 5000-10000)
    • --out (path)
  • Output:
    • JSON training file at --out

Trainer

  • Input:
    • dataset file path
    • model name
    • hyperparameters
  • Output:
    • trained model artifacts in output_dir

Inference

  • Input:
    • local model path
    • prompt
    • max new tokens
  • Output:
    • structured JSON to stdout
  • Contract:
    • required keys: code, explanation, confidence, important_tokens, relevancy_score, hallucination, hallucination_check_reason, latency_ms

Upload

  • Input:
    • model directory path
    • HF repo id
  • Output:
    • model artifacts uploaded to HF repo

6. Default Configuration

  • Model: Qwen/Qwen2.5-Coder-0.5B-Instruct
  • Dataset size: 8000
  • Epochs: 3
  • Batch size: 2
  • Learning rate: 1e-4
  • Max length: 512

7. Validation Criteria

Project is considered runnable when:

  • all scripts compile
  • dataset generation succeeds
  • a smoke training run completes
  • inference returns valid JSON payload with required keys
  • upload script accepts valid model dir and repo id

8. Known Constraints

  • CPU training is slow for full dataset runs.
  • HF login/token is required for upload.
  • Output quality depends heavily on dataset diversity and quality.