ConicAI_LLM_model / IMPLEMENTATION.md
girish00's picture
update endpoint helper files
a201d36 verified

Implementation Guide

Goal

Build and run a local fine-tuning pipeline for a coding assistant model with:

  • dataset generation
  • LoRA fine-tuning
  • local inference
  • optional Hugging Face upload

Project Modules

  • generate_dataset.py
    • Generates training samples into JSON.
    • Supports dataset size range: 5000 to 10000.
  • finetune_coding_llm_colab.py
    • Main training module (local usage).
    • Supports dataset generation, training, and optional HF upload.
  • run_pipeline.py
    • Orchestrates generate -> train -> upload in one command.
    • Reads defaults from training_config.json.
  • infer_local.py
    • Runs inference from local trained output.
    • Handles both LoRA adapter output and full model output.
    • Returns structured JSON fields including code, explanation, confidence, relevancy, hallucination check, and latency.
  • infer_cloud.py
    • Runs inference through the Hugging Face API using an HF token.
    • Reuses the local structured-output parser and repair checks so API output matches the local JSON contract.
    • Falls back to the local model/ folder when Hugging Face does not serve the custom repo through an inference provider.
  • handler.py
    • Custom Hugging Face Dedicated Inference Endpoint handler.
    • Loads the LoRA adapter/full model and returns the same structured JSON contract directly from the hosted endpoint.
  • evaluate_model.py
    • Runs a multi-prompt evaluation and reports pass rate (accuracy) for schema + quality checks.
  • upload_to_hf.py
    • Uploads local model folder to Hugging Face model repo.

Environment Setup

  1. Use Python 3.10+ (recommended).
  2. Install dependencies:
    • pip install -r requirements.txt
  3. (Optional) Login to Hugging Face before upload:
    • huggingface-cli login

Standard Execution Flow

  1. Generate dataset:
    • python generate_dataset.py --size 8000 --out train.json
  2. Train model:
    • python finetune_coding_llm_colab.py --dataset-size 8000 --train-file train.json --output-dir model --skip-dataset-gen
  3. Test inference:
    • python infer_local.py --model-path model --prompt "Fix this code: def add(a,b) return a+b"
    • Add --allow-downloads on a fresh machine if the base model is not cached locally.
  4. Evaluate quality:
    • python evaluate_model.py --model-path model
  5. Upload (optional):
    • python upload_to_hf.py --model-dir model --repo-id your-username/your-model-name
  6. Test cloud inference (optional):
    • PowerShell: $env:HF_TOKEN="your_huggingface_token"
    • python infer_cloud.py --repo-id your-username/your-model-name --prompt "Fix this code: def add(a,b) return a+b"
    • If you already logged in with hf auth login, the saved token can be used without setting HF_TOKEN.
    • Add --no-local-fallback if you want the command to fail when HF cloud serving is unavailable.
    • Add --allow-downloads if local fallback needs to download missing base-model files.
    • For true cloud execution, deploy a Hugging Face Dedicated Inference Endpoint and call:
      • python infer_cloud.py --endpoint-url "https://your-endpoint-url.endpoints.huggingface.cloud" --prompt "Fix this code: def add(a,b) return a+b" --no-local-fallback
    • Users should set their own token with $env:HF_TOKEN="their_huggingface_token" before calling the endpoint.

One-Command Execution

  • Run full pipeline without upload:

    • python run_pipeline.py --dataset-size 8000 --skip-upload
  • Run with upload:

    • python run_pipeline.py --dataset-size 8000 --hf-repo your-username/your-model-name

Performance Recommendations

  • CPU quick validation:
    • python run_pipeline.py --dataset-size 5000 --max-train-samples 20 --epochs 0.1 --skip-upload
  • Full quality run:
    • python run_pipeline.py --dataset-size 8000 --epochs 3 --batch-size 2 --learning-rate 1e-4 --max-length 512 --use-4bit --skip-upload

Error Handling Rules

  • If dataset file is missing, run generate_dataset.py.
  • If model folder is missing, run training first.
  • If HF upload fails, verify:
    • huggingface-cli whoami
    • repo permission and repo id format (username/repo)

Integration Notes

  • run_pipeline.py is the recommended entrypoint for regular usage.
  • training_config.json provides default values and can be overridden by CLI flags.
  • Inference works with LoRA adapters and full models automatically.

Hugging Face Existing Model Update

To update an already published Hugging Face model with current project behavior:

  1. Retrain with latest code:
    • python run_pipeline.py --dataset-size 8000 --skip-upload
  2. Validate local inference + evaluation:
    • python infer_local.py --model-path model --prompt "Fix this code: def add(a,b) return a+b"
    • python evaluate_model.py --model-path model
  3. Upload to same repo id:
    • python upload_to_hf.py --model-dir model --repo-id your-username/your-existing-model-name

Optional safer rollout:

  • Upload to a revision branch first and test before merging to main.

Current Output Contract

infer_local.py returns JSON with:

  • code
  • explanation
  • confidence
  • important_tokens
  • relevancy_score
  • hallucination
  • hallucination_check_reason
  • latency_ms

infer_cloud.py returns the same JSON keys through the Hugging Face API, or through local fallback if HF cannot serve the custom repo. Cloud responses may not include token-level probabilities, so important_tokens can be empty and confidence can be 0.0 unless the serving endpoint exposes token details.

For users calling the hosted model with their own token/API key, deploy the repository as a Hugging Face Dedicated Inference Endpoint. The included handler.py makes endpoint responses use the same JSON pattern:

  • code
  • explanation
  • confidence
  • important_tokens
  • relevancy_score
  • hallucination
  • hallucination_check_reason
  • latency_ms

Direct Hugging Face serverless calls to the model repo are not guaranteed to run custom LoRA repos. Dedicated endpoints or a cloud VM are required for true cloud execution.