# Implementation Guide

## Goal

Build and run a local fine-tuning pipeline for a coding assistant model with:
- dataset generation
- LoRA fine-tuning
- local inference
- optional Hugging Face upload

## Project Modules

- `generate_dataset.py`
  - Generates training samples into JSON.
  - Supports dataset size range: 5000 to 10000.
- `finetune_coding_llm_colab.py`
  - Main training module (local usage).
  - Supports dataset generation, training, and optional HF upload.
- `run_pipeline.py`
  - Orchestrates generate -> train -> upload in one command.
  - Reads defaults from `training_config.json`.
- `infer_local.py`
  - Runs inference from local trained output.
  - Handles both LoRA adapter output and full model output.
  - Returns structured JSON fields including code, explanation, confidence, relevancy, hallucination check, and latency.
- `infer_cloud.py`
  - Runs inference through the Hugging Face API using an HF token.
  - Reuses the local structured-output parser and repair checks so API output matches the local JSON contract.
  - Falls back to the local `model/` folder when Hugging Face does not serve the custom repo through an inference provider.
- `handler.py`
  - Custom Hugging Face Dedicated Inference Endpoint handler.
  - Loads the LoRA adapter/full model and returns the same structured JSON contract directly from the hosted endpoint.
- `evaluate_model.py`
  - Runs a multi-prompt evaluation and reports pass rate (accuracy) for schema + quality checks.
- `upload_to_hf.py`
  - Uploads local model folder to Hugging Face model repo.

## Environment Setup

1. Use Python 3.10+ (recommended).
2. Install dependencies:
   - `pip install -r requirements.txt`
3. (Optional) Login to Hugging Face before upload:
   - `huggingface-cli login`

## Standard Execution Flow

1. Generate dataset:
   - `python generate_dataset.py --size 8000 --out train.json`
2. Train model:
   - `python finetune_coding_llm_colab.py --dataset-size 8000 --train-file train.json --output-dir model --skip-dataset-gen`
3. Test inference:
   - `python infer_local.py --model-path model --prompt "Fix this code: def add(a,b) return a+b"`
   - Add `--allow-downloads` on a fresh machine if the base model is not cached locally.
4. Evaluate quality:
   - `python evaluate_model.py --model-path model`
5. Upload (optional):
   - `python upload_to_hf.py --model-dir model --repo-id your-username/your-model-name`
6. Test cloud inference (optional):
   - PowerShell: `$env:HF_TOKEN="your_huggingface_token"`
   - `python infer_cloud.py --repo-id your-username/your-model-name --prompt "Fix this code: def add(a,b) return a+b"`
   - If you already logged in with `hf auth login`, the saved token can be used without setting `HF_TOKEN`.
   - Add `--no-local-fallback` if you want the command to fail when HF cloud serving is unavailable.
   - Add `--allow-downloads` if local fallback needs to download missing base-model files.
   - For true cloud execution, deploy a Hugging Face Dedicated Inference Endpoint and call:
     - `python infer_cloud.py --endpoint-url "https://your-endpoint-url.endpoints.huggingface.cloud" --prompt "Fix this code: def add(a,b) return a+b" --no-local-fallback`
   - Users should set their own token with `$env:HF_TOKEN="their_huggingface_token"` before calling the endpoint.

## One-Command Execution

- Run full pipeline without upload:
  - `python run_pipeline.py --dataset-size 8000 --skip-upload`

- Run with upload:
  - `python run_pipeline.py --dataset-size 8000 --hf-repo your-username/your-model-name`

## Performance Recommendations

- CPU quick validation:
  - `python run_pipeline.py --dataset-size 5000 --max-train-samples 20 --epochs 0.1 --skip-upload`
- Full quality run:
  - `python run_pipeline.py --dataset-size 8000 --epochs 3 --batch-size 2 --learning-rate 1e-4 --max-length 512 --use-4bit --skip-upload`

## Error Handling Rules

- If dataset file is missing, run `generate_dataset.py`.
- If model folder is missing, run training first.
- If HF upload fails, verify:
  - `huggingface-cli whoami`
  - repo permission and repo id format (`username/repo`)

## Integration Notes

- `run_pipeline.py` is the recommended entrypoint for regular usage.
- `training_config.json` provides default values and can be overridden by CLI flags.
- Inference works with LoRA adapters and full models automatically.

## Hugging Face Existing Model Update

To update an already published Hugging Face model with current project behavior:

1. Retrain with latest code:
   - `python run_pipeline.py --dataset-size 8000 --skip-upload`
2. Validate local inference + evaluation:
   - `python infer_local.py --model-path model --prompt "Fix this code: def add(a,b) return a+b"`
   - `python evaluate_model.py --model-path model`
3. Upload to same repo id:
   - `python upload_to_hf.py --model-dir model --repo-id your-username/your-existing-model-name`

Optional safer rollout:
- Upload to a revision branch first and test before merging to main.

## Current Output Contract

`infer_local.py` returns JSON with:
- `code`
- `explanation`
- `confidence`
- `important_tokens`
- `relevancy_score`
- `hallucination`
- `hallucination_check_reason`
- `latency_ms`

`infer_cloud.py` returns the same JSON keys through the Hugging Face API, or through local fallback if HF cannot serve the custom repo. Cloud responses may not include token-level probabilities, so `important_tokens` can be empty and `confidence` can be `0.0` unless the serving endpoint exposes token details.

For users calling the hosted model with their own token/API key, deploy the repository as a Hugging Face Dedicated Inference Endpoint. The included `handler.py` makes endpoint responses use the same JSON pattern:

- `code`
- `explanation`
- `confidence`
- `important_tokens`
- `relevancy_score`
- `hallucination`
- `hallucination_check_reason`
- `latency_ms`

Direct Hugging Face serverless calls to the model repo are not guaranteed to run custom LoRA repos. Dedicated endpoints or a cloud VM are required for true cloud execution.