# Implementation Guide ## Goal Build and run a local fine-tuning pipeline for a coding assistant model with: - dataset generation - LoRA fine-tuning - local inference - optional Hugging Face upload ## Project Modules - `generate_dataset.py` - Generates training samples into JSON. - Supports dataset size range: 5000 to 10000. - `finetune_coding_llm_colab.py` - Main training module (local usage). - Supports dataset generation, training, and optional HF upload. - `run_pipeline.py` - Orchestrates generate -> train -> upload in one command. - Reads defaults from `training_config.json`. - `infer_local.py` - Runs inference from local trained output. - Handles both LoRA adapter output and full model output. - Returns structured JSON fields including code, explanation, confidence, relevancy, hallucination check, and latency. - `infer_cloud.py` - Runs inference through the Hugging Face API using an HF token. - Reuses the local structured-output parser and repair checks so API output matches the local JSON contract. - Falls back to the local `model/` folder when Hugging Face does not serve the custom repo through an inference provider. - `handler.py` - Custom Hugging Face Dedicated Inference Endpoint handler. - Loads the LoRA adapter/full model and returns the same structured JSON contract directly from the hosted endpoint. - `evaluate_model.py` - Runs a multi-prompt evaluation and reports pass rate (accuracy) for schema + quality checks. - `upload_to_hf.py` - Uploads local model folder to Hugging Face model repo. ## Environment Setup 1. Use Python 3.10+ (recommended). 2. Install dependencies: - `pip install -r requirements.txt` 3. (Optional) Login to Hugging Face before upload: - `huggingface-cli login` ## Standard Execution Flow 1. Generate dataset: - `python generate_dataset.py --size 8000 --out train.json` 2. Train model: - `python finetune_coding_llm_colab.py --dataset-size 8000 --train-file train.json --output-dir model --skip-dataset-gen` 3. Test inference: - `python infer_local.py --model-path model --prompt "Fix this code: def add(a,b) return a+b"` - Add `--allow-downloads` on a fresh machine if the base model is not cached locally. 4. Evaluate quality: - `python evaluate_model.py --model-path model` 5. Upload (optional): - `python upload_to_hf.py --model-dir model --repo-id your-username/your-model-name` 6. Test cloud inference (optional): - PowerShell: `$env:HF_TOKEN="your_huggingface_token"` - `python infer_cloud.py --repo-id your-username/your-model-name --prompt "Fix this code: def add(a,b) return a+b"` - If you already logged in with `hf auth login`, the saved token can be used without setting `HF_TOKEN`. - Add `--no-local-fallback` if you want the command to fail when HF cloud serving is unavailable. - Add `--allow-downloads` if local fallback needs to download missing base-model files. - For true cloud execution, deploy a Hugging Face Dedicated Inference Endpoint and call: - `python infer_cloud.py --endpoint-url "https://your-endpoint-url.endpoints.huggingface.cloud" --prompt "Fix this code: def add(a,b) return a+b" --no-local-fallback` - Users should set their own token with `$env:HF_TOKEN="their_huggingface_token"` before calling the endpoint. ## One-Command Execution - Run full pipeline without upload: - `python run_pipeline.py --dataset-size 8000 --skip-upload` - Run with upload: - `python run_pipeline.py --dataset-size 8000 --hf-repo your-username/your-model-name` ## Performance Recommendations - CPU quick validation: - `python run_pipeline.py --dataset-size 5000 --max-train-samples 20 --epochs 0.1 --skip-upload` - Full quality run: - `python run_pipeline.py --dataset-size 8000 --epochs 3 --batch-size 2 --learning-rate 1e-4 --max-length 512 --use-4bit --skip-upload` ## Error Handling Rules - If dataset file is missing, run `generate_dataset.py`. - If model folder is missing, run training first. - If HF upload fails, verify: - `huggingface-cli whoami` - repo permission and repo id format (`username/repo`) ## Integration Notes - `run_pipeline.py` is the recommended entrypoint for regular usage. - `training_config.json` provides default values and can be overridden by CLI flags. - Inference works with LoRA adapters and full models automatically. ## Hugging Face Existing Model Update To update an already published Hugging Face model with current project behavior: 1. Retrain with latest code: - `python run_pipeline.py --dataset-size 8000 --skip-upload` 2. Validate local inference + evaluation: - `python infer_local.py --model-path model --prompt "Fix this code: def add(a,b) return a+b"` - `python evaluate_model.py --model-path model` 3. Upload to same repo id: - `python upload_to_hf.py --model-dir model --repo-id your-username/your-existing-model-name` Optional safer rollout: - Upload to a revision branch first and test before merging to main. ## Current Output Contract `infer_local.py` returns JSON with: - `code` - `explanation` - `confidence` - `important_tokens` - `relevancy_score` - `hallucination` - `hallucination_check_reason` - `latency_ms` `infer_cloud.py` returns the same JSON keys through the Hugging Face API, or through local fallback if HF cannot serve the custom repo. Cloud responses may not include token-level probabilities, so `important_tokens` can be empty and `confidence` can be `0.0` unless the serving endpoint exposes token details. For users calling the hosted model with their own token/API key, deploy the repository as a Hugging Face Dedicated Inference Endpoint. The included `handler.py` makes endpoint responses use the same JSON pattern: - `code` - `explanation` - `confidence` - `important_tokens` - `relevancy_score` - `hallucination` - `hallucination_check_reason` - `latency_ms` Direct Hugging Face serverless calls to the model repo are not guaranteed to run custom LoRA repos. Dedicated endpoints or a cloud VM are required for true cloud execution.