ConicAI_LLM_model / IMPLEMENTATION.md
girish00's picture
update endpoint helper files
a201d36 verified
# Implementation Guide
## Goal
Build and run a local fine-tuning pipeline for a coding assistant model with:
- dataset generation
- LoRA fine-tuning
- local inference
- optional Hugging Face upload
## Project Modules
- `generate_dataset.py`
- Generates training samples into JSON.
- Supports dataset size range: 5000 to 10000.
- `finetune_coding_llm_colab.py`
- Main training module (local usage).
- Supports dataset generation, training, and optional HF upload.
- `run_pipeline.py`
- Orchestrates generate -> train -> upload in one command.
- Reads defaults from `training_config.json`.
- `infer_local.py`
- Runs inference from local trained output.
- Handles both LoRA adapter output and full model output.
- Returns structured JSON fields including code, explanation, confidence, relevancy, hallucination check, and latency.
- `infer_cloud.py`
- Runs inference through the Hugging Face API using an HF token.
- Reuses the local structured-output parser and repair checks so API output matches the local JSON contract.
- Falls back to the local `model/` folder when Hugging Face does not serve the custom repo through an inference provider.
- `handler.py`
- Custom Hugging Face Dedicated Inference Endpoint handler.
- Loads the LoRA adapter/full model and returns the same structured JSON contract directly from the hosted endpoint.
- `evaluate_model.py`
- Runs a multi-prompt evaluation and reports pass rate (accuracy) for schema + quality checks.
- `upload_to_hf.py`
- Uploads local model folder to Hugging Face model repo.
## Environment Setup
1. Use Python 3.10+ (recommended).
2. Install dependencies:
- `pip install -r requirements.txt`
3. (Optional) Login to Hugging Face before upload:
- `huggingface-cli login`
## Standard Execution Flow
1. Generate dataset:
- `python generate_dataset.py --size 8000 --out train.json`
2. Train model:
- `python finetune_coding_llm_colab.py --dataset-size 8000 --train-file train.json --output-dir model --skip-dataset-gen`
3. Test inference:
- `python infer_local.py --model-path model --prompt "Fix this code: def add(a,b) return a+b"`
- Add `--allow-downloads` on a fresh machine if the base model is not cached locally.
4. Evaluate quality:
- `python evaluate_model.py --model-path model`
5. Upload (optional):
- `python upload_to_hf.py --model-dir model --repo-id your-username/your-model-name`
6. Test cloud inference (optional):
- PowerShell: `$env:HF_TOKEN="your_huggingface_token"`
- `python infer_cloud.py --repo-id your-username/your-model-name --prompt "Fix this code: def add(a,b) return a+b"`
- If you already logged in with `hf auth login`, the saved token can be used without setting `HF_TOKEN`.
- Add `--no-local-fallback` if you want the command to fail when HF cloud serving is unavailable.
- Add `--allow-downloads` if local fallback needs to download missing base-model files.
- For true cloud execution, deploy a Hugging Face Dedicated Inference Endpoint and call:
- `python infer_cloud.py --endpoint-url "https://your-endpoint-url.endpoints.huggingface.cloud" --prompt "Fix this code: def add(a,b) return a+b" --no-local-fallback`
- Users should set their own token with `$env:HF_TOKEN="their_huggingface_token"` before calling the endpoint.
## One-Command Execution
- Run full pipeline without upload:
- `python run_pipeline.py --dataset-size 8000 --skip-upload`
- Run with upload:
- `python run_pipeline.py --dataset-size 8000 --hf-repo your-username/your-model-name`
## Performance Recommendations
- CPU quick validation:
- `python run_pipeline.py --dataset-size 5000 --max-train-samples 20 --epochs 0.1 --skip-upload`
- Full quality run:
- `python run_pipeline.py --dataset-size 8000 --epochs 3 --batch-size 2 --learning-rate 1e-4 --max-length 512 --use-4bit --skip-upload`
## Error Handling Rules
- If dataset file is missing, run `generate_dataset.py`.
- If model folder is missing, run training first.
- If HF upload fails, verify:
- `huggingface-cli whoami`
- repo permission and repo id format (`username/repo`)
## Integration Notes
- `run_pipeline.py` is the recommended entrypoint for regular usage.
- `training_config.json` provides default values and can be overridden by CLI flags.
- Inference works with LoRA adapters and full models automatically.
## Hugging Face Existing Model Update
To update an already published Hugging Face model with current project behavior:
1. Retrain with latest code:
- `python run_pipeline.py --dataset-size 8000 --skip-upload`
2. Validate local inference + evaluation:
- `python infer_local.py --model-path model --prompt "Fix this code: def add(a,b) return a+b"`
- `python evaluate_model.py --model-path model`
3. Upload to same repo id:
- `python upload_to_hf.py --model-dir model --repo-id your-username/your-existing-model-name`
Optional safer rollout:
- Upload to a revision branch first and test before merging to main.
## Current Output Contract
`infer_local.py` returns JSON with:
- `code`
- `explanation`
- `confidence`
- `important_tokens`
- `relevancy_score`
- `hallucination`
- `hallucination_check_reason`
- `latency_ms`
`infer_cloud.py` returns the same JSON keys through the Hugging Face API, or through local fallback if HF cannot serve the custom repo. Cloud responses may not include token-level probabilities, so `important_tokens` can be empty and `confidence` can be `0.0` unless the serving endpoint exposes token details.
For users calling the hosted model with their own token/API key, deploy the repository as a Hugging Face Dedicated Inference Endpoint. The included `handler.py` makes endpoint responses use the same JSON pattern:
- `code`
- `explanation`
- `confidence`
- `important_tokens`
- `relevancy_score`
- `hallucination`
- `hallucination_check_reason`
- `latency_ms`
Direct Hugging Face serverless calls to the model repo are not guaranteed to run custom LoRA repos. Dedicated endpoints or a cloud VM are required for true cloud execution.