Instructions to use girish00/ConicAI_LLM_model with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use girish00/ConicAI_LLM_model with PEFT:

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-0.5B-Instruct")
model = PeftModel.from_pretrained(base_model, "girish00/ConicAI_LLM_model")

Transformers

How to use girish00/ConicAI_LLM_model with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="girish00/ConicAI_LLM_model")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("girish00/ConicAI_LLM_model")
model = AutoModelForCausalLM.from_pretrained("girish00/ConicAI_LLM_model")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use girish00/ConicAI_LLM_model with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "girish00/ConicAI_LLM_model"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "girish00/ConicAI_LLM_model",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/girish00/ConicAI_LLM_model

SGLang

How to use girish00/ConicAI_LLM_model with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "girish00/ConicAI_LLM_model" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "girish00/ConicAI_LLM_model",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "girish00/ConicAI_LLM_model" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "girish00/ConicAI_LLM_model",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use girish00/ConicAI_LLM_model with Docker Model Runner:
```
docker model run hf.co/girish00/ConicAI_LLM_model
```

ConicAI_LLM_model

File size: 6,118 Bytes

# Implementation Guide

## Goal

Build and run a local fine-tuning pipeline for a coding assistant model with:
- dataset generation
- LoRA fine-tuning
- local inference
- optional Hugging Face upload

## Project Modules

- `generate_dataset.py`
  - Generates training samples into JSON.
  - Supports dataset size range: 5000 to 10000.
- `finetune_coding_llm_colab.py`
  - Main training module (local usage).
  - Supports dataset generation, training, and optional HF upload.
- `run_pipeline.py`
  - Orchestrates generate -> train -> upload in one command.
  - Reads defaults from `training_config.json`.
- `infer_local.py`
  - Runs inference from local trained output.
  - Handles both LoRA adapter output and full model output.
  - Returns structured JSON fields including code, explanation, confidence, relevancy, hallucination check, and latency.
- `infer_cloud.py`
  - Runs inference through the Hugging Face API using an HF token.
  - Reuses the local structured-output parser and repair checks so API output matches the local JSON contract.
  - Falls back to the local `model/` folder when Hugging Face does not serve the custom repo through an inference provider.
- `handler.py`
  - Custom Hugging Face Dedicated Inference Endpoint handler.
  - Loads the LoRA adapter/full model and returns the same structured JSON contract directly from the hosted endpoint.
- `evaluate_model.py`
  - Runs a multi-prompt evaluation and reports pass rate (accuracy) for schema + quality checks.
- `upload_to_hf.py`
  - Uploads local model folder to Hugging Face model repo.

## Environment Setup

1. Use Python 3.10+ (recommended).
2. Install dependencies:
   - `pip install -r requirements.txt`
3. (Optional) Login to Hugging Face before upload:
   - `huggingface-cli login`

## Standard Execution Flow

1. Generate dataset:
   - `python generate_dataset.py --size 8000 --out train.json`
2. Train model:
   - `python finetune_coding_llm_colab.py --dataset-size 8000 --train-file train.json --output-dir model --skip-dataset-gen`
3. Test inference:
   - `python infer_local.py --model-path model --prompt "Fix this code: def add(a,b) return a+b"`
   - Add `--allow-downloads` on a fresh machine if the base model is not cached locally.
4. Evaluate quality:
   - `python evaluate_model.py --model-path model`
5. Upload (optional):
   - `python upload_to_hf.py --model-dir model --repo-id your-username/your-model-name`
6. Test cloud inference (optional):
   - PowerShell: `$env:HF_TOKEN="your_huggingface_token"`
   - `python infer_cloud.py --repo-id your-username/your-model-name --prompt "Fix this code: def add(a,b) return a+b"`
   - If you already logged in with `hf auth login`, the saved token can be used without setting `HF_TOKEN`.
   - Add `--no-local-fallback` if you want the command to fail when HF cloud serving is unavailable.
   - Add `--allow-downloads` if local fallback needs to download missing base-model files.
   - For true cloud execution, deploy a Hugging Face Dedicated Inference Endpoint and call:
     - `python infer_cloud.py --endpoint-url "https://your-endpoint-url.endpoints.huggingface.cloud" --prompt "Fix this code: def add(a,b) return a+b" --no-local-fallback`
   - Users should set their own token with `$env:HF_TOKEN="their_huggingface_token"` before calling the endpoint.

## One-Command Execution

- Run full pipeline without upload:
  - `python run_pipeline.py --dataset-size 8000 --skip-upload`

- Run with upload:
  - `python run_pipeline.py --dataset-size 8000 --hf-repo your-username/your-model-name`

## Performance Recommendations

- CPU quick validation:
  - `python run_pipeline.py --dataset-size 5000 --max-train-samples 20 --epochs 0.1 --skip-upload`
- Full quality run:
  - `python run_pipeline.py --dataset-size 8000 --epochs 3 --batch-size 2 --learning-rate 1e-4 --max-length 512 --use-4bit --skip-upload`

## Error Handling Rules

- If dataset file is missing, run `generate_dataset.py`.
- If model folder is missing, run training first.
- If HF upload fails, verify:
  - `huggingface-cli whoami`
  - repo permission and repo id format (`username/repo`)

## Integration Notes

- `run_pipeline.py` is the recommended entrypoint for regular usage.
- `training_config.json` provides default values and can be overridden by CLI flags.
- Inference works with LoRA adapters and full models automatically.

## Hugging Face Existing Model Update

To update an already published Hugging Face model with current project behavior:

1. Retrain with latest code:
   - `python run_pipeline.py --dataset-size 8000 --skip-upload`
2. Validate local inference + evaluation:
   - `python infer_local.py --model-path model --prompt "Fix this code: def add(a,b) return a+b"`
   - `python evaluate_model.py --model-path model`
3. Upload to same repo id:
   - `python upload_to_hf.py --model-dir model --repo-id your-username/your-existing-model-name`

Optional safer rollout:
- Upload to a revision branch first and test before merging to main.

## Current Output Contract

`infer_local.py` returns JSON with:
- `code`
- `explanation`
- `confidence`
- `important_tokens`
- `relevancy_score`
- `hallucination`
- `hallucination_check_reason`
- `latency_ms`

`infer_cloud.py` returns the same JSON keys through the Hugging Face API, or through local fallback if HF cannot serve the custom repo. Cloud responses may not include token-level probabilities, so `important_tokens` can be empty and `confidence` can be `0.0` unless the serving endpoint exposes token details.

For users calling the hosted model with their own token/API key, deploy the repository as a Hugging Face Dedicated Inference Endpoint. The included `handler.py` makes endpoint responses use the same JSON pattern:

- `code`
- `explanation`
- `confidence`
- `important_tokens`
- `relevancy_score`
- `hallucination`
- `hallucination_check_reason`
- `latency_ms`

Direct Hugging Face serverless calls to the model repo are not guaranteed to run custom LoRA repos. Dedicated endpoints or a cloud VM are required for true cloud execution.