Instructions to use girish00/ConicAI_LLM_model with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use girish00/ConicAI_LLM_model with PEFT:

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-0.5B-Instruct")
model = PeftModel.from_pretrained(base_model, "girish00/ConicAI_LLM_model")

Transformers

How to use girish00/ConicAI_LLM_model with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="girish00/ConicAI_LLM_model")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("girish00/ConicAI_LLM_model")
model = AutoModelForCausalLM.from_pretrained("girish00/ConicAI_LLM_model")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use girish00/ConicAI_LLM_model with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "girish00/ConicAI_LLM_model"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "girish00/ConicAI_LLM_model",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/girish00/ConicAI_LLM_model

SGLang

How to use girish00/ConicAI_LLM_model with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "girish00/ConicAI_LLM_model" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "girish00/ConicAI_LLM_model",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "girish00/ConicAI_LLM_model" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "girish00/ConicAI_LLM_model",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use girish00/ConicAI_LLM_model with Docker Model Runner:
```
docker model run hf.co/girish00/ConicAI_LLM_model
```

ConicAI_LLM_model / IMPLEMENTATION.md

girish00

update endpoint helper files

a201d36 verified 22 days ago

preview code

raw

history blame contribute delete

6.12 kB

Implementation Guide

Goal

Build and run a local fine-tuning pipeline for a coding assistant model with:

dataset generation
LoRA fine-tuning
local inference
optional Hugging Face upload

Project Modules

generate_dataset.py
- Generates training samples into JSON.
- Supports dataset size range: 5000 to 10000.
finetune_coding_llm_colab.py
- Main training module (local usage).
- Supports dataset generation, training, and optional HF upload.
run_pipeline.py
- Orchestrates generate -> train -> upload in one command.
- Reads defaults from training_config.json.
infer_local.py
- Runs inference from local trained output.
- Handles both LoRA adapter output and full model output.
- Returns structured JSON fields including code, explanation, confidence, relevancy, hallucination check, and latency.
infer_cloud.py
- Runs inference through the Hugging Face API using an HF token.
- Reuses the local structured-output parser and repair checks so API output matches the local JSON contract.
- Falls back to the local model/ folder when Hugging Face does not serve the custom repo through an inference provider.
handler.py
- Custom Hugging Face Dedicated Inference Endpoint handler.
- Loads the LoRA adapter/full model and returns the same structured JSON contract directly from the hosted endpoint.
evaluate_model.py
- Runs a multi-prompt evaluation and reports pass rate (accuracy) for schema + quality checks.
upload_to_hf.py
- Uploads local model folder to Hugging Face model repo.

Environment Setup

Use Python 3.10+ (recommended).
Install dependencies:
- pip install -r requirements.txt
(Optional) Login to Hugging Face before upload:
- huggingface-cli login

Standard Execution Flow

Generate dataset:
- python generate_dataset.py --size 8000 --out train.json
Train model:
- python finetune_coding_llm_colab.py --dataset-size 8000 --train-file train.json --output-dir model --skip-dataset-gen
Test inference:
- python infer_local.py --model-path model --prompt "Fix this code: def add(a,b) return a+b"
- Add --allow-downloads on a fresh machine if the base model is not cached locally.
Evaluate quality:
- python evaluate_model.py --model-path model
Upload (optional):
- python upload_to_hf.py --model-dir model --repo-id your-username/your-model-name
Test cloud inference (optional):
- PowerShell: $env:HF_TOKEN="your_huggingface_token"
- python infer_cloud.py --repo-id your-username/your-model-name --prompt "Fix this code: def add(a,b) return a+b"
- If you already logged in with hf auth login, the saved token can be used without setting HF_TOKEN.
- Add --no-local-fallback if you want the command to fail when HF cloud serving is unavailable.
- Add --allow-downloads if local fallback needs to download missing base-model files.
- For true cloud execution, deploy a Hugging Face Dedicated Inference Endpoint and call:
  - python infer_cloud.py --endpoint-url "https://your-endpoint-url.endpoints.huggingface.cloud" --prompt "Fix this code: def add(a,b) return a+b" --no-local-fallback
- Users should set their own token with $env:HF_TOKEN="their_huggingface_token" before calling the endpoint.

One-Command Execution

Run full pipeline without upload:
- python run_pipeline.py --dataset-size 8000 --skip-upload
Run with upload:
- python run_pipeline.py --dataset-size 8000 --hf-repo your-username/your-model-name

Performance Recommendations

CPU quick validation:
- python run_pipeline.py --dataset-size 5000 --max-train-samples 20 --epochs 0.1 --skip-upload
Full quality run:
- python run_pipeline.py --dataset-size 8000 --epochs 3 --batch-size 2 --learning-rate 1e-4 --max-length 512 --use-4bit --skip-upload

Error Handling Rules

If dataset file is missing, run generate_dataset.py.
If model folder is missing, run training first.
If HF upload fails, verify:
- huggingface-cli whoami
- repo permission and repo id format (username/repo)

Integration Notes

run_pipeline.py is the recommended entrypoint for regular usage.
training_config.json provides default values and can be overridden by CLI flags.
Inference works with LoRA adapters and full models automatically.

Hugging Face Existing Model Update

To update an already published Hugging Face model with current project behavior:

Retrain with latest code:
- python run_pipeline.py --dataset-size 8000 --skip-upload
Validate local inference + evaluation:
- python infer_local.py --model-path model --prompt "Fix this code: def add(a,b) return a+b"
- python evaluate_model.py --model-path model
Upload to same repo id:
- python upload_to_hf.py --model-dir model --repo-id your-username/your-existing-model-name

Optional safer rollout:

Upload to a revision branch first and test before merging to main.

Current Output Contract

infer_local.py returns JSON with:

code
explanation
confidence
important_tokens
relevancy_score
hallucination
hallucination_check_reason
latency_ms

infer_cloud.py returns the same JSON keys through the Hugging Face API, or through local fallback if HF cannot serve the custom repo. Cloud responses may not include token-level probabilities, so important_tokens can be empty and confidence can be 0.0 unless the serving endpoint exposes token details.

For users calling the hosted model with their own token/API key, deploy the repository as a Hugging Face Dedicated Inference Endpoint. The included handler.py makes endpoint responses use the same JSON pattern:

code
explanation
confidence
important_tokens
relevancy_score
hallucination
hallucination_check_reason
latency_ms

Direct Hugging Face serverless calls to the model repo are not guaranteed to run custom LoRA repos. Dedicated endpoints or a cloud VM are required for true cloud execution.