Instructions to use girish00/ConicAI_LLM_model with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use girish00/ConicAI_LLM_model with PEFT:

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-0.5B-Instruct")
model = PeftModel.from_pretrained(base_model, "girish00/ConicAI_LLM_model")

Transformers

How to use girish00/ConicAI_LLM_model with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="girish00/ConicAI_LLM_model")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("girish00/ConicAI_LLM_model")
model = AutoModelForCausalLM.from_pretrained("girish00/ConicAI_LLM_model")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use girish00/ConicAI_LLM_model with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "girish00/ConicAI_LLM_model"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "girish00/ConicAI_LLM_model",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/girish00/ConicAI_LLM_model

SGLang

How to use girish00/ConicAI_LLM_model with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "girish00/ConicAI_LLM_model" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "girish00/ConicAI_LLM_model",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "girish00/ConicAI_LLM_model" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "girish00/ConicAI_LLM_model",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use girish00/ConicAI_LLM_model with Docker Model Runner:
```
docker model run hf.co/girish00/ConicAI_LLM_model
```

ConicAI_LLM_model / IMPLEMENTATION.md

girish00

update endpoint helper files

a201d36 verified 22 days ago

preview code

raw

history blame contribute delete

6.12 kB

	# Implementation Guide

	## Goal

	Build and run a local fine-tuning pipeline for a coding assistant model with:
	- dataset generation
	- LoRA fine-tuning
	- local inference
	- optional Hugging Face upload

	## Project Modules

	- `generate_dataset.py`
	- Generates training samples into JSON.
	- Supports dataset size range: 5000 to 10000.
	- `finetune_coding_llm_colab.py`
	- Main training module (local usage).
	- Supports dataset generation, training, and optional HF upload.
	- `run_pipeline.py`
	- Orchestrates generate -> train -> upload in one command.
	- Reads defaults from `training_config.json`.
	- `infer_local.py`
	- Runs inference from local trained output.
	- Handles both LoRA adapter output and full model output.
	- Returns structured JSON fields including code, explanation, confidence, relevancy, hallucination check, and latency.
	- `infer_cloud.py`
	- Runs inference through the Hugging Face API using an HF token.
	- Reuses the local structured-output parser and repair checks so API output matches the local JSON contract.
	- Falls back to the local `model/` folder when Hugging Face does not serve the custom repo through an inference provider.
	- `handler.py`
	- Custom Hugging Face Dedicated Inference Endpoint handler.
	- Loads the LoRA adapter/full model and returns the same structured JSON contract directly from the hosted endpoint.
	- `evaluate_model.py`
	- Runs a multi-prompt evaluation and reports pass rate (accuracy) for schema + quality checks.
	- `upload_to_hf.py`
	- Uploads local model folder to Hugging Face model repo.

	## Environment Setup

	1. Use Python 3.10+ (recommended).
	2. Install dependencies:
	- `pip install -r requirements.txt`
	3. (Optional) Login to Hugging Face before upload:
	- `huggingface-cli login`

	## Standard Execution Flow

	1. Generate dataset:
	- `python generate_dataset.py --size 8000 --out train.json`
	2. Train model:
	- `python finetune_coding_llm_colab.py --dataset-size 8000 --train-file train.json --output-dir model --skip-dataset-gen`
	3. Test inference:
	- `python infer_local.py --model-path model --prompt "Fix this code: def add(a,b) return a+b"`
	- Add `--allow-downloads` on a fresh machine if the base model is not cached locally.
	4. Evaluate quality:
	- `python evaluate_model.py --model-path model`
	5. Upload (optional):
	- `python upload_to_hf.py --model-dir model --repo-id your-username/your-model-name`
	6. Test cloud inference (optional):
	- PowerShell: `$env:HF_TOKEN="your_huggingface_token"`
	- `python infer_cloud.py --repo-id your-username/your-model-name --prompt "Fix this code: def add(a,b) return a+b"`
	- If you already logged in with `hf auth login`, the saved token can be used without setting `HF_TOKEN`.
	- Add `--no-local-fallback` if you want the command to fail when HF cloud serving is unavailable.
	- Add `--allow-downloads` if local fallback needs to download missing base-model files.
	- For true cloud execution, deploy a Hugging Face Dedicated Inference Endpoint and call:
	- `python infer_cloud.py --endpoint-url "https://your-endpoint-url.endpoints.huggingface.cloud" --prompt "Fix this code: def add(a,b) return a+b" --no-local-fallback`
	- Users should set their own token with `$env:HF_TOKEN="their_huggingface_token"` before calling the endpoint.

	## One-Command Execution

	- Run full pipeline without upload:
	- `python run_pipeline.py --dataset-size 8000 --skip-upload`

	- Run with upload:
	- `python run_pipeline.py --dataset-size 8000 --hf-repo your-username/your-model-name`

	## Performance Recommendations

	- CPU quick validation:
	- `python run_pipeline.py --dataset-size 5000 --max-train-samples 20 --epochs 0.1 --skip-upload`
	- Full quality run:
	- `python run_pipeline.py --dataset-size 8000 --epochs 3 --batch-size 2 --learning-rate 1e-4 --max-length 512 --use-4bit --skip-upload`

	## Error Handling Rules

	- If dataset file is missing, run `generate_dataset.py`.
	- If model folder is missing, run training first.
	- If HF upload fails, verify:
	- `huggingface-cli whoami`
	- repo permission and repo id format (`username/repo`)

	## Integration Notes

	- `run_pipeline.py` is the recommended entrypoint for regular usage.
	- `training_config.json` provides default values and can be overridden by CLI flags.
	- Inference works with LoRA adapters and full models automatically.

	## Hugging Face Existing Model Update

	To update an already published Hugging Face model with current project behavior:

	1. Retrain with latest code:
	- `python run_pipeline.py --dataset-size 8000 --skip-upload`
	2. Validate local inference + evaluation:
	- `python infer_local.py --model-path model --prompt "Fix this code: def add(a,b) return a+b"`
	- `python evaluate_model.py --model-path model`
	3. Upload to same repo id:
	- `python upload_to_hf.py --model-dir model --repo-id your-username/your-existing-model-name`

	Optional safer rollout:
	- Upload to a revision branch first and test before merging to main.

	## Current Output Contract

	`infer_local.py` returns JSON with:
	- `code`
	- `explanation`
	- `confidence`
	- `important_tokens`
	- `relevancy_score`
	- `hallucination`
	- `hallucination_check_reason`
	- `latency_ms`

	`infer_cloud.py` returns the same JSON keys through the Hugging Face API, or through local fallback if HF cannot serve the custom repo. Cloud responses may not include token-level probabilities, so `important_tokens` can be empty and `confidence` can be `0.0` unless the serving endpoint exposes token details.

	For users calling the hosted model with their own token/API key, deploy the repository as a Hugging Face Dedicated Inference Endpoint. The included `handler.py` makes endpoint responses use the same JSON pattern:

	- `code`
	- `explanation`
	- `confidence`
	- `important_tokens`
	- `relevancy_score`
	- `hallucination`
	- `hallucination_check_reason`
	- `latency_ms`

	Direct Hugging Face serverless calls to the model repo are not guaranteed to run custom LoRA repos. Dedicated endpoints or a cloud VM are required for true cloud execution.