Instructions to use girish00/ConicAI_LLM_model with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use girish00/ConicAI_LLM_model with PEFT:

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-0.5B-Instruct")
model = PeftModel.from_pretrained(base_model, "girish00/ConicAI_LLM_model")

Transformers

How to use girish00/ConicAI_LLM_model with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="girish00/ConicAI_LLM_model")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("girish00/ConicAI_LLM_model")
model = AutoModelForCausalLM.from_pretrained("girish00/ConicAI_LLM_model")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use girish00/ConicAI_LLM_model with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "girish00/ConicAI_LLM_model"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "girish00/ConicAI_LLM_model",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/girish00/ConicAI_LLM_model

SGLang

How to use girish00/ConicAI_LLM_model with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "girish00/ConicAI_LLM_model" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "girish00/ConicAI_LLM_model",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "girish00/ConicAI_LLM_model" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "girish00/ConicAI_LLM_model",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use girish00/ConicAI_LLM_model with Docker Model Runner:
```
docker model run hf.co/girish00/ConicAI_LLM_model
```

girish00 commited on Apr 22

Commit

ba8e702

verified ·

1 Parent(s): 47eeb2f

update readme file from readme_22_04_26.md

Browse files

Files changed (1) hide show

README.md +270 -230

README.md CHANGED Viewed

@@ -1,263 +1,303 @@
 ---
-license: mit
-base_model: Qwen/Qwen2.5-Coder-0.5B-Instruct
 library_name: peft
 pipeline_tag: text-generation
 tags:
-  - code
   - lora
-  - structured-output
 ---
-# Advanced Fine-Tune Coding Model (Local + Hugging Face)
-This project fine-tunes `Qwen/Qwen2.5-Coder-0.5B-Instruct` using LoRA for:
-- code fixing
-- debugging
-- explanation
-- confidence and relevancy-aware outputs
-## Files
-- `generate_dataset.py`: creates training dataset (5k-10k)
-- `finetune_coding_llm_colab.py`: local training script (LoRA) + optional upload
-- `infer_local.py`: test local trained model with structured JSON output
-- `infer_cloud.py`: run Hugging Face API inference and force the same structured JSON output
-- `handler.py`: custom Hugging Face Inference Endpoint handler that returns the same JSON contract from the hosted endpoint
-- `evaluate_model.py`: run multi-prompt quality checks and report accuracy
-- `upload_to_hf.py`: upload local model folder to HF
-- `run_pipeline.py`: one command for generate + train (+ optional upload)
-- `requirements.txt`: Python dependencies
-- `training_config.json`: default values automatically used by `run_pipeline.py`
-## Local Setup (No Colab)
-Install dependencies:
-```bash
-pip install -r requirements.txt
-```
-Generate dataset (example: 8000 samples):
-```bash
-python generate_dataset.py --size 8000 --out train.json
-```
-Train locally:
-```bash
-python finetune_coding_llm_colab.py --dataset-size 8000
-```
-Enable 4-bit quantized loading (GPU):
-```bash
-python finetune_coding_llm_colab.py --dataset-size 8000 --use-4bit
-```
-Fast CPU smoke run:
-```bash
-python finetune_coding_llm_colab.py --dataset-size 5000 --max-train-samples 200 --epochs 0.1
-```
-Single command pipeline (no upload):
-```bash
-python run_pipeline.py --dataset-size 8000 --skip-upload
-```
-If `training_config.json` exists, `run_pipeline.py` reads it automatically for defaults.
-Use existing dataset without regenerating:
-```bash
-python run_pipeline.py --dataset-size 8000 --train-file train.json --skip-generate --skip-upload
-```
-Tunable training knobs:
-```bash
-python run_pipeline.py --dataset-size 8000 --epochs 3 --batch-size 2 --learning-rate 1e-4 --max-length 512 --max-train-samples 0 --use-4bit --skip-upload
-```
-## Configure 5k-10k samples
-```python
---dataset-size 5000
---dataset-size 8000
---dataset-size 10000
-```
-Recommended values:
-- 5000 for fast iteration
-- 8000 as balanced
-- 10000 for stronger adaptation (slower)
-## Hugging Face Deployment
-Upload is optional and can be done after training:
-```bash
-python upload_to_hf.py --model-dir model --repo-id your-username/your-model-name
-```
-### Update Existing HF Model Repo
-To update your already-created Hugging Face model with this new JSON-output behavior:
-1. Retrain locally with latest code:
-```bash
-python run_pipeline.py --dataset-size 8000 --skip-upload
-```
-2. Login to Hugging Face:
-```bash
-huggingface-cli login
-```
-3. Upload to the same repo ID (this updates existing files):
-```bash
-python upload_to_hf.py --model-dir model --repo-id your-username/your-existing-model-name
-```
-Optional safer rollout using a new revision/branch:
-```bash
-python -c "from huggingface_hub import upload_folder; upload_folder(folder_path='model', repo_id='your-username/your-existing-model-name', repo_type='model', revision='v2-json-output')"
-```
-You can also trigger upload from trainer:
-```bash
-python finetune_coding_llm_colab.py --skip-dataset-gen --skip-train --upload --hf-repo your-username/your-model-name
-```
-## Quick Inference Test (Structured JSON)
-After local training, inference returns JSON with:
-- `code`
-- `explanation`
-- `confidence`
-- `important_tokens`
-- `relevancy_score`
-- `hallucination`
-- `hallucination_check_reason`
-- `latency_ms`
-```python
-python infer_local.py --model-path model --prompt "Fix this code: def add(a,b) return a+b"
-```
-Local inference uses cached model files by default to avoid slow network checks. If the base model is not already cached on a new machine, run once with:
-```bash
-python infer_local.py --model-path model --prompt "Fix this code: def add(a,b) return a+b" --allow-downloads
 ```
-Run the same structured-output wrapper through the Hugging Face API:
-```bash
-set HF_TOKEN=your_huggingface_token
-python infer_cloud.py --repo-id your-username/your-model-name --prompt "Fix this code: def add(a,b) return a+b"
 ```
-If `infer_cloud.py` falls back to local inference on a new machine that has not cached the base model yet, add `--allow-downloads`.
-PowerShell:
-```powershell
-$env:HF_TOKEN="your_huggingface_token"
-python infer_cloud.py --repo-id your-username/your-model-name --prompt "Fix this code: def add(a,b) return a+b"
-```
-If you already ran `hf auth login` or `huggingface-cli login`, you can omit `HF_TOKEN`; the saved token will be used automatically.
-For true cloud execution, deploy the model as a Hugging Face Dedicated Inference Endpoint and pass the endpoint URL:
-```powershell
-$env:HF_TOKEN="your_huggingface_token"
-python infer_cloud.py --endpoint-url "https://your-endpoint-url.endpoints.huggingface.cloud" --prompt "Fix this code: def add(a,b) return a+b" --no-local-fallback
 ```
-You can also use environment variables:
-```powershell
-$env:HF_TOKEN="your_huggingface_token"
-$env:HF_ENDPOINT_URL="https://your-endpoint-url.endpoints.huggingface.cloud"
-python infer_cloud.py --prompt "Fix this code: def add(a,b) return a+b" --no-local-fallback
 ```
-`infer_cloud.py` applies the same JSON parsing, Python syntax check, relevancy score, hallucination flag, and auto-repair fallback as `infer_local.py`. If Hugging Face cannot serve your custom model repo through an inference provider, the script automatically falls back to the local `model/` folder so the command still returns the local-style JSON. Use `--no-local-fallback` if you want cloud-only failure behavior.
-Hosted Hugging Face API calls usually do not return token logits, so `important_tokens` may be empty and `confidence` may be `0.0` unless your endpoint returns token-level details. When the local fallback runs, those fields are computed the same way as `infer_local.py`.
-### Cloud Output Guarantee
-To make other users receive this JSON pattern with their own token, deploy this repository as a Hugging Face Dedicated Inference Endpoint. The included `handler.py` is loaded by the endpoint and returns:
-```json
-{
-  "code": "string",
-  "explanation": "string",
-  "confidence": 0.0,
-  "important_tokens": [],
-  "relevancy_score": 0.0,
-  "hallucination": false,
-  "hallucination_check_reason": "string",
-  "latency_ms": 0
 }
 ```
-Endpoint request example:
-```powershell
-$env:HF_TOKEN="their_huggingface_token"
-Invoke-RestMethod `
-  -Uri "https://your-endpoint-url.endpoints.huggingface.cloud" `
-  -Method Post `
-  -Headers @{ Authorization = "Bearer $env:HF_TOKEN" } `
-  -ContentType "application/json" `
-  -Body '{"inputs":"Fix this code: def add(a,b) return a+b","parameters":{"max_new_tokens":320}}'
-```
-Calling the model repository directly through Hugging Face serverless inference is not enough if Hugging Face has no provider serving the custom repo. Use a Dedicated Inference Endpoint or your own cloud VM for true cloud execution.
-Explicit base model for LoRA adapter loading:
-```python
-python infer_local.py --model-path model --base-model Qwen/Qwen2.5-Coder-0.5B-Instruct --prompt "Fix this code: def add(a,b) return a+b"
-```
-`infer_local.py` automatically handles both:
-- LoRA adapter output folders
-- Fully merged/full-model output folders
-## Accuracy Evaluation
-Run default evaluation prompts:
-```bash
-python evaluate_model.py --model-path model
-```
-Run with custom prompts:
-```bash
-python evaluate_model.py --model-path model --prompt "Fix this code: if x = 5: print(x)" --prompt "Write python code for linear regression and explain it"
-```
-For higher quality output:
-- use dataset size `8000` or `10000`
-- use `epochs >= 3`
-- prefer `--use-4bit` when GPU is available
-- keep prompts specific and task-focused
-## Recommended Run Order
-```bash
-python run_pipeline.py --dataset-size 8000 --skip-upload
-python infer_local.py --model-path model --prompt "Fix this code: def add(a,b) return a+b"
-python evaluate_model.py --model-path model
-python upload_to_hf.py --model-dir model --repo-id your-username/your-existing-model-name
-```

 ---
+license: apache-2.0
+base_model: "Qwen/Qwen2.5-Coder-0.5B-Instruct"
 library_name: peft
 pipeline_tag: text-generation
 tags:
   - lora
+  - transformers
+  - coding
+  - code-generation
+  - peft
 ---
+# ConicAI Coding LLM
+## Model Details
+### Model Description
+ConicAI LLM Model is a parameter-efficient fine-tuned coding assistant built using LoRA on top of Qwen2.5-Coder. It is designed to generate, debug, and explain code with structured outputs.
+* **Developed by:** GIRISH KUMAR DEWANGAN
+* **Model type:** Causal Language Model (Code LLM)
+* **Language(s):** Python, general programming
+* **used for:** Code generation, debugging, fixing error, getting evaluation score, check hallucination and relevancy score as well
+* **License:** Apache 2.0
+* **Finetuned from model:** Qwen/Qwen2.5-Coder-0.5B-Instruct
+---
+## Model Sources
+* **Repository:** https://huggingface.co/girish00/ConicAI_LLM_model
+* **Paper:** Available in repo ("ConicAI_paper.md")
+---
+## Uses
+### Direct Use
+* Code generation
+* Debugging
+* Code explanation
+* Learning programming
+---
+### Downstream Use
+* Coding assistants
+* AI-based education tools
+* Developer productivity tools
+---
+### Out-of-Scope Use
+* Security-critical systems
+* Autonomous production systems
+* High-risk environments
+---
+## Bias, Risks, and Limitations
+* May generate incorrect logic
+* Confidence scores are heuristic
+* Output depends on prompt quality
+* Limited dataset generalization
+---
+## Recommendations
+* Always validate generated code
+* Use structured prompts
+* Avoid ambiguous instructions
+---
+## Structured Output Framework
+The model produces outputs in structured JSON format:
 ```
+{
+  "code": "...",
+  "explanation": "...",
+  "confidence": 0.84,
+  "relevancy_score": 0.82,
+  "hallucination": false
+}
+```
+```text
+This enables:
+-Easy API integration
+-Automated evaluation
+-Better interpretability
 ```
+---
+## 🚀 How to Get Started with the Model
+```python
+!pip -q install -U transformers peft accelerate huggingface_hub safetensors
+!pip install --upgrade torchao
+from google.colab import userdata
+HF_TOKEN = userdata.get('HF_TOKEN')
+model = "girish00/ConicAI_LLM_model"
+prompt = input("Enter your prompt: ")
+from huggingface_hub import login, snapshot_download
+login(token=HF_TOKEN)
+repo = snapshot_download(model, token=HF_TOKEN)
+import sys, os
+sys.path.append(repo)
+from infer_local import build_instruction_prompt, build_structured_result
+from peft import PeftConfig, PeftModel
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch, time, json
+cfg = PeftConfig.from_pretrained(repo)
+base = cfg.base_model_name_or_path
+tokenizer = AutoTokenizer.from_pretrained(base)
+base_model = AutoModelForCausalLM.from_pretrained(
+    base,
+    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
+    device_map="auto"
+)
+llm = PeftModel.from_pretrained(base_model, repo)
+llm.eval()
+inputs = tokenizer(build_instruction_prompt(prompt), return_tensors="pt").to(llm.device)
+start = time.perf_counter()
+with torch.no_grad():
+    out = llm.generate(
+        **inputs,
+        max_new_tokens=320,
+        output_scores=True,
+        return_dict_in_generate=True,
+        do_sample=False,
+        pad_token_id=tokenizer.eos_token_id
+    )
+latency = int((time.perf_counter() - start) * 1000)
+gen_ids = out.sequences[0][inputs["input_ids"].shape[1]:].tolist()
+text = tokenizer.decode(gen_ids, skip_special_tokens=True)
+conf = []
+for tid, score in zip(gen_ids, out.scores):
+    probs = torch.softmax(score[0], dim=-1)
+    conf.append(float(probs[tid].item()))
+print(json.dumps(
+    build_structured_result(
+        prompt,
+        text,
+        latency,
+        tokenizer=tokenizer,
+        generated_ids=gen_ids,
+        token_confidences=conf
+    ),
+    indent=2
+))
 ```
+---
+## 📊 Benchmark Results
+![Benchmark](./benchmark.png)
+---
+## Training Details
+### Dataset
+* Size: ~5K samples
+* Instruction-based coding dataset
+### Training Procedure
+* Method: LoRA fine-tuning
+* Framework: Transformers + PEFT
+* Precision: FP16 / Mixed
+### Training Hyperparameters
+| Parameter           | Value |
+| ------------------- | ----- |
+| Epochs              | 1–3   |
+| Batch Size          | 2     |
+| Learning Rate       | 2e-4  |
+| Max Sequence Length | 512   |
+| LoRA Rank (r)       | 8     |
+| LoRA Alpha          | 16    |
+| LoRA Dropout        | 0.05  |
+---
+## Inference Configuration
+```text
+max_new_tokens = 200
+temperature = 0.2
+top_p = 0.9
+do_sample = True
 ```
+---
+## Evaluation
+### Metrics
+* Code correctness
+* Syntax validity
+* Relevancy score
+* Hallucination rate
+* Confidence score
+* Latency
+---
+### Results Summary
+* Higher correctness vs base model
+* Lower hallucination rate
+* Better structured outputs
+---
+## Technical Specifications
+### Architecture
+* Transformer-based causal LM
+* LoRA adaptation
+---
+### Hardware
+* GPU recommended (optional)
+* CPU supported
+---
+### Software
+* Transformers
+* PEFT
+* PyTorch
+---
+## Environmental Impact
+* Low compute due to LoRA
+* Efficient fine-tuning
+---
+## Citation
+**BibTeX:**
+```text
+@misc{conicai_llm,
+  author = {Girish},
+  title = {ConicAI Coding LLM},
+  year = {2026},
+  publisher = {Hugging Face}
 }
 ```
+---
+## Model Card Authors
+GIRISH KUMAR DEWANGAN
+---
+### Framework versions
+* PEFT 0.19.0