Spaces:

girish00
/

coding-llm-space

Running

App Files Files Community

coding-llm-space / specification_file.md

girish00's picture

Upload folder using huggingface_hub

07a91a1 verified about 1 month ago

|

history blame contribute delete

4.27 kB

A newer version of the Gradio SDK is available: 6.14.0

Advanced Coding LLM - Technical Specification

1) Objective

Build a production-ready coding assistant API deployable locally and on Hugging Face, supporting:

Code generation
Debugging/fixing buggy code
Code explanation
Instruction following
Explainability signals
Relevancy scoring
Hallucination checks
Optional RAG

2) Core Functional Requirements

2.1 Model

Primary model: Qwen/Qwen2.5-Coder-1.5B-Instruct
Fallback model: Qwen/Qwen2.5-Coder-0.5B-Instruct
Emergency fallback mode supported (mock path available)
Architecture compatible with future LoRA integration (src/lora_prepare.py)

2.2 API

Framework: FastAPI
Endpoint: POST /generate
Health: GET /health
Input schema:
- instruction: str
- input: str
Output schema:
- code: str
- explanation: str
- confidence: float
- important_tokens: list[str]
- relevancy_score: float
- hallucination: bool
- latency_ms: int

2.3 Explainability

Confidence from token probabilities over generated tokens
Important tokens extracted from low-probability tokens

2.4 Relevancy

Query-to-output semantic score using TF-IDF + cosine similarity

2.5 Hallucination Checks

Python syntax validation (ast.parse)
Runtime smoke execution for Python-like outputs
Skip runtime execution for non-Python-like outputs

2.6 RAG

Basic retrieval from local snippets dataset
FAISS index over normalized TF-IDF vectors
Inject top-k snippets into prompt context

3) Non-Functional Requirements

Runnable on local workstation
Supports no-training initial deployment
Lazy-load model to reduce startup failures
Graceful fallback response when model unavailable
Windows-compatible developer workflow

4) Security Requirements

API key auth via x-api-key (if configured)
Per-IP in-memory rate limiting
No secrets committed to repository (.env ignored)

5) Performance Requirements

Lazy model initialization
Runtime checks bounded by timeout
Optional mock mode (FORCE_MOCK_MODE=true) for fast operational checks

6) Deployment Requirements

Local

python tasks.py install
python tasks.py run

Docker

docker compose up --build -d

Hugging Face Space

python tasks.py hf-upload --repo-id <id> --token <token>
Gradio entrypoint in app.py

7) Project Structure

coding-llm/
│── data/
│── src/
│── api/
│── requirements.txt
│── README.md
│── instruction.md
│── specification_file.md

8) Module Responsibilities

api/main.py: API routes and response wiring
api/security.py: API key + rate limiting
src/config.py: environment-driven settings
src/model_loader.py: model/fallback loading
src/generator.py: generation + confidence extraction
src/pipeline.py: orchestration layer
src/rag.py: snippet retrieval
src/relevancy.py: relevancy score computation
src/hallucination.py: syntax/runtime checks
src/lora_prepare.py: LoRA adapter hook
app.py: Gradio UI for HF Spaces
upload_to_hf.py: HF deployment uploader
tasks.py: command runner
smoke_test.py: runtime integration validation

9) Operational Modes

Real Model Mode
- FORCE_MOCK_MODE=false
- Uses HF model loading and generation
Mock Mode
- FORCE_MOCK_MODE=true
- Returns deterministic fallback output for reliability testing

10) Validation and QA

Static compile check with python -m compileall
Lint diagnostics via editor/tooling
Smoke checks:
- health endpoint reachable
- generate endpoint returns full schema

11) Known Constraints

First generation may be slow due to model download/warmup
Quality depends on available model and decoding configuration
In-memory rate limiter is single-process only

12) Future Enhancements

Redis-backed distributed rate limiting
Better language-aware hallucination tests
Prompt templates per task type
Streaming token responses
Persistent vector store (Chroma/FAISS on-disk)
CI/CD workflow for automated deploy/test