# Advanced Coding LLM - Technical Specification ## 1) Objective Build a production-ready coding assistant API deployable locally and on Hugging Face, supporting: - Code generation - Debugging/fixing buggy code - Code explanation - Instruction following - Explainability signals - Relevancy scoring - Hallucination checks - Optional RAG ## 2) Core Functional Requirements ### 2.1 Model - Primary model: `Qwen/Qwen2.5-Coder-1.5B-Instruct` - Fallback model: `Qwen/Qwen2.5-Coder-0.5B-Instruct` - Emergency fallback mode supported (mock path available) - Architecture compatible with future LoRA integration (`src/lora_prepare.py`) ### 2.2 API - Framework: FastAPI - Endpoint: `POST /generate` - Health: `GET /health` - Input schema: - `instruction: str` - `input: str` - Output schema: - `code: str` - `explanation: str` - `confidence: float` - `important_tokens: list[str]` - `relevancy_score: float` - `hallucination: bool` - `latency_ms: int` ### 2.3 Explainability - Confidence from token probabilities over generated tokens - Important tokens extracted from low-probability tokens ### 2.4 Relevancy - Query-to-output semantic score using TF-IDF + cosine similarity ### 2.5 Hallucination Checks - Python syntax validation (`ast.parse`) - Runtime smoke execution for Python-like outputs - Skip runtime execution for non-Python-like outputs ### 2.6 RAG - Basic retrieval from local snippets dataset - FAISS index over normalized TF-IDF vectors - Inject top-k snippets into prompt context ## 3) Non-Functional Requirements - Runnable on local workstation - Supports no-training initial deployment - Lazy-load model to reduce startup failures - Graceful fallback response when model unavailable - Windows-compatible developer workflow ## 4) Security Requirements - API key auth via `x-api-key` (if configured) - Per-IP in-memory rate limiting - No secrets committed to repository (`.env` ignored) ## 5) Performance Requirements - Lazy model initialization - Runtime checks bounded by timeout - Optional mock mode (`FORCE_MOCK_MODE=true`) for fast operational checks ## 6) Deployment Requirements ### Local - `python tasks.py install` - `python tasks.py run` ### Docker - `docker compose up --build -d` ### Hugging Face Space - `python tasks.py hf-upload --repo-id --token ` - Gradio entrypoint in `app.py` ## 7) Project Structure ```text coding-llm/ │── data/ │── src/ │── api/ │── requirements.txt │── README.md │── instruction.md │── specification_file.md ``` ## 8) Module Responsibilities - `api/main.py`: API routes and response wiring - `api/security.py`: API key + rate limiting - `src/config.py`: environment-driven settings - `src/model_loader.py`: model/fallback loading - `src/generator.py`: generation + confidence extraction - `src/pipeline.py`: orchestration layer - `src/rag.py`: snippet retrieval - `src/relevancy.py`: relevancy score computation - `src/hallucination.py`: syntax/runtime checks - `src/lora_prepare.py`: LoRA adapter hook - `app.py`: Gradio UI for HF Spaces - `upload_to_hf.py`: HF deployment uploader - `tasks.py`: command runner - `smoke_test.py`: runtime integration validation ## 9) Operational Modes - **Real Model Mode** - `FORCE_MOCK_MODE=false` - Uses HF model loading and generation - **Mock Mode** - `FORCE_MOCK_MODE=true` - Returns deterministic fallback output for reliability testing ## 10) Validation and QA - Static compile check with `python -m compileall` - Lint diagnostics via editor/tooling - Smoke checks: - health endpoint reachable - generate endpoint returns full schema ## 11) Known Constraints - First generation may be slow due to model download/warmup - Quality depends on available model and decoding configuration - In-memory rate limiter is single-process only ## 12) Future Enhancements - Redis-backed distributed rate limiting - Better language-aware hallucination tests - Prompt templates per task type - Streaming token responses - Persistent vector store (Chroma/FAISS on-disk) - CI/CD workflow for automated deploy/test