Spaces:
Running
Running
| # Advanced Coding LLM - Technical Specification | |
| ## 1) Objective | |
| Build a production-ready coding assistant API deployable locally and on Hugging Face, supporting: | |
| - Code generation | |
| - Debugging/fixing buggy code | |
| - Code explanation | |
| - Instruction following | |
| - Explainability signals | |
| - Relevancy scoring | |
| - Hallucination checks | |
| - Optional RAG | |
| ## 2) Core Functional Requirements | |
| ### 2.1 Model | |
| - Primary model: `Qwen/Qwen2.5-Coder-1.5B-Instruct` | |
| - Fallback model: `Qwen/Qwen2.5-Coder-0.5B-Instruct` | |
| - Emergency fallback mode supported (mock path available) | |
| - Architecture compatible with future LoRA integration (`src/lora_prepare.py`) | |
| ### 2.2 API | |
| - Framework: FastAPI | |
| - Endpoint: `POST /generate` | |
| - Health: `GET /health` | |
| - Input schema: | |
| - `instruction: str` | |
| - `input: str` | |
| - Output schema: | |
| - `code: str` | |
| - `explanation: str` | |
| - `confidence: float` | |
| - `important_tokens: list[str]` | |
| - `relevancy_score: float` | |
| - `hallucination: bool` | |
| - `latency_ms: int` | |
| ### 2.3 Explainability | |
| - Confidence from token probabilities over generated tokens | |
| - Important tokens extracted from low-probability tokens | |
| ### 2.4 Relevancy | |
| - Query-to-output semantic score using TF-IDF + cosine similarity | |
| ### 2.5 Hallucination Checks | |
| - Python syntax validation (`ast.parse`) | |
| - Runtime smoke execution for Python-like outputs | |
| - Skip runtime execution for non-Python-like outputs | |
| ### 2.6 RAG | |
| - Basic retrieval from local snippets dataset | |
| - FAISS index over normalized TF-IDF vectors | |
| - Inject top-k snippets into prompt context | |
| ## 3) Non-Functional Requirements | |
| - Runnable on local workstation | |
| - Supports no-training initial deployment | |
| - Lazy-load model to reduce startup failures | |
| - Graceful fallback response when model unavailable | |
| - Windows-compatible developer workflow | |
| ## 4) Security Requirements | |
| - API key auth via `x-api-key` (if configured) | |
| - Per-IP in-memory rate limiting | |
| - No secrets committed to repository (`.env` ignored) | |
| ## 5) Performance Requirements | |
| - Lazy model initialization | |
| - Runtime checks bounded by timeout | |
| - Optional mock mode (`FORCE_MOCK_MODE=true`) for fast operational checks | |
| ## 6) Deployment Requirements | |
| ### Local | |
| - `python tasks.py install` | |
| - `python tasks.py run` | |
| ### Docker | |
| - `docker compose up --build -d` | |
| ### Hugging Face Space | |
| - `python tasks.py hf-upload --repo-id <id> --token <token>` | |
| - Gradio entrypoint in `app.py` | |
| ## 7) Project Structure | |
| ```text | |
| coding-llm/ | |
| βββ data/ | |
| βββ src/ | |
| βββ api/ | |
| βββ requirements.txt | |
| βββ README.md | |
| βββ instruction.md | |
| βββ specification_file.md | |
| ``` | |
| ## 8) Module Responsibilities | |
| - `api/main.py`: API routes and response wiring | |
| - `api/security.py`: API key + rate limiting | |
| - `src/config.py`: environment-driven settings | |
| - `src/model_loader.py`: model/fallback loading | |
| - `src/generator.py`: generation + confidence extraction | |
| - `src/pipeline.py`: orchestration layer | |
| - `src/rag.py`: snippet retrieval | |
| - `src/relevancy.py`: relevancy score computation | |
| - `src/hallucination.py`: syntax/runtime checks | |
| - `src/lora_prepare.py`: LoRA adapter hook | |
| - `app.py`: Gradio UI for HF Spaces | |
| - `upload_to_hf.py`: HF deployment uploader | |
| - `tasks.py`: command runner | |
| - `smoke_test.py`: runtime integration validation | |
| ## 9) Operational Modes | |
| - **Real Model Mode** | |
| - `FORCE_MOCK_MODE=false` | |
| - Uses HF model loading and generation | |
| - **Mock Mode** | |
| - `FORCE_MOCK_MODE=true` | |
| - Returns deterministic fallback output for reliability testing | |
| ## 10) Validation and QA | |
| - Static compile check with `python -m compileall` | |
| - Lint diagnostics via editor/tooling | |
| - Smoke checks: | |
| - health endpoint reachable | |
| - generate endpoint returns full schema | |
| ## 11) Known Constraints | |
| - First generation may be slow due to model download/warmup | |
| - Quality depends on available model and decoding configuration | |
| - In-memory rate limiter is single-process only | |
| ## 12) Future Enhancements | |
| - Redis-backed distributed rate limiting | |
| - Better language-aware hallucination tests | |
| - Prompt templates per task type | |
| - Streaming token responses | |
| - Persistent vector store (Chroma/FAISS on-disk) | |
| - CI/CD workflow for automated deploy/test | |