Spaces:
Running
Running
A newer version of the Gradio SDK is available: 6.14.0
Advanced Coding LLM - Technical Specification
1) Objective
Build a production-ready coding assistant API deployable locally and on Hugging Face, supporting:
- Code generation
- Debugging/fixing buggy code
- Code explanation
- Instruction following
- Explainability signals
- Relevancy scoring
- Hallucination checks
- Optional RAG
2) Core Functional Requirements
2.1 Model
- Primary model:
Qwen/Qwen2.5-Coder-1.5B-Instruct - Fallback model:
Qwen/Qwen2.5-Coder-0.5B-Instruct - Emergency fallback mode supported (mock path available)
- Architecture compatible with future LoRA integration (
src/lora_prepare.py)
2.2 API
- Framework: FastAPI
- Endpoint:
POST /generate - Health:
GET /health - Input schema:
instruction: strinput: str
- Output schema:
code: strexplanation: strconfidence: floatimportant_tokens: list[str]relevancy_score: floathallucination: boollatency_ms: int
2.3 Explainability
- Confidence from token probabilities over generated tokens
- Important tokens extracted from low-probability tokens
2.4 Relevancy
- Query-to-output semantic score using TF-IDF + cosine similarity
2.5 Hallucination Checks
- Python syntax validation (
ast.parse) - Runtime smoke execution for Python-like outputs
- Skip runtime execution for non-Python-like outputs
2.6 RAG
- Basic retrieval from local snippets dataset
- FAISS index over normalized TF-IDF vectors
- Inject top-k snippets into prompt context
3) Non-Functional Requirements
- Runnable on local workstation
- Supports no-training initial deployment
- Lazy-load model to reduce startup failures
- Graceful fallback response when model unavailable
- Windows-compatible developer workflow
4) Security Requirements
- API key auth via
x-api-key(if configured) - Per-IP in-memory rate limiting
- No secrets committed to repository (
.envignored)
5) Performance Requirements
- Lazy model initialization
- Runtime checks bounded by timeout
- Optional mock mode (
FORCE_MOCK_MODE=true) for fast operational checks
6) Deployment Requirements
Local
python tasks.py installpython tasks.py run
Docker
docker compose up --build -d
Hugging Face Space
python tasks.py hf-upload --repo-id <id> --token <token>- Gradio entrypoint in
app.py
7) Project Structure
coding-llm/
βββ data/
βββ src/
βββ api/
βββ requirements.txt
βββ README.md
βββ instruction.md
βββ specification_file.md
8) Module Responsibilities
api/main.py: API routes and response wiringapi/security.py: API key + rate limitingsrc/config.py: environment-driven settingssrc/model_loader.py: model/fallback loadingsrc/generator.py: generation + confidence extractionsrc/pipeline.py: orchestration layersrc/rag.py: snippet retrievalsrc/relevancy.py: relevancy score computationsrc/hallucination.py: syntax/runtime checkssrc/lora_prepare.py: LoRA adapter hookapp.py: Gradio UI for HF Spacesupload_to_hf.py: HF deployment uploadertasks.py: command runnersmoke_test.py: runtime integration validation
9) Operational Modes
- Real Model Mode
FORCE_MOCK_MODE=false- Uses HF model loading and generation
- Mock Mode
FORCE_MOCK_MODE=true- Returns deterministic fallback output for reliability testing
10) Validation and QA
- Static compile check with
python -m compileall - Lint diagnostics via editor/tooling
- Smoke checks:
- health endpoint reachable
- generate endpoint returns full schema
11) Known Constraints
- First generation may be slow due to model download/warmup
- Quality depends on available model and decoding configuration
- In-memory rate limiter is single-process only
12) Future Enhancements
- Redis-backed distributed rate limiting
- Better language-aware hallucination tests
- Prompt templates per task type
- Streaming token responses
- Persistent vector store (Chroma/FAISS on-disk)
- CI/CD workflow for automated deploy/test