# Advanced Coding LLM - Technical Specification

## 1) Objective

Build a production-ready coding assistant API deployable locally and on Hugging Face, supporting:

- Code generation
- Debugging/fixing buggy code
- Code explanation
- Instruction following
- Explainability signals
- Relevancy scoring
- Hallucination checks
- Optional RAG

## 2) Core Functional Requirements

### 2.1 Model

- Primary model: `Qwen/Qwen2.5-Coder-1.5B-Instruct`
- Fallback model: `Qwen/Qwen2.5-Coder-0.5B-Instruct`
- Emergency fallback mode supported (mock path available)
- Architecture compatible with future LoRA integration (`src/lora_prepare.py`)

### 2.2 API

- Framework: FastAPI
- Endpoint: `POST /generate`
- Health: `GET /health`
- Input schema:
  - `instruction: str`
  - `input: str`
- Output schema:
  - `code: str`
  - `explanation: str`
  - `confidence: float`
  - `important_tokens: list[str]`
  - `relevancy_score: float`
  - `hallucination: bool`
  - `latency_ms: int`

### 2.3 Explainability

- Confidence from token probabilities over generated tokens
- Important tokens extracted from low-probability tokens

### 2.4 Relevancy

- Query-to-output semantic score using TF-IDF + cosine similarity

### 2.5 Hallucination Checks

- Python syntax validation (`ast.parse`)
- Runtime smoke execution for Python-like outputs
- Skip runtime execution for non-Python-like outputs

### 2.6 RAG

- Basic retrieval from local snippets dataset
- FAISS index over normalized TF-IDF vectors
- Inject top-k snippets into prompt context

## 3) Non-Functional Requirements

- Runnable on local workstation
- Supports no-training initial deployment
- Lazy-load model to reduce startup failures
- Graceful fallback response when model unavailable
- Windows-compatible developer workflow

## 4) Security Requirements

- API key auth via `x-api-key` (if configured)
- Per-IP in-memory rate limiting
- No secrets committed to repository (`.env` ignored)

## 5) Performance Requirements

- Lazy model initialization
- Runtime checks bounded by timeout
- Optional mock mode (`FORCE_MOCK_MODE=true`) for fast operational checks

## 6) Deployment Requirements

### Local

- `python tasks.py install`
- `python tasks.py run`

### Docker

- `docker compose up --build -d`

### Hugging Face Space

- `python tasks.py hf-upload --repo-id <id> --token <token>`
- Gradio entrypoint in `app.py`

## 7) Project Structure

```text
coding-llm/
│── data/
│── src/
│── api/
│── requirements.txt
│── README.md
│── instruction.md
│── specification_file.md
```

## 8) Module Responsibilities

- `api/main.py`: API routes and response wiring
- `api/security.py`: API key + rate limiting
- `src/config.py`: environment-driven settings
- `src/model_loader.py`: model/fallback loading
- `src/generator.py`: generation + confidence extraction
- `src/pipeline.py`: orchestration layer
- `src/rag.py`: snippet retrieval
- `src/relevancy.py`: relevancy score computation
- `src/hallucination.py`: syntax/runtime checks
- `src/lora_prepare.py`: LoRA adapter hook
- `app.py`: Gradio UI for HF Spaces
- `upload_to_hf.py`: HF deployment uploader
- `tasks.py`: command runner
- `smoke_test.py`: runtime integration validation

## 9) Operational Modes

- **Real Model Mode**
  - `FORCE_MOCK_MODE=false`
  - Uses HF model loading and generation
- **Mock Mode**
  - `FORCE_MOCK_MODE=true`
  - Returns deterministic fallback output for reliability testing

## 10) Validation and QA

- Static compile check with `python -m compileall`
- Lint diagnostics via editor/tooling
- Smoke checks:
  - health endpoint reachable
  - generate endpoint returns full schema

## 11) Known Constraints

- First generation may be slow due to model download/warmup
- Quality depends on available model and decoding configuration
- In-memory rate limiter is single-process only

## 12) Future Enhancements

- Redis-backed distributed rate limiting
- Better language-aware hallucination tests
- Prompt templates per task type
- Streaming token responses
- Persistent vector store (Chroma/FAISS on-disk)
- CI/CD workflow for automated deploy/test