File size: 4,268 Bytes
07a91a1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
# Advanced Coding LLM - Technical Specification

## 1) Objective

Build a production-ready coding assistant API deployable locally and on Hugging Face, supporting:

- Code generation
- Debugging/fixing buggy code
- Code explanation
- Instruction following
- Explainability signals
- Relevancy scoring
- Hallucination checks
- Optional RAG

## 2) Core Functional Requirements

### 2.1 Model

- Primary model: `Qwen/Qwen2.5-Coder-1.5B-Instruct`
- Fallback model: `Qwen/Qwen2.5-Coder-0.5B-Instruct`
- Emergency fallback mode supported (mock path available)
- Architecture compatible with future LoRA integration (`src/lora_prepare.py`)

### 2.2 API

- Framework: FastAPI
- Endpoint: `POST /generate`
- Health: `GET /health`
- Input schema:
  - `instruction: str`
  - `input: str`
- Output schema:
  - `code: str`
  - `explanation: str`
  - `confidence: float`
  - `important_tokens: list[str]`
  - `relevancy_score: float`
  - `hallucination: bool`
  - `latency_ms: int`

### 2.3 Explainability

- Confidence from token probabilities over generated tokens
- Important tokens extracted from low-probability tokens

### 2.4 Relevancy

- Query-to-output semantic score using TF-IDF + cosine similarity

### 2.5 Hallucination Checks

- Python syntax validation (`ast.parse`)
- Runtime smoke execution for Python-like outputs
- Skip runtime execution for non-Python-like outputs

### 2.6 RAG

- Basic retrieval from local snippets dataset
- FAISS index over normalized TF-IDF vectors
- Inject top-k snippets into prompt context

## 3) Non-Functional Requirements

- Runnable on local workstation
- Supports no-training initial deployment
- Lazy-load model to reduce startup failures
- Graceful fallback response when model unavailable
- Windows-compatible developer workflow

## 4) Security Requirements

- API key auth via `x-api-key` (if configured)
- Per-IP in-memory rate limiting
- No secrets committed to repository (`.env` ignored)

## 5) Performance Requirements

- Lazy model initialization
- Runtime checks bounded by timeout
- Optional mock mode (`FORCE_MOCK_MODE=true`) for fast operational checks

## 6) Deployment Requirements

### Local

- `python tasks.py install`
- `python tasks.py run`

### Docker

- `docker compose up --build -d`

### Hugging Face Space

- `python tasks.py hf-upload --repo-id <id> --token <token>`
- Gradio entrypoint in `app.py`

## 7) Project Structure

```text

coding-llm/

│── data/

│── src/

│── api/

│── requirements.txt

│── README.md

│── instruction.md

│── specification_file.md

```

## 8) Module Responsibilities

- `api/main.py`: API routes and response wiring
- `api/security.py`: API key + rate limiting
- `src/config.py`: environment-driven settings
- `src/model_loader.py`: model/fallback loading
- `src/generator.py`: generation + confidence extraction
- `src/pipeline.py`: orchestration layer
- `src/rag.py`: snippet retrieval
- `src/relevancy.py`: relevancy score computation
- `src/hallucination.py`: syntax/runtime checks
- `src/lora_prepare.py`: LoRA adapter hook
- `app.py`: Gradio UI for HF Spaces
- `upload_to_hf.py`: HF deployment uploader
- `tasks.py`: command runner
- `smoke_test.py`: runtime integration validation

## 9) Operational Modes

- **Real Model Mode**
  - `FORCE_MOCK_MODE=false`
  - Uses HF model loading and generation
- **Mock Mode**
  - `FORCE_MOCK_MODE=true`
  - Returns deterministic fallback output for reliability testing

## 10) Validation and QA

- Static compile check with `python -m compileall`
- Lint diagnostics via editor/tooling
- Smoke checks:
  - health endpoint reachable
  - generate endpoint returns full schema

## 11) Known Constraints

- First generation may be slow due to model download/warmup
- Quality depends on available model and decoding configuration
- In-memory rate limiter is single-process only

## 12) Future Enhancements

- Redis-backed distributed rate limiting
- Better language-aware hallucination tests
- Prompt templates per task type
- Streaming token responses
- Persistent vector store (Chroma/FAISS on-disk)
- CI/CD workflow for automated deploy/test