Spaces:
Running
Running
File size: 4,268 Bytes
07a91a1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 | # Advanced Coding LLM - Technical Specification
## 1) Objective
Build a production-ready coding assistant API deployable locally and on Hugging Face, supporting:
- Code generation
- Debugging/fixing buggy code
- Code explanation
- Instruction following
- Explainability signals
- Relevancy scoring
- Hallucination checks
- Optional RAG
## 2) Core Functional Requirements
### 2.1 Model
- Primary model: `Qwen/Qwen2.5-Coder-1.5B-Instruct`
- Fallback model: `Qwen/Qwen2.5-Coder-0.5B-Instruct`
- Emergency fallback mode supported (mock path available)
- Architecture compatible with future LoRA integration (`src/lora_prepare.py`)
### 2.2 API
- Framework: FastAPI
- Endpoint: `POST /generate`
- Health: `GET /health`
- Input schema:
- `instruction: str`
- `input: str`
- Output schema:
- `code: str`
- `explanation: str`
- `confidence: float`
- `important_tokens: list[str]`
- `relevancy_score: float`
- `hallucination: bool`
- `latency_ms: int`
### 2.3 Explainability
- Confidence from token probabilities over generated tokens
- Important tokens extracted from low-probability tokens
### 2.4 Relevancy
- Query-to-output semantic score using TF-IDF + cosine similarity
### 2.5 Hallucination Checks
- Python syntax validation (`ast.parse`)
- Runtime smoke execution for Python-like outputs
- Skip runtime execution for non-Python-like outputs
### 2.6 RAG
- Basic retrieval from local snippets dataset
- FAISS index over normalized TF-IDF vectors
- Inject top-k snippets into prompt context
## 3) Non-Functional Requirements
- Runnable on local workstation
- Supports no-training initial deployment
- Lazy-load model to reduce startup failures
- Graceful fallback response when model unavailable
- Windows-compatible developer workflow
## 4) Security Requirements
- API key auth via `x-api-key` (if configured)
- Per-IP in-memory rate limiting
- No secrets committed to repository (`.env` ignored)
## 5) Performance Requirements
- Lazy model initialization
- Runtime checks bounded by timeout
- Optional mock mode (`FORCE_MOCK_MODE=true`) for fast operational checks
## 6) Deployment Requirements
### Local
- `python tasks.py install`
- `python tasks.py run`
### Docker
- `docker compose up --build -d`
### Hugging Face Space
- `python tasks.py hf-upload --repo-id <id> --token <token>`
- Gradio entrypoint in `app.py`
## 7) Project Structure
```text
coding-llm/
βββ data/
βββ src/
βββ api/
βββ requirements.txt
βββ README.md
βββ instruction.md
βββ specification_file.md
```
## 8) Module Responsibilities
- `api/main.py`: API routes and response wiring
- `api/security.py`: API key + rate limiting
- `src/config.py`: environment-driven settings
- `src/model_loader.py`: model/fallback loading
- `src/generator.py`: generation + confidence extraction
- `src/pipeline.py`: orchestration layer
- `src/rag.py`: snippet retrieval
- `src/relevancy.py`: relevancy score computation
- `src/hallucination.py`: syntax/runtime checks
- `src/lora_prepare.py`: LoRA adapter hook
- `app.py`: Gradio UI for HF Spaces
- `upload_to_hf.py`: HF deployment uploader
- `tasks.py`: command runner
- `smoke_test.py`: runtime integration validation
## 9) Operational Modes
- **Real Model Mode**
- `FORCE_MOCK_MODE=false`
- Uses HF model loading and generation
- **Mock Mode**
- `FORCE_MOCK_MODE=true`
- Returns deterministic fallback output for reliability testing
## 10) Validation and QA
- Static compile check with `python -m compileall`
- Lint diagnostics via editor/tooling
- Smoke checks:
- health endpoint reachable
- generate endpoint returns full schema
## 11) Known Constraints
- First generation may be slow due to model download/warmup
- Quality depends on available model and decoding configuration
- In-memory rate limiter is single-process only
## 12) Future Enhancements
- Redis-backed distributed rate limiting
- Better language-aware hallucination tests
- Prompt templates per task type
- Streaming token responses
- Persistent vector store (Chroma/FAISS on-disk)
- CI/CD workflow for automated deploy/test
|