Spaces:

girish00
/

coding-llm-space

Running

App Files Files Community

coding-llm-space / specification_file.md

girish00

Upload folder using huggingface_hub

07a91a1 verified about 1 month ago

preview code

raw

history blame contribute delete

4.27 kB

	# Advanced Coding LLM - Technical Specification

	## 1) Objective

	Build a production-ready coding assistant API deployable locally and on Hugging Face, supporting:

	- Code generation
	- Debugging/fixing buggy code
	- Code explanation
	- Instruction following
	- Explainability signals
	- Relevancy scoring
	- Hallucination checks
	- Optional RAG

	## 2) Core Functional Requirements

	### 2.1 Model

	- Primary model: `Qwen/Qwen2.5-Coder-1.5B-Instruct`
	- Fallback model: `Qwen/Qwen2.5-Coder-0.5B-Instruct`
	- Emergency fallback mode supported (mock path available)
	- Architecture compatible with future LoRA integration (`src/lora_prepare.py`)

	### 2.2 API

	- Framework: FastAPI
	- Endpoint: `POST /generate`
	- Health: `GET /health`
	- Input schema:
	- `instruction: str`
	- `input: str`
	- Output schema:
	- `code: str`
	- `explanation: str`
	- `confidence: float`
	- `important_tokens: list[str]`
	- `relevancy_score: float`
	- `hallucination: bool`
	- `latency_ms: int`

	### 2.3 Explainability

	- Confidence from token probabilities over generated tokens
	- Important tokens extracted from low-probability tokens

	### 2.4 Relevancy

	- Query-to-output semantic score using TF-IDF + cosine similarity

	### 2.5 Hallucination Checks

	- Python syntax validation (`ast.parse`)
	- Runtime smoke execution for Python-like outputs
	- Skip runtime execution for non-Python-like outputs

	### 2.6 RAG

	- Basic retrieval from local snippets dataset
	- FAISS index over normalized TF-IDF vectors
	- Inject top-k snippets into prompt context

	## 3) Non-Functional Requirements

	- Runnable on local workstation
	- Supports no-training initial deployment
	- Lazy-load model to reduce startup failures
	- Graceful fallback response when model unavailable
	- Windows-compatible developer workflow

	## 4) Security Requirements

	- API key auth via `x-api-key` (if configured)
	- Per-IP in-memory rate limiting
	- No secrets committed to repository (`.env` ignored)

	## 5) Performance Requirements

	- Lazy model initialization
	- Runtime checks bounded by timeout
	- Optional mock mode (`FORCE_MOCK_MODE=true`) for fast operational checks

	## 6) Deployment Requirements

	### Local

	- `python tasks.py install`
	- `python tasks.py run`

	### Docker

	- `docker compose up --build -d`

	### Hugging Face Space

	- `python tasks.py hf-upload --repo-id <id> --token <token>`
	- Gradio entrypoint in `app.py`

	## 7) Project Structure

	```text
	coding-llm/
	│── data/
	│── src/
	│── api/
	│── requirements.txt
	│── README.md
	│── instruction.md
	│── specification_file.md
	```

	## 8) Module Responsibilities

	- `api/main.py`: API routes and response wiring
	- `api/security.py`: API key + rate limiting
	- `src/config.py`: environment-driven settings
	- `src/model_loader.py`: model/fallback loading
	- `src/generator.py`: generation + confidence extraction
	- `src/pipeline.py`: orchestration layer
	- `src/rag.py`: snippet retrieval
	- `src/relevancy.py`: relevancy score computation
	- `src/hallucination.py`: syntax/runtime checks
	- `src/lora_prepare.py`: LoRA adapter hook
	- `app.py`: Gradio UI for HF Spaces
	- `upload_to_hf.py`: HF deployment uploader
	- `tasks.py`: command runner
	- `smoke_test.py`: runtime integration validation

	## 9) Operational Modes

	- Real Model Mode
	- `FORCE_MOCK_MODE=false`
	- Uses HF model loading and generation
	- Mock Mode
	- `FORCE_MOCK_MODE=true`
	- Returns deterministic fallback output for reliability testing

	## 10) Validation and QA

	- Static compile check with `python -m compileall`
	- Lint diagnostics via editor/tooling
	- Smoke checks:
	- health endpoint reachable
	- generate endpoint returns full schema

	## 11) Known Constraints

	- First generation may be slow due to model download/warmup
	- Quality depends on available model and decoding configuration
	- In-memory rate limiter is single-process only

	## 12) Future Enhancements

	- Redis-backed distributed rate limiting
	- Better language-aware hallucination tests
	- Prompt templates per task type
	- Streaming token responses
	- Persistent vector store (Chroma/FAISS on-disk)
	- CI/CD workflow for automated deploy/test