coding-llm-space / specification_file.md
girish00's picture
Upload folder using huggingface_hub
07a91a1 verified

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

Advanced Coding LLM - Technical Specification

1) Objective

Build a production-ready coding assistant API deployable locally and on Hugging Face, supporting:

  • Code generation
  • Debugging/fixing buggy code
  • Code explanation
  • Instruction following
  • Explainability signals
  • Relevancy scoring
  • Hallucination checks
  • Optional RAG

2) Core Functional Requirements

2.1 Model

  • Primary model: Qwen/Qwen2.5-Coder-1.5B-Instruct
  • Fallback model: Qwen/Qwen2.5-Coder-0.5B-Instruct
  • Emergency fallback mode supported (mock path available)
  • Architecture compatible with future LoRA integration (src/lora_prepare.py)

2.2 API

  • Framework: FastAPI
  • Endpoint: POST /generate
  • Health: GET /health
  • Input schema:
    • instruction: str
    • input: str
  • Output schema:
    • code: str
    • explanation: str
    • confidence: float
    • important_tokens: list[str]
    • relevancy_score: float
    • hallucination: bool
    • latency_ms: int

2.3 Explainability

  • Confidence from token probabilities over generated tokens
  • Important tokens extracted from low-probability tokens

2.4 Relevancy

  • Query-to-output semantic score using TF-IDF + cosine similarity

2.5 Hallucination Checks

  • Python syntax validation (ast.parse)
  • Runtime smoke execution for Python-like outputs
  • Skip runtime execution for non-Python-like outputs

2.6 RAG

  • Basic retrieval from local snippets dataset
  • FAISS index over normalized TF-IDF vectors
  • Inject top-k snippets into prompt context

3) Non-Functional Requirements

  • Runnable on local workstation
  • Supports no-training initial deployment
  • Lazy-load model to reduce startup failures
  • Graceful fallback response when model unavailable
  • Windows-compatible developer workflow

4) Security Requirements

  • API key auth via x-api-key (if configured)
  • Per-IP in-memory rate limiting
  • No secrets committed to repository (.env ignored)

5) Performance Requirements

  • Lazy model initialization
  • Runtime checks bounded by timeout
  • Optional mock mode (FORCE_MOCK_MODE=true) for fast operational checks

6) Deployment Requirements

Local

  • python tasks.py install
  • python tasks.py run

Docker

  • docker compose up --build -d

Hugging Face Space

  • python tasks.py hf-upload --repo-id <id> --token <token>
  • Gradio entrypoint in app.py

7) Project Structure

coding-llm/
│── data/
│── src/
│── api/
│── requirements.txt
│── README.md
│── instruction.md
│── specification_file.md

8) Module Responsibilities

  • api/main.py: API routes and response wiring
  • api/security.py: API key + rate limiting
  • src/config.py: environment-driven settings
  • src/model_loader.py: model/fallback loading
  • src/generator.py: generation + confidence extraction
  • src/pipeline.py: orchestration layer
  • src/rag.py: snippet retrieval
  • src/relevancy.py: relevancy score computation
  • src/hallucination.py: syntax/runtime checks
  • src/lora_prepare.py: LoRA adapter hook
  • app.py: Gradio UI for HF Spaces
  • upload_to_hf.py: HF deployment uploader
  • tasks.py: command runner
  • smoke_test.py: runtime integration validation

9) Operational Modes

  • Real Model Mode
    • FORCE_MOCK_MODE=false
    • Uses HF model loading and generation
  • Mock Mode
    • FORCE_MOCK_MODE=true
    • Returns deterministic fallback output for reliability testing

10) Validation and QA

  • Static compile check with python -m compileall
  • Lint diagnostics via editor/tooling
  • Smoke checks:
    • health endpoint reachable
    • generate endpoint returns full schema

11) Known Constraints

  • First generation may be slow due to model download/warmup
  • Quality depends on available model and decoding configuration
  • In-memory rate limiter is single-process only

12) Future Enhancements

  • Redis-backed distributed rate limiting
  • Better language-aware hallucination tests
  • Prompt templates per task type
  • Streaming token responses
  • Persistent vector store (Chroma/FAISS on-disk)
  • CI/CD workflow for automated deploy/test