File size: 4,657 Bytes
3fa9baf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
# Product Requirements Document (PRD)
# Project: NCOS_S1 (Large Compliance LLM Pipeline)

## 1. Project Overview
Deploy a large compliance LLM (ACATECH/ncos, Llama-2-70B) on Hugging Face Spaces, with a Next.js frontend (Vercel), Supabase for test cases, and Redis for queueing. The backend is a FastAPI app running in a Docker container for full control (CUDA, dependencies, etc.).

---

## 2. Current State Analysis
- **Backend:**
  - FastAPI app in Hugging Face Space, Dockerized.
  - CUDA and torch set up for GPU inference.
  - Permissions and cache issues resolved.
  - Requirements are mostly correct and reproducible.
- **Frontend:**
  - Next.js app on Vercel (not tightly integrated yet).
- **Test/Queue:**
  - Supabase for test cases.
  - Redis for queueing (not fully integrated).
- **Issues:**
  - Dependency hell (CUDA, torch, flash-attn, numpy, etc.).
  - File permission and cache issues.
  - Model/tokenizer loading errors (corrupt/incompatible files).
  - Manual syncing of requirements and Dockerfile.
  - No robust, end-to-end pipeline from test case β†’ queue β†’ model β†’ result β†’ storage.
  - No clear API contract between frontend, backend, and test/queue system.
  - No health checks, monitoring, or error reporting.
  - No automated deployment or CI/CD for the Space.
  - Monolithic codebase, hard to debug.

---

## 3. Goals
- Modular, robust, and reproducible pipeline for LLM compliance testing.
- Clean separation of backend, frontend, and queue/storage.
- Automated, reliable deployment and monitoring.
- Clear API contract and documentation.

---

## 4. Recommended Architecture
### A. Modular Structure
- **Backend (Hugging Face Space):**
  - FastAPI app, Dockerized, REST API for inference.
  - Handles model loading, inference, health checks.
  - Connects to Redis for job queueing.
  - Optionally connects to Supabase for test/result storage.
- **Frontend (Vercel/Next.js):**
  - Calls backend API for inference.
  - Displays results, test case status, health info.
- **Queue/Storage:**
  - Redis for job queueing (decouples frontend/backend).
  - Supabase for storing test cases/results.

### B. Key Features
- Robust error handling and logging in backend.
- Health check endpoints (`/healthz`, `/readyz`).
- Clear API contract (OpenAPI/Swagger for FastAPI).
- Automated Docker build and deployment (version pinning).
- CI/CD pipeline for backend and frontend.
- Documentation for setup, usage, troubleshooting.

---

## 5. Action Plan
### Step 1: Design the API Contract
- Define endpoints for:
  - `/infer` (POST): Accepts input, returns model output.
  - `/healthz` (GET): Returns service health.
  - `/queue` (POST/GET): For job submission/status (if using Redis).
- Use FastAPI's OpenAPI docs for clarity.

### Step 2: Clean Backend Implementation
- Start a new repo or clean branch.
- Write a minimal FastAPI app:
  - Loads model/tokenizer (with robust error handling).
  - Exposes `/infer` and `/healthz`.
  - Logs errors and requests.
- Add Redis integration for queueing (optional, but recommended for scale).
- Add Supabase integration for test/result storage (optional, can be added after core works).

### Step 3: Dockerize the Backend
- Use a clean, minimal Dockerfile:
  - Start from `nvidia/cuda:12.1.0-devel-ubuntu22.04`.
  - Install Python, torch, dependencies in correct order.
  - Set up cache and permissions.
  - Pin all versions in `requirements.txt`.
  - Add a health check in Dockerfile (`HEALTHCHECK`).

### Step 4: Model/Tokenizer Management
- Ensure model/tokenizer files are valid and compatible.
- Test loading locally before pushing to Hugging Face.
- Document the process for updating model files.

### Step 5: Frontend Integration
- Update Next.js frontend to call the new backend API.
- Show job status, results, and health info.
- Add error handling and user feedback.

### Step 6: Queue and Storage Integration
- Set up Redis for job queueing.
- Set up Supabase for test case/result storage.
- Ensure backend can pull jobs from Redis, process, and store results in Supabase.

### Step 7: Monitoring and Health
- Add logging and error reporting (e.g., to stdout, or a logging service).
- Implement `/healthz` and `/readyz` endpoints.
- Optionally, add Prometheus/Grafana metrics.

### Step 8: CI/CD and Documentation
- Add GitHub Actions or similar for automated build/test/deploy.
- Write clear README and API docs.

---

## 6. Success Criteria
- End-to-end pipeline works: test case β†’ queue β†’ model β†’ result β†’ storage.
- Robust error handling and health checks in place.
- Automated, reproducible builds and deployments.
- Clear, up-to-date documentation for all components.