lamionx commited on
Commit
8a464c3
·
1 Parent(s): 6a0df05

Deploy AI API with FastAPI + Mistral-7B: streaming, authentication, multi-language support, caching

Browse files
Files changed (4) hide show
  1. Dockerfile +77 -0
  2. README.md +45 -8
  3. app.py +734 -0
  4. requirements.txt +18 -0
Dockerfile ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ============================================================================
2
+ # AI API Dockerfile - Optimized for Hugging Face Spaces
3
+ # ============================================================================
4
+ # This Dockerfile is configured to run on Hugging Face Spaces with:
5
+ # - Port 7860 (HF Spaces default)
6
+ # - Non-root user for security
7
+ # - CUDA support for GPU acceleration
8
+ # - 4-bit quantization for efficient memory usage
9
+ # ============================================================================
10
+
11
+ FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04
12
+
13
+ # Set environment variables
14
+ ENV DEBIAN_FRONTEND=noninteractive \
15
+ PYTHONUNBUFFERED=1 \
16
+ PYTHONDONTWRITEBYTECODE=1 \
17
+ PIP_NO_CACHE_DIR=1 \
18
+ PIP_DISABLE_PIP_VERSION_CHECK=1 \
19
+ CUDA_HOME=/usr/local/cuda \
20
+ PATH=/usr/local/cuda/bin:${PATH} \
21
+ LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH}
22
+
23
+ # Install system dependencies
24
+ RUN apt-get update && apt-get install -y \
25
+ python3.10 \
26
+ python3-pip \
27
+ python3-dev \
28
+ build-essential \
29
+ git \
30
+ wget \
31
+ curl \
32
+ ca-certificates \
33
+ libgomp1 \
34
+ && rm -rf /var/lib/apt/lists/*
35
+
36
+ # Create non-root user for security
37
+ RUN useradd -m -u 1000 -s /bin/bash appuser
38
+
39
+ # Set working directory
40
+ WORKDIR /app
41
+
42
+ # Copy requirements first (for better caching)
43
+ COPY requirements.txt .
44
+
45
+ # Upgrade pip and install Python dependencies
46
+ RUN pip3 install --upgrade pip setuptools wheel && \
47
+ pip3 install --no-cache-dir -r requirements.txt
48
+
49
+ # Copy application code
50
+ COPY main.py .
51
+
52
+ # Create necessary directories
53
+ RUN mkdir -p /app/logs && \
54
+ chown -R appuser:appuser /app
55
+
56
+ # Switch to non-root user
57
+ USER appuser
58
+
59
+ # Set default environment variables (can be overridden)
60
+ ENV MODEL_NAME="mistralai/Mistral-7B-Instruct-v0.2" \
61
+ API_KEY="your-secret-api-key-here" \
62
+ MAX_LENGTH="2048" \
63
+ TEMPERATURE="0.7" \
64
+ TOP_P="0.95" \
65
+ CACHE_SIZE="100" \
66
+ PORT="7860" \
67
+ HOST="0.0.0.0"
68
+
69
+ # Expose port 7860 (Hugging Face Spaces default)
70
+ EXPOSE 7860
71
+
72
+ # Health check
73
+ HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
74
+ CMD curl -f http://localhost:7860/health || exit 1
75
+
76
+ # Run the application
77
+ CMD ["python3", "main.py"]
README.md CHANGED
@@ -1,13 +1,50 @@
1
  ---
2
- title: Laminou
3
- emoji: 🏢
4
- colorFrom: gray
5
- colorTo: blue
6
- sdk: gradio
7
- sdk_version: 6.3.0
8
- app_file: app.py
9
  pinned: false
10
  license: apache-2.0
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: AI API - Mistral 7B
3
+ emoji: 🤖
4
+ colorFrom: indigo
5
+ colorTo: purple
6
+ sdk: docker
7
+ app_port: 7860
 
8
  pinned: false
9
  license: apache-2.0
10
  ---
11
 
12
+ # AI API - Mistral 7B
13
+
14
+ Production-ready AI API powered by Mistral-7B-Instruct-v0.2 with:
15
+
16
+ - ⚡ FastAPI with streaming responses
17
+ - 🔐 API Key authentication
18
+ - 🌍 Multi-language support (8 languages)
19
+ - 💾 Smart response caching
20
+ - 🎨 Built-in web interface
21
+ - 🚀 Optimized 4-bit quantization
22
+
23
+ ## Usage
24
+
25
+ ### Web Interface
26
+
27
+ Open the Space URL in your browser to access the interactive chat interface.
28
+
29
+ ### API Endpoint
30
+
31
+ ```bash
32
+ curl -X POST "https://lamionx-laminou.hf.space/api/chat" \
33
+ -H "X-API-Key: your-secret-api-key-here" \
34
+ -H "Content-Type: application/json" \
35
+ -d '{
36
+ "prompt": "Explain quantum computing",
37
+ "language": "en",
38
+ "stream": false
39
+ }'
40
+ ```
41
+
42
+ ### Configuration
43
+
44
+ Set your API key as a Space secret:
45
+ - Go to Settings → Repository secrets
46
+ - Add: `API_KEY` = `your-secret-api-key-here`
47
+
48
+ ## Documentation
49
+
50
+ For full documentation, visit the [GitHub repository](https://github.com/your-repo).
app.py ADDED
@@ -0,0 +1,734 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ AI API Server - FastAPI + Mistral-7B
3
+ Production-ready API with streaming, authentication, and caching
4
+ """
5
+
6
+ import os
7
+ import logging
8
+ import asyncio
9
+ from typing import Optional, Dict, Any, AsyncGenerator
10
+ from datetime import datetime
11
+ from functools import lru_cache
12
+ import hashlib
13
+
14
+ from fastapi import FastAPI, HTTPException, Header, Request, status
15
+ from fastapi.responses import StreamingResponse, HTMLResponse
16
+ from fastapi.middleware.cors import CORSMiddleware
17
+ from pydantic import BaseModel, Field, validator
18
+ import torch
19
+ from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
20
+ import uvicorn
21
+
22
+ # ============================================================================
23
+ # LOGGING CONFIGURATION
24
+ # ============================================================================
25
+
26
+ logging.basicConfig(
27
+ level=logging.INFO,
28
+ format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
29
+ handlers=[
30
+ logging.StreamHandler(),
31
+ logging.FileHandler("api.log")
32
+ ]
33
+ )
34
+ logger = logging.getLogger(__name__)
35
+
36
+ # ============================================================================
37
+ # CONFIGURATION
38
+ # ============================================================================
39
+
40
+ class Config:
41
+ """Application configuration"""
42
+ MODEL_NAME = os.getenv("MODEL_NAME", "mistralai/Mistral-7B-Instruct-v0.2")
43
+ API_KEY = os.getenv("API_KEY", "your-secret-api-key-here")
44
+ MAX_LENGTH = int(os.getenv("MAX_LENGTH", "2048"))
45
+ TEMPERATURE = float(os.getenv("TEMPERATURE", "0.7"))
46
+ TOP_P = float(os.getenv("TOP_P", "0.95"))
47
+ CACHE_SIZE = int(os.getenv("CACHE_SIZE", "100"))
48
+ PORT = int(os.getenv("PORT", "7860"))
49
+ HOST = os.getenv("HOST", "0.0.0.0")
50
+
51
+ # Quantization config for 4-bit loading (optimized for free hardware)
52
+ QUANTIZATION_CONFIG = BitsAndBytesConfig(
53
+ load_in_4bit=True,
54
+ bnb_4bit_compute_dtype=torch.float16,
55
+ bnb_4bit_use_double_quant=True,
56
+ bnb_4bit_quant_type="nf4"
57
+ )
58
+
59
+ config = Config()
60
+
61
+ # ============================================================================
62
+ # PYDANTIC MODELS
63
+ # ============================================================================
64
+
65
+ class ChatRequest(BaseModel):
66
+ """Request model for chat endpoint"""
67
+ prompt: str = Field(..., min_length=1, max_length=4000, description="User prompt")
68
+ language: str = Field(default="en", description="Response language (en, pt, es)")
69
+ temperature: Optional[float] = Field(default=None, ge=0.0, le=2.0)
70
+ max_tokens: Optional[int] = Field(default=None, ge=1, le=4096)
71
+ stream: bool = Field(default=True, description="Enable streaming response")
72
+
73
+ @validator("language")
74
+ def validate_language(cls, v):
75
+ allowed = ["en", "pt", "es", "fr", "de", "it", "ja", "zh"]
76
+ if v not in allowed:
77
+ raise ValueError(f"Language must be one of {allowed}")
78
+ return v
79
+
80
+ class ChatResponse(BaseModel):
81
+ """Response model for chat endpoint"""
82
+ response: str
83
+ language: str
84
+ model: str
85
+ timestamp: str
86
+ cached: bool = False
87
+
88
+ class HealthResponse(BaseModel):
89
+ """Health check response"""
90
+ status: str
91
+ model_loaded: bool
92
+ timestamp: str
93
+
94
+ # ============================================================================
95
+ # SYSTEM PROMPTS (MULTI-LANGUAGE)
96
+ # ============================================================================
97
+
98
+ SYSTEM_PROMPTS = {
99
+ "en": "You are a helpful, respectful and honest AI assistant. Always answer as helpfully as possible, while being safe. If you don't know the answer, say so instead of making up information.",
100
+ "pt": "Você é um assistente de IA útil, respeitoso e honesto. Sempre responda da forma mais útil possível, mantendo a segurança. Se não souber a resposta, diga isso ao invés de inventar informações.",
101
+ "es": "Eres un asistente de IA útil, respetuoso y honesto. Siempre responde de la manera más útil posible, manteniendo la seguridad. Si no sabes la respuesta, dilo en lugar de inventar información.",
102
+ "fr": "Vous êtes un assistant IA utile, respectueux et honnête. Répondez toujours de la manière la plus utile possible, tout en restant sûr. Si vous ne connaissez pas la réponse, dites-le au lieu d'inventer des informations.",
103
+ "de": "Sie sind ein hilfreicher, respektvoller und ehrlicher KI-Assistent. Antworten Sie immer so hilfreich wie möglich und bleiben Sie dabei sicher. Wenn Sie die Antwort nicht wissen, sagen Sie es, anstatt Informationen zu erfinden.",
104
+ "it": "Sei un assistente AI utile, rispettoso e onesto. Rispondi sempre nel modo più utile possibile, mantenendo la sicurezza. Se non conosci la risposta, dillo invece di inventare informazioni.",
105
+ "ja": "あなたは親切で、礼儀正しく、正直なAIアシスタントです。常に安全を保ちながら、できるだけ役立つように答えてください。答えがわからない場合は、情報を作り上げるのではなく、そう言ってください。",
106
+ "zh": "你是一个乐于助人、尊重他人且诚实的AI助手。在保持安全的同时,始终尽可能有帮助地回答。如果你不知道答案,请说出来,而不是编造信息。"
107
+ }
108
+
109
+ # ============================================================================
110
+ # SIMPLE CACHE IMPLEMENTATION
111
+ # ============================================================================
112
+
113
+ class ResponseCache:
114
+ """Simple in-memory cache for responses"""
115
+
116
+ def __init__(self, max_size: int = 100):
117
+ self.cache: Dict[str, tuple[str, datetime]] = {}
118
+ self.max_size = max_size
119
+ logger.info(f"Initialized cache with max size: {max_size}")
120
+
121
+ def _generate_key(self, prompt: str, language: str, temperature: float) -> str:
122
+ """Generate cache key from parameters"""
123
+ content = f"{prompt}:{language}:{temperature}"
124
+ return hashlib.md5(content.encode()).hexdigest()
125
+
126
+ def get(self, prompt: str, language: str, temperature: float) -> Optional[str]:
127
+ """Retrieve cached response"""
128
+ key = self._generate_key(prompt, language, temperature)
129
+ if key in self.cache:
130
+ response, timestamp = self.cache[key]
131
+ logger.info(f"Cache HIT for key: {key[:8]}...")
132
+ return response
133
+ logger.info(f"Cache MISS for key: {key[:8]}...")
134
+ return None
135
+
136
+ def set(self, prompt: str, language: str, temperature: float, response: str):
137
+ """Store response in cache"""
138
+ if len(self.cache) >= self.max_size:
139
+ # Remove oldest entry
140
+ oldest_key = min(self.cache.keys(), key=lambda k: self.cache[k][1])
141
+ del self.cache[oldest_key]
142
+ logger.info(f"Cache full, removed oldest entry: {oldest_key[:8]}...")
143
+
144
+ key = self._generate_key(prompt, language, temperature)
145
+ self.cache[key] = (response, datetime.now())
146
+ logger.info(f"Cached response for key: {key[:8]}...")
147
+
148
+ # ============================================================================
149
+ # MODEL LOADING
150
+ # ============================================================================
151
+
152
+ class ModelManager:
153
+ """Manages model loading and inference"""
154
+
155
+ def __init__(self):
156
+ self.model = None
157
+ self.tokenizer = None
158
+ self.device = "cuda" if torch.cuda.is_available() else "cpu"
159
+ logger.info(f"Device: {self.device}")
160
+
161
+ async def load_model(self):
162
+ """Load model with quantization"""
163
+ try:
164
+ logger.info(f"Loading model: {config.MODEL_NAME}")
165
+
166
+ # Load tokenizer
167
+ self.tokenizer = AutoTokenizer.from_pretrained(
168
+ config.MODEL_NAME,
169
+ trust_remote_code=True
170
+ )
171
+
172
+ # Load model with 4-bit quantization
173
+ self.model = AutoModelForCausalLM.from_pretrained(
174
+ config.MODEL_NAME,
175
+ quantization_config=config.QUANTIZATION_CONFIG,
176
+ device_map="auto",
177
+ trust_remote_code=True,
178
+ low_cpu_mem_usage=True
179
+ )
180
+
181
+ logger.info("Model loaded successfully!")
182
+ return True
183
+
184
+ except Exception as e:
185
+ logger.error(f"Failed to load model: {str(e)}", exc_info=True)
186
+ return False
187
+
188
+ def format_prompt(self, prompt: str, language: str) -> str:
189
+ """Format prompt with system message"""
190
+ system_prompt = SYSTEM_PROMPTS.get(language, SYSTEM_PROMPTS["en"])
191
+ return f"<s>[INST] {system_prompt}\n\nUser: {prompt} [/INST]"
192
+
193
+ async def generate_stream(
194
+ self,
195
+ prompt: str,
196
+ language: str,
197
+ temperature: float,
198
+ max_tokens: int
199
+ ) -> AsyncGenerator[str, None]:
200
+ """Generate response with streaming"""
201
+ try:
202
+ formatted_prompt = self.format_prompt(prompt, language)
203
+
204
+ # Tokenize input
205
+ inputs = self.tokenizer(
206
+ formatted_prompt,
207
+ return_tensors="pt",
208
+ truncation=True,
209
+ max_length=config.MAX_LENGTH
210
+ ).to(self.device)
211
+
212
+ # Generate with streaming
213
+ with torch.no_grad():
214
+ for i in range(max_tokens):
215
+ outputs = self.model.generate(
216
+ **inputs,
217
+ max_new_tokens=1,
218
+ temperature=temperature,
219
+ top_p=config.TOP_P,
220
+ do_sample=True,
221
+ pad_token_id=self.tokenizer.eos_token_id
222
+ )
223
+
224
+ # Decode new token
225
+ new_token = self.tokenizer.decode(
226
+ outputs[0][-1:],
227
+ skip_special_tokens=True
228
+ )
229
+
230
+ # Check for end of sequence
231
+ if outputs[0][-1] == self.tokenizer.eos_token_id:
232
+ break
233
+
234
+ yield new_token
235
+
236
+ # Update inputs for next iteration
237
+ inputs = {"input_ids": outputs}
238
+
239
+ # Small delay to simulate realistic streaming
240
+ await asyncio.sleep(0.01)
241
+
242
+ except Exception as e:
243
+ logger.error(f"Generation error: {str(e)}", exc_info=True)
244
+ yield f"\n\n[Error: {str(e)}]"
245
+
246
+ async def generate(
247
+ self,
248
+ prompt: str,
249
+ language: str,
250
+ temperature: float,
251
+ max_tokens: int
252
+ ) -> str:
253
+ """Generate complete response (non-streaming)"""
254
+ try:
255
+ formatted_prompt = self.format_prompt(prompt, language)
256
+
257
+ inputs = self.tokenizer(
258
+ formatted_prompt,
259
+ return_tensors="pt",
260
+ truncation=True,
261
+ max_length=config.MAX_LENGTH
262
+ ).to(self.device)
263
+
264
+ with torch.no_grad():
265
+ outputs = self.model.generate(
266
+ **inputs,
267
+ max_new_tokens=max_tokens,
268
+ temperature=temperature,
269
+ top_p=config.TOP_P,
270
+ do_sample=True,
271
+ pad_token_id=self.tokenizer.eos_token_id
272
+ )
273
+
274
+ response = self.tokenizer.decode(
275
+ outputs[0][inputs["input_ids"].shape[1]:],
276
+ skip_special_tokens=True
277
+ )
278
+
279
+ return response.strip()
280
+
281
+ except Exception as e:
282
+ logger.error(f"Generation error: {str(e)}", exc_info=True)
283
+ raise HTTPException(
284
+ status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
285
+ detail=f"Generation failed: {str(e)}"
286
+ )
287
+
288
+ # ============================================================================
289
+ # FASTAPI APPLICATION
290
+ # ============================================================================
291
+
292
+ app = FastAPI(
293
+ title="AI API - Mistral 7B",
294
+ description="Production-ready AI API with streaming, authentication, and caching",
295
+ version="1.0.0",
296
+ docs_url="/docs",
297
+ redoc_url="/redoc"
298
+ )
299
+
300
+ # CORS middleware
301
+ app.add_middleware(
302
+ CORSMiddleware,
303
+ allow_origins=["*"],
304
+ allow_credentials=True,
305
+ allow_methods=["*"],
306
+ allow_headers=["*"],
307
+ )
308
+
309
+ # Global instances
310
+ model_manager = ModelManager()
311
+ cache = ResponseCache(max_size=config.CACHE_SIZE)
312
+
313
+ # ============================================================================
314
+ # AUTHENTICATION
315
+ # ============================================================================
316
+
317
+ async def verify_api_key(x_api_key: str = Header(..., alias="X-API-Key")):
318
+ """Verify API key from header"""
319
+ if x_api_key != config.API_KEY:
320
+ logger.warning(f"Invalid API key attempt: {x_api_key[:8]}...")
321
+ raise HTTPException(
322
+ status_code=status.HTTP_401_UNAUTHORIZED,
323
+ detail="Invalid API key"
324
+ )
325
+ return x_api_key
326
+
327
+ # ============================================================================
328
+ # STARTUP/SHUTDOWN EVENTS
329
+ # ============================================================================
330
+
331
+ @app.on_event("startup")
332
+ async def startup_event():
333
+ """Load model on startup"""
334
+ logger.info("Starting AI API server...")
335
+ success = await model_manager.load_model()
336
+ if not success:
337
+ logger.error("Failed to load model, server may not function correctly")
338
+
339
+ @app.on_event("shutdown")
340
+ async def shutdown_event():
341
+ """Cleanup on shutdown"""
342
+ logger.info("Shutting down AI API server...")
343
+ # Clear cache
344
+ cache.cache.clear()
345
+
346
+ # ============================================================================
347
+ # ROUTES
348
+ # ============================================================================
349
+
350
+ @app.get("/", response_class=HTMLResponse)
351
+ async def root():
352
+ """Serve frontend HTML"""
353
+ html_content = """
354
+ <!DOCTYPE html>
355
+ <html lang="en">
356
+ <head>
357
+ <meta charset="UTF-8">
358
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
359
+ <title>AI API - Mistral 7B</title>
360
+ <script src="https://cdn.tailwindcss.com"></script>
361
+ <style>
362
+ @keyframes fadeIn {
363
+ from { opacity: 0; transform: translateY(10px); }
364
+ to { opacity: 1; transform: translateY(0); }
365
+ }
366
+ .message {
367
+ animation: fadeIn 0.3s ease-out;
368
+ }
369
+ .typing-indicator {
370
+ display: inline-block;
371
+ }
372
+ .typing-indicator span {
373
+ display: inline-block;
374
+ width: 8px;
375
+ height: 8px;
376
+ border-radius: 50%;
377
+ background-color: #6366F1;
378
+ margin: 0 2px;
379
+ animation: typing 1.4s infinite;
380
+ }
381
+ .typing-indicator span:nth-child(2) {
382
+ animation-delay: 0.2s;
383
+ }
384
+ .typing-indicator span:nth-child(3) {
385
+ animation-delay: 0.4s;
386
+ }
387
+ @keyframes typing {
388
+ 0%, 60%, 100% { transform: translateY(0); }
389
+ 30% { transform: translateY(-10px); }
390
+ }
391
+ </style>
392
+ </head>
393
+ <body class="bg-gradient-to-br from-slate-50 to-slate-100 min-h-screen">
394
+ <div class="container mx-auto px-4 py-8 max-w-4xl">
395
+ <!-- Header -->
396
+ <div class="bg-white rounded-2xl shadow-lg p-6 mb-6">
397
+ <h1 class="text-3xl font-bold text-slate-800 mb-2">🤖 AI API - Mistral 7B</h1>
398
+ <p class="text-slate-600">Production-ready AI API with streaming responses</p>
399
+ </div>
400
+
401
+ <!-- API Key Section -->
402
+ <div class="bg-white rounded-2xl shadow-lg p-6 mb-6">
403
+ <label class="block text-sm font-semibold text-slate-700 mb-2">API Key</label>
404
+ <input
405
+ type="password"
406
+ id="apiKey"
407
+ placeholder="Enter your API key"
408
+ class="w-full px-4 py-3 border border-slate-300 rounded-lg focus:ring-2 focus:ring-indigo-500 focus:border-transparent outline-none"
409
+ />
410
+ </div>
411
+
412
+ <!-- Chat Interface -->
413
+ <div class="bg-white rounded-2xl shadow-lg p-6 mb-6">
414
+ <div id="messages" class="space-y-4 mb-6 max-h-96 overflow-y-auto">
415
+ <div class="text-center text-slate-400 py-8">
416
+ Start a conversation by typing a message below
417
+ </div>
418
+ </div>
419
+
420
+ <!-- Input Area -->
421
+ <div class="flex gap-3">
422
+ <select id="language" class="px-4 py-3 border border-slate-300 rounded-lg focus:ring-2 focus:ring-indigo-500 outline-none">
423
+ <option value="en">English</option>
424
+ <option value="pt">Português</option>
425
+ <option value="es">Español</option>
426
+ <option value="fr">Français</option>
427
+ <option value="de">Deutsch</option>
428
+ <option value="it">Italiano</option>
429
+ </select>
430
+ <input
431
+ type="text"
432
+ id="prompt"
433
+ placeholder="Type your message..."
434
+ class="flex-1 px-4 py-3 border border-slate-300 rounded-lg focus:ring-2 focus:ring-indigo-500 focus:border-transparent outline-none"
435
+ />
436
+ <button
437
+ onclick="sendMessage()"
438
+ id="sendBtn"
439
+ class="px-6 py-3 bg-indigo-600 text-white font-semibold rounded-lg hover:bg-indigo-700 transition-colors disabled:bg-slate-300 disabled:cursor-not-allowed"
440
+ >
441
+ Send
442
+ </button>
443
+ </div>
444
+ </div>
445
+
446
+ <!-- Documentation -->
447
+ <div class="bg-white rounded-2xl shadow-lg p-6">
448
+ <h2 class="text-xl font-bold text-slate-800 mb-4">📚 API Documentation</h2>
449
+ <div class="space-y-4">
450
+ <div>
451
+ <h3 class="font-semibold text-slate-700 mb-2">Endpoint</h3>
452
+ <code class="block bg-slate-100 p-3 rounded-lg text-sm">POST /api/chat</code>
453
+ </div>
454
+ <div>
455
+ <h3 class="font-semibold text-slate-700 mb-2">Headers</h3>
456
+ <code class="block bg-slate-100 p-3 rounded-lg text-sm">X-API-Key: your-api-key</code>
457
+ </div>
458
+ <div>
459
+ <h3 class="font-semibold text-slate-700 mb-2">Example (curl)</h3>
460
+ <pre class="bg-slate-100 p-3 rounded-lg text-sm overflow-x-auto"><code>curl -X POST "http://localhost:7860/api/chat" \\
461
+ -H "X-API-Key: your-secret-api-key-here" \\
462
+ -H "Content-Type: application/json" \\
463
+ -d '{
464
+ "prompt": "Explain quantum computing",
465
+ "language": "en",
466
+ "stream": false
467
+ }'</code></pre>
468
+ </div>
469
+ </div>
470
+ </div>
471
+ </div>
472
+
473
+ <script>
474
+ const messagesDiv = document.getElementById('messages');
475
+ const promptInput = document.getElementById('prompt');
476
+ const apiKeyInput = document.getElementById('apiKey');
477
+ const languageSelect = document.getElementById('language');
478
+ const sendBtn = document.getElementById('sendBtn');
479
+
480
+ // Load API key from localStorage
481
+ const savedApiKey = localStorage.getItem('apiKey');
482
+ if (savedApiKey) {
483
+ apiKeyInput.value = savedApiKey;
484
+ }
485
+
486
+ // Save API key on change
487
+ apiKeyInput.addEventListener('change', () => {
488
+ localStorage.setItem('apiKey', apiKeyInput.value);
489
+ });
490
+
491
+ // Send on Enter
492
+ promptInput.addEventListener('keypress', (e) => {
493
+ if (e.key === 'Enter' && !e.shiftKey) {
494
+ e.preventDefault();
495
+ sendMessage();
496
+ }
497
+ });
498
+
499
+ function addMessage(content, isUser = false) {
500
+ if (messagesDiv.children[0]?.textContent.includes('Start a conversation')) {
501
+ messagesDiv.innerHTML = '';
502
+ }
503
+
504
+ const messageDiv = document.createElement('div');
505
+ messageDiv.className = `message flex ${isUser ? 'justify-end' : 'justify-start'}`;
506
+
507
+ const bubble = document.createElement('div');
508
+ bubble.className = `max-w-[70%] px-4 py-3 rounded-2xl ${
509
+ isUser
510
+ ? 'bg-indigo-600 text-white'
511
+ : 'bg-slate-100 text-slate-800'
512
+ }`;
513
+ bubble.textContent = content;
514
+
515
+ messageDiv.appendChild(bubble);
516
+ messagesDiv.appendChild(messageDiv);
517
+ messagesDiv.scrollTop = messagesDiv.scrollHeight;
518
+
519
+ return bubble;
520
+ }
521
+
522
+ function addTypingIndicator() {
523
+ const messageDiv = document.createElement('div');
524
+ messageDiv.className = 'message flex justify-start';
525
+ messageDiv.id = 'typing-indicator';
526
+
527
+ const bubble = document.createElement('div');
528
+ bubble.className = 'max-w-[70%] px-4 py-3 rounded-2xl bg-slate-100';
529
+ bubble.innerHTML = '<div class="typing-indicator"><span></span><span></span><span></span></div>';
530
+
531
+ messageDiv.appendChild(bubble);
532
+ messagesDiv.appendChild(messageDiv);
533
+ messagesDiv.scrollTop = messagesDiv.scrollHeight;
534
+ }
535
+
536
+ function removeTypingIndicator() {
537
+ const indicator = document.getElementById('typing-indicator');
538
+ if (indicator) {
539
+ indicator.remove();
540
+ }
541
+ }
542
+
543
+ async function sendMessage() {
544
+ const prompt = promptInput.value.trim();
545
+ const apiKey = apiKeyInput.value.trim();
546
+ const language = languageSelect.value;
547
+
548
+ if (!prompt) return;
549
+ if (!apiKey) {
550
+ alert('Please enter your API key');
551
+ return;
552
+ }
553
+
554
+ // Add user message
555
+ addMessage(prompt, true);
556
+ promptInput.value = '';
557
+
558
+ // Disable send button
559
+ sendBtn.disabled = true;
560
+ addTypingIndicator();
561
+
562
+ try {
563
+ const response = await fetch('/api/chat', {
564
+ method: 'POST',
565
+ headers: {
566
+ 'Content-Type': 'application/json',
567
+ 'X-API-Key': apiKey
568
+ },
569
+ body: JSON.stringify({
570
+ prompt: prompt,
571
+ language: language,
572
+ stream: true
573
+ })
574
+ });
575
+
576
+ if (!response.ok) {
577
+ throw new Error(`HTTP ${response.status}: ${await response.text()}`);
578
+ }
579
+
580
+ removeTypingIndicator();
581
+ const bubble = addMessage('', false);
582
+
583
+ // Read stream
584
+ const reader = response.body.getReader();
585
+ const decoder = new TextDecoder();
586
+ let fullResponse = '';
587
+
588
+ while (true) {
589
+ const { done, value } = await reader.read();
590
+ if (done) break;
591
+
592
+ const chunk = decoder.decode(value);
593
+ const lines = chunk.split('\\n');
594
+
595
+ for (const line of lines) {
596
+ if (line.startsWith('data: ')) {
597
+ const data = line.slice(6);
598
+ if (data === '[DONE]') break;
599
+
600
+ try {
601
+ const json = JSON.parse(data);
602
+ if (json.token) {
603
+ fullResponse += json.token;
604
+ bubble.textContent = fullResponse;
605
+ messagesDiv.scrollTop = messagesDiv.scrollHeight;
606
+ }
607
+ } catch (e) {
608
+ console.error('Parse error:', e);
609
+ }
610
+ }
611
+ }
612
+ }
613
+
614
+ } catch (error) {
615
+ removeTypingIndicator();
616
+ addMessage(`Error: ${error.message}`, false);
617
+ } finally {
618
+ sendBtn.disabled = false;
619
+ promptInput.focus();
620
+ }
621
+ }
622
+ </script>
623
+ </body>
624
+ </html>
625
+ """
626
+ return HTMLResponse(content=html_content)
627
+
628
+ @app.get("/health", response_model=HealthResponse)
629
+ async def health_check():
630
+ """Health check endpoint"""
631
+ return HealthResponse(
632
+ status="healthy",
633
+ model_loaded=model_manager.model is not None,
634
+ timestamp=datetime.now().isoformat()
635
+ )
636
+
637
+ @app.post("/api/chat")
638
+ async def chat(
639
+ request: ChatRequest,
640
+ api_key: str = Header(..., alias="X-API-Key")
641
+ ):
642
+ """
643
+ Chat endpoint with streaming support
644
+
645
+ Requires X-API-Key header for authentication
646
+ """
647
+ # Verify API key
648
+ await verify_api_key(api_key)
649
+
650
+ # Check if model is loaded
651
+ if model_manager.model is None:
652
+ raise HTTPException(
653
+ status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
654
+ detail="Model not loaded yet, please try again later"
655
+ )
656
+
657
+ # Get parameters
658
+ temperature = request.temperature or config.TEMPERATURE
659
+ max_tokens = request.max_tokens or 512
660
+
661
+ try:
662
+ # Check cache for non-streaming requests
663
+ if not request.stream:
664
+ cached_response = cache.get(request.prompt, request.language, temperature)
665
+ if cached_response:
666
+ return ChatResponse(
667
+ response=cached_response,
668
+ language=request.language,
669
+ model=config.MODEL_NAME,
670
+ timestamp=datetime.now().isoformat(),
671
+ cached=True
672
+ )
673
+
674
+ # Streaming response
675
+ if request.stream:
676
+ async def generate():
677
+ full_response = ""
678
+ async for token in model_manager.generate_stream(
679
+ request.prompt,
680
+ request.language,
681
+ temperature,
682
+ max_tokens
683
+ ):
684
+ full_response += token
685
+ yield f"data: {{'token': '{token}'}}\n\n"
686
+
687
+ # Cache complete response
688
+ cache.set(request.prompt, request.language, temperature, full_response)
689
+ yield "data: [DONE]\n\n"
690
+
691
+ return StreamingResponse(
692
+ generate(),
693
+ media_type="text/event-stream"
694
+ )
695
+
696
+ # Non-streaming response
697
+ else:
698
+ response = await model_manager.generate(
699
+ request.prompt,
700
+ request.language,
701
+ temperature,
702
+ max_tokens
703
+ )
704
+
705
+ # Cache response
706
+ cache.set(request.prompt, request.language, temperature, response)
707
+
708
+ return ChatResponse(
709
+ response=response,
710
+ language=request.language,
711
+ model=config.MODEL_NAME,
712
+ timestamp=datetime.now().isoformat(),
713
+ cached=False
714
+ )
715
+
716
+ except Exception as e:
717
+ logger.error(f"Chat endpoint error: {str(e)}", exc_info=True)
718
+ raise HTTPException(
719
+ status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
720
+ detail=str(e)
721
+ )
722
+
723
+ # ============================================================================
724
+ # MAIN
725
+ # ============================================================================
726
+
727
+ if __name__ == "__main__":
728
+ logger.info(f"Starting server on {config.HOST}:{config.PORT}")
729
+ uvicorn.run(
730
+ app,
731
+ host=config.HOST,
732
+ port=config.PORT,
733
+ log_level="info"
734
+ )
requirements.txt ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # FastAPI and Server
2
+ fastapi==0.115.0
3
+ uvicorn[standard]==0.32.0
4
+ python-multipart==0.0.12
5
+ pydantic==2.10.0
6
+ pydantic-settings==2.6.0
7
+
8
+ # AI/ML Libraries
9
+ torch==2.5.0
10
+ transformers==4.46.0
11
+ accelerate==1.1.0
12
+ bitsandbytes==0.44.0
13
+ sentencepiece==0.2.0
14
+ protobuf==5.28.0
15
+
16
+ # Utilities
17
+ python-dotenv==1.0.1
18
+ aiofiles==24.1.0