# System Patterns ## Architecture Overview ``` ┌─────────────────────────────────────────┐ │ FastAPI App │ ├─────────────────────────────────────────┤ │ Routes: │ │ • GET / (Welcome) │ │ • POST /download (Model Download) │ │ • POST /v1/chat/completions (Chat) │ ├─────────────────────────────────────────┤ │ Global State: │ │ • pipe (Pipeline) │ │ • tokenizer (Tokenizer) │ │ • model_name (Current Model) │ ├─────────────────────────────────────────┤ │ Startup Event: │ │ • Load .env │ │ • Initialize default model │ └─────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────┐ │ Utils Modules │ ├─────────────────────────────────────────┤ │ utils/model.py: │ │ • check_model() - Verify model exists │ │ • download_model() - Download model │ │ • initialize_pipeline() - Setup model │ │ • DownloadRequest - Pydantic model │ ├─────────────────────────────────────────┤ │ utils/chat_request.py: │ │ • ChatRequest - Request validation │ ├─────────────────────────────────────────┤ │ utils/chat_response.py: │ │ • create_chat_response() - Generate │ │ • convert_json_format() - Parse output │ │ • ChatResponse/ChatChoice/ChatUsage │ └─────────────────────────────────────────┘ ``` ## Data Flow Patterns ### 1. Application Startup ``` .env → load_dotenv() → os.getenv("DEFAULT_MODEL_NAME") ↓ initialize_pipeline(model_name) ↓ check_model() → verify cache exists ↓ AutoTokenizer + AutoModelForCausalLM ↓ pipeline("text-generation") ↓ Global: pipe, tokenizer, model_name ``` ### 2. Chat Request Flow ``` POST /v1/chat/completions ↓ ChatRequest (validation) ↓ Check model_name match ↓ create_chat_response(request, pipe, tokenizer) ↓ pipe(messages, max_new_tokens) ↓ convert_json_format() → clean output ↓ Calculate tokens (tokenizer.encode) ↓ ChatResponse (Pydantic) ``` ### 3. Download Flow ``` POST /download ↓ download_model(model_name) ↓ AutoTokenizer.from_pretrained(cache_dir) AutoModelForCausalLM.from_pretrained(cache_dir) ↓ initialize_pipeline(model_name) ↓ Update global: pipe, tokenizer, model_name ↓ Return success + loaded status ``` ## Key Design Decisions ### 1. Global State Management - **Why**: FastAPI is stateless, but models are expensive to load - **Solution**: Global variables for pipe/tokenizer/model_name - **Trade-off**: Single model at a time, but efficient ### 2. Lazy Initialization with Fallback - **Why**: Model might not exist on startup - **Solution**: Startup event tries to load, but doesn't fail - **Trade-off**: Graceful degradation vs. guaranteed availability ### 3. Model Switching - **Why**: Users may want different models - **Solution**: Check request.model vs. current model_name - **Trade-off**: Re-initialization overhead vs. flexibility ### 4. Error Handling - **Why**: Model operations can fail in multiple ways - **Solution**: HTTPException for client errors, try/except for internal - **Trade-off**: Clear API vs. implementation complexity ### 5. Environment Configuration - **Why**: Different deployments need different defaults - **Solution**: .env file with fallback - **Trade-off**: External config vs. hardcoded values ## Security Considerations - ✅ No hardcoded credentials in code - ✅ HUGGINGFACE_TOKEN from environment - ✅ Input validation via Pydantic - ✅ No arbitrary code execution from user input ## Performance Patterns - ✅ Model loaded once at startup - ✅ Tokenizer reused across requests - ✅ Token counting with actual tokenizer - ✅ Async route handlers for concurrency