| # System Patterns | |
| ## Architecture Overview | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| β FastAPI App β | |
| βββββββββββββββββββββββββββββββββββββββββββ€ | |
| β Routes: β | |
| β β’ GET / (Welcome) β | |
| β β’ POST /download (Model Download) β | |
| β β’ POST /v1/chat/completions (Chat) β | |
| βββββββββββββββββββββββββββββββββββββββββββ€ | |
| β Global State: β | |
| β β’ pipe (Pipeline) β | |
| β β’ tokenizer (Tokenizer) β | |
| β β’ model_name (Current Model) β | |
| βββββββββββββββββββββββββββββββββββββββββββ€ | |
| β Startup Event: β | |
| β β’ Load .env β | |
| β β’ Initialize default model β | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| β Utils Modules β | |
| βββββββββββββββββββββββββββββββββββββββββββ€ | |
| β utils/model.py: β | |
| β β’ check_model() - Verify model exists β | |
| β β’ download_model() - Download model β | |
| β β’ initialize_pipeline() - Setup model β | |
| β β’ DownloadRequest - Pydantic model β | |
| βββββββββββββββββββββββββββββββββββββββββββ€ | |
| β utils/chat_request.py: β | |
| β β’ ChatRequest - Request validation β | |
| βββββββββββββββββββββββββββββββββββββββββββ€ | |
| β utils/chat_response.py: β | |
| β β’ create_chat_response() - Generate β | |
| β β’ convert_json_format() - Parse output β | |
| β β’ ChatResponse/ChatChoice/ChatUsage β | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ## Data Flow Patterns | |
| ### 1. Application Startup | |
| ``` | |
| .env β load_dotenv() β os.getenv("DEFAULT_MODEL_NAME") | |
| β | |
| initialize_pipeline(model_name) | |
| β | |
| check_model() β verify cache exists | |
| β | |
| AutoTokenizer + AutoModelForCausalLM | |
| β | |
| pipeline("text-generation") | |
| β | |
| Global: pipe, tokenizer, model_name | |
| ``` | |
| ### 2. Chat Request Flow | |
| ``` | |
| POST /v1/chat/completions | |
| β | |
| ChatRequest (validation) | |
| β | |
| Check model_name match | |
| β | |
| create_chat_response(request, pipe, tokenizer) | |
| β | |
| pipe(messages, max_new_tokens) | |
| β | |
| convert_json_format() β clean output | |
| β | |
| Calculate tokens (tokenizer.encode) | |
| β | |
| ChatResponse (Pydantic) | |
| ``` | |
| ### 3. Download Flow | |
| ``` | |
| POST /download | |
| β | |
| download_model(model_name) | |
| β | |
| AutoTokenizer.from_pretrained(cache_dir) | |
| AutoModelForCausalLM.from_pretrained(cache_dir) | |
| β | |
| initialize_pipeline(model_name) | |
| β | |
| Update global: pipe, tokenizer, model_name | |
| β | |
| Return success + loaded status | |
| ``` | |
| ## Key Design Decisions | |
| ### 1. Global State Management | |
| - **Why**: FastAPI is stateless, but models are expensive to load | |
| - **Solution**: Global variables for pipe/tokenizer/model_name | |
| - **Trade-off**: Single model at a time, but efficient | |
| ### 2. Lazy Initialization with Fallback | |
| - **Why**: Model might not exist on startup | |
| - **Solution**: Startup event tries to load, but doesn't fail | |
| - **Trade-off**: Graceful degradation vs. guaranteed availability | |
| ### 3. Model Switching | |
| - **Why**: Users may want different models | |
| - **Solution**: Check request.model vs. current model_name | |
| - **Trade-off**: Re-initialization overhead vs. flexibility | |
| ### 4. Error Handling | |
| - **Why**: Model operations can fail in multiple ways | |
| - **Solution**: HTTPException for client errors, try/except for internal | |
| - **Trade-off**: Clear API vs. implementation complexity | |
| ### 5. Environment Configuration | |
| - **Why**: Different deployments need different defaults | |
| - **Solution**: .env file with fallback | |
| - **Trade-off**: External config vs. hardcoded values | |
| ## Security Considerations | |
| - β No hardcoded credentials in code | |
| - β HUGGINGFACE_TOKEN from environment | |
| - β Input validation via Pydantic | |
| - β No arbitrary code execution from user input | |
| ## Performance Patterns | |
| - β Model loaded once at startup | |
| - β Tokenizer reused across requests | |
| - β Token counting with actual tokenizer | |
| - β Async route handlers for concurrency | |