System Patterns
Architecture Overview
βββββββββββββββββββββββββββββββββββββββββββ
β FastAPI App β
βββββββββββββββββββββββββββββββββββββββββββ€
β Routes: β
β β’ GET / (Welcome) β
β β’ POST /download (Model Download) β
β β’ POST /v1/chat/completions (Chat) β
βββββββββββββββββββββββββββββββββββββββββββ€
β Global State: β
β β’ pipe (Pipeline) β
β β’ tokenizer (Tokenizer) β
β β’ model_name (Current Model) β
βββββββββββββββββββββββββββββββββββββββββββ€
β Startup Event: β
β β’ Load .env β
β β’ Initialize default model β
βββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββ
β Utils Modules β
βββββββββββββββββββββββββββββββββββββββββββ€
β utils/model.py: β
β β’ check_model() - Verify model exists β
β β’ download_model() - Download model β
β β’ initialize_pipeline() - Setup model β
β β’ DownloadRequest - Pydantic model β
βββββββββββββββββββββββββββββββββββββββββββ€
β utils/chat_request.py: β
β β’ ChatRequest - Request validation β
βββββββββββββββββββββββββββββββββββββββββββ€
β utils/chat_response.py: β
β β’ create_chat_response() - Generate β
β β’ convert_json_format() - Parse output β
β β’ ChatResponse/ChatChoice/ChatUsage β
βββββββββββββββββββββββββββββββββββββββββββ
Data Flow Patterns
1. Application Startup
.env β load_dotenv() β os.getenv("DEFAULT_MODEL_NAME")
β
initialize_pipeline(model_name)
β
check_model() β verify cache exists
β
AutoTokenizer + AutoModelForCausalLM
β
pipeline("text-generation")
β
Global: pipe, tokenizer, model_name
2. Chat Request Flow
POST /v1/chat/completions
β
ChatRequest (validation)
β
Check model_name match
β
create_chat_response(request, pipe, tokenizer)
β
pipe(messages, max_new_tokens)
β
convert_json_format() β clean output
β
Calculate tokens (tokenizer.encode)
β
ChatResponse (Pydantic)
3. Download Flow
POST /download
β
download_model(model_name)
β
AutoTokenizer.from_pretrained(cache_dir)
AutoModelForCausalLM.from_pretrained(cache_dir)
β
initialize_pipeline(model_name)
β
Update global: pipe, tokenizer, model_name
β
Return success + loaded status
Key Design Decisions
1. Global State Management
- Why: FastAPI is stateless, but models are expensive to load
- Solution: Global variables for pipe/tokenizer/model_name
- Trade-off: Single model at a time, but efficient
2. Lazy Initialization with Fallback
- Why: Model might not exist on startup
- Solution: Startup event tries to load, but doesn't fail
- Trade-off: Graceful degradation vs. guaranteed availability
3. Model Switching
- Why: Users may want different models
- Solution: Check request.model vs. current model_name
- Trade-off: Re-initialization overhead vs. flexibility
4. Error Handling
- Why: Model operations can fail in multiple ways
- Solution: HTTPException for client errors, try/except for internal
- Trade-off: Clear API vs. implementation complexity
5. Environment Configuration
- Why: Different deployments need different defaults
- Solution: .env file with fallback
- Trade-off: External config vs. hardcoded values
Security Considerations
- β
No hardcoded credentials in code
- β
HUGGINGFACE_TOKEN from environment
- β
Input validation via Pydantic
- β
No arbitrary code execution from user input
Performance Patterns
- β
Model loaded once at startup
- β
Tokenizer reused across requests
- β
Token counting with actual tokenizer
- β
Async route handlers for concurrency