File size: 4,968 Bytes
e142333 f036bb3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
# System Patterns
## Architecture Overview
```
βββββββββββββββββββββββββββββββββββββββββββ
β FastAPI App β
βββββββββββββββββββββββββββββββββββββββββββ€
β Routes: β
β β’ GET / (Welcome) β
β β’ POST /download (Model Download) β
β β’ POST /v1/chat/completions (Chat) β
βββββββββββββββββββββββββββββββββββββββββββ€
β Global State: β
β β’ pipe (Pipeline) β
β β’ tokenizer (Tokenizer) β
β β’ model_name (Current Model) β
βββββββββββββββββββββββββββββββββββββββββββ€
β Startup Event: β
β β’ Load .env β
β β’ Initialize default model β
βββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββ
β Utils Modules β
βββββββββββββββββββββββββββββββββββββββββββ€
β utils/model.py: β
β β’ check_model() - Verify model exists β
β β’ download_model() - Download model β
β β’ initialize_pipeline() - Setup model β
β β’ DownloadRequest - Pydantic model β
βββββββββββββββββββββββββββββββββββββββββββ€
β utils/chat_request.py: β
β β’ ChatRequest - Request validation β
βββββββββββββββββββββββββββββββββββββββββββ€
β utils/chat_response.py: β
β β’ create_chat_response() - Generate β
β β’ convert_json_format() - Parse output β
β β’ ChatResponse/ChatChoice/ChatUsage β
βββββββββββββββββββββββββββββββββββββββββββ
```
## Data Flow Patterns
### 1. Application Startup
```
.env β load_dotenv() β os.getenv("DEFAULT_MODEL_NAME")
β
initialize_pipeline(model_name)
β
check_model() β verify cache exists
β
AutoTokenizer + AutoModelForCausalLM
β
pipeline("text-generation")
β
Global: pipe, tokenizer, model_name
```
### 2. Chat Request Flow
```
POST /v1/chat/completions
β
ChatRequest (validation)
β
Check model_name match
β
create_chat_response(request, pipe, tokenizer)
β
pipe(messages, max_new_tokens)
β
convert_json_format() β clean output
β
Calculate tokens (tokenizer.encode)
β
ChatResponse (Pydantic)
```
### 3. Download Flow
```
POST /download
β
download_model(model_name)
β
AutoTokenizer.from_pretrained(cache_dir)
AutoModelForCausalLM.from_pretrained(cache_dir)
β
initialize_pipeline(model_name)
β
Update global: pipe, tokenizer, model_name
β
Return success + loaded status
```
## Key Design Decisions
### 1. Global State Management
- **Why**: FastAPI is stateless, but models are expensive to load
- **Solution**: Global variables for pipe/tokenizer/model_name
- **Trade-off**: Single model at a time, but efficient
### 2. Lazy Initialization with Fallback
- **Why**: Model might not exist on startup
- **Solution**: Startup event tries to load, but doesn't fail
- **Trade-off**: Graceful degradation vs. guaranteed availability
### 3. Model Switching
- **Why**: Users may want different models
- **Solution**: Check request.model vs. current model_name
- **Trade-off**: Re-initialization overhead vs. flexibility
### 4. Error Handling
- **Why**: Model operations can fail in multiple ways
- **Solution**: HTTPException for client errors, try/except for internal
- **Trade-off**: Clear API vs. implementation complexity
### 5. Environment Configuration
- **Why**: Different deployments need different defaults
- **Solution**: .env file with fallback
- **Trade-off**: External config vs. hardcoded values
## Security Considerations
- β
No hardcoded credentials in code
- β
HUGGINGFACE_TOKEN from environment
- β
Input validation via Pydantic
- β
No arbitrary code execution from user input
## Performance Patterns
- β
Model loaded once at startup
- β
Tokenizer reused across requests
- β
Token counting with actual tokenizer
- β
Async route handlers for concurrency
|