# System Patterns

## Architecture Overview
```
┌─────────────────────────────────────────┐
│              FastAPI App                │
├─────────────────────────────────────────┤
│  Routes:                                │
│  • GET / (Welcome)                      │
│  • POST /download (Model Download)      │
│  • POST /v1/chat/completions (Chat)     │
├─────────────────────────────────────────┤
│  Global State:                          │
│  • pipe (Pipeline)                      │
│  • tokenizer (Tokenizer)                │
│  • model_name (Current Model)           │
├─────────────────────────────────────────┤
│  Startup Event:                         │
│  • Load .env                            │
│  • Initialize default model             │
└─────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────┐
│           Utils Modules                 │
├─────────────────────────────────────────┤
│  utils/model.py:                        │
│  • check_model() - Verify model exists  │
│  • download_model() - Download model    │
│  • initialize_pipeline() - Setup model  │
│  • DownloadRequest - Pydantic model     │
├─────────────────────────────────────────┤
│  utils/chat_request.py:                 │
│  • ChatRequest - Request validation     │
├─────────────────────────────────────────┤
│  utils/chat_response.py:                │
│  • create_chat_response() - Generate    │
│  • convert_json_format() - Parse output │
│  • ChatResponse/ChatChoice/ChatUsage    │
└─────────────────────────────────────────┘
```

## Data Flow Patterns

### 1. Application Startup
```
.env → load_dotenv() → os.getenv("DEFAULT_MODEL_NAME")
     ↓
initialize_pipeline(model_name)
     ↓
check_model() → verify cache exists
     ↓
AutoTokenizer + AutoModelForCausalLM
     ↓
pipeline("text-generation")
     ↓
Global: pipe, tokenizer, model_name
```

### 2. Chat Request Flow
```
POST /v1/chat/completions
     ↓
ChatRequest (validation)
     ↓
Check model_name match
     ↓
create_chat_response(request, pipe, tokenizer)
     ↓
pipe(messages, max_new_tokens)
     ↓
convert_json_format() → clean output
     ↓
Calculate tokens (tokenizer.encode)
     ↓
ChatResponse (Pydantic)
```

### 3. Download Flow
```
POST /download
     ↓
download_model(model_name)
     ↓
AutoTokenizer.from_pretrained(cache_dir)
AutoModelForCausalLM.from_pretrained(cache_dir)
     ↓
initialize_pipeline(model_name)
     ↓
Update global: pipe, tokenizer, model_name
     ↓
Return success + loaded status
```

## Key Design Decisions

### 1. Global State Management
- **Why**: FastAPI is stateless, but models are expensive to load
- **Solution**: Global variables for pipe/tokenizer/model_name
- **Trade-off**: Single model at a time, but efficient

### 2. Lazy Initialization with Fallback
- **Why**: Model might not exist on startup
- **Solution**: Startup event tries to load, but doesn't fail
- **Trade-off**: Graceful degradation vs. guaranteed availability

### 3. Model Switching
- **Why**: Users may want different models
- **Solution**: Check request.model vs. current model_name
- **Trade-off**: Re-initialization overhead vs. flexibility

### 4. Error Handling
- **Why**: Model operations can fail in multiple ways
- **Solution**: HTTPException for client errors, try/except for internal
- **Trade-off**: Clear API vs. implementation complexity

### 5. Environment Configuration
- **Why**: Different deployments need different defaults
- **Solution**: .env file with fallback
- **Trade-off**: External config vs. hardcoded values

## Security Considerations
- ✅ No hardcoded credentials in code
- ✅ HUGGINGFACE_TOKEN from environment
- ✅ Input validation via Pydantic
- ✅ No arbitrary code execution from user input

## Performance Patterns
- ✅ Model loaded once at startup
- ✅ Tokenizer reused across requests
- ✅ Token counting with actual tokenizer
- ✅ Async route handlers for concurrency