airsmodel / memory-bank /techContext.md
tanbushi's picture
update
f036bb3

Tech Context

Technology Stack

Core Framework

  • FastAPI: Modern, high-performance web framework
  • Uvicorn: ASGI server for running FastAPI
  • Python 3.8+: Required for type hints and async features

AI/ML Libraries

  • Transformers: Hugging Face library for model loading
  • PyTorch: Backend for transformers
  • Accelerate: Model optimization and distribution
  • HuggingFace Hub: Model downloading and authentication

Utilities

  • Pydantic: Data validation and settings management
  • python-dotenv: Environment variable management
  • python-multipart: Form data handling

Dependencies (requirements.txt)

fastapi
uvicorn[standard]
transformers
huggingface_hub
torch
accelerate
python-multipart
python-dotenv

Configuration

Environment Variables

# .env file
DEFAULT_MODEL_NAME="unsloth/functiongemma-270m-it"
HUGGINGFACE_TOKEN="hf_xxx"  # Optional, for gated models

Model Cache

  • Location: ./my_model_cache
  • Structure: Hugging Face cache format
  • Management: Automatic via transformers library

API Endpoints

1. GET /

Purpose: Health check and welcome message Response:

{"message": "Welcome to HF-Model-Runner API! Visit /docs for API documentation."}

2. POST /download

Purpose: Download and initialize a model Request:

{"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0"}

Response:

{
  "status": "success",
  "message": "模型 TinyLlama/TinyLlama-1.1B-Chat-v1.0 下载成功",
  "loaded": true,
  "current_model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
}

3. POST /v1/chat/completions

Purpose: OpenAI-compatible chat completion Request:

{
  "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
  "messages": [{"role": "user", "content": "Hello"}],
  "max_tokens": 500,
  "temperature": 1.0
}

Response:

{
  "id": "chatcmpl-1234567890",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Hello! How can I help you?"
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 8,
    "total_tokens": 18
  }
}

Module Structure

app.py (Main Application)

# Global state
model_name = None
pipe = None
tokenizer = None

# Startup event
@app.on_event("startup")
async def startup_event():
    load_dotenv()
    default_model = os.getenv("DEFAULT_MODEL_NAME", "fallback")
    # Initialize pipeline

# Routes
GET /, POST /download, POST /v1/chat/completions

utils/model.py (Model Management)

class DownloadRequest(BaseModel):
    model: str

def check_model(model_name) -> tuple
def download_model(model_name) -> tuple
def initialize_pipeline(model_name) -> tuple

utils/chat_request.py (Request Validation)

class ChatRequest(BaseModel):
    model: Optional[str]
    messages: List[Dict[str, Any]]
    max_tokens: Optional[int]
    temperature: Optional[float]
    # ... other fields

utils/chat_response.py (Response Generation)

class ChatResponse(BaseModel): ...
class ChatChoice(BaseModel): ...
class ChatUsage(BaseModel): ...

def convert_json_format(input_data) -> dict
def create_chat_response(request, pipe, tokenizer) -> ChatResponse

Deployment

Hugging Face Spaces

  • SDK: Docker
  • Port: 7860 (standard for HF Spaces)
  • Requirements: All dependencies in requirements.txt
  • Environment: .env file for configuration

Local Development

# Install dependencies
pip install -r requirements.txt

# Run server
uvicorn app:app --host 0.0.0.0 --port 7860 --reload

# Access
http://localhost:7860
http://localhost:7860/docs

Error Handling

Common Errors

  1. Model Not Found: HTTP 404 from check_model()
  2. Download Failed: HTTP 500 with error message
  3. Initialization Failed: HTTP 500 detail
  4. Pipeline Error: Exception in create_chat_response()

Logging

  • Startup: Model initialization status
  • Download: Progress and success/failure
  • Chat: Token counts and errors

Performance Considerations

Memory

  • Single model loaded at a time
  • Tokenizer cached
  • Pipeline reused across requests

Latency

  • Startup: One-time initialization cost
  • Chat: Inference time (depends on model size)
  • Download: Network + disk I/O

Scalability

  • Single model per instance
  • Stateless API routes
  • Async handlers for concurrency