Spaces:

airsltd
/

airsmodel

Sleeping

App Files Files Community

airsmodel / memory-bank /techContext.md

tanbushi

update

f036bb3 about 2 months ago

preview code

raw

history blame contribute delete

4.55 kB

	# Tech Context

	## Technology Stack

	### Core Framework
	- FastAPI: Modern, high-performance web framework
	- Uvicorn: ASGI server for running FastAPI
	- Python 3.8+: Required for type hints and async features

	### AI/ML Libraries
	- Transformers: Hugging Face library for model loading
	- PyTorch: Backend for transformers
	- Accelerate: Model optimization and distribution
	- HuggingFace Hub: Model downloading and authentication

	### Utilities
	- Pydantic: Data validation and settings management
	- python-dotenv: Environment variable management
	- python-multipart: Form data handling

	## Dependencies (requirements.txt)
	```
	fastapi
	uvicorn[standard]
	transformers
	huggingface_hub
	torch
	accelerate
	python-multipart
	python-dotenv
	```

	## Configuration

	### Environment Variables
	```bash
	# .env file
	DEFAULT_MODEL_NAME="unsloth/functiongemma-270m-it"
	HUGGINGFACE_TOKEN="hf_xxx" # Optional, for gated models
	```

	### Model Cache
	- Location: `./my_model_cache`
	- Structure: Hugging Face cache format
	- Management: Automatic via transformers library

	## API Endpoints

	### 1. GET /
	Purpose: Health check and welcome message
	Response:
	```json
	{"message": "Welcome to HF-Model-Runner API! Visit /docs for API documentation."}
	```

	### 2. POST /download
	Purpose: Download and initialize a model
	Request:
	```json
	{"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0"}
	```
	Response:
	```json
	{
	"status": "success",
	"message": "模型 TinyLlama/TinyLlama-1.1B-Chat-v1.0 下载成功",
	"loaded": true,
	"current_model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
	}
	```

	### 3. POST /v1/chat/completions
	Purpose: OpenAI-compatible chat completion
	Request:
	```json
	{
	"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
	"messages": [{"role": "user", "content": "Hello"}],
	"max_tokens": 500,
	"temperature": 1.0
	}
	```
	Response:
	```json
	{
	"id": "chatcmpl-1234567890",
	"object": "chat.completion",
	"created": 1234567890,
	"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
	"choices": [{
	"index": 0,
	"message": {
	"role": "assistant",
	"content": "Hello! How can I help you?"
	},
	"finish_reason": "stop"
	}],
	"usage": {
	"prompt_tokens": 10,
	"completion_tokens": 8,
	"total_tokens": 18
	}
	}
	```

	## Module Structure

	### app.py (Main Application)
	```python
	# Global state
	model_name = None
	pipe = None
	tokenizer = None

	# Startup event
	@app.on_event("startup")
	async def startup_event():
	load_dotenv()
	default_model = os.getenv("DEFAULT_MODEL_NAME", "fallback")
	# Initialize pipeline

	# Routes
	GET /, POST /download, POST /v1/chat/completions
	```

	### utils/model.py (Model Management)
	```python
	class DownloadRequest(BaseModel):
	model: str

	def check_model(model_name) -> tuple
	def download_model(model_name) -> tuple
	def initialize_pipeline(model_name) -> tuple
	```

	### utils/chat_request.py (Request Validation)
	```python
	class ChatRequest(BaseModel):
	model: Optional[str]
	messages: List[Dict[str, Any]]
	max_tokens: Optional[int]
	temperature: Optional[float]
	# ... other fields
	```

	### utils/chat_response.py (Response Generation)
	```python
	class ChatResponse(BaseModel): ...
	class ChatChoice(BaseModel): ...
	class ChatUsage(BaseModel): ...

	def convert_json_format(input_data) -> dict
	def create_chat_response(request, pipe, tokenizer) -> ChatResponse
	```

	## Deployment

	### Hugging Face Spaces
	- SDK: Docker
	- Port: 7860 (standard for HF Spaces)
	- Requirements: All dependencies in requirements.txt
	- Environment: .env file for configuration

	### Local Development
	```bash
	# Install dependencies
	pip install -r requirements.txt

	# Run server
	uvicorn app:app --host 0.0.0.0 --port 7860 --reload

	# Access
	http://localhost:7860
	http://localhost:7860/docs
	```

	## Error Handling

	### Common Errors
	1. Model Not Found: HTTP 404 from check_model()
	2. Download Failed: HTTP 500 with error message
	3. Initialization Failed: HTTP 500 detail
	4. Pipeline Error: Exception in create_chat_response()

	### Logging
	- Startup: Model initialization status
	- Download: Progress and success/failure
	- Chat: Token counts and errors

	## Performance Considerations

	### Memory
	- Single model loaded at a time
	- Tokenizer cached
	- Pipeline reused across requests

	### Latency
	- Startup: One-time initialization cost
	- Chat: Inference time (depends on model size)
	- Download: Network + disk I/O

	### Scalability
	- Single model per instance
	- Stateless API routes
	- Async handlers for concurrency