khushalcodiste commited on
Commit
6c84960
·
1 Parent(s): 6bfa874

Add application file

Browse files
Files changed (13) hide show
  1. .dockerignore +26 -0
  2. .env +9 -0
  3. .env.example +9 -0
  4. .gitignore +10 -0
  5. .python-version +1 -0
  6. Dockerfile +33 -0
  7. README.md +213 -10
  8. docker-compose.yml +24 -0
  9. main.py +163 -0
  10. pyproject.toml +7 -0
  11. requirements.txt +6 -0
  12. setup.sh +51 -0
  13. uv.lock +8 -0
.dockerignore ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ __pycache__
2
+ *.pyc
3
+ *.pyo
4
+ *.pyd
5
+ .Python
6
+ env
7
+ venv
8
+ .venv
9
+ pip-log.txt
10
+ pip-delete-this-directory.txt
11
+ .tox
12
+ .coverage
13
+ .coverage.*
14
+ .cache
15
+ nosetests.xml
16
+ coverage.xml
17
+ *.cover
18
+ *.log
19
+ .git
20
+ .mypy_cache
21
+ .pytest_cache
22
+ .hypothesis
23
+ .env
24
+ .env.local
25
+ .env.production
26
+ .env.staging
.env ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ # HuggingFace model configuration
2
+ MODEL_NAME=google/gemma-4-E2B-it
3
+
4
+ # Application configuration
5
+ APP_HOST=0.0.0.0
6
+ APP_PORT=8001
7
+
8
+ # Logging
9
+ LOG_LEVEL=INFO
.env.example ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ # HuggingFace model configuration
2
+ MODEL_NAME=google/gemma-4-E2B-it
3
+
4
+ # Application configuration
5
+ APP_HOST=0.0.0.0
6
+ APP_PORT=8001
7
+
8
+ # Logging
9
+ LOG_LEVEL=INFO
.gitignore ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python-generated files
2
+ __pycache__/
3
+ *.py[oc]
4
+ build/
5
+ dist/
6
+ wheels/
7
+ *.egg-info
8
+
9
+ # Virtual environments
10
+ .venv
.python-version ADDED
@@ -0,0 +1 @@
 
 
1
+ 3.11
Dockerfile ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Base image
2
+ FROM python:3.10-slim
3
+
4
+ # Install system dependencies (including curl for healthcheck)
5
+ RUN apt-get update && apt-get install -y --no-install-recommends \
6
+ curl \
7
+ git \
8
+ && rm -rf /var/lib/apt/lists/*
9
+
10
+ # Create user (HF requirement)
11
+ RUN useradd -m -u 1000 user
12
+
13
+ # Set working directory
14
+ WORKDIR /home/user/app
15
+
16
+ # Copy requirements first (for caching)
17
+ COPY --chown=user requirements.txt .
18
+
19
+ # Install dependencies
20
+ RUN pip install --no-cache-dir --upgrade pip && \
21
+ pip install --no-cache-dir -r requirements.txt
22
+
23
+ # Copy app
24
+ COPY --chown=user . .
25
+
26
+ # Switch to user
27
+ USER user
28
+
29
+ # Expose port (default 8001, but configurable)
30
+ EXPOSE 8001
31
+
32
+ # Run FastAPI with APP_PORT environment variable (default 8001)
33
+ CMD ["sh", "-c", "uvicorn main:app --host 0.0.0.0 --port ${APP_PORT:-8001}"]
README.md CHANGED
@@ -1,10 +1,213 @@
1
- ---
2
- title: Gemme4
3
- emoji: 🐢
4
- colorFrom: indigo
5
- colorTo: blue
6
- sdk: docker
7
- pinned: false
8
- ---
9
-
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Gemma4 FastAPI Application
2
+
3
+ A FastAPI application that integrates with HuggingFace to serve the Gemma-4-E2B model via REST API endpoints.
4
+
5
+ ## Features
6
+
7
+ - **Text Generation**: Generate text using Gemma-4's advanced reasoning capabilities
8
+ - **Chat Interface**: Interactive chat with conversation memory
9
+ - **Thinking Mode**: Enable Gemma-4's internal reasoning process
10
+ - **Streaming Support**: Real-time streaming responses
11
+ - **Health Monitoring**: Service health checks and model status
12
+ - **Docker Containerization**: Easy deployment with Docker Compose
13
+ - **GPU Support**: Automatic GPU detection and optimization
14
+ - **Local Execution**: No cloud dependencies, runs entirely on your hardware
15
+
16
+ ## Prerequisites
17
+
18
+ - Docker and Docker Compose
19
+ - At least 8GB RAM (16GB recommended for optimal performance)
20
+ - NVIDIA GPU with CUDA support (optional, CPU mode available)
21
+ - HuggingFace account (optional, for faster downloads)
22
+
23
+ ## Quick Start
24
+
25
+ 1. **Clone the repository**
26
+ ```bash
27
+ git clone <repository-url>
28
+ cd gemma4-fastapi
29
+ ```
30
+
31
+ 2. **Configure environment** (optional)
32
+ ```bash
33
+ cp .env.example .env
34
+ # Edit .env with your preferred settings if desired
35
+ ```
36
+
37
+ 3. **Run the setup script**
38
+ ```bash
39
+ chmod +x setup.sh
40
+ ./setup.sh
41
+ ```
42
+
43
+ Or manually:
44
+ ```bash
45
+ # Build and start the application
46
+ docker compose up --build -d
47
+
48
+ # Wait for the application to be ready
49
+ # The first startup may take several minutes as the model downloads
50
+ sleep 120
51
+ curl http://localhost:8001/api/health
52
+ ```
53
+
54
+ 4. **Test the API**
55
+ ```bash
56
+ curl http://localhost:8001/api/health
57
+ ```
58
+
59
+ ## API Endpoints
60
+
61
+ ### Health Check
62
+ - `GET /api/health` - Check service and model status
63
+
64
+ ### Text Generation
65
+ - `POST /api/generate` - Generate text from a prompt
66
+
67
+ ### Chat
68
+ - `POST /api/chat` - Chat with the model
69
+
70
+ ## API Usage Examples
71
+
72
+ ### Text Generation
73
+ ```bash
74
+ curl -X POST "http://localhost:8001/api/generate" \
75
+ -H "Content-Type: application/json" \
76
+ -d '{
77
+ "prompt": "Explain quantum computing in simple terms",
78
+ "think": false,
79
+ "stream": false
80
+ }'
81
+ ```
82
+
83
+ ### Chat
84
+ ```bash
85
+ curl -X POST "http://localhost:8001/api/chat" \
86
+ -H "Content-Type: application/json" \
87
+ -d '{
88
+ "messages": [
89
+ {"role": "user", "content": "Hello, how are you?"}
90
+ ],
91
+ "think": false,
92
+ "stream": false
93
+ }'
94
+ ```
95
+
96
+ ### Streaming Response
97
+ ```bash
98
+ curl -X POST "http://localhost:8001/api/generate" \
99
+ -H "Content-Type: application/json" \
100
+ -d '{
101
+ "prompt": "Write a short story",
102
+ "stream": true
103
+ }'
104
+ ```
105
+
106
+ ## Configuration
107
+
108
+ Environment variables in `.env`:
109
+
110
+ - `MODEL_NAME`: HuggingFace model to use (default: google/gemma-4-E2B)
111
+ - `APP_HOST`: FastAPI host (default: 0.0.0.0)
112
+ - `APP_PORT`: FastAPI port (default: 8001)
113
+ - `LOG_LEVEL`: Logging level (default: INFO)
114
+
115
+ ## Available Models
116
+
117
+ The application works with any causal language model from HuggingFace. Some recommended options:
118
+
119
+ - `google/gemma-4-E2B` - Efficient 2B model (default)
120
+ - `google/gemma-2-2b-it` - Gemma 2 2B instruction-tuned
121
+ - `google/gemma-2-9b` - Gemma 2 9B for better quality
122
+ - `meta-llama/Llama-2-7b` - Llama 2 7B
123
+ - Any other causal language model from HuggingFace
124
+
125
+ ## Development
126
+
127
+ ### Local Development (without Docker)
128
+
129
+ 1. **Create a virtual environment**
130
+ ```bash
131
+ python -m venv venv
132
+ source venv/bin/activate # On Windows: venv\Scripts\activate
133
+ ```
134
+
135
+ 2. **Install dependencies**
136
+ ```bash
137
+ pip install -r requirements.txt
138
+ ```
139
+
140
+ 3. **Run the application**
141
+ ```bash
142
+ uvicorn app.main:app --reload
143
+ ```
144
+
145
+ ### Running Tests
146
+
147
+ ```bash
148
+ pytest
149
+ ```
150
+
151
+ ## Docker Commands
152
+
153
+ ```bash
154
+ # Build the image
155
+ docker compose build
156
+
157
+ # Start services
158
+ docker compose up
159
+
160
+ # Start in background
161
+ docker compose up -d
162
+
163
+ # View logs
164
+ docker compose logs -f
165
+
166
+ # Stop services
167
+ docker compose down
168
+
169
+ # Rebuild and restart
170
+ docker compose up --build --force-recreate
171
+ ```
172
+
173
+ ## Troubleshooting
174
+
175
+ ### Model Download Issues
176
+ If the model is taking too long to download on first startup:
177
+ - The model is being downloaded from HuggingFace (this can take 10+ minutes depending on connection)
178
+ - You can monitor progress in the logs: `docker compose logs -f gemma4-app`
179
+ - The model cache is stored in a Docker volume for faster subsequent startups
180
+
181
+ ### Memory Issues
182
+ If you encounter out-of-memory errors:
183
+ - The model downloads are large. E2B variant is 2B parameters (~5-6GB)
184
+ - Ensure you have at least 16GB total RAM available
185
+ - For CPU-only mode, consider using a smaller model variant
186
+
187
+ ### Connection Issues
188
+ - Verify the API is running: `curl http://localhost:8001/api/health`
189
+ - Check Docker network: `docker compose ps`
190
+ - View logs: `docker compose logs gemma4-app`
191
+
192
+ ### GPU Not Being Used
193
+ - Check that NVIDIA Docker runtime is installed: `docker run --rm --runtime=nvidia nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi`
194
+ - Verify the container has GPU access: `docker compose logs gemma4-app` (should show "Using device: cuda")
195
+
196
+ ## API Documentation
197
+
198
+ Once running, visit `http://localhost:8001/docs` for interactive API documentation (Swagger UI).
199
+
200
+ ## Performance Tips
201
+
202
+ 1. **GPU Usage**: If you have an NVIDIA GPU with CUDA, the app will automatically use it for faster inference
203
+ 2. **Model Caching**: The model is cached in Docker after first download
204
+ 3. **Batch Processing**: For best performance with multiple requests, use streaming mode
205
+ 4. **Memory Management**: Keep the container memory settings high enough for smooth operation
206
+
207
+ ## License
208
+
209
+ [Add your license here]
210
+
211
+ ## Contributing
212
+
213
+ [Add contribution guidelines here]
docker-compose.yml ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ version: '3.8'
2
+
3
+ services:
4
+ gemma4-app:
5
+ build: .
6
+ container_name: gemma4-api
7
+ ports:
8
+ - "${APP_PORT:-8001}:${APP_PORT:-8001}"
9
+ environment:
10
+ - MODEL_NAME=${MODEL_NAME:-google/gemma-4-E2B}
11
+ - APP_PORT=${APP_PORT:-8001}
12
+ - LOG_LEVEL=${LOG_LEVEL:-INFO}
13
+ - HF_HOME=/home/user/.cache/huggingface
14
+ healthcheck:
15
+ test: ["CMD", "curl", "-f", "http://localhost:${APP_PORT:-8001}/"]
16
+ interval: 30s
17
+ timeout: 10s
18
+ retries: 3
19
+ start_period: 120s
20
+ volumes:
21
+ - model_cache:/home/user/.cache/huggingface
22
+
23
+ volumes:
24
+ model_cache:
main.py ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import time
2
+ import logging
3
+ import os
4
+ from typing import Optional
5
+
6
+ from fastapi import FastAPI, Request, HTTPException
7
+ from fastapi.middleware.cors import CORSMiddleware
8
+ from fastapi.responses import JSONResponse
9
+ from pydantic import BaseModel
10
+
11
+ from transformers import pipeline
12
+
13
+ # =========================
14
+ # 🔥 LOGGING CONFIG
15
+ # =========================
16
+ logging.basicConfig(
17
+ level=logging.INFO,
18
+ format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
19
+ )
20
+
21
+ logger = logging.getLogger("gemma-api")
22
+
23
+ # =========================
24
+ # 🚀 APP INIT
25
+ # =========================
26
+ app = FastAPI(
27
+ title="Gemma 4 API",
28
+ version="1.0.0",
29
+ )
30
+
31
+ # =========================
32
+ # 🌐 CORS CONFIG
33
+ # =========================
34
+ origins = [
35
+ "*", # ⚠️ change in production
36
+ ]
37
+
38
+ app.add_middleware(
39
+ CORSMiddleware,
40
+ allow_origins=origins,
41
+ allow_credentials=True,
42
+ allow_methods=["*"],
43
+ allow_headers=["*"],
44
+ )
45
+
46
+ # FastAPI uses CORSMiddleware to inject proper headers for cross-origin requests :contentReference[oaicite:0]{index=0}
47
+
48
+ # =========================
49
+ # ⏱️ REQUEST LOGGING MIDDLEWARE
50
+ # =========================
51
+ @app.middleware("http")
52
+ async def log_requests(request: Request, call_next):
53
+ start_time = time.time()
54
+
55
+ logger.info(f"➡️ Incoming request: {request.method} {request.url}")
56
+
57
+ try:
58
+ response = await call_next(request)
59
+ except Exception as e:
60
+ logger.exception(f"❌ Unhandled error: {str(e)}")
61
+ raise
62
+
63
+ process_time = time.time() - start_time
64
+ logger.info(f"⬅️ Completed in {process_time:.4f}s | Status: {response.status_code}")
65
+
66
+ response.headers["X-Process-Time"] = str(process_time)
67
+ return response
68
+
69
+ # Middleware is ideal for logging request/response lifecycle globally :contentReference[oaicite:1]{index=1}
70
+
71
+ # =========================
72
+ # 📦 MODEL LOADING
73
+ # =========================
74
+ pipe = None
75
+
76
+
77
+ @app.on_event("startup")
78
+ def load_model():
79
+ global pipe
80
+ try:
81
+ logger.info("🔄 Loading Gemma 4 model...")
82
+
83
+ pipe = pipeline(
84
+ "text-generation",
85
+ model="google/gemma-4-E2B",
86
+ device_map="auto",
87
+ torch_dtype="auto",
88
+ )
89
+
90
+ logger.info("✅ Model loaded successfully")
91
+
92
+ except Exception as e:
93
+ logger.exception("❌ Failed to load model")
94
+ raise e
95
+
96
+
97
+ # =========================
98
+ # 📥 REQUEST MODEL
99
+ # =========================
100
+ class GenerateRequest(BaseModel):
101
+ prompt: str
102
+ max_tokens: Optional[int] = 100
103
+ temperature: Optional[float] = 0.7
104
+
105
+
106
+ # =========================
107
+ # 📤 RESPONSE ENDPOINT
108
+ # =========================
109
+ @app.post("/generate")
110
+ async def generate(req: GenerateRequest):
111
+ if pipe is None:
112
+ raise HTTPException(status_code=500, detail="Model not loaded")
113
+
114
+ try:
115
+ logger.info(f"🧠 Generating for prompt: {req.prompt[:50]}...")
116
+
117
+ output = pipe(
118
+ req.prompt,
119
+ max_new_tokens=req.max_tokens,
120
+ temperature=req.temperature,
121
+ do_sample=True,
122
+ )
123
+
124
+ result = output[0]["generated_text"]
125
+
126
+ logger.info("✅ Generation successful")
127
+
128
+ return {
129
+ "success": True,
130
+ "response": result
131
+ }
132
+
133
+ except Exception as e:
134
+ logger.exception("❌ Generation failed")
135
+ raise HTTPException(status_code=500, detail=str(e))
136
+
137
+
138
+ # =========================
139
+ # ❤️ HEALTH CHECK
140
+ # =========================
141
+ @app.get("/")
142
+ @app.get("/api/health")
143
+ async def health():
144
+ return {
145
+ "status": "ok",
146
+ "model_loaded": pipe is not None
147
+ }
148
+
149
+
150
+ # =========================
151
+ # ❗ GLOBAL ERROR HANDLER
152
+ # =========================
153
+ @app.exception_handler(Exception)
154
+ async def global_exception_handler(request: Request, exc: Exception):
155
+ logger.exception(f"🔥 Global error: {str(exc)}")
156
+
157
+ return JSONResponse(
158
+ status_code=500,
159
+ content={
160
+ "success": False,
161
+ "error": str(exc)
162
+ },
163
+ )
pyproject.toml ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ [project]
2
+ name = "gemma4"
3
+ version = "0.1.0"
4
+ description = "Add your description here"
5
+ readme = "README.md"
6
+ requires-python = ">=3.11"
7
+ dependencies = []
requirements.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ fastapi
2
+ uvicorn[standard]
3
+ torch
4
+ accelerate
5
+ sentencepiece
6
+ transformers>=4.42.0
setup.sh ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ # Setup script for Gemma4 FastAPI application
4
+
5
+ echo "Setting up Gemma4 FastAPI application..."
6
+
7
+ # Check if Docker is installed
8
+ if ! command -v docker &> /dev/null; then
9
+ echo "Error: Docker is not installed. Please install Docker first."
10
+ exit 1
11
+ fi
12
+
13
+ # Check if Docker Compose is available
14
+ if ! docker compose version &> /dev/null; then
15
+ echo "Error: Docker Compose plugin is not available. Please install Docker Compose or enable the Docker Compose plugin."
16
+ exit 1
17
+ fi
18
+
19
+ # Load .env values if present
20
+ if [ -f .env ]; then
21
+ set -o allexport
22
+ . .env
23
+ set +o allexport
24
+ fi
25
+
26
+ MODEL_NAME=${MODEL_NAME:-google/gemma-4-E2B}
27
+ APP_PORT=${APP_PORT:-8001}
28
+
29
+ echo "Building Docker images..."
30
+ docker compose build
31
+
32
+ echo "Starting FastAPI application..."
33
+ docker compose up -d
34
+
35
+ echo "Waiting for application to be ready..."
36
+ until curl -sSf http://localhost:${APP_PORT}/ >/dev/null 2>&1; do
37
+ printf '.'
38
+ sleep 3
39
+ done
40
+
41
+ echo ""
42
+ echo "Setup complete!"
43
+ echo ""
44
+ echo "API will be available at: http://localhost:${APP_PORT}"
45
+ echo "API documentation at: http://localhost:${APP_PORT}/docs"
46
+ echo ""
47
+ echo "To check health: curl http://localhost:${APP_PORT}/"
48
+ echo "To check logs: docker compose logs -f"
49
+ echo "To stop services: docker compose down"
50
+ echo ""
51
+ echo "Note: First time startup may take several minutes as the model is downloaded."
uv.lock ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ version = 1
2
+ revision = 3
3
+ requires-python = ">=3.11"
4
+
5
+ [[package]]
6
+ name = "gemma4"
7
+ version = "0.1.0"
8
+ source = { virtual = "." }