khushalcodiste commited on
Commit
3a5ea54
·
1 Parent(s): 6c84960

Add application file

Browse files
Files changed (3) hide show
  1. .gitattributes +9 -33
  2. .gitignore +1 -0
  3. README.md +69 -45
.gitattributes CHANGED
@@ -1,35 +1,11 @@
1
- *.7z filter=lfs diff=lfs merge=lfs -text
2
- *.arrow filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
3
  *.bin filter=lfs diff=lfs merge=lfs -text
4
- *.bz2 filter=lfs diff=lfs merge=lfs -text
5
- *.ckpt filter=lfs diff=lfs merge=lfs -text
6
- *.ftz filter=lfs diff=lfs merge=lfs -text
7
- *.gz filter=lfs diff=lfs merge=lfs -text
8
- *.h5 filter=lfs diff=lfs merge=lfs -text
9
- *.joblib filter=lfs diff=lfs merge=lfs -text
10
- *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
- *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
- *.model filter=lfs diff=lfs merge=lfs -text
13
- *.msgpack filter=lfs diff=lfs merge=lfs -text
14
- *.npy filter=lfs diff=lfs merge=lfs -text
15
- *.npz filter=lfs diff=lfs merge=lfs -text
16
- *.onnx filter=lfs diff=lfs merge=lfs -text
17
- *.ot filter=lfs diff=lfs merge=lfs -text
18
- *.parquet filter=lfs diff=lfs merge=lfs -text
19
- *.pb filter=lfs diff=lfs merge=lfs -text
20
- *.pickle filter=lfs diff=lfs merge=lfs -text
21
- *.pkl filter=lfs diff=lfs merge=lfs -text
22
- *.pt filter=lfs diff=lfs merge=lfs -text
23
- *.pth filter=lfs diff=lfs merge=lfs -text
24
- *.rar filter=lfs diff=lfs merge=lfs -text
25
  *.safetensors filter=lfs diff=lfs merge=lfs -text
26
- saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
- *.tar.* filter=lfs diff=lfs merge=lfs -text
28
- *.tar filter=lfs diff=lfs merge=lfs -text
29
- *.tflite filter=lfs diff=lfs merge=lfs -text
30
- *.tgz filter=lfs diff=lfs merge=lfs -text
31
- *.wasm filter=lfs diff=lfs merge=lfs -text
32
- *.xz filter=lfs diff=lfs merge=lfs -text
33
- *.zip filter=lfs diff=lfs merge=lfs -text
34
- *.zst filter=lfs diff=lfs merge=lfs -text
35
- *tfevents* filter=lfs diff=lfs merge=lfs -text
 
1
+ *.py text eol=lf
2
+ *.sh text eol=lf
3
+ *.yml text eol=lf
4
+ *.yaml text eol=lf
5
+ *.json text eol=lf
6
+ *.md text eol=lf
7
+ *.txt text eol=lf
8
+
9
+ # HuggingFace model cache should not be committed
10
  *.bin filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  *.safetensors filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
.gitignore CHANGED
@@ -8,3 +8,4 @@ wheels/
8
 
9
  # Virtual environments
10
  .venv
 
 
8
 
9
  # Virtual environments
10
  .venv
11
+ Auth.txt
README.md CHANGED
@@ -1,24 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
  # Gemma4 FastAPI Application
2
 
3
- A FastAPI application that integrates with HuggingFace to serve the Gemma-4-E2B model via REST API endpoints.
4
 
5
  ## Features
6
 
7
  - **Text Generation**: Generate text using Gemma-4's advanced reasoning capabilities
8
- - **Chat Interface**: Interactive chat with conversation memory
9
- - **Thinking Mode**: Enable Gemma-4's internal reasoning process
10
- - **Streaming Support**: Real-time streaming responses
11
  - **Health Monitoring**: Service health checks and model status
12
  - **Docker Containerization**: Easy deployment with Docker Compose
13
- - **GPU Support**: Automatic GPU detection and optimization
14
- - **Local Execution**: No cloud dependencies, runs entirely on your hardware
 
15
 
16
  ## Prerequisites
17
 
18
  - Docker and Docker Compose
19
- - At least 8GB RAM (16GB recommended for optimal performance)
20
- - NVIDIA GPU with CUDA support (optional, CPU mode available)
21
- - HuggingFace account (optional, for faster downloads)
 
22
 
23
  ## Quick Start
24
 
@@ -48,58 +59,57 @@ A FastAPI application that integrates with HuggingFace to serve the Gemma-4-E2B
48
  # Wait for the application to be ready
49
  # The first startup may take several minutes as the model downloads
50
  sleep 120
51
- curl http://localhost:8001/api/health
52
  ```
53
 
54
  4. **Test the API**
55
  ```bash
56
- curl http://localhost:8001/api/health
57
  ```
58
 
59
  ## API Endpoints
60
 
61
  ### Health Check
62
- - `GET /api/health` - Check service and model status
 
63
 
64
  ### Text Generation
65
- - `POST /api/generate` - Generate text from a prompt
66
-
67
- ### Chat
68
- - `POST /api/chat` - Chat with the model
69
 
70
  ## API Usage Examples
71
 
 
 
 
 
 
72
  ### Text Generation
73
  ```bash
74
- curl -X POST "http://localhost:8001/api/generate" \
75
  -H "Content-Type: application/json" \
76
  -d '{
77
  "prompt": "Explain quantum computing in simple terms",
78
- "think": false,
79
- "stream": false
80
  }'
81
  ```
82
 
83
- ### Chat
84
- ```bash
85
- curl -X POST "http://localhost:8001/api/chat" \
86
- -H "Content-Type: application/json" \
87
- -d '{
88
- "messages": [
89
- {"role": "user", "content": "Hello, how are you?"}
90
- ],
91
- "think": false,
92
- "stream": false
93
- }'
94
  ```
95
 
96
- ### Streaming Response
97
  ```bash
98
- curl -X POST "http://localhost:8001/api/generate" \
99
  -H "Content-Type: application/json" \
100
  -d '{
101
- "prompt": "Write a short story",
102
- "stream": true
 
103
  }'
104
  ```
105
 
@@ -114,13 +124,14 @@ Environment variables in `.env`:
114
 
115
  ## Available Models
116
 
117
- The application works with any causal language model from HuggingFace. Some recommended options:
118
 
119
- - `google/gemma-4-E2B` - Efficient 2B model (default)
120
  - `google/gemma-2-2b-it` - Gemma 2 2B instruction-tuned
121
- - `google/gemma-2-9b` - Gemma 2 9B for better quality
122
  - `meta-llama/Llama-2-7b` - Llama 2 7B
123
- - Any other causal language model from HuggingFace
 
124
 
125
  ## Development
126
 
@@ -189,9 +200,22 @@ If you encounter out-of-memory errors:
189
  - Check Docker network: `docker compose ps`
190
  - View logs: `docker compose logs gemma4-app`
191
 
192
- ### GPU Not Being Used
193
- - Check that NVIDIA Docker runtime is installed: `docker run --rm --runtime=nvidia nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi`
194
- - Verify the container has GPU access: `docker compose logs gemma4-app` (should show "Using device: cuda")
 
 
 
 
 
 
 
 
 
 
 
 
 
195
 
196
  ## API Documentation
197
 
@@ -199,10 +223,10 @@ Once running, visit `http://localhost:8001/docs` for interactive API documentati
199
 
200
  ## Performance Tips
201
 
202
- 1. **GPU Usage**: If you have an NVIDIA GPU with CUDA, the app will automatically use it for faster inference
203
- 2. **Model Caching**: The model is cached in Docker after first download
204
- 3. **Batch Processing**: For best performance with multiple requests, use streaming mode
205
- 4. **Memory Management**: Keep the container memory settings high enough for smooth operation
206
 
207
  ## License
208
 
 
1
+ ---
2
+ title: Gemma4 FastAPI API
3
+ emoji: 🚀
4
+ colorFrom: purple
5
+ colorTo: blue
6
+ sdk: docker
7
+ sdk_version: "latest"
8
+ app_file: main.py
9
+ pinned: false
10
+ short_description: Text generation with Gemma-4-E2B via FastAPI
11
+ ---
12
+
13
  # Gemma4 FastAPI Application
14
 
15
+ A FastAPI application that serves the Gemma-4-E2B model via REST API endpoints with enterprise-grade reliability and monitoring.
16
 
17
  ## Features
18
 
19
  - **Text Generation**: Generate text using Gemma-4's advanced reasoning capabilities
 
 
 
20
  - **Health Monitoring**: Service health checks and model status
21
  - **Docker Containerization**: Easy deployment with Docker Compose
22
+ - **CPU/GPU Support**: Automatic device detection
23
+ - **Local Execution**: No cloud dependencies, runs on your hardware
24
+ - **FastAPI**: Interactive API documentation at `/docs` and `/redoc`
25
 
26
  ## Prerequisites
27
 
28
  - Docker and Docker Compose
29
+ - At least 4GB RAM (8GB+ recommended)
30
+ - CPU: Works on any modern CPU
31
+ - Optional: NVIDIA GPU for faster inference (can be enabled via docker-compose config)
32
+ - HuggingFace account (optional, for faster model downloads)
33
 
34
  ## Quick Start
35
 
 
59
  # Wait for the application to be ready
60
  # The first startup may take several minutes as the model downloads
61
  sleep 120
62
+ curl http://localhost:8001/
63
  ```
64
 
65
  4. **Test the API**
66
  ```bash
67
+ curl http://localhost:8001/
68
  ```
69
 
70
  ## API Endpoints
71
 
72
  ### Health Check
73
+ - `GET /` - Check service and model status
74
+ - `GET /api/health` - Alias for health check
75
 
76
  ### Text Generation
77
+ - `POST /generate` - Generate text from a prompt
 
 
 
78
 
79
  ## API Usage Examples
80
 
81
+ ### Health Check
82
+ ```bash
83
+ curl http://localhost:8001/
84
+ ```
85
+
86
  ### Text Generation
87
  ```bash
88
+ curl -X POST "http://localhost:8001/generate" \
89
  -H "Content-Type: application/json" \
90
  -d '{
91
  "prompt": "Explain quantum computing in simple terms",
92
+ "max_tokens": 200,
93
+ "temperature": 0.7
94
  }'
95
  ```
96
 
97
+ Response:
98
+ ```json
99
+ {
100
+ "success": true,
101
+ "response": "Quantum computing is..."
102
+ }
 
 
 
 
 
103
  ```
104
 
105
+ ### Generate with Different Parameters
106
  ```bash
107
+ curl -X POST "http://localhost:8001/generate" \
108
  -H "Content-Type: application/json" \
109
  -d '{
110
+ "prompt": "Write a poem about AI",
111
+ "max_tokens": 150,
112
+ "temperature": 0.9
113
  }'
114
  ```
115
 
 
124
 
125
  ## Available Models
126
 
127
+ The application works with any causal language model from HuggingFace. Default and recommended options:
128
 
129
+ - `google/gemma-4-E2B` - Efficient 2B model (default, lightweight)
130
  - `google/gemma-2-2b-it` - Gemma 2 2B instruction-tuned
131
+ - `google/gemma-2-9b` - Gemma 2 9B (better quality, needs more RAM)
132
  - `meta-llama/Llama-2-7b` - Llama 2 7B
133
+
134
+ To change models, update the `MODEL_NAME` environment variable in `.env`.
135
 
136
  ## Development
137
 
 
200
  - Check Docker network: `docker compose ps`
201
  - View logs: `docker compose logs gemma4-app`
202
 
203
+ ### Application Won't Start
204
+ - Check system resources: `docker stats`
205
+ - Ensure port 8001 is available: `lsof -i :8001` (or `netstat -ano | findstr :8001` on Windows)
206
+ - Increase Docker memory if needed via Docker Desktop settings
207
+
208
+ ## HuggingFace Spaces Deployment
209
+
210
+ This repository is configured for deployment on [HuggingFace Spaces](https://huggingface.co/spaces):
211
+
212
+ 1. Fork or clone this repository
213
+ 2. Create a new Space on HuggingFace (Docker SDK)
214
+ 3. Connect your repository
215
+ 4. The app will deploy automatically
216
+ 5. Access via the Space URL
217
+
218
+ Note: First startup may take 5-10 minutes as the model downloads.
219
 
220
  ## API Documentation
221
 
 
223
 
224
  ## Performance Tips
225
 
226
+ 1. **CPU-based**: Current setup uses CPU (optimized for resource efficiency)
227
+ 2. **Model Caching**: The model is cached after first download
228
+ 3. **Memory**: 4GB minimum, 8GB+ recommended for smooth operation
229
+ 4. **Inference Speed**: Depends on hardware; typically 10-30 tokens/second on modern CPUs
230
 
231
  ## License
232