Commit
Β·
a4e7832
1
Parent(s):
f6fdf6a
Add comprehensive performance and compatibility test suite
Browse files- Inference speed tests (latency, throughput, TTFT)
- OpenAI API compatibility tests
- Concurrent load testing
- Comprehensive benchmark script
- Test documentation and guides
- DEPLOYMENT.md +104 -0
- TESTING.md +223 -0
- requirements-dev.txt +11 -0
- tests/performance/README.md +271 -0
- tests/performance/__init__.py +2 -0
- tests/performance/benchmark.py +344 -0
- tests/performance/test_inference_speed.py +242 -0
- tests/performance/test_openai_compatibility.py +345 -0
DEPLOYMENT.md
ADDED
|
@@ -0,0 +1,104 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# PRIIPs LLM Service - Deployment Configuration
|
| 2 |
+
|
| 3 |
+
## Overview
|
| 4 |
+
This service uses vLLM on NVIDIA L40 GPU to serve the DragonLLM/LLM-Pro-Finance-Small model.
|
| 5 |
+
|
| 6 |
+
## Configuration
|
| 7 |
+
|
| 8 |
+
### Docker Setup
|
| 9 |
+
- **Base Image**: `nvidia/cuda:12.1.0-runtime-ubuntu22.04`
|
| 10 |
+
- **Python Version**: 3.11
|
| 11 |
+
- **vLLM Version**: >=0.6.0
|
| 12 |
+
|
| 13 |
+
### Model Configuration
|
| 14 |
+
- **Model**: `DragonLLM/LLM-Pro-Finance-Small`
|
| 15 |
+
- **Backend**: vLLM (optimized for L40 GPU)
|
| 16 |
+
- **Authentication**: HF_TOKEN_LC environment variable
|
| 17 |
+
- **GPU Utilization**: 90% of available memory
|
| 18 |
+
- **Tensor Parallel Size**: 1 (single L40 GPU)
|
| 19 |
+
- **Max Model Length**: 4096 tokens
|
| 20 |
+
- **Dtype**: float16
|
| 21 |
+
|
| 22 |
+
### vLLM Advantages
|
| 23 |
+
1. **High Throughput**: PagedAttention for efficient memory management
|
| 24 |
+
2. **GPU Optimization**: Specifically optimized for NVIDIA GPUs like L40
|
| 25 |
+
3. **Fast Inference**: Up to 24x faster than standard Transformers
|
| 26 |
+
4. **Batching**: Automatic continuous batching for multiple requests
|
| 27 |
+
5. **OpenAI Compatible**: Drop-in replacement for OpenAI API
|
| 28 |
+
|
| 29 |
+
### Hardware
|
| 30 |
+
- **GPU**: NVIDIA L40S
|
| 31 |
+
- **VRAM**: 48GB
|
| 32 |
+
- **Platform**: Hugging Face Spaces
|
| 33 |
+
|
| 34 |
+
### Environment Variables Required
|
| 35 |
+
```bash
|
| 36 |
+
HF_TOKEN_LC=<your_hugging_face_token> # For accessing Dragon LLM models
|
| 37 |
+
SERVICE_API_KEY=<optional> # For API authentication
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
### API Endpoints
|
| 41 |
+
- `GET /` - Service info
|
| 42 |
+
- `GET /health` - Health check
|
| 43 |
+
- `GET /v1/models` - List available models
|
| 44 |
+
- `POST /v1/chat/completions` - Chat completions (OpenAI compatible)
|
| 45 |
+
- `POST /extract-priips` - PRIIPs document extraction
|
| 46 |
+
|
| 47 |
+
### Model Loading
|
| 48 |
+
- Model loads on first API request (lazy loading)
|
| 49 |
+
- Downloads from Hugging Face using HF_TOKEN_LC
|
| 50 |
+
- Cached in `/tmp/huggingface` directory
|
| 51 |
+
- Automatic GPU detection and optimization
|
| 52 |
+
|
| 53 |
+
### Performance
|
| 54 |
+
- **Latency**: ~100-200ms per request (depends on prompt length)
|
| 55 |
+
- **Throughput**: High with vLLM's continuous batching
|
| 56 |
+
- **Memory**: Efficient PagedAttention reduces memory fragmentation
|
| 57 |
+
|
| 58 |
+
## Integration
|
| 59 |
+
|
| 60 |
+
### PydanticAI
|
| 61 |
+
```python
|
| 62 |
+
from pydantic_ai import Agent
|
| 63 |
+
from pydantic_ai.models.openai import OpenAIModel
|
| 64 |
+
|
| 65 |
+
model = OpenAIModel(
|
| 66 |
+
"DragonLLM/LLM-Pro-Finance-Small",
|
| 67 |
+
base_url="https://jeanbaptdzd-priips-llm-service.hf.space/v1"
|
| 68 |
+
)
|
| 69 |
+
agent = Agent(model=model)
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
### DSPy
|
| 73 |
+
```python
|
| 74 |
+
import dspy
|
| 75 |
+
|
| 76 |
+
lm = dspy.OpenAI(
|
| 77 |
+
model="DragonLLM/LLM-Pro-Finance-Small",
|
| 78 |
+
api_base="https://jeanbaptdzd-priips-llm-service.hf.space/v1"
|
| 79 |
+
)
|
| 80 |
+
dspy.settings.configure(lm=lm)
|
| 81 |
+
```
|
| 82 |
+
|
| 83 |
+
## Troubleshooting
|
| 84 |
+
|
| 85 |
+
### Build Errors
|
| 86 |
+
- Check that CUDA base image is compatible
|
| 87 |
+
- Verify vLLM installation with GPU support
|
| 88 |
+
- Ensure HF_TOKEN_LC is set in Space secrets
|
| 89 |
+
|
| 90 |
+
### Runtime Errors
|
| 91 |
+
- Check GPU availability: `torch.cuda.is_available()`
|
| 92 |
+
- Verify model access with HF token
|
| 93 |
+
- Check logs for OOM (out of memory) errors
|
| 94 |
+
|
| 95 |
+
### Performance Issues
|
| 96 |
+
- Increase `gpu_memory_utilization` if underutilizing
|
| 97 |
+
- Adjust `max_model_len` based on use case
|
| 98 |
+
- Enable tensor parallelism for multi-GPU setups
|
| 99 |
+
|
| 100 |
+
## Monitoring
|
| 101 |
+
- Check Space status via Hugging Face dashboard
|
| 102 |
+
- Monitor GPU utilization and memory usage
|
| 103 |
+
- Review application logs for errors
|
| 104 |
+
|
TESTING.md
ADDED
|
@@ -0,0 +1,223 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Testing Guide
|
| 2 |
+
|
| 3 |
+
## Quick Start
|
| 4 |
+
|
| 5 |
+
Once your Hugging Face Space is deployed and running, you can run comprehensive performance tests:
|
| 6 |
+
|
| 7 |
+
```bash
|
| 8 |
+
# Install test dependencies
|
| 9 |
+
pip install -r requirements-dev.txt
|
| 10 |
+
|
| 11 |
+
# Run the comprehensive benchmark (recommended)
|
| 12 |
+
python tests/performance/benchmark.py
|
| 13 |
+
|
| 14 |
+
# Or run individual test suites
|
| 15 |
+
pytest tests/performance/test_inference_speed.py -v -s
|
| 16 |
+
pytest tests/performance/test_openai_compatibility.py -v -s
|
| 17 |
+
```
|
| 18 |
+
|
| 19 |
+
## What Gets Tested
|
| 20 |
+
|
| 21 |
+
### β‘ Performance Metrics
|
| 22 |
+
- **Latency**: End-to-end response time
|
| 23 |
+
- **Token Throughput**: Tokens generated per second
|
| 24 |
+
- **Concurrent Handling**: Multiple simultaneous requests
|
| 25 |
+
- **Time to First Token (TTFT)**: Latency to start streaming
|
| 26 |
+
|
| 27 |
+
### π OpenAI API Compatibility
|
| 28 |
+
- Endpoint compatibility (`/v1/models`, `/v1/chat/completions`)
|
| 29 |
+
- Message formats (system, user, assistant, multi-turn)
|
| 30 |
+
- Parameters (temperature, max_tokens, top_p, stream)
|
| 31 |
+
- Official OpenAI client library compatibility
|
| 32 |
+
- Response schema validation
|
| 33 |
+
|
| 34 |
+
### π Load Testing
|
| 35 |
+
- Single request performance
|
| 36 |
+
- Concurrent request handling (5-10 requests)
|
| 37 |
+
- Different prompt lengths
|
| 38 |
+
- Different output lengths (50-500 tokens)
|
| 39 |
+
|
| 40 |
+
## Expected Results (L40 GPU with vLLM)
|
| 41 |
+
|
| 42 |
+
### Good Performance:
|
| 43 |
+
```
|
| 44 |
+
β Average latency: 1-2 seconds (100 tokens)
|
| 45 |
+
β Token throughput: 50-100 tokens/second
|
| 46 |
+
β TTFT: < 500ms
|
| 47 |
+
β Concurrent capacity: 5-10 req/sec
|
| 48 |
+
β OpenAI compatibility: 100%
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
### Performance Indicators:
|
| 52 |
+
|
| 53 |
+
| Metric | Excellent | Good | Needs Improvement |
|
| 54 |
+
|--------|-----------|------|-------------------|
|
| 55 |
+
| Latency (100 tokens) | < 1s | 1-3s | > 3s |
|
| 56 |
+
| Token throughput | > 80 tok/s | 40-80 tok/s | < 40 tok/s |
|
| 57 |
+
| TTFT | < 300ms | 300-700ms | > 700ms |
|
| 58 |
+
| Concurrent (5 req) | < 4s | 4-8s | > 8s |
|
| 59 |
+
|
| 60 |
+
## Test Output Example
|
| 61 |
+
|
| 62 |
+
```bash
|
| 63 |
+
$ python tests/performance/benchmark.py
|
| 64 |
+
|
| 65 |
+
############################################################
|
| 66 |
+
PRIIPs LLM Service - Comprehensive Benchmark Suite
|
| 67 |
+
Service: https://jeanbaptdzd-priips-llm-service.hf.space
|
| 68 |
+
############################################################
|
| 69 |
+
|
| 70 |
+
Checking service health...
|
| 71 |
+
β Service is healthy
|
| 72 |
+
|
| 73 |
+
============================================================
|
| 74 |
+
BENCHMARK: Single Request Latency
|
| 75 |
+
============================================================
|
| 76 |
+
Run 1/5: 1.45s, 61.38 tokens/sec
|
| 77 |
+
Run 2/5: 1.52s, 58.92 tokens/sec
|
| 78 |
+
Run 3/5: 1.48s, 60.14 tokens/sec
|
| 79 |
+
Run 4/5: 1.51s, 59.21 tokens/sec
|
| 80 |
+
Run 5/5: 1.46s, 61.01 tokens/sec
|
| 81 |
+
|
| 82 |
+
Results:
|
| 83 |
+
Average latency: 1.48s (Β±0.03s)
|
| 84 |
+
Min/Max latency: 1.45s / 1.52s
|
| 85 |
+
Average throughput: 60.13 tokens/sec
|
| 86 |
+
Max throughput: 61.38 tokens/sec
|
| 87 |
+
|
| 88 |
+
============================================================
|
| 89 |
+
BENCHMARK: Concurrent Load (5 requests)
|
| 90 |
+
============================================================
|
| 91 |
+
|
| 92 |
+
Results:
|
| 93 |
+
Total time: 3.21s
|
| 94 |
+
Successful: 5/5
|
| 95 |
+
Average latency: 2.15s
|
| 96 |
+
Requests/sec: 1.56
|
| 97 |
+
|
| 98 |
+
============================================================
|
| 99 |
+
BENCHMARK: OpenAI API Compatibility
|
| 100 |
+
============================================================
|
| 101 |
+
β List models endpoint
|
| 102 |
+
β Chat completions endpoint
|
| 103 |
+
β System message support
|
| 104 |
+
β Conversation history
|
| 105 |
+
β Temperature parameter
|
| 106 |
+
β Max tokens parameter
|
| 107 |
+
|
| 108 |
+
Compatibility Score: 6/6 (100%)
|
| 109 |
+
|
| 110 |
+
############################################################
|
| 111 |
+
SUMMARY
|
| 112 |
+
############################################################
|
| 113 |
+
|
| 114 |
+
β‘ Performance:
|
| 115 |
+
Average latency: 1.48s
|
| 116 |
+
Token throughput: 60.13 tokens/sec
|
| 117 |
+
Concurrent capacity: 1.56 req/sec
|
| 118 |
+
|
| 119 |
+
π OpenAI Compatibility: 6/6
|
| 120 |
+
|
| 121 |
+
π Full results saved to benchmark_results.json
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
## Running Specific Tests
|
| 125 |
+
|
| 126 |
+
### Test Inference Speed Only:
|
| 127 |
+
```bash
|
| 128 |
+
pytest tests/performance/test_inference_speed.py::test_single_request_latency -v -s
|
| 129 |
+
```
|
| 130 |
+
|
| 131 |
+
### Test OpenAI Compatibility Only:
|
| 132 |
+
```bash
|
| 133 |
+
pytest tests/performance/test_openai_compatibility.py::TestOpenAIClientLibrary -v -s
|
| 134 |
+
```
|
| 135 |
+
|
| 136 |
+
### Test Streaming:
|
| 137 |
+
```bash
|
| 138 |
+
pytest tests/performance/test_openai_compatibility.py::TestOpenAIClientLibrary::test_streaming_with_openai_client -v -s
|
| 139 |
+
```
|
| 140 |
+
|
| 141 |
+
## Troubleshooting
|
| 142 |
+
|
| 143 |
+
### Service Not Available
|
| 144 |
+
```bash
|
| 145 |
+
# Check health endpoint
|
| 146 |
+
curl https://jeanbaptdzd-priips-llm-service.hf.space/health
|
| 147 |
+
|
| 148 |
+
# Check if Space is running on HF dashboard
|
| 149 |
+
```
|
| 150 |
+
|
| 151 |
+
### Slow Performance
|
| 152 |
+
- Check GPU utilization in HF Spaces logs
|
| 153 |
+
- Verify model is loaded (first request is slower)
|
| 154 |
+
- Check if using correct hardware (L40 GPU)
|
| 155 |
+
|
| 156 |
+
### OpenAI Client Errors
|
| 157 |
+
```bash
|
| 158 |
+
# Install latest OpenAI client
|
| 159 |
+
pip install --upgrade openai
|
| 160 |
+
```
|
| 161 |
+
|
| 162 |
+
## Integration Examples
|
| 163 |
+
|
| 164 |
+
### Use with PydanticAI:
|
| 165 |
+
```python
|
| 166 |
+
from pydantic_ai import Agent
|
| 167 |
+
from pydantic_ai.models.openai import OpenAIModel
|
| 168 |
+
|
| 169 |
+
model = OpenAIModel(
|
| 170 |
+
"DragonLLM/LLM-Pro-Finance-Small",
|
| 171 |
+
base_url="https://jeanbaptdzd-priips-llm-service.hf.space/v1"
|
| 172 |
+
)
|
| 173 |
+
agent = Agent(model=model)
|
| 174 |
+
result = agent.run_sync("What is machine learning?")
|
| 175 |
+
```
|
| 176 |
+
|
| 177 |
+
### Use with DSPy:
|
| 178 |
+
```python
|
| 179 |
+
import dspy
|
| 180 |
+
|
| 181 |
+
lm = dspy.OpenAI(
|
| 182 |
+
model="DragonLLM/LLM-Pro-Finance-Small",
|
| 183 |
+
api_base="https://jeanbaptdzd-priips-llm-service.hf.space/v1"
|
| 184 |
+
)
|
| 185 |
+
dspy.settings.configure(lm=lm)
|
| 186 |
+
```
|
| 187 |
+
|
| 188 |
+
### Direct OpenAI Client:
|
| 189 |
+
```python
|
| 190 |
+
from openai import OpenAI
|
| 191 |
+
|
| 192 |
+
client = OpenAI(
|
| 193 |
+
base_url="https://jeanbaptdzd-priips-llm-service.hf.space/v1",
|
| 194 |
+
api_key="dummy" # Not required if no auth
|
| 195 |
+
)
|
| 196 |
+
|
| 197 |
+
response = client.chat.completions.create(
|
| 198 |
+
model="DragonLLM/LLM-Pro-Finance-Small",
|
| 199 |
+
messages=[{"role": "user", "content": "Hello!"}]
|
| 200 |
+
)
|
| 201 |
+
print(response.choices[0].message.content)
|
| 202 |
+
```
|
| 203 |
+
|
| 204 |
+
## Continuous Monitoring
|
| 205 |
+
|
| 206 |
+
Set up automated performance monitoring:
|
| 207 |
+
|
| 208 |
+
```bash
|
| 209 |
+
# Run benchmarks hourly
|
| 210 |
+
0 * * * * cd /path/to/repo && python tests/performance/benchmark.py
|
| 211 |
+
|
| 212 |
+
# Compare results over time
|
| 213 |
+
python scripts/compare_benchmarks.py benchmark_results_*.json
|
| 214 |
+
```
|
| 215 |
+
|
| 216 |
+
## Next Steps
|
| 217 |
+
|
| 218 |
+
1. β
Run initial benchmark to establish baseline
|
| 219 |
+
2. Monitor performance over time
|
| 220 |
+
3. Optimize based on bottlenecks found
|
| 221 |
+
4. Test with production workloads
|
| 222 |
+
5. Set up alerts for performance degradation
|
| 223 |
+
|
requirements-dev.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Development and testing dependencies
|
| 2 |
+
-r requirements.txt
|
| 3 |
+
|
| 4 |
+
# Testing
|
| 5 |
+
pytest>=7.4.0
|
| 6 |
+
pytest-asyncio>=0.21.0
|
| 7 |
+
openai>=1.0.0
|
| 8 |
+
|
| 9 |
+
# Performance testing
|
| 10 |
+
httpx>=0.27.0
|
| 11 |
+
|
tests/performance/README.md
ADDED
|
@@ -0,0 +1,271 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Performance Test Suite
|
| 2 |
+
|
| 3 |
+
Comprehensive performance and compatibility tests for the PRIIPs LLM Service.
|
| 4 |
+
|
| 5 |
+
## Quick Start
|
| 6 |
+
|
| 7 |
+
```bash
|
| 8 |
+
# Install additional test dependencies
|
| 9 |
+
pip install pytest pytest-asyncio openai
|
| 10 |
+
|
| 11 |
+
# Run all performance tests
|
| 12 |
+
pytest tests/performance/ -v -s
|
| 13 |
+
|
| 14 |
+
# Run specific test suites
|
| 15 |
+
pytest tests/performance/test_inference_speed.py -v -s
|
| 16 |
+
pytest tests/performance/test_openai_compatibility.py -v -s
|
| 17 |
+
|
| 18 |
+
# Run comprehensive benchmark
|
| 19 |
+
python tests/performance/benchmark.py
|
| 20 |
+
```
|
| 21 |
+
|
| 22 |
+
## Test Suites
|
| 23 |
+
|
| 24 |
+
### 1. Inference Speed Tests (`test_inference_speed.py`)
|
| 25 |
+
|
| 26 |
+
Tests various performance metrics:
|
| 27 |
+
|
| 28 |
+
- **Single Request Latency**: Measures end-to-end latency for individual requests
|
| 29 |
+
- **Token Throughput**: Measures tokens generated per second at different lengths
|
| 30 |
+
- **Concurrent Requests**: Tests performance under concurrent load
|
| 31 |
+
- **Time to First Token (TTFT)**: Measures latency to first generated token
|
| 32 |
+
- **Prompt Processing Speed**: Tests how quickly different prompt lengths are processed
|
| 33 |
+
- **Temperature Variance**: Tests response generation with different temperatures
|
| 34 |
+
|
| 35 |
+
#### Key Metrics:
|
| 36 |
+
- Latency (seconds)
|
| 37 |
+
- Tokens per second
|
| 38 |
+
- Concurrent request handling
|
| 39 |
+
- TTFT (Time to First Token)
|
| 40 |
+
|
| 41 |
+
### 2. OpenAI Compatibility Tests (`test_openai_compatibility.py`)
|
| 42 |
+
|
| 43 |
+
Validates OpenAI API compatibility:
|
| 44 |
+
|
| 45 |
+
**Endpoint Compatibility:**
|
| 46 |
+
- `GET /v1/models` - Model listing
|
| 47 |
+
- `POST /v1/chat/completions` - Chat completions
|
| 48 |
+
|
| 49 |
+
**Message Format Tests:**
|
| 50 |
+
- System messages
|
| 51 |
+
- Conversation history
|
| 52 |
+
- Multi-turn conversations
|
| 53 |
+
|
| 54 |
+
**Parameter Tests:**
|
| 55 |
+
- `temperature`
|
| 56 |
+
- `max_tokens`
|
| 57 |
+
- `top_p`
|
| 58 |
+
- `stream`
|
| 59 |
+
|
| 60 |
+
**Client Library Tests:**
|
| 61 |
+
- Official OpenAI Python client compatibility
|
| 62 |
+
- Streaming support
|
| 63 |
+
|
| 64 |
+
**Error Handling:**
|
| 65 |
+
- Invalid models
|
| 66 |
+
- Missing required fields
|
| 67 |
+
- Empty messages
|
| 68 |
+
|
| 69 |
+
**Response Schema:**
|
| 70 |
+
- Full OpenAI response format validation
|
| 71 |
+
- Proper usage statistics
|
| 72 |
+
- Correct finish reasons
|
| 73 |
+
|
| 74 |
+
### 3. Comprehensive Benchmark (`benchmark.py`)
|
| 75 |
+
|
| 76 |
+
All-in-one benchmark script that:
|
| 77 |
+
- Runs all performance tests
|
| 78 |
+
- Validates OpenAI compatibility
|
| 79 |
+
- Generates detailed report
|
| 80 |
+
- Saves results to JSON
|
| 81 |
+
|
| 82 |
+
## Configuration
|
| 83 |
+
|
| 84 |
+
### Change Target URL
|
| 85 |
+
|
| 86 |
+
Edit the `BASE_URL` in each test file:
|
| 87 |
+
|
| 88 |
+
```python
|
| 89 |
+
# For production
|
| 90 |
+
BASE_URL = "https://jeanbaptdzd-priips-llm-service.hf.space"
|
| 91 |
+
|
| 92 |
+
# For local testing
|
| 93 |
+
BASE_URL = "http://localhost:7860"
|
| 94 |
+
```
|
| 95 |
+
|
| 96 |
+
### Adjust Test Parameters
|
| 97 |
+
|
| 98 |
+
Modify test parameters in each test:
|
| 99 |
+
|
| 100 |
+
```python
|
| 101 |
+
# Number of concurrent requests
|
| 102 |
+
num_concurrent = 10
|
| 103 |
+
|
| 104 |
+
# Number of test runs
|
| 105 |
+
num_runs = 10
|
| 106 |
+
|
| 107 |
+
# Max tokens for generation
|
| 108 |
+
max_tokens = 100
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
## Expected Results
|
| 112 |
+
|
| 113 |
+
### Good Performance Metrics (on L40 GPU):
|
| 114 |
+
|
| 115 |
+
- **Latency**: < 2 seconds for 100 tokens
|
| 116 |
+
- **Token Throughput**: > 50 tokens/second
|
| 117 |
+
- **TTFT**: < 500ms
|
| 118 |
+
- **Concurrent Handling**: > 5 requests/second
|
| 119 |
+
|
| 120 |
+
### OpenAI Compatibility:
|
| 121 |
+
|
| 122 |
+
Should pass all compatibility tests (100% score)
|
| 123 |
+
|
| 124 |
+
## Test Output Examples
|
| 125 |
+
|
| 126 |
+
### Inference Speed Test Output:
|
| 127 |
+
```
|
| 128 |
+
=== Single Request Performance ===
|
| 129 |
+
Latency: 1.45s
|
| 130 |
+
Prompt tokens: 12
|
| 131 |
+
Completion tokens: 89
|
| 132 |
+
Total tokens: 101
|
| 133 |
+
Tokens per second: 61.38
|
| 134 |
+
Response: Artificial intelligence (AI) refers to...
|
| 135 |
+
```
|
| 136 |
+
|
| 137 |
+
### Concurrent Load Test Output:
|
| 138 |
+
```
|
| 139 |
+
=== Concurrent Requests Test (10 requests) ===
|
| 140 |
+
Total time: 3.21s
|
| 141 |
+
Successful requests: 10/10
|
| 142 |
+
Average latency: 2.15s
|
| 143 |
+
Requests per second: 3.12
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
### OpenAI Compatibility Output:
|
| 147 |
+
```
|
| 148 |
+
=== OpenAI API Compatibility ===
|
| 149 |
+
β List models endpoint
|
| 150 |
+
β Chat completions endpoint
|
| 151 |
+
β System message support
|
| 152 |
+
β Conversation history
|
| 153 |
+
β Temperature parameter
|
| 154 |
+
β Max tokens parameter
|
| 155 |
+
|
| 156 |
+
Compatibility Score: 6/7 (86%)
|
| 157 |
+
```
|
| 158 |
+
|
| 159 |
+
## Troubleshooting
|
| 160 |
+
|
| 161 |
+
### Tests Timeout
|
| 162 |
+
- Increase timeout in `httpx.AsyncClient(timeout=120.0)`
|
| 163 |
+
- Check if service is running with health check
|
| 164 |
+
|
| 165 |
+
### Connection Errors
|
| 166 |
+
- Verify BASE_URL is correct
|
| 167 |
+
- Check network connectivity
|
| 168 |
+
- Ensure service is deployed and running
|
| 169 |
+
|
| 170 |
+
### Performance Lower Than Expected
|
| 171 |
+
- Check GPU utilization on server
|
| 172 |
+
- Verify vLLM configuration
|
| 173 |
+
- Look for model loading issues in logs
|
| 174 |
+
|
| 175 |
+
## Integration with CI/CD
|
| 176 |
+
|
| 177 |
+
Add to your CI pipeline:
|
| 178 |
+
|
| 179 |
+
```yaml
|
| 180 |
+
# .github/workflows/performance.yml
|
| 181 |
+
name: Performance Tests
|
| 182 |
+
|
| 183 |
+
on: [push, pull_request]
|
| 184 |
+
|
| 185 |
+
jobs:
|
| 186 |
+
test:
|
| 187 |
+
runs-on: ubuntu-latest
|
| 188 |
+
steps:
|
| 189 |
+
- uses: actions/checkout@v2
|
| 190 |
+
- name: Set up Python
|
| 191 |
+
uses: actions/setup-python@v2
|
| 192 |
+
with:
|
| 193 |
+
python-version: 3.11
|
| 194 |
+
- name: Install dependencies
|
| 195 |
+
run: |
|
| 196 |
+
pip install -r requirements.txt
|
| 197 |
+
pip install pytest pytest-asyncio openai
|
| 198 |
+
- name: Run performance tests
|
| 199 |
+
run: pytest tests/performance/ -v
|
| 200 |
+
```
|
| 201 |
+
|
| 202 |
+
## Benchmark Results
|
| 203 |
+
|
| 204 |
+
Results are saved to `benchmark_results.json` with structure:
|
| 205 |
+
|
| 206 |
+
```json
|
| 207 |
+
{
|
| 208 |
+
"single_request": {
|
| 209 |
+
"avg_latency": 1.45,
|
| 210 |
+
"avg_tokens_per_sec": 61.38
|
| 211 |
+
},
|
| 212 |
+
"concurrent_load": {
|
| 213 |
+
"requests_per_sec": 3.12,
|
| 214 |
+
"successful": 10
|
| 215 |
+
},
|
| 216 |
+
"openai_compatibility": {
|
| 217 |
+
"score": "6/7"
|
| 218 |
+
}
|
| 219 |
+
}
|
| 220 |
+
```
|
| 221 |
+
|
| 222 |
+
## Advanced Usage
|
| 223 |
+
|
| 224 |
+
### Custom Test Scenarios
|
| 225 |
+
|
| 226 |
+
Create custom test scenarios:
|
| 227 |
+
|
| 228 |
+
```python
|
| 229 |
+
@pytest.mark.asyncio
|
| 230 |
+
async def test_custom_scenario(client):
|
| 231 |
+
# Your custom test here
|
| 232 |
+
payload = {
|
| 233 |
+
"model": "DragonLLM/LLM-Pro-Finance-Small",
|
| 234 |
+
"messages": [{"role": "user", "content": "Custom prompt"}],
|
| 235 |
+
"max_tokens": 200
|
| 236 |
+
}
|
| 237 |
+
response = await client.post(f"{BASE_URL}/v1/chat/completions", json=payload)
|
| 238 |
+
assert response.status_code == 200
|
| 239 |
+
```
|
| 240 |
+
|
| 241 |
+
### Stress Testing
|
| 242 |
+
|
| 243 |
+
For stress testing, increase concurrent requests:
|
| 244 |
+
|
| 245 |
+
```python
|
| 246 |
+
await benchmark_concurrent_load(num_concurrent=50)
|
| 247 |
+
```
|
| 248 |
+
|
| 249 |
+
## Monitoring
|
| 250 |
+
|
| 251 |
+
Metrics to monitor during tests:
|
| 252 |
+
|
| 253 |
+
- **Server-side**:
|
| 254 |
+
- GPU utilization
|
| 255 |
+
- Memory usage
|
| 256 |
+
- Request queue length
|
| 257 |
+
- Model loading time
|
| 258 |
+
|
| 259 |
+
- **Client-side**:
|
| 260 |
+
- Response times
|
| 261 |
+
- Error rates
|
| 262 |
+
- Token throughput
|
| 263 |
+
- Network latency
|
| 264 |
+
|
| 265 |
+
## Support
|
| 266 |
+
|
| 267 |
+
For issues or questions:
|
| 268 |
+
- Check service logs at Hugging Face Spaces dashboard
|
| 269 |
+
- Review DEPLOYMENT.md for configuration details
|
| 270 |
+
- Verify vLLM is properly initialized with model
|
| 271 |
+
|
tests/performance/__init__.py
ADDED
|
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Performance test suite
|
| 2 |
+
|
tests/performance/benchmark.py
ADDED
|
@@ -0,0 +1,344 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Comprehensive benchmark suite for PRIIPs LLM Service
|
| 4 |
+
Run with: python tests/performance/benchmark.py
|
| 5 |
+
"""
|
| 6 |
+
import asyncio
|
| 7 |
+
import httpx
|
| 8 |
+
import time
|
| 9 |
+
import statistics
|
| 10 |
+
from typing import List, Dict
|
| 11 |
+
import json
|
| 12 |
+
|
| 13 |
+
# Configuration
|
| 14 |
+
BASE_URL = "https://jeanbaptdzd-priips-llm-service.hf.space"
|
| 15 |
+
# BASE_URL = "http://localhost:7860" # For local testing
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
class Benchmark:
|
| 19 |
+
def __init__(self, base_url: str = BASE_URL):
|
| 20 |
+
self.base_url = base_url
|
| 21 |
+
self.client = httpx.AsyncClient(timeout=120.0)
|
| 22 |
+
self.results = {}
|
| 23 |
+
|
| 24 |
+
async def health_check(self) -> bool:
|
| 25 |
+
"""Check if service is available"""
|
| 26 |
+
try:
|
| 27 |
+
response = await self.client.get(f"{self.base_url}/health")
|
| 28 |
+
return response.status_code == 200
|
| 29 |
+
except:
|
| 30 |
+
return False
|
| 31 |
+
|
| 32 |
+
async def benchmark_single_request(self, num_runs: int = 10) -> Dict:
|
| 33 |
+
"""Benchmark single request latency"""
|
| 34 |
+
print(f"\n{'='*60}")
|
| 35 |
+
print("BENCHMARK: Single Request Latency")
|
| 36 |
+
print(f"{'='*60}")
|
| 37 |
+
|
| 38 |
+
latencies = []
|
| 39 |
+
tokens_per_sec = []
|
| 40 |
+
|
| 41 |
+
payload = {
|
| 42 |
+
"model": "DragonLLM/LLM-Pro-Finance-Small",
|
| 43 |
+
"messages": [
|
| 44 |
+
{"role": "user", "content": "What is artificial intelligence?"}
|
| 45 |
+
],
|
| 46 |
+
"max_tokens": 100,
|
| 47 |
+
"temperature": 0.7
|
| 48 |
+
}
|
| 49 |
+
|
| 50 |
+
for i in range(num_runs):
|
| 51 |
+
start = time.time()
|
| 52 |
+
response = await self.client.post(
|
| 53 |
+
f"{self.base_url}/v1/chat/completions",
|
| 54 |
+
json=payload
|
| 55 |
+
)
|
| 56 |
+
end = time.time()
|
| 57 |
+
|
| 58 |
+
if response.status_code == 200:
|
| 59 |
+
data = response.json()
|
| 60 |
+
latency = end - start
|
| 61 |
+
completion_tokens = data["usage"]["completion_tokens"]
|
| 62 |
+
tps = completion_tokens / latency if latency > 0 else 0
|
| 63 |
+
|
| 64 |
+
latencies.append(latency)
|
| 65 |
+
tokens_per_sec.append(tps)
|
| 66 |
+
|
| 67 |
+
print(f"Run {i+1}/{num_runs}: {latency:.2f}s, {tps:.2f} tokens/sec")
|
| 68 |
+
|
| 69 |
+
results = {
|
| 70 |
+
"avg_latency": statistics.mean(latencies),
|
| 71 |
+
"min_latency": min(latencies),
|
| 72 |
+
"max_latency": max(latencies),
|
| 73 |
+
"std_latency": statistics.stdev(latencies) if len(latencies) > 1 else 0,
|
| 74 |
+
"avg_tokens_per_sec": statistics.mean(tokens_per_sec),
|
| 75 |
+
"max_tokens_per_sec": max(tokens_per_sec),
|
| 76 |
+
}
|
| 77 |
+
|
| 78 |
+
print(f"\nResults:")
|
| 79 |
+
print(f" Average latency: {results['avg_latency']:.2f}s (Β±{results['std_latency']:.2f}s)")
|
| 80 |
+
print(f" Min/Max latency: {results['min_latency']:.2f}s / {results['max_latency']:.2f}s")
|
| 81 |
+
print(f" Average throughput: {results['avg_tokens_per_sec']:.2f} tokens/sec")
|
| 82 |
+
print(f" Max throughput: {results['max_tokens_per_sec']:.2f} tokens/sec")
|
| 83 |
+
|
| 84 |
+
return results
|
| 85 |
+
|
| 86 |
+
async def benchmark_concurrent_load(self, num_concurrent: int = 10) -> Dict:
|
| 87 |
+
"""Benchmark concurrent request handling"""
|
| 88 |
+
print(f"\n{'='*60}")
|
| 89 |
+
print(f"BENCHMARK: Concurrent Load ({num_concurrent} requests)")
|
| 90 |
+
print(f"{'='*60}")
|
| 91 |
+
|
| 92 |
+
async def make_request(request_id: int):
|
| 93 |
+
payload = {
|
| 94 |
+
"model": "DragonLLM/LLM-Pro-Finance-Small",
|
| 95 |
+
"messages": [
|
| 96 |
+
{"role": "user", "content": f"Request {request_id}: Explain machine learning."}
|
| 97 |
+
],
|
| 98 |
+
"max_tokens": 50,
|
| 99 |
+
"temperature": 0.7
|
| 100 |
+
}
|
| 101 |
+
|
| 102 |
+
start = time.time()
|
| 103 |
+
response = await self.client.post(
|
| 104 |
+
f"{self.base_url}/v1/chat/completions",
|
| 105 |
+
json=payload
|
| 106 |
+
)
|
| 107 |
+
end = time.time()
|
| 108 |
+
|
| 109 |
+
return {
|
| 110 |
+
"request_id": request_id,
|
| 111 |
+
"latency": end - start,
|
| 112 |
+
"status": response.status_code,
|
| 113 |
+
"data": response.json() if response.status_code == 200 else None
|
| 114 |
+
}
|
| 115 |
+
|
| 116 |
+
start_time = time.time()
|
| 117 |
+
results = await asyncio.gather(*[make_request(i) for i in range(num_concurrent)])
|
| 118 |
+
end_time = time.time()
|
| 119 |
+
|
| 120 |
+
total_time = end_time - start_time
|
| 121 |
+
successful = [r for r in results if r["status"] == 200]
|
| 122 |
+
latencies = [r["latency"] for r in successful]
|
| 123 |
+
|
| 124 |
+
benchmark_results = {
|
| 125 |
+
"total_time": total_time,
|
| 126 |
+
"num_requests": num_concurrent,
|
| 127 |
+
"successful": len(successful),
|
| 128 |
+
"failed": num_concurrent - len(successful),
|
| 129 |
+
"avg_latency": statistics.mean(latencies) if latencies else 0,
|
| 130 |
+
"requests_per_sec": num_concurrent / total_time,
|
| 131 |
+
}
|
| 132 |
+
|
| 133 |
+
print(f"\nResults:")
|
| 134 |
+
print(f" Total time: {total_time:.2f}s")
|
| 135 |
+
print(f" Successful: {len(successful)}/{num_concurrent}")
|
| 136 |
+
print(f" Average latency: {benchmark_results['avg_latency']:.2f}s")
|
| 137 |
+
print(f" Requests/sec: {benchmark_results['requests_per_sec']:.2f}")
|
| 138 |
+
|
| 139 |
+
return benchmark_results
|
| 140 |
+
|
| 141 |
+
async def benchmark_different_lengths(self) -> Dict:
|
| 142 |
+
"""Benchmark with different output lengths"""
|
| 143 |
+
print(f"\n{'='*60}")
|
| 144 |
+
print("BENCHMARK: Different Output Lengths")
|
| 145 |
+
print(f"{'='*60}")
|
| 146 |
+
|
| 147 |
+
test_cases = [
|
| 148 |
+
{"name": "Short (50 tokens)", "max_tokens": 50},
|
| 149 |
+
{"name": "Medium (100 tokens)", "max_tokens": 100},
|
| 150 |
+
{"name": "Long (200 tokens)", "max_tokens": 200},
|
| 151 |
+
{"name": "Very Long (500 tokens)", "max_tokens": 500},
|
| 152 |
+
]
|
| 153 |
+
|
| 154 |
+
results_by_length = {}
|
| 155 |
+
|
| 156 |
+
for test_case in test_cases:
|
| 157 |
+
payload = {
|
| 158 |
+
"model": "DragonLLM/LLM-Pro-Finance-Small",
|
| 159 |
+
"messages": [
|
| 160 |
+
{"role": "user", "content": "Write about the history of computing."}
|
| 161 |
+
],
|
| 162 |
+
"max_tokens": test_case["max_tokens"],
|
| 163 |
+
"temperature": 0.7
|
| 164 |
+
}
|
| 165 |
+
|
| 166 |
+
start = time.time()
|
| 167 |
+
response = await self.client.post(
|
| 168 |
+
f"{self.base_url}/v1/chat/completions",
|
| 169 |
+
json=payload
|
| 170 |
+
)
|
| 171 |
+
end = time.time()
|
| 172 |
+
|
| 173 |
+
if response.status_code == 200:
|
| 174 |
+
data = response.json()
|
| 175 |
+
latency = end - start
|
| 176 |
+
completion_tokens = data["usage"]["completion_tokens"]
|
| 177 |
+
tps = completion_tokens / latency if latency > 0 else 0
|
| 178 |
+
|
| 179 |
+
results_by_length[test_case["name"]] = {
|
| 180 |
+
"latency": latency,
|
| 181 |
+
"tokens": completion_tokens,
|
| 182 |
+
"tokens_per_sec": tps
|
| 183 |
+
}
|
| 184 |
+
|
| 185 |
+
print(f"\n{test_case['name']}:")
|
| 186 |
+
print(f" Generated: {completion_tokens} tokens")
|
| 187 |
+
print(f" Time: {latency:.2f}s")
|
| 188 |
+
print(f" Throughput: {tps:.2f} tokens/sec")
|
| 189 |
+
|
| 190 |
+
return results_by_length
|
| 191 |
+
|
| 192 |
+
async def benchmark_openai_compatibility(self) -> Dict:
|
| 193 |
+
"""Test OpenAI API compatibility"""
|
| 194 |
+
print(f"\n{'='*60}")
|
| 195 |
+
print("BENCHMARK: OpenAI API Compatibility")
|
| 196 |
+
print(f"{'='*60}")
|
| 197 |
+
|
| 198 |
+
tests = {
|
| 199 |
+
"list_models": False,
|
| 200 |
+
"chat_completions": False,
|
| 201 |
+
"system_message": False,
|
| 202 |
+
"conversation_history": False,
|
| 203 |
+
"streaming": False,
|
| 204 |
+
"temperature_param": False,
|
| 205 |
+
"max_tokens_param": False,
|
| 206 |
+
}
|
| 207 |
+
|
| 208 |
+
# Test 1: List models
|
| 209 |
+
try:
|
| 210 |
+
response = await self.client.get(f"{self.base_url}/v1/models")
|
| 211 |
+
if response.status_code == 200:
|
| 212 |
+
data = response.json()
|
| 213 |
+
if "data" in data and len(data["data"]) > 0:
|
| 214 |
+
tests["list_models"] = True
|
| 215 |
+
print("β List models endpoint")
|
| 216 |
+
except:
|
| 217 |
+
pass
|
| 218 |
+
|
| 219 |
+
# Test 2: Chat completions
|
| 220 |
+
try:
|
| 221 |
+
payload = {"model": "DragonLLM/LLM-Pro-Finance-Small", "messages": [{"role": "user", "content": "Hi"}]}
|
| 222 |
+
response = await self.client.post(f"{self.base_url}/v1/chat/completions", json=payload)
|
| 223 |
+
if response.status_code == 200:
|
| 224 |
+
data = response.json()
|
| 225 |
+
if "choices" in data and "usage" in data:
|
| 226 |
+
tests["chat_completions"] = True
|
| 227 |
+
print("β Chat completions endpoint")
|
| 228 |
+
except:
|
| 229 |
+
pass
|
| 230 |
+
|
| 231 |
+
# Test 3: System message
|
| 232 |
+
try:
|
| 233 |
+
payload = {
|
| 234 |
+
"model": "DragonLLM/LLM-Pro-Finance-Small",
|
| 235 |
+
"messages": [
|
| 236 |
+
{"role": "system", "content": "Be helpful."},
|
| 237 |
+
{"role": "user", "content": "Hi"}
|
| 238 |
+
]
|
| 239 |
+
}
|
| 240 |
+
response = await self.client.post(f"{self.base_url}/v1/chat/completions", json=payload)
|
| 241 |
+
if response.status_code == 200:
|
| 242 |
+
tests["system_message"] = True
|
| 243 |
+
print("β System message support")
|
| 244 |
+
except:
|
| 245 |
+
pass
|
| 246 |
+
|
| 247 |
+
# Test 4: Conversation history
|
| 248 |
+
try:
|
| 249 |
+
payload = {
|
| 250 |
+
"model": "DragonLLM/LLM-Pro-Finance-Small",
|
| 251 |
+
"messages": [
|
| 252 |
+
{"role": "user", "content": "My name is Alice"},
|
| 253 |
+
{"role": "assistant", "content": "Hello Alice"},
|
| 254 |
+
{"role": "user", "content": "What's my name?"}
|
| 255 |
+
]
|
| 256 |
+
}
|
| 257 |
+
response = await self.client.post(f"{self.base_url}/v1/chat/completions", json=payload)
|
| 258 |
+
if response.status_code == 200:
|
| 259 |
+
tests["conversation_history"] = True
|
| 260 |
+
print("β Conversation history")
|
| 261 |
+
except:
|
| 262 |
+
pass
|
| 263 |
+
|
| 264 |
+
# Test 5: Temperature parameter
|
| 265 |
+
try:
|
| 266 |
+
payload = {
|
| 267 |
+
"model": "DragonLLM/LLM-Pro-Finance-Small",
|
| 268 |
+
"messages": [{"role": "user", "content": "Hi"}],
|
| 269 |
+
"temperature": 0.5
|
| 270 |
+
}
|
| 271 |
+
response = await self.client.post(f"{self.base_url}/v1/chat/completions", json=payload)
|
| 272 |
+
if response.status_code == 200:
|
| 273 |
+
tests["temperature_param"] = True
|
| 274 |
+
print("β Temperature parameter")
|
| 275 |
+
except:
|
| 276 |
+
pass
|
| 277 |
+
|
| 278 |
+
# Test 6: Max tokens parameter
|
| 279 |
+
try:
|
| 280 |
+
payload = {
|
| 281 |
+
"model": "DragonLLM/LLM-Pro-Finance-Small",
|
| 282 |
+
"messages": [{"role": "user", "content": "Hi"}],
|
| 283 |
+
"max_tokens": 10
|
| 284 |
+
}
|
| 285 |
+
response = await self.client.post(f"{self.base_url}/v1/chat/completions", json=payload)
|
| 286 |
+
if response.status_code == 200:
|
| 287 |
+
tests["max_tokens_param"] = True
|
| 288 |
+
print("β Max tokens parameter")
|
| 289 |
+
except:
|
| 290 |
+
pass
|
| 291 |
+
|
| 292 |
+
passed = sum(1 for v in tests.values() if v)
|
| 293 |
+
total = len(tests)
|
| 294 |
+
|
| 295 |
+
print(f"\nCompatibility Score: {passed}/{total} ({100*passed/total:.0f}%)")
|
| 296 |
+
|
| 297 |
+
return {"tests": tests, "score": f"{passed}/{total}"}
|
| 298 |
+
|
| 299 |
+
async def run_all_benchmarks(self):
|
| 300 |
+
"""Run all benchmarks"""
|
| 301 |
+
print(f"\n{'#'*60}")
|
| 302 |
+
print("PRIIPs LLM Service - Comprehensive Benchmark Suite")
|
| 303 |
+
print(f"Service: {self.base_url}")
|
| 304 |
+
print(f"{'#'*60}")
|
| 305 |
+
|
| 306 |
+
# Health check
|
| 307 |
+
print("\nChecking service health...")
|
| 308 |
+
if not await self.health_check():
|
| 309 |
+
print("β Service is not available!")
|
| 310 |
+
return
|
| 311 |
+
print("β Service is healthy")
|
| 312 |
+
|
| 313 |
+
# Run benchmarks
|
| 314 |
+
self.results["single_request"] = await self.benchmark_single_request(num_runs=5)
|
| 315 |
+
self.results["concurrent_load"] = await self.benchmark_concurrent_load(num_concurrent=5)
|
| 316 |
+
self.results["different_lengths"] = await self.benchmark_different_lengths()
|
| 317 |
+
self.results["openai_compatibility"] = await self.benchmark_openai_compatibility()
|
| 318 |
+
|
| 319 |
+
# Summary
|
| 320 |
+
print(f"\n{'#'*60}")
|
| 321 |
+
print("SUMMARY")
|
| 322 |
+
print(f"{'#'*60}")
|
| 323 |
+
print(f"\nβ‘ Performance:")
|
| 324 |
+
print(f" Average latency: {self.results['single_request']['avg_latency']:.2f}s")
|
| 325 |
+
print(f" Token throughput: {self.results['single_request']['avg_tokens_per_sec']:.2f} tokens/sec")
|
| 326 |
+
print(f" Concurrent capacity: {self.results['concurrent_load']['requests_per_sec']:.2f} req/sec")
|
| 327 |
+
print(f"\nπ OpenAI Compatibility: {self.results['openai_compatibility']['score']}")
|
| 328 |
+
|
| 329 |
+
# Save results
|
| 330 |
+
with open("benchmark_results.json", "w") as f:
|
| 331 |
+
json.dump(self.results, f, indent=2)
|
| 332 |
+
print(f"\nπ Full results saved to benchmark_results.json")
|
| 333 |
+
|
| 334 |
+
await self.client.aclose()
|
| 335 |
+
|
| 336 |
+
|
| 337 |
+
async def main():
|
| 338 |
+
benchmark = Benchmark()
|
| 339 |
+
await benchmark.run_all_benchmarks()
|
| 340 |
+
|
| 341 |
+
|
| 342 |
+
if __name__ == "__main__":
|
| 343 |
+
asyncio.run(main())
|
| 344 |
+
|
tests/performance/test_inference_speed.py
ADDED
|
@@ -0,0 +1,242 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Performance tests for inference speed and token throughput
|
| 3 |
+
Run with: pytest tests/performance/test_inference_speed.py -v -s
|
| 4 |
+
"""
|
| 5 |
+
import pytest
|
| 6 |
+
import httpx
|
| 7 |
+
import time
|
| 8 |
+
import asyncio
|
| 9 |
+
from typing import List, Dict
|
| 10 |
+
|
| 11 |
+
# Test configuration
|
| 12 |
+
BASE_URL = "https://jeanbaptdzd-priips-llm-service.hf.space"
|
| 13 |
+
# BASE_URL = "http://localhost:7860" # For local testing
|
| 14 |
+
|
| 15 |
+
@pytest.fixture
|
| 16 |
+
def client():
|
| 17 |
+
return httpx.AsyncClient(timeout=120.0)
|
| 18 |
+
|
| 19 |
+
@pytest.mark.asyncio
|
| 20 |
+
async def test_single_request_latency(client):
|
| 21 |
+
"""Test latency for a single chat completion request"""
|
| 22 |
+
payload = {
|
| 23 |
+
"model": "DragonLLM/LLM-Pro-Finance-Small",
|
| 24 |
+
"messages": [
|
| 25 |
+
{"role": "user", "content": "What is the capital of France?"}
|
| 26 |
+
],
|
| 27 |
+
"max_tokens": 50,
|
| 28 |
+
"temperature": 0.7
|
| 29 |
+
}
|
| 30 |
+
|
| 31 |
+
start_time = time.time()
|
| 32 |
+
response = await client.post(f"{BASE_URL}/v1/chat/completions", json=payload)
|
| 33 |
+
end_time = time.time()
|
| 34 |
+
|
| 35 |
+
assert response.status_code == 200
|
| 36 |
+
data = response.json()
|
| 37 |
+
|
| 38 |
+
latency = end_time - start_time
|
| 39 |
+
prompt_tokens = data["usage"]["prompt_tokens"]
|
| 40 |
+
completion_tokens = data["usage"]["completion_tokens"]
|
| 41 |
+
total_tokens = data["usage"]["total_tokens"]
|
| 42 |
+
|
| 43 |
+
print(f"\n=== Single Request Performance ===")
|
| 44 |
+
print(f"Latency: {latency:.2f}s")
|
| 45 |
+
print(f"Prompt tokens: {prompt_tokens}")
|
| 46 |
+
print(f"Completion tokens: {completion_tokens}")
|
| 47 |
+
print(f"Total tokens: {total_tokens}")
|
| 48 |
+
print(f"Tokens per second: {completion_tokens / latency:.2f}")
|
| 49 |
+
print(f"Response: {data['choices'][0]['message']['content'][:100]}...")
|
| 50 |
+
|
| 51 |
+
assert latency < 10.0, f"Latency too high: {latency:.2f}s"
|
| 52 |
+
assert completion_tokens > 0, "No tokens generated"
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
@pytest.mark.asyncio
|
| 56 |
+
async def test_token_throughput_various_lengths(client):
|
| 57 |
+
"""Test token generation speed with various output lengths"""
|
| 58 |
+
test_cases = [
|
| 59 |
+
{"max_tokens": 50, "prompt": "Explain photosynthesis in one sentence."},
|
| 60 |
+
{"max_tokens": 100, "prompt": "Explain photosynthesis in a short paragraph."},
|
| 61 |
+
{"max_tokens": 200, "prompt": "Explain photosynthesis in detail."},
|
| 62 |
+
{"max_tokens": 500, "prompt": "Write a detailed essay about photosynthesis."},
|
| 63 |
+
]
|
| 64 |
+
|
| 65 |
+
print(f"\n=== Token Throughput Test ===")
|
| 66 |
+
|
| 67 |
+
for test_case in test_cases:
|
| 68 |
+
payload = {
|
| 69 |
+
"model": "DragonLLM/LLM-Pro-Finance-Small",
|
| 70 |
+
"messages": [{"role": "user", "content": test_case["prompt"]}],
|
| 71 |
+
"max_tokens": test_case["max_tokens"],
|
| 72 |
+
"temperature": 0.7
|
| 73 |
+
}
|
| 74 |
+
|
| 75 |
+
start_time = time.time()
|
| 76 |
+
response = await client.post(f"{BASE_URL}/v1/chat/completions", json=payload)
|
| 77 |
+
end_time = time.time()
|
| 78 |
+
|
| 79 |
+
assert response.status_code == 200
|
| 80 |
+
data = response.json()
|
| 81 |
+
|
| 82 |
+
latency = end_time - start_time
|
| 83 |
+
completion_tokens = data["usage"]["completion_tokens"]
|
| 84 |
+
tokens_per_sec = completion_tokens / latency if latency > 0 else 0
|
| 85 |
+
|
| 86 |
+
print(f"\nMax tokens: {test_case['max_tokens']}")
|
| 87 |
+
print(f" Generated: {completion_tokens} tokens")
|
| 88 |
+
print(f" Time: {latency:.2f}s")
|
| 89 |
+
print(f" Throughput: {tokens_per_sec:.2f} tokens/sec")
|
| 90 |
+
|
| 91 |
+
assert completion_tokens > 0
|
| 92 |
+
|
| 93 |
+
|
| 94 |
+
@pytest.mark.asyncio
|
| 95 |
+
async def test_concurrent_requests(client):
|
| 96 |
+
"""Test performance with concurrent requests"""
|
| 97 |
+
num_requests = 5
|
| 98 |
+
|
| 99 |
+
async def make_request(request_id: int):
|
| 100 |
+
payload = {
|
| 101 |
+
"model": "DragonLLM/LLM-Pro-Finance-Small",
|
| 102 |
+
"messages": [
|
| 103 |
+
{"role": "user", "content": f"Request {request_id}: What is 2+2?"}
|
| 104 |
+
],
|
| 105 |
+
"max_tokens": 50,
|
| 106 |
+
"temperature": 0.7
|
| 107 |
+
}
|
| 108 |
+
|
| 109 |
+
start_time = time.time()
|
| 110 |
+
response = await client.post(f"{BASE_URL}/v1/chat/completions", json=payload)
|
| 111 |
+
end_time = time.time()
|
| 112 |
+
|
| 113 |
+
return {
|
| 114 |
+
"request_id": request_id,
|
| 115 |
+
"status": response.status_code,
|
| 116 |
+
"latency": end_time - start_time,
|
| 117 |
+
"response": response.json() if response.status_code == 200 else None
|
| 118 |
+
}
|
| 119 |
+
|
| 120 |
+
print(f"\n=== Concurrent Requests Test ({num_requests} requests) ===")
|
| 121 |
+
|
| 122 |
+
start_time = time.time()
|
| 123 |
+
results = await asyncio.gather(*[make_request(i) for i in range(num_requests)])
|
| 124 |
+
end_time = time.time()
|
| 125 |
+
|
| 126 |
+
total_time = end_time - start_time
|
| 127 |
+
successful = sum(1 for r in results if r["status"] == 200)
|
| 128 |
+
avg_latency = sum(r["latency"] for r in results) / len(results)
|
| 129 |
+
|
| 130 |
+
print(f"Total time: {total_time:.2f}s")
|
| 131 |
+
print(f"Successful requests: {successful}/{num_requests}")
|
| 132 |
+
print(f"Average latency: {avg_latency:.2f}s")
|
| 133 |
+
print(f"Requests per second: {num_requests / total_time:.2f}")
|
| 134 |
+
|
| 135 |
+
for result in results:
|
| 136 |
+
print(f" Request {result['request_id']}: {result['latency']:.2f}s - {result['status']}")
|
| 137 |
+
|
| 138 |
+
assert successful == num_requests
|
| 139 |
+
|
| 140 |
+
|
| 141 |
+
@pytest.mark.asyncio
|
| 142 |
+
async def test_time_to_first_token(client):
|
| 143 |
+
"""Test time to first token (TTFT) using streaming"""
|
| 144 |
+
payload = {
|
| 145 |
+
"model": "DragonLLM/LLM-Pro-Finance-Small",
|
| 146 |
+
"messages": [
|
| 147 |
+
{"role": "user", "content": "Count from 1 to 10."}
|
| 148 |
+
],
|
| 149 |
+
"max_tokens": 100,
|
| 150 |
+
"temperature": 0.7,
|
| 151 |
+
"stream": True
|
| 152 |
+
}
|
| 153 |
+
|
| 154 |
+
start_time = time.time()
|
| 155 |
+
first_token_time = None
|
| 156 |
+
token_count = 0
|
| 157 |
+
|
| 158 |
+
async with client.stream("POST", f"{BASE_URL}/v1/chat/completions", json=payload) as response:
|
| 159 |
+
async for line in response.aiter_lines():
|
| 160 |
+
if line.startswith("data: ") and line.strip() != "data: [DONE]":
|
| 161 |
+
if first_token_time is None:
|
| 162 |
+
first_token_time = time.time()
|
| 163 |
+
token_count += 1
|
| 164 |
+
|
| 165 |
+
end_time = time.time()
|
| 166 |
+
|
| 167 |
+
if first_token_time:
|
| 168 |
+
ttft = first_token_time - start_time
|
| 169 |
+
total_time = end_time - start_time
|
| 170 |
+
|
| 171 |
+
print(f"\n=== Time to First Token ===")
|
| 172 |
+
print(f"TTFT: {ttft:.3f}s")
|
| 173 |
+
print(f"Total time: {total_time:.2f}s")
|
| 174 |
+
print(f"Chunks received: {token_count}")
|
| 175 |
+
|
| 176 |
+
assert ttft < 5.0, f"TTFT too high: {ttft:.3f}s"
|
| 177 |
+
|
| 178 |
+
|
| 179 |
+
@pytest.mark.asyncio
|
| 180 |
+
async def test_prompt_processing_speed(client):
|
| 181 |
+
"""Test speed with different prompt lengths"""
|
| 182 |
+
prompts = [
|
| 183 |
+
"Hi", # Very short
|
| 184 |
+
"What is artificial intelligence?" * 5, # Short
|
| 185 |
+
"Explain quantum computing. " * 20, # Medium
|
| 186 |
+
"Write a detailed explanation of machine learning. " * 50, # Long
|
| 187 |
+
]
|
| 188 |
+
|
| 189 |
+
print(f"\n=== Prompt Processing Speed ===")
|
| 190 |
+
|
| 191 |
+
for i, prompt in enumerate(prompts):
|
| 192 |
+
payload = {
|
| 193 |
+
"model": "DragonLLM/LLM-Pro-Finance-Small",
|
| 194 |
+
"messages": [{"role": "user", "content": prompt}],
|
| 195 |
+
"max_tokens": 50,
|
| 196 |
+
"temperature": 0.7
|
| 197 |
+
}
|
| 198 |
+
|
| 199 |
+
start_time = time.time()
|
| 200 |
+
response = await client.post(f"{BASE_URL}/v1/chat/completions", json=payload)
|
| 201 |
+
end_time = time.time()
|
| 202 |
+
|
| 203 |
+
if response.status_code == 200:
|
| 204 |
+
data = response.json()
|
| 205 |
+
latency = end_time - start_time
|
| 206 |
+
prompt_tokens = data["usage"]["prompt_tokens"]
|
| 207 |
+
|
| 208 |
+
print(f"\nPrompt {i+1} (length ~{len(prompt)} chars):")
|
| 209 |
+
print(f" Prompt tokens: {prompt_tokens}")
|
| 210 |
+
print(f" Latency: {latency:.2f}s")
|
| 211 |
+
print(f" Tokens/sec: {prompt_tokens / latency:.2f}")
|
| 212 |
+
|
| 213 |
+
|
| 214 |
+
@pytest.mark.asyncio
|
| 215 |
+
async def test_temperature_variance(client):
|
| 216 |
+
"""Test response variance with different temperatures"""
|
| 217 |
+
temperatures = [0.0, 0.5, 1.0, 1.5]
|
| 218 |
+
prompt = "The future of artificial intelligence is"
|
| 219 |
+
|
| 220 |
+
print(f"\n=== Temperature Variance Test ===")
|
| 221 |
+
|
| 222 |
+
for temp in temperatures:
|
| 223 |
+
payload = {
|
| 224 |
+
"model": "DragonLLM/LLM-Pro-Finance-Small",
|
| 225 |
+
"messages": [{"role": "user", "content": prompt}],
|
| 226 |
+
"max_tokens": 50,
|
| 227 |
+
"temperature": temp
|
| 228 |
+
}
|
| 229 |
+
|
| 230 |
+
response = await client.post(f"{BASE_URL}/v1/chat/completions", json=payload)
|
| 231 |
+
assert response.status_code == 200
|
| 232 |
+
|
| 233 |
+
data = response.json()
|
| 234 |
+
content = data['choices'][0]['message']['content']
|
| 235 |
+
|
| 236 |
+
print(f"\nTemperature: {temp}")
|
| 237 |
+
print(f"Response: {content[:100]}...")
|
| 238 |
+
|
| 239 |
+
|
| 240 |
+
if __name__ == "__main__":
|
| 241 |
+
pytest.main([__file__, "-v", "-s"])
|
| 242 |
+
|
tests/performance/test_openai_compatibility.py
ADDED
|
@@ -0,0 +1,345 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
OpenAI API compatibility tests
|
| 3 |
+
Run with: pytest tests/performance/test_openai_compatibility.py -v -s
|
| 4 |
+
"""
|
| 5 |
+
import pytest
|
| 6 |
+
import httpx
|
| 7 |
+
from openai import OpenAI
|
| 8 |
+
import os
|
| 9 |
+
|
| 10 |
+
# Test configuration
|
| 11 |
+
BASE_URL = "https://jeanbaptdzd-priips-llm-service.hf.space"
|
| 12 |
+
# BASE_URL = "http://localhost:7860" # For local testing
|
| 13 |
+
|
| 14 |
+
@pytest.fixture
|
| 15 |
+
def httpx_client():
|
| 16 |
+
return httpx.AsyncClient(timeout=60.0)
|
| 17 |
+
|
| 18 |
+
@pytest.fixture
|
| 19 |
+
def openai_client():
|
| 20 |
+
"""Test using official OpenAI client library"""
|
| 21 |
+
return OpenAI(
|
| 22 |
+
base_url=f"{BASE_URL}/v1",
|
| 23 |
+
api_key="dummy-key" # Service may not require auth
|
| 24 |
+
)
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
class TestEndpointCompatibility:
|
| 28 |
+
"""Test that all OpenAI endpoints are available and compatible"""
|
| 29 |
+
|
| 30 |
+
@pytest.mark.asyncio
|
| 31 |
+
async def test_list_models_endpoint(self, httpx_client):
|
| 32 |
+
"""Test GET /v1/models endpoint"""
|
| 33 |
+
response = await httpx_client.get(f"{BASE_URL}/v1/models")
|
| 34 |
+
|
| 35 |
+
assert response.status_code == 200
|
| 36 |
+
data = response.json()
|
| 37 |
+
|
| 38 |
+
print(f"\n=== Models Endpoint ===")
|
| 39 |
+
print(f"Response structure: {data.keys()}")
|
| 40 |
+
|
| 41 |
+
# Check OpenAI-compatible structure
|
| 42 |
+
assert "object" in data
|
| 43 |
+
assert data["object"] == "list"
|
| 44 |
+
assert "data" in data
|
| 45 |
+
assert isinstance(data["data"], list)
|
| 46 |
+
assert len(data["data"]) > 0
|
| 47 |
+
|
| 48 |
+
# Check model object structure
|
| 49 |
+
model = data["data"][0]
|
| 50 |
+
assert "id" in model
|
| 51 |
+
assert "object" in model
|
| 52 |
+
assert model["object"] == "model"
|
| 53 |
+
|
| 54 |
+
print(f"Available models: {[m['id'] for m in data['data']]}")
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
@pytest.mark.asyncio
|
| 58 |
+
async def test_chat_completions_endpoint(self, httpx_client):
|
| 59 |
+
"""Test POST /v1/chat/completions endpoint"""
|
| 60 |
+
payload = {
|
| 61 |
+
"model": "DragonLLM/LLM-Pro-Finance-Small",
|
| 62 |
+
"messages": [
|
| 63 |
+
{"role": "user", "content": "Say hello"}
|
| 64 |
+
]
|
| 65 |
+
}
|
| 66 |
+
|
| 67 |
+
response = await httpx_client.post(
|
| 68 |
+
f"{BASE_URL}/v1/chat/completions",
|
| 69 |
+
json=payload
|
| 70 |
+
)
|
| 71 |
+
|
| 72 |
+
assert response.status_code == 200
|
| 73 |
+
data = response.json()
|
| 74 |
+
|
| 75 |
+
print(f"\n=== Chat Completions Endpoint ===")
|
| 76 |
+
print(f"Response structure: {data.keys()}")
|
| 77 |
+
|
| 78 |
+
# Check OpenAI-compatible structure
|
| 79 |
+
assert "id" in data
|
| 80 |
+
assert "object" in data
|
| 81 |
+
assert data["object"] == "chat.completion"
|
| 82 |
+
assert "created" in data
|
| 83 |
+
assert "model" in data
|
| 84 |
+
assert "choices" in data
|
| 85 |
+
assert "usage" in data
|
| 86 |
+
|
| 87 |
+
# Check choices structure
|
| 88 |
+
assert len(data["choices"]) > 0
|
| 89 |
+
choice = data["choices"][0]
|
| 90 |
+
assert "index" in choice
|
| 91 |
+
assert "message" in choice
|
| 92 |
+
assert "role" in choice["message"]
|
| 93 |
+
assert "content" in choice["message"]
|
| 94 |
+
assert "finish_reason" in choice
|
| 95 |
+
|
| 96 |
+
# Check usage structure
|
| 97 |
+
usage = data["usage"]
|
| 98 |
+
assert "prompt_tokens" in usage
|
| 99 |
+
assert "completion_tokens" in usage
|
| 100 |
+
assert "total_tokens" in usage
|
| 101 |
+
|
| 102 |
+
print(f"Response: {choice['message']['content'][:100]}...")
|
| 103 |
+
|
| 104 |
+
|
| 105 |
+
class TestOpenAIClientLibrary:
|
| 106 |
+
"""Test compatibility with official OpenAI Python client"""
|
| 107 |
+
|
| 108 |
+
def test_chat_completion_with_openai_client(self, openai_client):
|
| 109 |
+
"""Test chat completion using official OpenAI client"""
|
| 110 |
+
try:
|
| 111 |
+
response = openai_client.chat.completions.create(
|
| 112 |
+
model="DragonLLM/LLM-Pro-Finance-Small",
|
| 113 |
+
messages=[
|
| 114 |
+
{"role": "user", "content": "What is 2+2?"}
|
| 115 |
+
],
|
| 116 |
+
max_tokens=50
|
| 117 |
+
)
|
| 118 |
+
|
| 119 |
+
print(f"\n=== OpenAI Client Compatibility ===")
|
| 120 |
+
print(f"Response type: {type(response)}")
|
| 121 |
+
print(f"Model: {response.model}")
|
| 122 |
+
print(f"Content: {response.choices[0].message.content}")
|
| 123 |
+
print(f"Usage: {response.usage}")
|
| 124 |
+
|
| 125 |
+
assert response.choices[0].message.content is not None
|
| 126 |
+
assert len(response.choices) > 0
|
| 127 |
+
|
| 128 |
+
except Exception as e:
|
| 129 |
+
pytest.fail(f"OpenAI client failed: {e}")
|
| 130 |
+
|
| 131 |
+
|
| 132 |
+
def test_streaming_with_openai_client(self, openai_client):
|
| 133 |
+
"""Test streaming with official OpenAI client"""
|
| 134 |
+
try:
|
| 135 |
+
stream = openai_client.chat.completions.create(
|
| 136 |
+
model="DragonLLM/LLM-Pro-Finance-Small",
|
| 137 |
+
messages=[
|
| 138 |
+
{"role": "user", "content": "Count to 5"}
|
| 139 |
+
],
|
| 140 |
+
max_tokens=50,
|
| 141 |
+
stream=True
|
| 142 |
+
)
|
| 143 |
+
|
| 144 |
+
print(f"\n=== Streaming Compatibility ===")
|
| 145 |
+
chunks = []
|
| 146 |
+
for chunk in stream:
|
| 147 |
+
if chunk.choices[0].delta.content:
|
| 148 |
+
chunks.append(chunk.choices[0].delta.content)
|
| 149 |
+
print(chunk.choices[0].delta.content, end="", flush=True)
|
| 150 |
+
|
| 151 |
+
print()
|
| 152 |
+
assert len(chunks) > 0, "No chunks received"
|
| 153 |
+
|
| 154 |
+
except Exception as e:
|
| 155 |
+
pytest.fail(f"Streaming failed: {e}")
|
| 156 |
+
|
| 157 |
+
|
| 158 |
+
class TestMessageFormats:
|
| 159 |
+
"""Test different message formats and parameters"""
|
| 160 |
+
|
| 161 |
+
@pytest.mark.asyncio
|
| 162 |
+
async def test_system_message(self, httpx_client):
|
| 163 |
+
"""Test with system message"""
|
| 164 |
+
payload = {
|
| 165 |
+
"model": "DragonLLM/LLM-Pro-Finance-Small",
|
| 166 |
+
"messages": [
|
| 167 |
+
{"role": "system", "content": "You are a helpful assistant."},
|
| 168 |
+
{"role": "user", "content": "Hello"}
|
| 169 |
+
],
|
| 170 |
+
"max_tokens": 50
|
| 171 |
+
}
|
| 172 |
+
|
| 173 |
+
response = await httpx_client.post(
|
| 174 |
+
f"{BASE_URL}/v1/chat/completions",
|
| 175 |
+
json=payload
|
| 176 |
+
)
|
| 177 |
+
|
| 178 |
+
assert response.status_code == 200
|
| 179 |
+
data = response.json()
|
| 180 |
+
print(f"\n=== System Message Test ===")
|
| 181 |
+
print(f"Response: {data['choices'][0]['message']['content'][:100]}...")
|
| 182 |
+
|
| 183 |
+
|
| 184 |
+
@pytest.mark.asyncio
|
| 185 |
+
async def test_conversation_history(self, httpx_client):
|
| 186 |
+
"""Test with conversation history"""
|
| 187 |
+
payload = {
|
| 188 |
+
"model": "DragonLLM/LLM-Pro-Finance-Small",
|
| 189 |
+
"messages": [
|
| 190 |
+
{"role": "user", "content": "My name is Alice."},
|
| 191 |
+
{"role": "assistant", "content": "Hello Alice! Nice to meet you."},
|
| 192 |
+
{"role": "user", "content": "What's my name?"}
|
| 193 |
+
],
|
| 194 |
+
"max_tokens": 50
|
| 195 |
+
}
|
| 196 |
+
|
| 197 |
+
response = await httpx_client.post(
|
| 198 |
+
f"{BASE_URL}/v1/chat/completions",
|
| 199 |
+
json=payload
|
| 200 |
+
)
|
| 201 |
+
|
| 202 |
+
assert response.status_code == 200
|
| 203 |
+
data = response.json()
|
| 204 |
+
print(f"\n=== Conversation History Test ===")
|
| 205 |
+
print(f"Response: {data['choices'][0]['message']['content']}")
|
| 206 |
+
|
| 207 |
+
|
| 208 |
+
@pytest.mark.asyncio
|
| 209 |
+
async def test_various_parameters(self, httpx_client):
|
| 210 |
+
"""Test various OpenAI parameters"""
|
| 211 |
+
parameters = [
|
| 212 |
+
{"temperature": 0.0},
|
| 213 |
+
{"temperature": 1.0},
|
| 214 |
+
{"top_p": 0.5},
|
| 215 |
+
{"max_tokens": 10},
|
| 216 |
+
{"max_tokens": 100},
|
| 217 |
+
]
|
| 218 |
+
|
| 219 |
+
print(f"\n=== Parameter Compatibility Test ===")
|
| 220 |
+
|
| 221 |
+
for params in parameters:
|
| 222 |
+
payload = {
|
| 223 |
+
"model": "DragonLLM/LLM-Pro-Finance-Small",
|
| 224 |
+
"messages": [{"role": "user", "content": "Hello"}],
|
| 225 |
+
**params
|
| 226 |
+
}
|
| 227 |
+
|
| 228 |
+
response = await httpx_client.post(
|
| 229 |
+
f"{BASE_URL}/v1/chat/completions",
|
| 230 |
+
json=payload
|
| 231 |
+
)
|
| 232 |
+
|
| 233 |
+
assert response.status_code == 200
|
| 234 |
+
print(f"β Parameters {params} work correctly")
|
| 235 |
+
|
| 236 |
+
|
| 237 |
+
class TestErrorHandling:
|
| 238 |
+
"""Test error handling and edge cases"""
|
| 239 |
+
|
| 240 |
+
@pytest.mark.asyncio
|
| 241 |
+
async def test_invalid_model(self, httpx_client):
|
| 242 |
+
"""Test with invalid model name"""
|
| 243 |
+
payload = {
|
| 244 |
+
"model": "invalid-model",
|
| 245 |
+
"messages": [{"role": "user", "content": "Hello"}]
|
| 246 |
+
}
|
| 247 |
+
|
| 248 |
+
response = await httpx_client.post(
|
| 249 |
+
f"{BASE_URL}/v1/chat/completions",
|
| 250 |
+
json=payload
|
| 251 |
+
)
|
| 252 |
+
|
| 253 |
+
print(f"\n=== Invalid Model Test ===")
|
| 254 |
+
print(f"Status: {response.status_code}")
|
| 255 |
+
# Should handle gracefully (either 400 or use default model)
|
| 256 |
+
|
| 257 |
+
|
| 258 |
+
@pytest.mark.asyncio
|
| 259 |
+
async def test_missing_messages(self, httpx_client):
|
| 260 |
+
"""Test with missing messages field"""
|
| 261 |
+
payload = {
|
| 262 |
+
"model": "DragonLLM/LLM-Pro-Finance-Small"
|
| 263 |
+
}
|
| 264 |
+
|
| 265 |
+
response = await httpx_client.post(
|
| 266 |
+
f"{BASE_URL}/v1/chat/completions",
|
| 267 |
+
json=payload
|
| 268 |
+
)
|
| 269 |
+
|
| 270 |
+
print(f"\n=== Missing Messages Test ===")
|
| 271 |
+
print(f"Status: {response.status_code}")
|
| 272 |
+
assert response.status_code in [400, 422], "Should return error for missing messages"
|
| 273 |
+
|
| 274 |
+
|
| 275 |
+
@pytest.mark.asyncio
|
| 276 |
+
async def test_empty_message(self, httpx_client):
|
| 277 |
+
"""Test with empty message content"""
|
| 278 |
+
payload = {
|
| 279 |
+
"model": "DragonLLM/LLM-Pro-Finance-Small",
|
| 280 |
+
"messages": [{"role": "user", "content": ""}],
|
| 281 |
+
"max_tokens": 50
|
| 282 |
+
}
|
| 283 |
+
|
| 284 |
+
response = await httpx_client.post(
|
| 285 |
+
f"{BASE_URL}/v1/chat/completions",
|
| 286 |
+
json=payload
|
| 287 |
+
)
|
| 288 |
+
|
| 289 |
+
print(f"\n=== Empty Message Test ===")
|
| 290 |
+
print(f"Status: {response.status_code}")
|
| 291 |
+
|
| 292 |
+
|
| 293 |
+
class TestResponseFormat:
|
| 294 |
+
"""Test response format compliance"""
|
| 295 |
+
|
| 296 |
+
@pytest.mark.asyncio
|
| 297 |
+
async def test_response_schema(self, httpx_client):
|
| 298 |
+
"""Validate complete response schema"""
|
| 299 |
+
payload = {
|
| 300 |
+
"model": "DragonLLM/LLM-Pro-Finance-Small",
|
| 301 |
+
"messages": [{"role": "user", "content": "Test"}],
|
| 302 |
+
"max_tokens": 50
|
| 303 |
+
}
|
| 304 |
+
|
| 305 |
+
response = await httpx_client.post(
|
| 306 |
+
f"{BASE_URL}/v1/chat/completions",
|
| 307 |
+
json=payload
|
| 308 |
+
)
|
| 309 |
+
|
| 310 |
+
assert response.status_code == 200
|
| 311 |
+
data = response.json()
|
| 312 |
+
|
| 313 |
+
print(f"\n=== Response Schema Validation ===")
|
| 314 |
+
|
| 315 |
+
# Root level fields
|
| 316 |
+
required_fields = ["id", "object", "created", "model", "choices", "usage"]
|
| 317 |
+
for field in required_fields:
|
| 318 |
+
assert field in data, f"Missing required field: {field}"
|
| 319 |
+
print(f"β {field}: {type(data[field]).__name__}")
|
| 320 |
+
|
| 321 |
+
# Choices validation
|
| 322 |
+
choice = data["choices"][0]
|
| 323 |
+
choice_fields = ["index", "message", "finish_reason"]
|
| 324 |
+
for field in choice_fields:
|
| 325 |
+
assert field in choice, f"Missing choice field: {field}"
|
| 326 |
+
|
| 327 |
+
# Message validation
|
| 328 |
+
message = choice["message"]
|
| 329 |
+
message_fields = ["role", "content"]
|
| 330 |
+
for field in message_fields:
|
| 331 |
+
assert field in message, f"Missing message field: {field}"
|
| 332 |
+
|
| 333 |
+
# Usage validation
|
| 334 |
+
usage = data["usage"]
|
| 335 |
+
usage_fields = ["prompt_tokens", "completion_tokens", "total_tokens"]
|
| 336 |
+
for field in usage_fields:
|
| 337 |
+
assert field in usage, f"Missing usage field: {field}"
|
| 338 |
+
assert isinstance(usage[field], int), f"{field} should be int"
|
| 339 |
+
|
| 340 |
+
print("β All schema validations passed")
|
| 341 |
+
|
| 342 |
+
|
| 343 |
+
if __name__ == "__main__":
|
| 344 |
+
pytest.main([__file__, "-v", "-s"])
|
| 345 |
+
|