Commit
·
7a92d8e
1
Parent(s):
7239fe3
Cleanup: remove redundant docs, condense README
Browse files- Removed koyeb_logs_analysis.md (one-time analysis)
- Removed openai_api_verification.md (historical verification)
- Removed transformers_verification.md (historical verification)
- Condensed README from 117 to 89 lines (24% reduction)
- Removed repetitive sections and consolidated information
- Dockerfile.koyeb +1 -0
- README.md +7 -30
- docs/openai_api_verification.md +0 -202
- docs/transformers_verification.md +0 -190
Dockerfile.koyeb
CHANGED
|
@@ -21,3 +21,4 @@ EXPOSE 8000
|
|
| 21 |
# Use ENTRYPOINT so it can't be overridden by empty Koyeb args
|
| 22 |
ENTRYPOINT ["/start-vllm.sh"]
|
| 23 |
|
|
|
|
|
|
| 21 |
# Use ENTRYPOINT so it can't be overridden by empty Koyeb args
|
| 22 |
ENTRYPOINT ["/start-vllm.sh"]
|
| 23 |
|
| 24 |
+
|
README.md
CHANGED
|
@@ -13,7 +13,7 @@ suggested_hardware: l4x1
|
|
| 13 |
|
| 14 |
OpenAI-compatible API powered by DragonLLM/Qwen-Open-Finance-R-8B.
|
| 15 |
|
| 16 |
-
## Deployment
|
| 17 |
|
| 18 |
| Platform | Backend | Dockerfile | Use Case |
|
| 19 |
|----------|---------|------------|----------|
|
|
@@ -25,13 +25,11 @@ OpenAI-compatible API powered by DragonLLM/Qwen-Open-Finance-R-8B.
|
|
| 25 |
- OpenAI-compatible API
|
| 26 |
- Tool/function calling support
|
| 27 |
- Streaming responses
|
| 28 |
-
- French and English financial terminology
|
| 29 |
- Rate limiting (30 req/min, 500 req/hour)
|
| 30 |
- Statistics tracking via `/v1/stats`
|
| 31 |
|
| 32 |
## Quick Start
|
| 33 |
|
| 34 |
-
### Chat Completion
|
| 35 |
```bash
|
| 36 |
curl -X POST "https://your-endpoint/v1/chat/completions" \
|
| 37 |
-H "Content-Type: application/json" \
|
|
@@ -42,7 +40,6 @@ curl -X POST "https://your-endpoint/v1/chat/completions" \
|
|
| 42 |
}'
|
| 43 |
```
|
| 44 |
|
| 45 |
-
### OpenAI SDK
|
| 46 |
```python
|
| 47 |
from openai import OpenAI
|
| 48 |
|
|
@@ -56,32 +53,17 @@ response = client.chat.completions.create(
|
|
| 56 |
|
| 57 |
## Configuration
|
| 58 |
|
| 59 |
-
### Environment Variables
|
| 60 |
-
|
| 61 |
| Variable | Required | Default | Description |
|
| 62 |
|----------|----------|---------|-------------|
|
| 63 |
| `HF_TOKEN_LC2` | Yes | - | Hugging Face token |
|
| 64 |
| `MODEL` | No | `DragonLLM/Qwen-Open-Finance-R-8B` | Model name |
|
| 65 |
| `PORT` | No | `8000` (vLLM) / `7860` (Transformers) | Server port |
|
| 66 |
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
| `TOOL_CALL_PARSER` | `hermes` | Parser for Qwen models |
|
| 73 |
-
| `MAX_MODEL_LEN` | `8192` | Max context length |
|
| 74 |
-
| `GPU_MEMORY_UTILIZATION` | `0.90` | GPU memory fraction |
|
| 75 |
-
|
| 76 |
-
## Koyeb Deployment
|
| 77 |
-
|
| 78 |
-
Build and push the vLLM image:
|
| 79 |
-
```bash
|
| 80 |
-
docker build --platform linux/amd64 -f Dockerfile.koyeb -t your-registry/dragon-llm-inference:vllm-amd64 .
|
| 81 |
-
docker push your-registry/dragon-llm-inference:vllm-amd64
|
| 82 |
-
```
|
| 83 |
-
|
| 84 |
-
Recommended instance: `gpu-nvidia-l40s` (48GB VRAM)
|
| 85 |
|
| 86 |
## API Endpoints
|
| 87 |
|
|
@@ -92,7 +74,7 @@ Recommended instance: `gpu-nvidia-l40s` (48GB VRAM)
|
|
| 92 |
| `/v1/stats` | GET | Usage statistics |
|
| 93 |
| `/health` | GET | Health check |
|
| 94 |
|
| 95 |
-
## Technical
|
| 96 |
|
| 97 |
- **Model**: DragonLLM/Qwen-Open-Finance-R-8B (8B parameters)
|
| 98 |
- **vLLM Backend**: vllm-openai:latest with hermes tool parser
|
|
@@ -104,12 +86,7 @@ Recommended instance: `gpu-nvidia-l40s` (48GB VRAM)
|
|
| 104 |
```bash
|
| 105 |
pip install -r requirements.txt
|
| 106 |
uvicorn app.main:app --reload --port 8080
|
| 107 |
-
```
|
| 108 |
-
|
| 109 |
-
### Testing
|
| 110 |
-
```bash
|
| 111 |
pytest tests/ -v
|
| 112 |
-
python tests/integration/test_tool_calls.py
|
| 113 |
```
|
| 114 |
|
| 115 |
## License
|
|
|
|
| 13 |
|
| 14 |
OpenAI-compatible API powered by DragonLLM/Qwen-Open-Finance-R-8B.
|
| 15 |
|
| 16 |
+
## Deployment
|
| 17 |
|
| 18 |
| Platform | Backend | Dockerfile | Use Case |
|
| 19 |
|----------|---------|------------|----------|
|
|
|
|
| 25 |
- OpenAI-compatible API
|
| 26 |
- Tool/function calling support
|
| 27 |
- Streaming responses
|
|
|
|
| 28 |
- Rate limiting (30 req/min, 500 req/hour)
|
| 29 |
- Statistics tracking via `/v1/stats`
|
| 30 |
|
| 31 |
## Quick Start
|
| 32 |
|
|
|
|
| 33 |
```bash
|
| 34 |
curl -X POST "https://your-endpoint/v1/chat/completions" \
|
| 35 |
-H "Content-Type: application/json" \
|
|
|
|
| 40 |
}'
|
| 41 |
```
|
| 42 |
|
|
|
|
| 43 |
```python
|
| 44 |
from openai import OpenAI
|
| 45 |
|
|
|
|
| 53 |
|
| 54 |
## Configuration
|
| 55 |
|
|
|
|
|
|
|
| 56 |
| Variable | Required | Default | Description |
|
| 57 |
|----------|----------|---------|-------------|
|
| 58 |
| `HF_TOKEN_LC2` | Yes | - | Hugging Face token |
|
| 59 |
| `MODEL` | No | `DragonLLM/Qwen-Open-Finance-R-8B` | Model name |
|
| 60 |
| `PORT` | No | `8000` (vLLM) / `7860` (Transformers) | Server port |
|
| 61 |
|
| 62 |
+
**vLLM-specific (Koyeb):**
|
| 63 |
+
- `ENABLE_AUTO_TOOL_CHOICE=true` - Enable tool calling
|
| 64 |
+
- `TOOL_CALL_PARSER=hermes` - Parser for Qwen models
|
| 65 |
+
- `MAX_MODEL_LEN=8192` - Max context length
|
| 66 |
+
- `GPU_MEMORY_UTILIZATION=0.90` - GPU memory fraction
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
|
| 68 |
## API Endpoints
|
| 69 |
|
|
|
|
| 74 |
| `/v1/stats` | GET | Usage statistics |
|
| 75 |
| `/health` | GET | Health check |
|
| 76 |
|
| 77 |
+
## Technical Specs
|
| 78 |
|
| 79 |
- **Model**: DragonLLM/Qwen-Open-Finance-R-8B (8B parameters)
|
| 80 |
- **vLLM Backend**: vllm-openai:latest with hermes tool parser
|
|
|
|
| 86 |
```bash
|
| 87 |
pip install -r requirements.txt
|
| 88 |
uvicorn app.main:app --reload --port 8080
|
|
|
|
|
|
|
|
|
|
|
|
|
| 89 |
pytest tests/ -v
|
|
|
|
| 90 |
```
|
| 91 |
|
| 92 |
## License
|
docs/openai_api_verification.md
DELETED
|
@@ -1,202 +0,0 @@
|
|
| 1 |
-
# OpenAI API Compatibility Verification
|
| 2 |
-
|
| 3 |
-
## Overview
|
| 4 |
-
This document verifies that our OpenAI API wrapper implementation correctly follows the OpenAI API specification and properly connects to the Qwen fine-tuned model.
|
| 5 |
-
|
| 6 |
-
## Connection Flow
|
| 7 |
-
|
| 8 |
-
```
|
| 9 |
-
OpenAI-compatible Client
|
| 10 |
-
↓ (OpenAI API requests)
|
| 11 |
-
Hugging Face Space API (simple-llm-pro-finance)
|
| 12 |
-
↓ (FastAPI router)
|
| 13 |
-
TransformersProvider
|
| 14 |
-
↓ (Hugging Face Transformers)
|
| 15 |
-
Qwen-Open-Finance-R-8B Model
|
| 16 |
-
```
|
| 17 |
-
|
| 18 |
-
## OpenAI API Specification Compliance
|
| 19 |
-
|
| 20 |
-
### 1. Chat Completions Endpoint: `/v1/chat/completions`
|
| 21 |
-
|
| 22 |
-
#### ✅ Request Parameters (All Supported)
|
| 23 |
-
|
| 24 |
-
| Parameter | Type | Status | Notes |
|
| 25 |
-
|-----------|------|--------|-------|
|
| 26 |
-
| `model` | string | ✅ | Required, defaults to configured model |
|
| 27 |
-
| `messages` | array | ✅ | Required, validated |
|
| 28 |
-
| `temperature` | number | ✅ | Optional, default 0.7, validated (0-2) |
|
| 29 |
-
| `max_tokens` | integer | ✅ | Optional, validated (≥1) |
|
| 30 |
-
| `stream` | boolean | ✅ | Optional, default false |
|
| 31 |
-
| `top_p` | number | ✅ | Optional, default 1.0 |
|
| 32 |
-
| `tools` | array | ✅ | Optional, tool definitions |
|
| 33 |
-
| `tool_choice` | string/object | ✅ | Optional, supports "none", "auto", "required" |
|
| 34 |
-
| `response_format` | object | ✅ | Optional, supports {"type": "json_object"} |
|
| 35 |
-
|
| 36 |
-
#### ✅ Response Format
|
| 37 |
-
|
| 38 |
-
| Field | Type | Status | Notes |
|
| 39 |
-
|-------|------|--------|-------|
|
| 40 |
-
| `id` | string | ✅ | Generated chat completion ID |
|
| 41 |
-
| `object` | string | ✅ | "chat.completion" |
|
| 42 |
-
| `created` | integer | ✅ | Unix timestamp |
|
| 43 |
-
| `model` | string | ✅ | Model name |
|
| 44 |
-
| `choices` | array | ✅ | Array of Choice objects |
|
| 45 |
-
| `usage` | object | ✅ | Token usage statistics |
|
| 46 |
-
|
| 47 |
-
#### ✅ Choice Object
|
| 48 |
-
|
| 49 |
-
| Field | Type | Status | Notes |
|
| 50 |
-
|-------|------|--------|-------|
|
| 51 |
-
| `index` | integer | ✅ | Choice index |
|
| 52 |
-
| `message` | object | ✅ | Message object |
|
| 53 |
-
| `finish_reason` | string | ✅ | "stop", "length", "tool_calls" |
|
| 54 |
-
|
| 55 |
-
#### ✅ Message Object
|
| 56 |
-
|
| 57 |
-
| Field | Type | Status | Notes |
|
| 58 |
-
|-------|------|--------|-------|
|
| 59 |
-
| `role` | string | ✅ | "assistant" |
|
| 60 |
-
| `content` | string/null | ✅ | Message content |
|
| 61 |
-
| `tool_calls` | array/null | ✅ | Array of ToolCall objects |
|
| 62 |
-
|
| 63 |
-
#### ✅ ToolCall Object
|
| 64 |
-
|
| 65 |
-
| Field | Type | Status | Notes |
|
| 66 |
-
|-------|------|--------|-------|
|
| 67 |
-
| `id` | string | ✅ | Tool call ID |
|
| 68 |
-
| `type` | string | ✅ | "function" |
|
| 69 |
-
| `function` | object | ✅ | FunctionCall object |
|
| 70 |
-
|
| 71 |
-
#### ✅ FunctionCall Object
|
| 72 |
-
|
| 73 |
-
| Field | Type | Status | Notes |
|
| 74 |
-
|-------|------|--------|-------|
|
| 75 |
-
| `name` | string | ✅ | Function name |
|
| 76 |
-
| `arguments` | string | ✅ | JSON string of arguments |
|
| 77 |
-
|
| 78 |
-
### 2. Tool Choice Handling
|
| 79 |
-
|
| 80 |
-
#### ✅ Supported Values
|
| 81 |
-
|
| 82 |
-
- `"none"`: Model will not call any tools
|
| 83 |
-
- `"auto"`: Model can choose to call tools (default)
|
| 84 |
-
- `"required"`: Model must call a tool (converted to "auto" for text-based models)
|
| 85 |
-
- `{"type": "function", "function": {"name": "..."}}`: Force specific tool
|
| 86 |
-
|
| 87 |
-
**Implementation Note**: Since Qwen is a text-based model (not native function calling), we convert `"required"` to `"auto"` and handle tool calls via text parsing.
|
| 88 |
-
|
| 89 |
-
### 3. Response Format Handling
|
| 90 |
-
|
| 91 |
-
#### ✅ JSON Object Mode
|
| 92 |
-
|
| 93 |
-
When `response_format={"type": "json_object"}` is provided:
|
| 94 |
-
- ✅ System prompt is enhanced with JSON output instructions
|
| 95 |
-
- ✅ Response is parsed to extract JSON from markdown code blocks
|
| 96 |
-
- ✅ Clean JSON is returned for validation
|
| 97 |
-
|
| 98 |
-
**Implementation**: Since Qwen doesn't have native JSON mode, we enforce it via prompt engineering and post-processing.
|
| 99 |
-
|
| 100 |
-
## Client Integration
|
| 101 |
-
|
| 102 |
-
### ✅ Supported Parameters
|
| 103 |
-
|
| 104 |
-
The API accepts standard OpenAI API parameters:
|
| 105 |
-
|
| 106 |
-
```python
|
| 107 |
-
{
|
| 108 |
-
"model": "dragon-llm-open-finance",
|
| 109 |
-
"messages": [...],
|
| 110 |
-
"temperature": 0.7,
|
| 111 |
-
"max_tokens": 3000,
|
| 112 |
-
"response_format": {"type": "json_object"}, # ✅ Supported
|
| 113 |
-
"tool_choice": "required", # ✅ Accepted (converted to "auto")
|
| 114 |
-
"tools": [...] # ✅ Tool definitions supported
|
| 115 |
-
}
|
| 116 |
-
```
|
| 117 |
-
|
| 118 |
-
### ✅ Implementation Details
|
| 119 |
-
|
| 120 |
-
1. ✅ `tool_choice="required"` → Accepted and converted to `"auto"`
|
| 121 |
-
2. ✅ `response_format={"type": "json_object"}` → JSON instructions added to prompt
|
| 122 |
-
3. ✅ `tools` array → Formatted and added to system prompt
|
| 123 |
-
4. ✅ Tool calls in response → Parsed from text and returned in OpenAI format
|
| 124 |
-
|
| 125 |
-
## Qwen Model Integration
|
| 126 |
-
|
| 127 |
-
### ✅ Model Connection
|
| 128 |
-
|
| 129 |
-
1. **Model Loading**: ✅ Uses Hugging Face Transformers
|
| 130 |
-
- Model: `DragonLLM/Qwen-Open-Finance-R-8B`
|
| 131 |
-
- Tokenizer: Auto-loaded with model
|
| 132 |
-
- Device: Auto (CUDA if available)
|
| 133 |
-
|
| 134 |
-
2. **Prompt Formatting**: ✅ Uses Qwen chat template
|
| 135 |
-
- System prompts properly formatted
|
| 136 |
-
- Tools added to system prompt
|
| 137 |
-
- JSON instructions added when needed
|
| 138 |
-
|
| 139 |
-
3. **Response Processing**: ✅
|
| 140 |
-
- Text generation via Transformers
|
| 141 |
-
- Tool call parsing from text
|
| 142 |
-
- JSON extraction from markdown
|
| 143 |
-
|
| 144 |
-
### ✅ Qwen-Specific Considerations
|
| 145 |
-
|
| 146 |
-
1. **Text-Based Tool Calls**: Qwen doesn't have native function calling, so we:
|
| 147 |
-
- Format tools in system prompt
|
| 148 |
-
- Parse `<tool_call>...</tool_call>` blocks from response
|
| 149 |
-
- Convert to OpenAI-compatible format
|
| 150 |
-
|
| 151 |
-
2. **JSON Output**: Qwen doesn't have native JSON mode, so we:
|
| 152 |
-
- Add JSON instructions to system prompt
|
| 153 |
-
- Extract JSON from markdown code blocks
|
| 154 |
-
- Validate and return clean JSON
|
| 155 |
-
|
| 156 |
-
## Verification Checklist
|
| 157 |
-
|
| 158 |
-
### API Compatibility
|
| 159 |
-
- [x] All required OpenAI API parameters supported
|
| 160 |
-
- [x] Response format matches OpenAI specification
|
| 161 |
-
- [x] Error handling follows OpenAI error format
|
| 162 |
-
- [x] Streaming support implemented
|
| 163 |
-
- [x] Tool calls properly formatted
|
| 164 |
-
|
| 165 |
-
### Client Compatibility
|
| 166 |
-
- [x] `tool_choice="required"` accepted
|
| 167 |
-
- [x] `response_format` supported
|
| 168 |
-
- [x] Structured output requests handled correctly
|
| 169 |
-
- [x] Tool definitions passed through
|
| 170 |
-
- [x] Structured outputs extracted
|
| 171 |
-
|
| 172 |
-
### Qwen Model Integration
|
| 173 |
-
- [x] Model loads correctly from Hugging Face
|
| 174 |
-
- [x] Chat template applied correctly
|
| 175 |
-
- [x] Tools formatted for Qwen prompt style
|
| 176 |
-
- [x] Tool calls parsed from Qwen text format
|
| 177 |
-
- [x] JSON extracted from Qwen responses
|
| 178 |
-
|
| 179 |
-
## Testing Recommendations
|
| 180 |
-
|
| 181 |
-
1. **Basic Chat**: Verify simple chat completions work
|
| 182 |
-
2. **Tool Calls**: Test with tools defined, verify parsing
|
| 183 |
-
3. **Structured Outputs**: Test with `response_format`, verify JSON extraction
|
| 184 |
-
4. **Error Handling**: Test invalid requests return proper errors
|
| 185 |
-
5. **Streaming**: Test streaming responses work correctly
|
| 186 |
-
|
| 187 |
-
## Known Limitations
|
| 188 |
-
|
| 189 |
-
1. **Native Function Calling**: Qwen doesn't support native function calling, so we use text-based parsing
|
| 190 |
-
2. **JSON Mode**: Qwen doesn't have native JSON mode, so we enforce via prompts
|
| 191 |
-
3. **Tool Choice "required"**: Converted to "auto" since we can't force tool calls in text-based models
|
| 192 |
-
|
| 193 |
-
## Conclusion
|
| 194 |
-
|
| 195 |
-
✅ **Our OpenAI API wrapper is correctly implemented and properly connected to the Qwen fine-tuned model.**
|
| 196 |
-
|
| 197 |
-
The implementation:
|
| 198 |
-
- Follows OpenAI API specification
|
| 199 |
-
- Handles OpenAI-compatible parameters correctly
|
| 200 |
-
- Properly integrates with Qwen model via Transformers
|
| 201 |
-
- Provides fallbacks for features not natively supported by Qwen
|
| 202 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/transformers_verification.md
DELETED
|
@@ -1,190 +0,0 @@
|
|
| 1 |
-
# Transformers Library Usage Verification
|
| 2 |
-
|
| 3 |
-
## Current Implementation
|
| 4 |
-
|
| 5 |
-
### ✅ Library Version
|
| 6 |
-
- **Dockerfile**: `transformers>=4.45.0` (updated from 4.40.0)
|
| 7 |
-
- **Minimum Required**: 4.37.0 for Qwen1.5, 4.35.0 for Qwen2.5
|
| 8 |
-
- **Recommended**: 4.45.0+ for latest Qwen features and bug fixes
|
| 9 |
-
|
| 10 |
-
### ✅ Correct Usage of Transformers API
|
| 11 |
-
|
| 12 |
-
#### 1. Model Loading
|
| 13 |
-
```python
|
| 14 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 15 |
-
|
| 16 |
-
# ✅ Correct: Using AutoModelForCausalLM for causal language models
|
| 17 |
-
model = AutoModelForCausalLM.from_pretrained(
|
| 18 |
-
MODEL_NAME,
|
| 19 |
-
token=hf_token,
|
| 20 |
-
trust_remote_code=True, # ✅ Required for Qwen models
|
| 21 |
-
dtype=torch.bfloat16, # ✅ Memory-efficient precision
|
| 22 |
-
device_map="auto", # ✅ Automatic device placement
|
| 23 |
-
max_memory={0: "20GiB"}, # ✅ Memory management
|
| 24 |
-
cache_dir=CACHE_DIR,
|
| 25 |
-
low_cpu_mem_usage=True, # ✅ Efficient loading
|
| 26 |
-
)
|
| 27 |
-
```
|
| 28 |
-
|
| 29 |
-
**Verification**:
|
| 30 |
-
- ✅ `AutoModelForCausalLM` is correct for Qwen (causal LM architecture)
|
| 31 |
-
- ✅ `trust_remote_code=True` is required for Qwen's custom code
|
| 32 |
-
- ✅ `dtype=torch.bfloat16` is optimal for memory and performance
|
| 33 |
-
- ✅ `device_map="auto"` automatically handles GPU/CPU placement
|
| 34 |
-
- ✅ `max_memory` limits GPU memory usage
|
| 35 |
-
|
| 36 |
-
#### 2. Tokenizer Loading
|
| 37 |
-
```python
|
| 38 |
-
# ✅ Correct: Using AutoTokenizer
|
| 39 |
-
tokenizer = AutoTokenizer.from_pretrained(
|
| 40 |
-
MODEL_NAME,
|
| 41 |
-
token=hf_token,
|
| 42 |
-
trust_remote_code=True, # ✅ Required for Qwen
|
| 43 |
-
cache_dir=CACHE_DIR,
|
| 44 |
-
)
|
| 45 |
-
```
|
| 46 |
-
|
| 47 |
-
**Verification**:
|
| 48 |
-
- ✅ `AutoTokenizer` automatically detects Qwen tokenizer
|
| 49 |
-
- ✅ `trust_remote_code=True` loads Qwen's custom tokenizer code
|
| 50 |
-
- ✅ Chat template handling is correct
|
| 51 |
-
|
| 52 |
-
#### 3. Chat Template Usage
|
| 53 |
-
```python
|
| 54 |
-
# ✅ Correct: Using apply_chat_template
|
| 55 |
-
if hasattr(tokenizer, "apply_chat_template"):
|
| 56 |
-
prompt = tokenizer.apply_chat_template(
|
| 57 |
-
messages,
|
| 58 |
-
tokenize=False,
|
| 59 |
-
add_generation_prompt=True,
|
| 60 |
-
)
|
| 61 |
-
```
|
| 62 |
-
|
| 63 |
-
**Verification**:
|
| 64 |
-
- ✅ `apply_chat_template` is the modern way (replaces manual formatting)
|
| 65 |
-
- ✅ `tokenize=False` returns string (we tokenize separately)
|
| 66 |
-
- ✅ `add_generation_prompt=True` adds assistant prompt
|
| 67 |
-
|
| 68 |
-
#### 4. Model Generation
|
| 69 |
-
```python
|
| 70 |
-
# ✅ Correct: Using model.generate()
|
| 71 |
-
outputs = model.generate(
|
| 72 |
-
**inputs,
|
| 73 |
-
max_new_tokens=max_tokens,
|
| 74 |
-
temperature=temperature,
|
| 75 |
-
top_p=top_p,
|
| 76 |
-
top_k=DEFAULT_TOP_K,
|
| 77 |
-
do_sample=temperature > 0,
|
| 78 |
-
pad_token_id=PAD_TOKEN_ID,
|
| 79 |
-
eos_token_id=EOS_TOKENS,
|
| 80 |
-
repetition_penalty=REPETITION_PENALTY,
|
| 81 |
-
use_cache=True,
|
| 82 |
-
)
|
| 83 |
-
```
|
| 84 |
-
|
| 85 |
-
**Verification**:
|
| 86 |
-
- ✅ `max_new_tokens` is correct (not `max_length`)
|
| 87 |
-
- ✅ `do_sample` based on temperature is correct
|
| 88 |
-
- ✅ `pad_token_id` and `eos_token_id` properly configured
|
| 89 |
-
- ✅ `repetition_penalty` helps avoid repetition
|
| 90 |
-
- ✅ `use_cache=True` improves performance
|
| 91 |
-
|
| 92 |
-
#### 5. Streaming Support
|
| 93 |
-
```python
|
| 94 |
-
# ✅ Correct: Using TextIteratorStreamer
|
| 95 |
-
from transformers import TextIteratorStreamer
|
| 96 |
-
|
| 97 |
-
streamer = TextIteratorStreamer(
|
| 98 |
-
tokenizer,
|
| 99 |
-
skip_prompt=True,
|
| 100 |
-
skip_special_tokens=True
|
| 101 |
-
)
|
| 102 |
-
```
|
| 103 |
-
|
| 104 |
-
**Verification**:
|
| 105 |
-
- ✅ `TextIteratorStreamer` is the correct class for streaming
|
| 106 |
-
- ✅ `skip_prompt=True` avoids re-printing the prompt
|
| 107 |
-
- ✅ `skip_special_tokens=True` produces clean output
|
| 108 |
-
|
| 109 |
-
## Qwen-Specific Considerations
|
| 110 |
-
|
| 111 |
-
### ✅ Model Architecture
|
| 112 |
-
- **Qwen-Open-Finance-R-8B** is based on Qwen architecture
|
| 113 |
-
- Uses **CausalLM** architecture (autoregressive generation)
|
| 114 |
-
- Compatible with `AutoModelForCausalLM`
|
| 115 |
-
|
| 116 |
-
### ✅ Tokenizer Features
|
| 117 |
-
- Qwen tokenizer supports chat templates
|
| 118 |
-
- Custom chat template can be loaded from model repo
|
| 119 |
-
- Handles special tokens correctly
|
| 120 |
-
|
| 121 |
-
### ✅ Generation Parameters
|
| 122 |
-
- Qwen works well with:
|
| 123 |
-
- `temperature`: 0.1-1.0 (we use 0.7 default)
|
| 124 |
-
- `top_p`: 0.9-1.0 (we use 1.0 default)
|
| 125 |
-
- `top_k`: 50-100 (we use DEFAULT_TOP_K)
|
| 126 |
-
- `repetition_penalty`: 1.0-1.2 (we use REPETITION_PENALTY)
|
| 127 |
-
|
| 128 |
-
## Best Practices Followed
|
| 129 |
-
|
| 130 |
-
1. ✅ **Memory Management**: Using `bfloat16`, `low_cpu_mem_usage`, `max_memory`
|
| 131 |
-
2. ✅ **Device Handling**: `device_map="auto"` for automatic GPU/CPU
|
| 132 |
-
3. ✅ **Caching**: Using `cache_dir` for model/tokenizer caching
|
| 133 |
-
4. ✅ **Error Handling**: Proper exception handling in initialization
|
| 134 |
-
5. ✅ **Thread Safety**: Using locks for concurrent initialization
|
| 135 |
-
6. ✅ **Streaming**: Proper async streaming implementation
|
| 136 |
-
|
| 137 |
-
## Potential Improvements
|
| 138 |
-
|
| 139 |
-
### 1. Consider Using `torch.compile()` (PyTorch 2.0+)
|
| 140 |
-
```python
|
| 141 |
-
# Optional: Compile model for faster inference
|
| 142 |
-
if hasattr(torch, 'compile'):
|
| 143 |
-
model = torch.compile(model, mode="reduce-overhead")
|
| 144 |
-
```
|
| 145 |
-
|
| 146 |
-
### 2. Consider Flash Attention 2
|
| 147 |
-
```python
|
| 148 |
-
# For faster attention computation (if supported)
|
| 149 |
-
model = AutoModelForCausalLM.from_pretrained(
|
| 150 |
-
...,
|
| 151 |
-
attn_implementation="flash_attention_2", # If available
|
| 152 |
-
)
|
| 153 |
-
```
|
| 154 |
-
|
| 155 |
-
### 3. Consider Quantization (if memory constrained)
|
| 156 |
-
```python
|
| 157 |
-
# 8-bit quantization (requires bitsandbytes)
|
| 158 |
-
from transformers import BitsAndBytesConfig
|
| 159 |
-
|
| 160 |
-
quantization_config = BitsAndBytesConfig(
|
| 161 |
-
load_in_8bit=True,
|
| 162 |
-
)
|
| 163 |
-
```
|
| 164 |
-
|
| 165 |
-
## Version Compatibility Matrix
|
| 166 |
-
|
| 167 |
-
| Component | Minimum | Recommended | Current |
|
| 168 |
-
|-----------|---------|-------------|---------|
|
| 169 |
-
| Transformers | 4.37.0 | 4.45.0+ | 4.45.0+ ✅ |
|
| 170 |
-
| PyTorch | 2.0.0 | 2.5.0+ | 2.5.0+ ✅ |
|
| 171 |
-
| Python | 3.8 | 3.11+ | 3.11 ✅ |
|
| 172 |
-
| CUDA | 11.8 | 12.4 | 12.4 ✅ |
|
| 173 |
-
|
| 174 |
-
## Conclusion
|
| 175 |
-
|
| 176 |
-
✅ **Our Transformers implementation is correct and follows best practices.**
|
| 177 |
-
|
| 178 |
-
The code:
|
| 179 |
-
- Uses correct Transformers API methods
|
| 180 |
-
- Properly handles Qwen-specific requirements
|
| 181 |
-
- Implements efficient memory management
|
| 182 |
-
- Supports streaming correctly
|
| 183 |
-
- Uses appropriate generation parameters
|
| 184 |
-
|
| 185 |
-
The version update to 4.45.0+ ensures:
|
| 186 |
-
- Latest bug fixes
|
| 187 |
-
- Better Qwen support
|
| 188 |
-
- Improved performance
|
| 189 |
-
- Security updates
|
| 190 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|