jeanbaptdzd commited on
Commit
7a92d8e
·
1 Parent(s): 7239fe3

Cleanup: remove redundant docs, condense README

Browse files

- Removed koyeb_logs_analysis.md (one-time analysis)
- Removed openai_api_verification.md (historical verification)
- Removed transformers_verification.md (historical verification)
- Condensed README from 117 to 89 lines (24% reduction)
- Removed repetitive sections and consolidated information

Dockerfile.koyeb CHANGED
@@ -21,3 +21,4 @@ EXPOSE 8000
21
  # Use ENTRYPOINT so it can't be overridden by empty Koyeb args
22
  ENTRYPOINT ["/start-vllm.sh"]
23
 
 
 
21
  # Use ENTRYPOINT so it can't be overridden by empty Koyeb args
22
  ENTRYPOINT ["/start-vllm.sh"]
23
 
24
+
README.md CHANGED
@@ -13,7 +13,7 @@ suggested_hardware: l4x1
13
 
14
  OpenAI-compatible API powered by DragonLLM/Qwen-Open-Finance-R-8B.
15
 
16
- ## Deployment Options
17
 
18
  | Platform | Backend | Dockerfile | Use Case |
19
  |----------|---------|------------|----------|
@@ -25,13 +25,11 @@ OpenAI-compatible API powered by DragonLLM/Qwen-Open-Finance-R-8B.
25
  - OpenAI-compatible API
26
  - Tool/function calling support
27
  - Streaming responses
28
- - French and English financial terminology
29
  - Rate limiting (30 req/min, 500 req/hour)
30
  - Statistics tracking via `/v1/stats`
31
 
32
  ## Quick Start
33
 
34
- ### Chat Completion
35
  ```bash
36
  curl -X POST "https://your-endpoint/v1/chat/completions" \
37
  -H "Content-Type: application/json" \
@@ -42,7 +40,6 @@ curl -X POST "https://your-endpoint/v1/chat/completions" \
42
  }'
43
  ```
44
 
45
- ### OpenAI SDK
46
  ```python
47
  from openai import OpenAI
48
 
@@ -56,32 +53,17 @@ response = client.chat.completions.create(
56
 
57
  ## Configuration
58
 
59
- ### Environment Variables
60
-
61
  | Variable | Required | Default | Description |
62
  |----------|----------|---------|-------------|
63
  | `HF_TOKEN_LC2` | Yes | - | Hugging Face token |
64
  | `MODEL` | No | `DragonLLM/Qwen-Open-Finance-R-8B` | Model name |
65
  | `PORT` | No | `8000` (vLLM) / `7860` (Transformers) | Server port |
66
 
67
- ### vLLM-specific (Koyeb)
68
-
69
- | Variable | Default | Description |
70
- |----------|---------|-------------|
71
- | `ENABLE_AUTO_TOOL_CHOICE` | `true` | Enable tool calling |
72
- | `TOOL_CALL_PARSER` | `hermes` | Parser for Qwen models |
73
- | `MAX_MODEL_LEN` | `8192` | Max context length |
74
- | `GPU_MEMORY_UTILIZATION` | `0.90` | GPU memory fraction |
75
-
76
- ## Koyeb Deployment
77
-
78
- Build and push the vLLM image:
79
- ```bash
80
- docker build --platform linux/amd64 -f Dockerfile.koyeb -t your-registry/dragon-llm-inference:vllm-amd64 .
81
- docker push your-registry/dragon-llm-inference:vllm-amd64
82
- ```
83
-
84
- Recommended instance: `gpu-nvidia-l40s` (48GB VRAM)
85
 
86
  ## API Endpoints
87
 
@@ -92,7 +74,7 @@ Recommended instance: `gpu-nvidia-l40s` (48GB VRAM)
92
  | `/v1/stats` | GET | Usage statistics |
93
  | `/health` | GET | Health check |
94
 
95
- ## Technical Specifications
96
 
97
  - **Model**: DragonLLM/Qwen-Open-Finance-R-8B (8B parameters)
98
  - **vLLM Backend**: vllm-openai:latest with hermes tool parser
@@ -104,12 +86,7 @@ Recommended instance: `gpu-nvidia-l40s` (48GB VRAM)
104
  ```bash
105
  pip install -r requirements.txt
106
  uvicorn app.main:app --reload --port 8080
107
- ```
108
-
109
- ### Testing
110
- ```bash
111
  pytest tests/ -v
112
- python tests/integration/test_tool_calls.py
113
  ```
114
 
115
  ## License
 
13
 
14
  OpenAI-compatible API powered by DragonLLM/Qwen-Open-Finance-R-8B.
15
 
16
+ ## Deployment
17
 
18
  | Platform | Backend | Dockerfile | Use Case |
19
  |----------|---------|------------|----------|
 
25
  - OpenAI-compatible API
26
  - Tool/function calling support
27
  - Streaming responses
 
28
  - Rate limiting (30 req/min, 500 req/hour)
29
  - Statistics tracking via `/v1/stats`
30
 
31
  ## Quick Start
32
 
 
33
  ```bash
34
  curl -X POST "https://your-endpoint/v1/chat/completions" \
35
  -H "Content-Type: application/json" \
 
40
  }'
41
  ```
42
 
 
43
  ```python
44
  from openai import OpenAI
45
 
 
53
 
54
  ## Configuration
55
 
 
 
56
  | Variable | Required | Default | Description |
57
  |----------|----------|---------|-------------|
58
  | `HF_TOKEN_LC2` | Yes | - | Hugging Face token |
59
  | `MODEL` | No | `DragonLLM/Qwen-Open-Finance-R-8B` | Model name |
60
  | `PORT` | No | `8000` (vLLM) / `7860` (Transformers) | Server port |
61
 
62
+ **vLLM-specific (Koyeb):**
63
+ - `ENABLE_AUTO_TOOL_CHOICE=true` - Enable tool calling
64
+ - `TOOL_CALL_PARSER=hermes` - Parser for Qwen models
65
+ - `MAX_MODEL_LEN=8192` - Max context length
66
+ - `GPU_MEMORY_UTILIZATION=0.90` - GPU memory fraction
 
 
 
 
 
 
 
 
 
 
 
 
 
67
 
68
  ## API Endpoints
69
 
 
74
  | `/v1/stats` | GET | Usage statistics |
75
  | `/health` | GET | Health check |
76
 
77
+ ## Technical Specs
78
 
79
  - **Model**: DragonLLM/Qwen-Open-Finance-R-8B (8B parameters)
80
  - **vLLM Backend**: vllm-openai:latest with hermes tool parser
 
86
  ```bash
87
  pip install -r requirements.txt
88
  uvicorn app.main:app --reload --port 8080
 
 
 
 
89
  pytest tests/ -v
 
90
  ```
91
 
92
  ## License
docs/openai_api_verification.md DELETED
@@ -1,202 +0,0 @@
1
- # OpenAI API Compatibility Verification
2
-
3
- ## Overview
4
- This document verifies that our OpenAI API wrapper implementation correctly follows the OpenAI API specification and properly connects to the Qwen fine-tuned model.
5
-
6
- ## Connection Flow
7
-
8
- ```
9
- OpenAI-compatible Client
10
- ↓ (OpenAI API requests)
11
- Hugging Face Space API (simple-llm-pro-finance)
12
- ↓ (FastAPI router)
13
- TransformersProvider
14
- ↓ (Hugging Face Transformers)
15
- Qwen-Open-Finance-R-8B Model
16
- ```
17
-
18
- ## OpenAI API Specification Compliance
19
-
20
- ### 1. Chat Completions Endpoint: `/v1/chat/completions`
21
-
22
- #### ✅ Request Parameters (All Supported)
23
-
24
- | Parameter | Type | Status | Notes |
25
- |-----------|------|--------|-------|
26
- | `model` | string | ✅ | Required, defaults to configured model |
27
- | `messages` | array | ✅ | Required, validated |
28
- | `temperature` | number | ✅ | Optional, default 0.7, validated (0-2) |
29
- | `max_tokens` | integer | ✅ | Optional, validated (≥1) |
30
- | `stream` | boolean | ✅ | Optional, default false |
31
- | `top_p` | number | ✅ | Optional, default 1.0 |
32
- | `tools` | array | ✅ | Optional, tool definitions |
33
- | `tool_choice` | string/object | ✅ | Optional, supports "none", "auto", "required" |
34
- | `response_format` | object | ✅ | Optional, supports {"type": "json_object"} |
35
-
36
- #### ✅ Response Format
37
-
38
- | Field | Type | Status | Notes |
39
- |-------|------|--------|-------|
40
- | `id` | string | ✅ | Generated chat completion ID |
41
- | `object` | string | ✅ | "chat.completion" |
42
- | `created` | integer | ✅ | Unix timestamp |
43
- | `model` | string | ✅ | Model name |
44
- | `choices` | array | ✅ | Array of Choice objects |
45
- | `usage` | object | ✅ | Token usage statistics |
46
-
47
- #### ✅ Choice Object
48
-
49
- | Field | Type | Status | Notes |
50
- |-------|------|--------|-------|
51
- | `index` | integer | ✅ | Choice index |
52
- | `message` | object | ✅ | Message object |
53
- | `finish_reason` | string | ✅ | "stop", "length", "tool_calls" |
54
-
55
- #### ✅ Message Object
56
-
57
- | Field | Type | Status | Notes |
58
- |-------|------|--------|-------|
59
- | `role` | string | ✅ | "assistant" |
60
- | `content` | string/null | ✅ | Message content |
61
- | `tool_calls` | array/null | ✅ | Array of ToolCall objects |
62
-
63
- #### ✅ ToolCall Object
64
-
65
- | Field | Type | Status | Notes |
66
- |-------|------|--------|-------|
67
- | `id` | string | ✅ | Tool call ID |
68
- | `type` | string | ✅ | "function" |
69
- | `function` | object | ✅ | FunctionCall object |
70
-
71
- #### ✅ FunctionCall Object
72
-
73
- | Field | Type | Status | Notes |
74
- |-------|------|--------|-------|
75
- | `name` | string | ✅ | Function name |
76
- | `arguments` | string | ✅ | JSON string of arguments |
77
-
78
- ### 2. Tool Choice Handling
79
-
80
- #### ✅ Supported Values
81
-
82
- - `"none"`: Model will not call any tools
83
- - `"auto"`: Model can choose to call tools (default)
84
- - `"required"`: Model must call a tool (converted to "auto" for text-based models)
85
- - `{"type": "function", "function": {"name": "..."}}`: Force specific tool
86
-
87
- **Implementation Note**: Since Qwen is a text-based model (not native function calling), we convert `"required"` to `"auto"` and handle tool calls via text parsing.
88
-
89
- ### 3. Response Format Handling
90
-
91
- #### ✅ JSON Object Mode
92
-
93
- When `response_format={"type": "json_object"}` is provided:
94
- - ✅ System prompt is enhanced with JSON output instructions
95
- - ✅ Response is parsed to extract JSON from markdown code blocks
96
- - ✅ Clean JSON is returned for validation
97
-
98
- **Implementation**: Since Qwen doesn't have native JSON mode, we enforce it via prompt engineering and post-processing.
99
-
100
- ## Client Integration
101
-
102
- ### ✅ Supported Parameters
103
-
104
- The API accepts standard OpenAI API parameters:
105
-
106
- ```python
107
- {
108
- "model": "dragon-llm-open-finance",
109
- "messages": [...],
110
- "temperature": 0.7,
111
- "max_tokens": 3000,
112
- "response_format": {"type": "json_object"}, # ✅ Supported
113
- "tool_choice": "required", # ✅ Accepted (converted to "auto")
114
- "tools": [...] # ✅ Tool definitions supported
115
- }
116
- ```
117
-
118
- ### ✅ Implementation Details
119
-
120
- 1. ✅ `tool_choice="required"` → Accepted and converted to `"auto"`
121
- 2. ✅ `response_format={"type": "json_object"}` → JSON instructions added to prompt
122
- 3. ✅ `tools` array → Formatted and added to system prompt
123
- 4. ✅ Tool calls in response → Parsed from text and returned in OpenAI format
124
-
125
- ## Qwen Model Integration
126
-
127
- ### ✅ Model Connection
128
-
129
- 1. **Model Loading**: ✅ Uses Hugging Face Transformers
130
- - Model: `DragonLLM/Qwen-Open-Finance-R-8B`
131
- - Tokenizer: Auto-loaded with model
132
- - Device: Auto (CUDA if available)
133
-
134
- 2. **Prompt Formatting**: ✅ Uses Qwen chat template
135
- - System prompts properly formatted
136
- - Tools added to system prompt
137
- - JSON instructions added when needed
138
-
139
- 3. **Response Processing**: ✅
140
- - Text generation via Transformers
141
- - Tool call parsing from text
142
- - JSON extraction from markdown
143
-
144
- ### ✅ Qwen-Specific Considerations
145
-
146
- 1. **Text-Based Tool Calls**: Qwen doesn't have native function calling, so we:
147
- - Format tools in system prompt
148
- - Parse `<tool_call>...</tool_call>` blocks from response
149
- - Convert to OpenAI-compatible format
150
-
151
- 2. **JSON Output**: Qwen doesn't have native JSON mode, so we:
152
- - Add JSON instructions to system prompt
153
- - Extract JSON from markdown code blocks
154
- - Validate and return clean JSON
155
-
156
- ## Verification Checklist
157
-
158
- ### API Compatibility
159
- - [x] All required OpenAI API parameters supported
160
- - [x] Response format matches OpenAI specification
161
- - [x] Error handling follows OpenAI error format
162
- - [x] Streaming support implemented
163
- - [x] Tool calls properly formatted
164
-
165
- ### Client Compatibility
166
- - [x] `tool_choice="required"` accepted
167
- - [x] `response_format` supported
168
- - [x] Structured output requests handled correctly
169
- - [x] Tool definitions passed through
170
- - [x] Structured outputs extracted
171
-
172
- ### Qwen Model Integration
173
- - [x] Model loads correctly from Hugging Face
174
- - [x] Chat template applied correctly
175
- - [x] Tools formatted for Qwen prompt style
176
- - [x] Tool calls parsed from Qwen text format
177
- - [x] JSON extracted from Qwen responses
178
-
179
- ## Testing Recommendations
180
-
181
- 1. **Basic Chat**: Verify simple chat completions work
182
- 2. **Tool Calls**: Test with tools defined, verify parsing
183
- 3. **Structured Outputs**: Test with `response_format`, verify JSON extraction
184
- 4. **Error Handling**: Test invalid requests return proper errors
185
- 5. **Streaming**: Test streaming responses work correctly
186
-
187
- ## Known Limitations
188
-
189
- 1. **Native Function Calling**: Qwen doesn't support native function calling, so we use text-based parsing
190
- 2. **JSON Mode**: Qwen doesn't have native JSON mode, so we enforce via prompts
191
- 3. **Tool Choice "required"**: Converted to "auto" since we can't force tool calls in text-based models
192
-
193
- ## Conclusion
194
-
195
- ✅ **Our OpenAI API wrapper is correctly implemented and properly connected to the Qwen fine-tuned model.**
196
-
197
- The implementation:
198
- - Follows OpenAI API specification
199
- - Handles OpenAI-compatible parameters correctly
200
- - Properly integrates with Qwen model via Transformers
201
- - Provides fallbacks for features not natively supported by Qwen
202
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/transformers_verification.md DELETED
@@ -1,190 +0,0 @@
1
- # Transformers Library Usage Verification
2
-
3
- ## Current Implementation
4
-
5
- ### ✅ Library Version
6
- - **Dockerfile**: `transformers>=4.45.0` (updated from 4.40.0)
7
- - **Minimum Required**: 4.37.0 for Qwen1.5, 4.35.0 for Qwen2.5
8
- - **Recommended**: 4.45.0+ for latest Qwen features and bug fixes
9
-
10
- ### ✅ Correct Usage of Transformers API
11
-
12
- #### 1. Model Loading
13
- ```python
14
- from transformers import AutoModelForCausalLM, AutoTokenizer
15
-
16
- # ✅ Correct: Using AutoModelForCausalLM for causal language models
17
- model = AutoModelForCausalLM.from_pretrained(
18
- MODEL_NAME,
19
- token=hf_token,
20
- trust_remote_code=True, # ✅ Required for Qwen models
21
- dtype=torch.bfloat16, # ✅ Memory-efficient precision
22
- device_map="auto", # ✅ Automatic device placement
23
- max_memory={0: "20GiB"}, # ✅ Memory management
24
- cache_dir=CACHE_DIR,
25
- low_cpu_mem_usage=True, # ✅ Efficient loading
26
- )
27
- ```
28
-
29
- **Verification**:
30
- - ✅ `AutoModelForCausalLM` is correct for Qwen (causal LM architecture)
31
- - ✅ `trust_remote_code=True` is required for Qwen's custom code
32
- - ✅ `dtype=torch.bfloat16` is optimal for memory and performance
33
- - ✅ `device_map="auto"` automatically handles GPU/CPU placement
34
- - ✅ `max_memory` limits GPU memory usage
35
-
36
- #### 2. Tokenizer Loading
37
- ```python
38
- # ✅ Correct: Using AutoTokenizer
39
- tokenizer = AutoTokenizer.from_pretrained(
40
- MODEL_NAME,
41
- token=hf_token,
42
- trust_remote_code=True, # ✅ Required for Qwen
43
- cache_dir=CACHE_DIR,
44
- )
45
- ```
46
-
47
- **Verification**:
48
- - ✅ `AutoTokenizer` automatically detects Qwen tokenizer
49
- - ✅ `trust_remote_code=True` loads Qwen's custom tokenizer code
50
- - ✅ Chat template handling is correct
51
-
52
- #### 3. Chat Template Usage
53
- ```python
54
- # ✅ Correct: Using apply_chat_template
55
- if hasattr(tokenizer, "apply_chat_template"):
56
- prompt = tokenizer.apply_chat_template(
57
- messages,
58
- tokenize=False,
59
- add_generation_prompt=True,
60
- )
61
- ```
62
-
63
- **Verification**:
64
- - ✅ `apply_chat_template` is the modern way (replaces manual formatting)
65
- - ✅ `tokenize=False` returns string (we tokenize separately)
66
- - ✅ `add_generation_prompt=True` adds assistant prompt
67
-
68
- #### 4. Model Generation
69
- ```python
70
- # ✅ Correct: Using model.generate()
71
- outputs = model.generate(
72
- **inputs,
73
- max_new_tokens=max_tokens,
74
- temperature=temperature,
75
- top_p=top_p,
76
- top_k=DEFAULT_TOP_K,
77
- do_sample=temperature > 0,
78
- pad_token_id=PAD_TOKEN_ID,
79
- eos_token_id=EOS_TOKENS,
80
- repetition_penalty=REPETITION_PENALTY,
81
- use_cache=True,
82
- )
83
- ```
84
-
85
- **Verification**:
86
- - ✅ `max_new_tokens` is correct (not `max_length`)
87
- - ✅ `do_sample` based on temperature is correct
88
- - ✅ `pad_token_id` and `eos_token_id` properly configured
89
- - ✅ `repetition_penalty` helps avoid repetition
90
- - ✅ `use_cache=True` improves performance
91
-
92
- #### 5. Streaming Support
93
- ```python
94
- # ✅ Correct: Using TextIteratorStreamer
95
- from transformers import TextIteratorStreamer
96
-
97
- streamer = TextIteratorStreamer(
98
- tokenizer,
99
- skip_prompt=True,
100
- skip_special_tokens=True
101
- )
102
- ```
103
-
104
- **Verification**:
105
- - ✅ `TextIteratorStreamer` is the correct class for streaming
106
- - ✅ `skip_prompt=True` avoids re-printing the prompt
107
- - ✅ `skip_special_tokens=True` produces clean output
108
-
109
- ## Qwen-Specific Considerations
110
-
111
- ### ✅ Model Architecture
112
- - **Qwen-Open-Finance-R-8B** is based on Qwen architecture
113
- - Uses **CausalLM** architecture (autoregressive generation)
114
- - Compatible with `AutoModelForCausalLM`
115
-
116
- ### ✅ Tokenizer Features
117
- - Qwen tokenizer supports chat templates
118
- - Custom chat template can be loaded from model repo
119
- - Handles special tokens correctly
120
-
121
- ### ✅ Generation Parameters
122
- - Qwen works well with:
123
- - `temperature`: 0.1-1.0 (we use 0.7 default)
124
- - `top_p`: 0.9-1.0 (we use 1.0 default)
125
- - `top_k`: 50-100 (we use DEFAULT_TOP_K)
126
- - `repetition_penalty`: 1.0-1.2 (we use REPETITION_PENALTY)
127
-
128
- ## Best Practices Followed
129
-
130
- 1. ✅ **Memory Management**: Using `bfloat16`, `low_cpu_mem_usage`, `max_memory`
131
- 2. ✅ **Device Handling**: `device_map="auto"` for automatic GPU/CPU
132
- 3. ✅ **Caching**: Using `cache_dir` for model/tokenizer caching
133
- 4. ✅ **Error Handling**: Proper exception handling in initialization
134
- 5. ✅ **Thread Safety**: Using locks for concurrent initialization
135
- 6. ✅ **Streaming**: Proper async streaming implementation
136
-
137
- ## Potential Improvements
138
-
139
- ### 1. Consider Using `torch.compile()` (PyTorch 2.0+)
140
- ```python
141
- # Optional: Compile model for faster inference
142
- if hasattr(torch, 'compile'):
143
- model = torch.compile(model, mode="reduce-overhead")
144
- ```
145
-
146
- ### 2. Consider Flash Attention 2
147
- ```python
148
- # For faster attention computation (if supported)
149
- model = AutoModelForCausalLM.from_pretrained(
150
- ...,
151
- attn_implementation="flash_attention_2", # If available
152
- )
153
- ```
154
-
155
- ### 3. Consider Quantization (if memory constrained)
156
- ```python
157
- # 8-bit quantization (requires bitsandbytes)
158
- from transformers import BitsAndBytesConfig
159
-
160
- quantization_config = BitsAndBytesConfig(
161
- load_in_8bit=True,
162
- )
163
- ```
164
-
165
- ## Version Compatibility Matrix
166
-
167
- | Component | Minimum | Recommended | Current |
168
- |-----------|---------|-------------|---------|
169
- | Transformers | 4.37.0 | 4.45.0+ | 4.45.0+ ✅ |
170
- | PyTorch | 2.0.0 | 2.5.0+ | 2.5.0+ ✅ |
171
- | Python | 3.8 | 3.11+ | 3.11 ✅ |
172
- | CUDA | 11.8 | 12.4 | 12.4 ✅ |
173
-
174
- ## Conclusion
175
-
176
- ✅ **Our Transformers implementation is correct and follows best practices.**
177
-
178
- The code:
179
- - Uses correct Transformers API methods
180
- - Properly handles Qwen-specific requirements
181
- - Implements efficient memory management
182
- - Supports streaming correctly
183
- - Uses appropriate generation parameters
184
-
185
- The version update to 4.45.0+ ensures:
186
- - Latest bug fixes
187
- - Better Qwen support
188
- - Improved performance
189
- - Security updates
190
-