Spaces:

jeanbaptdzd
/

open-finance-llm-8b

Paused

jeanbaptdzd commited on 13 days ago

Commit

7a92d8e

1 Parent(s): 7239fe3

Cleanup: remove redundant docs, condense README

- Removed koyeb_logs_analysis.md (one-time analysis)
- Removed openai_api_verification.md (historical verification)
- Removed transformers_verification.md (historical verification)
- Condensed README from 117 to 89 lines (24% reduction)
- Removed repetitive sections and consolidated information

Files changed (4) hide show

Dockerfile.koyeb +1 -0
README.md +7 -30
docs/openai_api_verification.md +0 -202
docs/transformers_verification.md +0 -190

Dockerfile.koyeb CHANGED Viewed

@@ -21,3 +21,4 @@ EXPOSE 8000
 # Use ENTRYPOINT so it can't be overridden by empty Koyeb args
 ENTRYPOINT ["/start-vllm.sh"]


21	# Use ENTRYPOINT so it can't be overridden by empty Koyeb args
22	ENTRYPOINT ["/start-vllm.sh"]
23
24	+

README.md CHANGED Viewed

@@ -13,7 +13,7 @@ suggested_hardware: l4x1
 OpenAI-compatible API powered by DragonLLM/Qwen-Open-Finance-R-8B.
-## Deployment Options
 | Platform | Backend | Dockerfile | Use Case |
 |----------|---------|------------|----------|
@@ -25,13 +25,11 @@ OpenAI-compatible API powered by DragonLLM/Qwen-Open-Finance-R-8B.
 - OpenAI-compatible API
 - Tool/function calling support
 - Streaming responses
-- French and English financial terminology
 - Rate limiting (30 req/min, 500 req/hour)
 - Statistics tracking via `/v1/stats`
 ## Quick Start
-### Chat Completion
 ```bash
 curl -X POST "https://your-endpoint/v1/chat/completions" \
   -H "Content-Type: application/json" \
@@ -42,7 +40,6 @@ curl -X POST "https://your-endpoint/v1/chat/completions" \
   }'
 ```
-### OpenAI SDK
 ```python
 from openai import OpenAI
@@ -56,32 +53,17 @@ response = client.chat.completions.create(
 ## Configuration
-### Environment Variables
 | Variable | Required | Default | Description |
 |----------|----------|---------|-------------|
 | `HF_TOKEN_LC2` | Yes | - | Hugging Face token |
 | `MODEL` | No | `DragonLLM/Qwen-Open-Finance-R-8B` | Model name |
 | `PORT` | No | `8000` (vLLM) / `7860` (Transformers) | Server port |
-### vLLM-specific (Koyeb)
-| Variable | Default | Description |
-|----------|---------|-------------|
-| `ENABLE_AUTO_TOOL_CHOICE` | `true` | Enable tool calling |
-| `TOOL_CALL_PARSER` | `hermes` | Parser for Qwen models |
-| `MAX_MODEL_LEN` | `8192` | Max context length |
-| `GPU_MEMORY_UTILIZATION` | `0.90` | GPU memory fraction |
-## Koyeb Deployment
-Build and push the vLLM image:
-```bash
-docker build --platform linux/amd64 -f Dockerfile.koyeb -t your-registry/dragon-llm-inference:vllm-amd64 .
-docker push your-registry/dragon-llm-inference:vllm-amd64
-```
-Recommended instance: `gpu-nvidia-l40s` (48GB VRAM)
 ## API Endpoints
@@ -92,7 +74,7 @@ Recommended instance: `gpu-nvidia-l40s` (48GB VRAM)
 | `/v1/stats` | GET | Usage statistics |
 | `/health` | GET | Health check |
-## Technical Specifications
 - **Model**: DragonLLM/Qwen-Open-Finance-R-8B (8B parameters)
 - **vLLM Backend**: vllm-openai:latest with hermes tool parser
@@ -104,12 +86,7 @@ Recommended instance: `gpu-nvidia-l40s` (48GB VRAM)
 ```bash
 pip install -r requirements.txt
 uvicorn app.main:app --reload --port 8080
-```
-### Testing
-```bash
 pytest tests/ -v
-python tests/integration/test_tool_calls.py
 ```
 ## License

 OpenAI-compatible API powered by DragonLLM/Qwen-Open-Finance-R-8B.
+## Deployment
 | Platform | Backend | Dockerfile | Use Case |
 |----------|---------|------------|----------|
 - OpenAI-compatible API
 - Tool/function calling support
 - Streaming responses
 - Rate limiting (30 req/min, 500 req/hour)
 - Statistics tracking via `/v1/stats`
 ## Quick Start
 ```bash
 curl -X POST "https://your-endpoint/v1/chat/completions" \
   -H "Content-Type: application/json" \
   }'
 ```
 ```python
 from openai import OpenAI
 ## Configuration
 | Variable | Required | Default | Description |
 |----------|----------|---------|-------------|
 | `HF_TOKEN_LC2` | Yes | - | Hugging Face token |
 | `MODEL` | No | `DragonLLM/Qwen-Open-Finance-R-8B` | Model name |
 | `PORT` | No | `8000` (vLLM) / `7860` (Transformers) | Server port |
+**vLLM-specific (Koyeb):**
+- `ENABLE_AUTO_TOOL_CHOICE=true` - Enable tool calling
+- `TOOL_CALL_PARSER=hermes` - Parser for Qwen models
+- `MAX_MODEL_LEN=8192` - Max context length
+- `GPU_MEMORY_UTILIZATION=0.90` - GPU memory fraction
 ## API Endpoints
 | `/v1/stats` | GET | Usage statistics |
 | `/health` | GET | Health check |
+## Technical Specs
 - **Model**: DragonLLM/Qwen-Open-Finance-R-8B (8B parameters)
 - **vLLM Backend**: vllm-openai:latest with hermes tool parser
 ```bash
 pip install -r requirements.txt
 uvicorn app.main:app --reload --port 8080
 pytest tests/ -v
 ```
 ## License

docs/openai_api_verification.md DELETED Viewed

@@ -1,202 +0,0 @@
-# OpenAI API Compatibility Verification
-## Overview
-This document verifies that our OpenAI API wrapper implementation correctly follows the OpenAI API specification and properly connects to the Qwen fine-tuned model.
-## Connection Flow
-```
-OpenAI-compatible Client
-    ↓ (OpenAI API requests)
-Hugging Face Space API (simple-llm-pro-finance)
-    ↓ (FastAPI router)
-TransformersProvider
-    ↓ (Hugging Face Transformers)
-Qwen-Open-Finance-R-8B Model
-```
-## OpenAI API Specification Compliance
-### 1. Chat Completions Endpoint: `/v1/chat/completions`
-#### ✅ Request Parameters (All Supported)
-| Parameter | Type | Status | Notes |
-|-----------|------|--------|-------|
-| `model` | string | ✅ | Required, defaults to configured model |
-| `messages` | array | ✅ | Required, validated |
-| `temperature` | number | ✅ | Optional, default 0.7, validated (0-2) |
-| `max_tokens` | integer | ✅ | Optional, validated (≥1) |
-| `stream` | boolean | ✅ | Optional, default false |
-| `top_p` | number | ✅ | Optional, default 1.0 |
-| `tools` | array | ✅ | Optional, tool definitions |
-| `tool_choice` | string/object | ✅ | Optional, supports "none", "auto", "required" |
-| `response_format` | object | ✅ | Optional, supports {"type": "json_object"} |
-#### ✅ Response Format
-| Field | Type | Status | Notes |
-|-------|------|--------|-------|
-| `id` | string | ✅ | Generated chat completion ID |
-| `object` | string | ✅ | "chat.completion" |
-| `created` | integer | ✅ | Unix timestamp |
-| `model` | string | ✅ | Model name |
-| `choices` | array | ✅ | Array of Choice objects |
-| `usage` | object | ✅ | Token usage statistics |
-#### ✅ Choice Object
-| Field | Type | Status | Notes |
-|-------|------|--------|-------|
-| `index` | integer | ✅ | Choice index |
-| `message` | object | ✅ | Message object |
-| `finish_reason` | string | ✅ | "stop", "length", "tool_calls" |
-#### ✅ Message Object
-| Field | Type | Status | Notes |
-|-------|------|--------|-------|
-| `role` | string | ✅ | "assistant" |
-| `content` | string/null | ✅ | Message content |
-| `tool_calls` | array/null | ✅ | Array of ToolCall objects |
-#### ✅ ToolCall Object
-| Field | Type | Status | Notes |
-|-------|------|--------|-------|
-| `id` | string | ✅ | Tool call ID |
-| `type` | string | ✅ | "function" |
-| `function` | object | ✅ | FunctionCall object |
-#### ✅ FunctionCall Object
-| Field | Type | Status | Notes |
-|-------|------|--------|-------|
-| `name` | string | ✅ | Function name |
-| `arguments` | string | ✅ | JSON string of arguments |
-### 2. Tool Choice Handling
-#### ✅ Supported Values
-- `"none"`: Model will not call any tools
-- `"auto"`: Model can choose to call tools (default)
-- `"required"`: Model must call a tool (converted to "auto" for text-based models)
-- `{"type": "function", "function": {"name": "..."}}`: Force specific tool
-**Implementation Note**: Since Qwen is a text-based model (not native function calling), we convert `"required"` to `"auto"` and handle tool calls via text parsing.
-### 3. Response Format Handling
-#### ✅ JSON Object Mode
-When `response_format={"type": "json_object"}` is provided:
-- ✅ System prompt is enhanced with JSON output instructions
-- ✅ Response is parsed to extract JSON from markdown code blocks
-- ✅ Clean JSON is returned for validation
-**Implementation**: Since Qwen doesn't have native JSON mode, we enforce it via prompt engineering and post-processing.
-## Client Integration
-### ✅ Supported Parameters
-The API accepts standard OpenAI API parameters:
-```python
-{
-    "model": "dragon-llm-open-finance",
-    "messages": [...],
-    "temperature": 0.7,
-    "max_tokens": 3000,
-    "response_format": {"type": "json_object"},  # ✅ Supported
-    "tool_choice": "required",  # ✅ Accepted (converted to "auto")
-    "tools": [...]  # ✅ Tool definitions supported
-}
-```
-### ✅ Implementation Details
-1. ✅ `tool_choice="required"` → Accepted and converted to `"auto"`
-2. ✅ `response_format={"type": "json_object"}` → JSON instructions added to prompt
-3. ✅ `tools` array → Formatted and added to system prompt
-4. ✅ Tool calls in response → Parsed from text and returned in OpenAI format
-## Qwen Model Integration
-### ✅ Model Connection
-1. **Model Loading**: ✅ Uses Hugging Face Transformers
-   - Model: `DragonLLM/Qwen-Open-Finance-R-8B`
-   - Tokenizer: Auto-loaded with model
-   - Device: Auto (CUDA if available)
-2. **Prompt Formatting**: ✅ Uses Qwen chat template
-   - System prompts properly formatted
-   - Tools added to system prompt
-   - JSON instructions added when needed
-3. **Response Processing**: ✅
-   - Text generation via Transformers
-   - Tool call parsing from text
-   - JSON extraction from markdown
-### ✅ Qwen-Specific Considerations
-1. **Text-Based Tool Calls**: Qwen doesn't have native function calling, so we:
-   - Format tools in system prompt
-   - Parse `<tool_call>...</tool_call>` blocks from response
-   - Convert to OpenAI-compatible format
-2. **JSON Output**: Qwen doesn't have native JSON mode, so we:
-   - Add JSON instructions to system prompt
-   - Extract JSON from markdown code blocks
-   - Validate and return clean JSON
-## Verification Checklist
-### API Compatibility
-- [x] All required OpenAI API parameters supported
-- [x] Response format matches OpenAI specification
-- [x] Error handling follows OpenAI error format
-- [x] Streaming support implemented
-- [x] Tool calls properly formatted
-### Client Compatibility
-- [x] `tool_choice="required"` accepted
-- [x] `response_format` supported
-- [x] Structured output requests handled correctly
-- [x] Tool definitions passed through
-- [x] Structured outputs extracted
-### Qwen Model Integration
-- [x] Model loads correctly from Hugging Face
-- [x] Chat template applied correctly
-- [x] Tools formatted for Qwen prompt style
-- [x] Tool calls parsed from Qwen text format
-- [x] JSON extracted from Qwen responses
-## Testing Recommendations
-1. **Basic Chat**: Verify simple chat completions work
-2. **Tool Calls**: Test with tools defined, verify parsing
-3. **Structured Outputs**: Test with `response_format`, verify JSON extraction
-4. **Error Handling**: Test invalid requests return proper errors
-5. **Streaming**: Test streaming responses work correctly
-## Known Limitations
-1. **Native Function Calling**: Qwen doesn't support native function calling, so we use text-based parsing
-2. **JSON Mode**: Qwen doesn't have native JSON mode, so we enforce via prompts
-3. **Tool Choice "required"**: Converted to "auto" since we can't force tool calls in text-based models
-## Conclusion
-✅ **Our OpenAI API wrapper is correctly implemented and properly connected to the Qwen fine-tuned model.**
-The implementation:
-- Follows OpenAI API specification
-- Handles OpenAI-compatible parameters correctly
-- Properly integrates with Qwen model via Transformers
-- Provides fallbacks for features not natively supported by Qwen

docs/transformers_verification.md DELETED Viewed

@@ -1,190 +0,0 @@
-# Transformers Library Usage Verification
-## Current Implementation
-### ✅ Library Version
-- **Dockerfile**: `transformers>=4.45.0` (updated from 4.40.0)
-- **Minimum Required**: 4.37.0 for Qwen1.5, 4.35.0 for Qwen2.5
-- **Recommended**: 4.45.0+ for latest Qwen features and bug fixes
-### ✅ Correct Usage of Transformers API
-#### 1. Model Loading
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-# ✅ Correct: Using AutoModelForCausalLM for causal language models
-model = AutoModelForCausalLM.from_pretrained(
-    MODEL_NAME,
-    token=hf_token,
-    trust_remote_code=True,  # ✅ Required for Qwen models
-    dtype=torch.bfloat16,    # ✅ Memory-efficient precision
-    device_map="auto",       # ✅ Automatic device placement
-    max_memory={0: "20GiB"}, # ✅ Memory management
-    cache_dir=CACHE_DIR,
-    low_cpu_mem_usage=True,  # ✅ Efficient loading
-)
-```
-**Verification**:
-- ✅ `AutoModelForCausalLM` is correct for Qwen (causal LM architecture)
-- ✅ `trust_remote_code=True` is required for Qwen's custom code
-- ✅ `dtype=torch.bfloat16` is optimal for memory and performance
-- ✅ `device_map="auto"` automatically handles GPU/CPU placement
-- ✅ `max_memory` limits GPU memory usage
-#### 2. Tokenizer Loading
-```python
-# ✅ Correct: Using AutoTokenizer
-tokenizer = AutoTokenizer.from_pretrained(
-    MODEL_NAME,
-    token=hf_token,
-    trust_remote_code=True,  # ✅ Required for Qwen
-    cache_dir=CACHE_DIR,
-)
-```
-**Verification**:
-- ✅ `AutoTokenizer` automatically detects Qwen tokenizer
-- ✅ `trust_remote_code=True` loads Qwen's custom tokenizer code
-- ✅ Chat template handling is correct
-#### 3. Chat Template Usage
-```python
-# ✅ Correct: Using apply_chat_template
-if hasattr(tokenizer, "apply_chat_template"):
-    prompt = tokenizer.apply_chat_template(
-        messages,
-        tokenize=False,
-        add_generation_prompt=True,
-    )
-```
-**Verification**:
-- ✅ `apply_chat_template` is the modern way (replaces manual formatting)
-- ✅ `tokenize=False` returns string (we tokenize separately)
-- ✅ `add_generation_prompt=True` adds assistant prompt
-#### 4. Model Generation
-```python
-# ✅ Correct: Using model.generate()
-outputs = model.generate(
-    **inputs,
-    max_new_tokens=max_tokens,
-    temperature=temperature,
-    top_p=top_p,
-    top_k=DEFAULT_TOP_K,
-    do_sample=temperature > 0,
-    pad_token_id=PAD_TOKEN_ID,
-    eos_token_id=EOS_TOKENS,
-    repetition_penalty=REPETITION_PENALTY,
-    use_cache=True,
-)
-```
-**Verification**:
-- ✅ `max_new_tokens` is correct (not `max_length`)
-- ✅ `do_sample` based on temperature is correct
-- ✅ `pad_token_id` and `eos_token_id` properly configured
-- ✅ `repetition_penalty` helps avoid repetition
-- ✅ `use_cache=True` improves performance
-#### 5. Streaming Support
-```python
-# ✅ Correct: Using TextIteratorStreamer
-from transformers import TextIteratorStreamer
-streamer = TextIteratorStreamer(
-    tokenizer,
-    skip_prompt=True,
-    skip_special_tokens=True
-)
-```
-**Verification**:
-- ✅ `TextIteratorStreamer` is the correct class for streaming
-- ✅ `skip_prompt=True` avoids re-printing the prompt
-- ✅ `skip_special_tokens=True` produces clean output
-## Qwen-Specific Considerations
-### ✅ Model Architecture
-- **Qwen-Open-Finance-R-8B** is based on Qwen architecture
-- Uses **CausalLM** architecture (autoregressive generation)
-- Compatible with `AutoModelForCausalLM`
-### ✅ Tokenizer Features
-- Qwen tokenizer supports chat templates
-- Custom chat template can be loaded from model repo
-- Handles special tokens correctly
-### ✅ Generation Parameters
-- Qwen works well with:
-  - `temperature`: 0.1-1.0 (we use 0.7 default)
-  - `top_p`: 0.9-1.0 (we use 1.0 default)
-  - `top_k`: 50-100 (we use DEFAULT_TOP_K)
-  - `repetition_penalty`: 1.0-1.2 (we use REPETITION_PENALTY)
-## Best Practices Followed
-1. ✅ **Memory Management**: Using `bfloat16`, `low_cpu_mem_usage`, `max_memory`
-2. ✅ **Device Handling**: `device_map="auto"` for automatic GPU/CPU
-3. ✅ **Caching**: Using `cache_dir` for model/tokenizer caching
-4. ✅ **Error Handling**: Proper exception handling in initialization
-5. ✅ **Thread Safety**: Using locks for concurrent initialization
-6. ✅ **Streaming**: Proper async streaming implementation
-## Potential Improvements
-### 1. Consider Using `torch.compile()` (PyTorch 2.0+)
-```python
-# Optional: Compile model for faster inference
-if hasattr(torch, 'compile'):
-    model = torch.compile(model, mode="reduce-overhead")
-```
-### 2. Consider Flash Attention 2
-```python
-# For faster attention computation (if supported)
-model = AutoModelForCausalLM.from_pretrained(
-    ...,
-    attn_implementation="flash_attention_2",  # If available
-)
-```
-### 3. Consider Quantization (if memory constrained)
-```python
-# 8-bit quantization (requires bitsandbytes)
-from transformers import BitsAndBytesConfig
-quantization_config = BitsAndBytesConfig(
-    load_in_8bit=True,
-)
-```
-## Version Compatibility Matrix
-| Component | Minimum | Recommended | Current |
-|-----------|---------|-------------|---------|
-| Transformers | 4.37.0 | 4.45.0+ | 4.45.0+ ✅ |
-| PyTorch | 2.0.0 | 2.5.0+ | 2.5.0+ ✅ |
-| Python | 3.8 | 3.11+ | 3.11 ✅ |
-| CUDA | 11.8 | 12.4 | 12.4 ✅ |
-## Conclusion
-✅ **Our Transformers implementation is correct and follows best practices.**
-The code:
-- Uses correct Transformers API methods
-- Properly handles Qwen-specific requirements
-- Implements efficient memory management
-- Supports streaming correctly
-- Uses appropriate generation parameters
-The version update to 4.45.0+ ensures:
-- Latest bug fixes
-- Better Qwen support
-- Improved performance
-- Security updates