Spaces:

jeanbaptdzd
/

open-finance-llm-8b

Paused

jeanbaptdzd commited on Nov 2

Commit

9c71bb7

1 Parent(s): dc80161

Migrate from vLLM to Transformers library

- Removed vLLM dependency (doesn't support Qwen3ForCausalLM yet)
- Switched to Transformers library with native Qwen3 support
- Updated Dockerfile: removed vLLM, added transformers + accelerate
- Rewrote app/providers/vllm.py to use Transformers
- Implemented streaming with TextIteratorStreamer
- Updated all documentation and configuration
- Removed vllm_base_url from config
- Updated tests to match new config structure

This provides better compatibility with Qwen3 models while we wait for vLLM support.

Files changed (7) hide show

Dockerfile +11 -11
README.md +4 -8
app/config.py +0 -1
app/main.py +4 -4
app/providers/vllm.py +131 -154
requirements.txt +1 -3
tests/test_config.py +1 -4

Dockerfile CHANGED Viewed

@@ -24,15 +24,18 @@ RUN python3 -m pip install --upgrade pip
 # Set working directory
 WORKDIR /app
-# Install PyTorch with CUDA 12.4 support FIRST (critical for vLLM compatibility)
-# Updated to PyTorch 2.5+ for better vLLM 0.9.x compatibility
 RUN pip install --no-cache-dir \
     torch>=2.5.0 \
     --index-url https://download.pytorch.org/whl/cu124
-# Install vLLM 0.11.0 (latest, supports Qwen3ForCausalLM - requires 0.8.4+)
-# vLLM 0.11.0 - includes Qwen3 support and latest optimizations
-RUN pip install --no-cache-dir vllm==0.11.0
 # Install application dependencies
 RUN pip install --no-cache-dir \
@@ -56,17 +59,14 @@ RUN useradd -m -u 1000 user && \
 USER user
-# Set environment variables for optimal vLLM performance
 ENV HF_HOME=/tmp/huggingface
 ENV TORCHINDUCTOR_CACHE_DIR=/tmp/torch/inductor
-ENV TRITON_CACHE_DIR=/tmp/triton
-ENV TORCH_COMPILE_DEBUG=0
 ENV CUDA_VISIBLE_DEVICES=0
 # Optimize CUDA memory allocation
 ENV PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
-# vLLM 0.9.x uses v1 engine by default (more efficient)
-# VLLM_USE_V1=0 can be set if needed for compatibility, but v1 is recommended
-# ENV VLLM_USE_V1=0  # Commented out - v1 engine is default and preferred in 0.9.x
 # Expose port
 EXPOSE 7860

 # Set working directory
 WORKDIR /app
+# Install PyTorch with CUDA 12.4 support
 RUN pip install --no-cache-dir \
     torch>=2.5.0 \
+    torchvision \
+    torchaudio \
     --index-url https://download.pytorch.org/whl/cu124
+# Install Transformers and accelerate for optimized inference
+RUN pip install --no-cache-dir \
+    transformers>=4.40.0 \
+    accelerate>=0.30.0 \
+    bitsandbytes  # Optional: for quantization support
 # Install application dependencies
 RUN pip install --no-cache-dir \
 USER user
+# Set environment variables for optimal Transformers performance
 ENV HF_HOME=/tmp/huggingface
 ENV TORCHINDUCTOR_CACHE_DIR=/tmp/torch/inductor
 ENV CUDA_VISIBLE_DEVICES=0
 # Optimize CUDA memory allocation
 ENV PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+# Enable Transformers optimizations
+ENV TRANSFORMERS_CACHE=/tmp/huggingface
 # Expose port
 EXPOSE 7860

README.md CHANGED Viewed

@@ -11,7 +11,7 @@ suggested_hardware: l4x1
 # Open Finance LLM 8B
-OpenAI-compatible API powered by `DragonLLM/qwen3-8b-fin-v1.0` via vLLM.
 ## 🚀 Quick Start
@@ -63,14 +63,9 @@ The service uses these environment variables:
   - **Important**: You must accept the model's terms at https://huggingface.co/DragonLLM/qwen3-8b-fin-v1.0 before the token will work
 ### Optional Configuration
-- `VLLM_BASE_URL`: vLLM server endpoint (default: `http://localhost:8000/v1`)
 - `MODEL`: Model name (default: `DragonLLM/qwen3-8b-fin-v1.0`)
 - `SERVICE_API_KEY`: Optional API key for authentication (set via `x-api-key` header)
 - `LOG_LEVEL`: Logging level (default: `info`)
-- `VLLM_USE_EAGER`: Control optimization mode (default: `auto`)
-  - `auto`: Try optimized mode (CUDA graphs), fallback to eager if needed (recommended)
-  - `false`: Force optimized mode (CUDA graphs enabled, may fail if unsupported)
-  - `true`: Force eager mode (slower but more stable)
 ### Setting Up HF_TOKEN_LC2 in Hugging Face Spaces
@@ -145,9 +140,10 @@ MIT License - see LICENSE file for details.
 ---
-**Note**: This service runs vLLM 0.11.0 (latest stable) with `DragonLLM/qwen3-8b-fin-v1.0` model. The service initializes the model automatically on startup. For production use, ensure proper GPU resources (L4 or better) are available.
 ### Version Information
-- **vLLM:** 0.11.0 (supports Qwen3ForCausalLM - requires 0.8.4+)
 - **PyTorch:** 2.5.0+ (CUDA 12.4)
 - **CUDA:** 12.4

 # Open Finance LLM 8B
+OpenAI-compatible API powered by `DragonLLM/qwen3-8b-fin-v1.0` via Transformers.
 ## 🚀 Quick Start
   - **Important**: You must accept the model's terms at https://huggingface.co/DragonLLM/qwen3-8b-fin-v1.0 before the token will work
 ### Optional Configuration
 - `MODEL`: Model name (default: `DragonLLM/qwen3-8b-fin-v1.0`)
 - `SERVICE_API_KEY`: Optional API key for authentication (set via `x-api-key` header)
 - `LOG_LEVEL`: Logging level (default: `info`)
 ### Setting Up HF_TOKEN_LC2 in Hugging Face Spaces
 ---
+**Note**: This service runs with `DragonLLM/qwen3-8b-fin-v1.0` using the Transformers library. The service initializes the model automatically on startup. For production use, ensure proper GPU resources (L4 or better) are available.
 ### Version Information
+- **Transformers:** 4.40.0+ (supports Qwen3ForCausalLM)
 - **PyTorch:** 2.5.0+ (CUDA 12.4)
 - **CUDA:** 12.4
+- **Accelerate:** 0.30.0+ (for optimized inference)

app/config.py CHANGED Viewed

@@ -2,7 +2,6 @@ from pydantic_settings import BaseSettings
 class Settings(BaseSettings):
-    vllm_base_url: str = "http://localhost:8000/v1"
     model: str = "DragonLLM/qwen3-8b-fin-v1.0"
     service_api_key: str | None = None
     log_level: str = "info"

 class Settings(BaseSettings):
     model: str = "DragonLLM/qwen3-8b-fin-v1.0"
     service_api_key: str | None = None
     log_level: str = "info"

app/main.py CHANGED Viewed

@@ -7,7 +7,7 @@ import logging
 logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger(__name__)
-app = FastAPI(title="LLM Pro Finance API (vLLM)")
 # Mount routers
 app.include_router(openai_api.router, prefix="/v1")
@@ -23,8 +23,8 @@ async def startup_event():
     logger.info("Initializing model in background thread...")
     def load_model():
-        from app.providers.vllm import initialize_vllm
-        initialize_vllm()
     # Start model loading in background thread
     thread = threading.Thread(target=load_model, daemon=True)
@@ -38,7 +38,7 @@ async def root():
         "service": "Qwen Open Finance R 8B Inference",
         "version": "1.0.0",
         "model": "DragonLLM/qwen3-8b-fin-v1.0",
-        "backend": "vLLM"
     }
 @app.get("/health")

 logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger(__name__)
+app = FastAPI(title="LLM Pro Finance API (Transformers)")
 # Mount routers
 app.include_router(openai_api.router, prefix="/v1")
     logger.info("Initializing model in background thread...")
     def load_model():
+        from app.providers.vllm import initialize_model
+        initialize_model()
     # Start model loading in background thread
     thread = threading.Thread(target=load_model, daemon=True)
         "service": "Qwen Open Finance R 8B Inference",
         "version": "1.0.0",
         "model": "DragonLLM/qwen3-8b-fin-v1.0",
+        "backend": "Transformers"
     }
 @app.get("/health")

app/providers/vllm.py CHANGED Viewed

@@ -1,28 +1,32 @@
 import os
 import time
 from typing import Dict, Any, AsyncIterator, Union
-from vllm import LLM, SamplingParams
 import asyncio
 from huggingface_hub import login
-# Model configuration - back to working DragonLLM model
 model_name = "DragonLLM/qwen3-8b-fin-v1.0"
-llm_engine = None
-def initialize_vllm():
-    """Initialize vLLM engine with the model
     Handles authentication with Hugging Face Hub for accessing DragonLLM models.
     Prioritizes HF_TOKEN_LC2 (DragonLLM access) over HF_TOKEN_LC.
     """
-    global llm_engine
-    if llm_engine is None:
         import logging
         logger = logging.getLogger(__name__)
-        logger.info(f"Initializing vLLM with model: {model_name}")
-        print(f"Initializing vLLM with model: {model_name}")
         # Get HF token from environment (Hugging Face Space secret)
         # Priority: HF_TOKEN_LC2 (for DragonLLM access) > HF_TOKEN_LC > HF_TOKEN
@@ -56,99 +60,55 @@ def initialize_vllm():
                 logger.warning(f"⚠️  Warning: Failed to authenticate with HF Hub: {e}")
                 print(f"⚠️  Warning: Failed to authenticate with HF Hub: {e}")
-            # Set all possible environment variables that vLLM/huggingface_hub might check
-            # This ensures compatibility across different versions
             os.environ["HF_TOKEN"] = hf_token
             os.environ["HUGGING_FACE_HUB_TOKEN"] = hf_token
-            # Some tools check for these variants too
             os.environ["HF_API_TOKEN"] = hf_token
             logger.info("✅ Hugging Face token environment variables set")
         else:
             logger.warning("⚠️  WARNING: No HF token found in environment!")
-            logger.warning(f"   Checked: HF_TOKEN_LC2, HF_TOKEN_LC, HF_TOKEN, HUGGING_FACE_HUB_TOKEN")
-            logger.warning(f"   Available env vars: {[k for k in os.environ.keys() if 'TOKEN' in k or 'HF' in k]}")
             print("⚠️  WARNING: No HF token found in environment!")
             print(f"   Checked: HF_TOKEN_LC2, HF_TOKEN_LC, HF_TOKEN, HUGGING_FACE_HUB_TOKEN")
-            print(f"   Available env vars with 'TOKEN' or 'HF': {[k for k in os.environ.keys() if 'TOKEN' in k or 'HF' in k]}")
             print("   ⚠️  Model download may fail if DragonLLM/qwen3-8b-fin-v1.0 is gated!")
         try:
-            # Initialize vLLM engine
-            # Note: vLLM 0.11.0 supports Qwen3ForCausalLM (requires 0.8.4+)
-            logger.info(f"Attempting to load model: {model_name}")
-            print(f"Attempting to load model: {model_name}")
-            print(f"Model type: DragonLLM Qwen3 8B (bfloat16)")
-            print(f"vLLM version: 0.11.0 (Qwen3ForCausalLM support)")
-            print(f"Download directory: /tmp/huggingface")
             print(f"Trust remote code: True")
-            print(f"L4 GPU: 24GB VRAM available")
-            # Try optimized mode first (CUDA graphs enabled)
-            # Falls back to eager mode if CUDA graphs fail
-            use_optimized = os.getenv("VLLM_USE_EAGER", "auto").lower()
-            if use_optimized == "true":
-                enforce_eager = True
-                mode_desc = "Eager mode (forced)"
-            elif use_optimized == "false":
-                enforce_eager = False
-                mode_desc = "Optimized mode (CUDA graphs enabled)"
-            else:  # "auto" - try optimized, fallback to eager
-                enforce_eager = False
-                mode_desc = "Optimized mode (auto, fallback to eager if needed)"
-            print(f"Mode: {mode_desc}")
-            print(f"GPU memory utilization: 0.85")
-            print(f"vLLM: v0.9.2 (Latest stable, improved Qwen3 support)")
-            print(f"PyTorch: 2.5.0+ (CUDA 12.4 binary)")
-            # Common initialization parameters
-            init_params = {
-                "model": model_name,
-                "trust_remote_code": True,
-                "dtype": "bfloat16",  # Use bfloat16 for Qwen3 (required)
-                "max_model_len": 4096,  # Reduced for L4 KV cache constraints
-                "gpu_memory_utilization": 0.85,  # Can use more with stable v0 engine
-                "tensor_parallel_size": 1,  # Single L4 GPU
-                "download_dir": "/tmp/huggingface",
-                "tokenizer_mode": "auto",
-                "disable_log_stats": False,  # Enable logging for debugging
-            }
-            # Try optimized mode first (unless explicitly disabled)
-            if use_optimized == "auto" or use_optimized == "false":
-                try:
-                    print(f"🚀 Attempting optimized mode with CUDA graphs...")
-                    logger.info("Attempting optimized mode (enforce_eager=False)")
-                    init_params["enforce_eager"] = False
-                    llm_engine = LLM(**init_params)
-                    print(f"✅ vLLM engine initialized successfully in OPTIMIZED mode!")
-                    logger.info("✅ vLLM engine initialized in optimized mode (CUDA graphs enabled)")
-                except Exception as opt_error:
-                    error_msg = str(opt_error).lower()
-                    # Check if error is CUDA graph related
-                    if "cuda graph" in error_msg or "graph" in error_msg or use_optimized == "auto":
-                        logger.warning(f"⚠️  Optimized mode failed, falling back to eager mode: {opt_error}")
-                        print(f"⚠️  Optimized mode failed: {opt_error}")
-                        print(f"🔄 Falling back to eager mode for stability...")
-                        init_params["enforce_eager"] = True
-                        llm_engine = LLM(**init_params)
-                        print(f"✅ vLLM engine initialized successfully in EAGER mode (fallback)")
-                        logger.info("✅ vLLM engine initialized in eager mode (fallback after optimized mode failure)")
-                    else:
-                        # Re-raise if it's not a CUDA graph issue or if optimized is forced
-                        raise
-            else:
-                # Eager mode explicitly requested
-                print(f"⚙️  Using eager mode (explicitly requested)")
-                logger.info("Using eager mode (VLLM_USE_EAGER=true)")
-                init_params["enforce_eager"] = True
-                llm_engine = LLM(**init_params)
-                print(f"✅ vLLM engine initialized successfully in EAGER mode!")
-                logger.info("✅ vLLM engine initialized in eager mode")
         except Exception as e:
-            error_msg = f"❌ Error initializing vLLM: {e}"
             logger.error(error_msg, exc_info=True)
             print(error_msg)
@@ -167,7 +127,7 @@ def initialize_vllm():
             raise
-class VLLMProvider:
     def __init__(self):
         # Don't initialize at import time
         pass
@@ -193,44 +153,61 @@ class VLLMProvider:
         logger = logging.getLogger(__name__)
         try:
-            # Initialize vLLM on first use
-            if llm_engine is None:
-                logger.info("vLLM engine not initialized, initializing now...")
-                initialize_vllm()
-                logger.info("vLLM engine initialized successfully")
             messages = payload.get("messages", [])
             temperature = payload.get("temperature", 0.7)
             max_tokens = payload.get("max_tokens", 1000)
             top_p = payload.get("top_p", 1.0)
-            # Convert messages to prompt
-            prompt = self._messages_to_prompt(messages)
             logger.info(f"Generating response for prompt: {prompt[:100]}...")
-            # Set up sampling parameters
-            sampling_params = SamplingParams(
-                temperature=temperature,
-                top_p=top_p,
-                max_tokens=max_tokens,
-            )
             # Handle streaming vs non-streaming
             if stream:
-                return self._chat_stream(prompt, sampling_params, payload.get("model", model_name))
-            # Generate response using vLLM (non-streaming)
-            outputs = llm_engine.generate([prompt], sampling_params)
-            # Extract the generated text
-            generated_text = outputs[0].outputs[0].text
             logger.info(f"Generated text: {generated_text[:100]}...")
             # Build OpenAI-compatible response
             completion_id = f"chatcmpl-{os.urandom(12).hex()}"
             created = int(time.time())
-            prompt_tokens = len(outputs[0].prompt_token_ids)
-            completion_tokens = len(outputs[0].outputs[0].token_ids)
             return {
                 "id": completion_id,
@@ -257,72 +234,72 @@ class VLLMProvider:
             logger.error(f"Error in chat completion: {str(e)}", exc_info=True)
             raise
-    async def _chat_stream(self, prompt: str, sampling_params: SamplingParams, model: str) -> AsyncIterator[str]:
-        """Stream chat completions using vLLM
-        Note: vLLM 0.6.5 with synchronous LLM doesn't support true streaming.
-        This implementation generates the full response and yields it in chunks
-        for OpenAI API compatibility. For true streaming, use AsyncLLMEngine.
-        """
         import logging
         logger = logging.getLogger(__name__)
         completion_id = f"chatcmpl-{os.urandom(12).hex()}"
         created = int(time.time())
-        # Generate response (non-streaming backend, but we'll chunk it)
-        # Run in thread pool to avoid blocking
-        loop = asyncio.get_event_loop()
-        outputs = await loop.run_in_executor(
-            None,
-            lambda: llm_engine.generate([prompt], sampling_params)
-        )
-        generated_text = outputs[0].outputs[0].text
-        finish_reason = outputs[0].outputs[0].finish_reason or "stop"
-        # Yield text in chunks (simulate streaming)
-        # Split into reasonable chunks (words or characters)
-        chunk_size = 10  # words per chunk
-        words = generated_text.split()
-        for i in range(0, len(words), chunk_size):
-            chunk_words = words[i:i + chunk_size]
-            delta_text = " ".join(chunk_words)
-            if i + chunk_size < len(words):
-                delta_text += " "
-            # Format as OpenAI SSE stream chunk
-            chunk = {
-                "id": completion_id,
-                "object": "chat.completion.chunk",
-                "created": created,
-                "model": model,
-                "choices": [
-                    {
-                        "index": 0,
-                        "delta": {
-                            "content": delta_text
-                        },
-                        "finish_reason": None
-                    }
-                ]
-            }
-            yield f"data: {self._json_dumps(chunk)}\n\n"
-            await asyncio.sleep(0)  # Yield control
-        # Send final chunk with finish_reason
         final_chunk = {
             "id": completion_id,
             "object": "chat.completion.chunk",
             "created": created,
-            "model": model,
             "choices": [
                 {
                     "index": 0,
                     "delta": {},
-                    "finish_reason": finish_reason
                 }
             ]
         }
@@ -335,7 +312,7 @@ class VLLMProvider:
         return json.dumps(obj, ensure_ascii=False)
     def _messages_to_prompt(self, messages: list) -> str:
-        """Convert OpenAI messages format to prompt"""
         prompt = ""
         for message in messages:
             role = message["role"]
@@ -351,7 +328,7 @@ class VLLMProvider:
 # Module-level provider instance for backward compatibility
-_provider = VLLMProvider()
 # Module-level functions for direct import

 import os
 import time
+import torch
 from typing import Dict, Any, AsyncIterator, Union
 import asyncio
 from huggingface_hub import login
+from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
+from threading import Thread
+# Model configuration
 model_name = "DragonLLM/qwen3-8b-fin-v1.0"
+model = None
+tokenizer = None
+device = "cuda" if torch.cuda.is_available() else "cpu"
+def initialize_model():
+    """Initialize Transformers model with Qwen3
     Handles authentication with Hugging Face Hub for accessing DragonLLM models.
     Prioritizes HF_TOKEN_LC2 (DragonLLM access) over HF_TOKEN_LC.
     """
+    global model, tokenizer
+    if model is None:
         import logging
         logger = logging.getLogger(__name__)
+        logger.info(f"Initializing Transformers with model: {model_name}")
+        print(f"Initializing Transformers with model: {model_name}")
         # Get HF token from environment (Hugging Face Space secret)
         # Priority: HF_TOKEN_LC2 (for DragonLLM access) > HF_TOKEN_LC > HF_TOKEN
                 logger.warning(f"⚠️  Warning: Failed to authenticate with HF Hub: {e}")
                 print(f"⚠️  Warning: Failed to authenticate with HF Hub: {e}")
+            # Set all possible environment variables
             os.environ["HF_TOKEN"] = hf_token
             os.environ["HUGGING_FACE_HUB_TOKEN"] = hf_token
             os.environ["HF_API_TOKEN"] = hf_token
             logger.info("✅ Hugging Face token environment variables set")
         else:
             logger.warning("⚠️  WARNING: No HF token found in environment!")
             print("⚠️  WARNING: No HF token found in environment!")
             print(f"   Checked: HF_TOKEN_LC2, HF_TOKEN_LC, HF_TOKEN, HUGGING_FACE_HUB_TOKEN")
             print("   ⚠️  Model download may fail if DragonLLM/qwen3-8b-fin-v1.0 is gated!")
         try:
+            logger.info(f"Loading model: {model_name}")
+            print(f"Loading model: {model_name}")
+            print(f"Model type: DragonLLM Qwen3 8B")
+            print(f"Device: {device}")
             print(f"Trust remote code: True")
+            # Load tokenizer
+            print("📥 Loading tokenizer...")
+            tokenizer = AutoTokenizer.from_pretrained(
+                model_name,
+                token=hf_token,
+                trust_remote_code=True,
+                cache_dir="/tmp/huggingface"
+            )
+            logger.info("✅ Tokenizer loaded")
+            print("✅ Tokenizer loaded")
+            # Load model with optimizations
+            print("📥 Loading model (this may take a few minutes)...")
+            model = AutoModelForCausalLM.from_pretrained(
+                model_name,
+                token=hf_token,
+                trust_remote_code=True,
+                torch_dtype=torch.bfloat16,
+                device_map="auto",
+                cache_dir="/tmp/huggingface"
+            )
+            # Set to eval mode for inference
+            model.eval()
+            print(f"✅ Model loaded successfully!")
+            logger.info("✅ Model initialized successfully")
         except Exception as e:
+            error_msg = f"❌ Error initializing model: {e}"
             logger.error(error_msg, exc_info=True)
             print(error_msg)
             raise
+class TransformersProvider:
     def __init__(self):
         # Don't initialize at import time
         pass
         logger = logging.getLogger(__name__)
         try:
+            # Initialize model on first use
+            if model is None:
+                logger.info("Model not initialized, initializing now...")
+                initialize_model()
+                logger.info("Model initialized successfully")
             messages = payload.get("messages", [])
             temperature = payload.get("temperature", 0.7)
             max_tokens = payload.get("max_tokens", 1000)
             top_p = payload.get("top_p", 1.0)
+            # Convert messages to prompt using tokenizer's chat template
+            if hasattr(tokenizer, "apply_chat_template"):
+                prompt = tokenizer.apply_chat_template(
+                    messages,
+                    tokenize=False,
+                    add_generation_prompt=True
+                )
+            else:
+                # Fallback to simple prompt format
+                prompt = self._messages_to_prompt(messages)
             logger.info(f"Generating response for prompt: {prompt[:100]}...")
+            # Tokenize
+            inputs = tokenizer(prompt, return_tensors="pt").to(device)
             # Handle streaming vs non-streaming
             if stream:
+                return self._chat_stream(inputs, temperature, top_p, max_tokens, payload.get("model", model_name))
+            # Generate response (non-streaming)
+            with torch.no_grad():
+                outputs = model.generate(
+                    **inputs,
+                    max_new_tokens=max_tokens,
+                    temperature=temperature,
+                    top_p=top_p,
+                    do_sample=temperature > 0,
+                    pad_token_id=tokenizer.eos_token_id
+                )
+            # Decode response
+            generated_ids = outputs[0][inputs.input_ids.shape[1]:]
+            generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)
             logger.info(f"Generated text: {generated_text[:100]}...")
+            # Calculate tokens (approximate)
+            prompt_tokens = inputs.input_ids.shape[1]
+            completion_tokens = len(generated_ids)
             # Build OpenAI-compatible response
             completion_id = f"chatcmpl-{os.urandom(12).hex()}"
             created = int(time.time())
             return {
                 "id": completion_id,
             logger.error(f"Error in chat completion: {str(e)}", exc_info=True)
             raise
+    async def _chat_stream(self, inputs, temperature: float, top_p: float, max_tokens: int, model_id: str) -> AsyncIterator[str]:
+        """Stream chat completions using Transformers TextIteratorStreamer"""
         import logging
         logger = logging.getLogger(__name__)
         completion_id = f"chatcmpl-{os.urandom(12).hex()}"
         created = int(time.time())
+        # Create streamer
+        streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
+        # Generation parameters
+        generation_kwargs = {
+            "max_new_tokens": max_tokens,
+            "temperature": temperature,
+            "top_p": top_p,
+            "do_sample": temperature > 0,
+            "pad_token_id": tokenizer.eos_token_id,
+            "streamer": streamer
+        }
+        # Run generation in a separate thread
+        def generate():
+            with torch.no_grad():
+                model.generate(**inputs, **generation_kwargs)
+        generation_thread = Thread(target=generate)
+        generation_thread.start()
+        # Stream tokens as they're generated
+        try:
+            for token in streamer:
+                # Yield chunks
+                chunk = {
+                    "id": completion_id,
+                    "object": "chat.completion.chunk",
+                    "created": created,
+                    "model": model_id,
+                    "choices": [
+                        {
+                            "index": 0,
+                            "delta": {
+                                "content": token
+                            },
+                            "finish_reason": None
+                        }
+                    ]
+                }
+                yield f"data: {self._json_dumps(chunk)}\n\n"
+                await asyncio.sleep(0)  # Yield control
+        finally:
+            # Wait for generation to complete
+            generation_thread.join()
+        # Send final chunk
         final_chunk = {
             "id": completion_id,
             "object": "chat.completion.chunk",
             "created": created,
+            "model": model_id,
             "choices": [
                 {
                     "index": 0,
                     "delta": {},
+                    "finish_reason": "stop"
                 }
             ]
         }
         return json.dumps(obj, ensure_ascii=False)
     def _messages_to_prompt(self, messages: list) -> str:
+        """Convert OpenAI messages format to prompt (fallback)"""
         prompt = ""
         for message in messages:
             role = message["role"]
 # Module-level provider instance for backward compatibility
+_provider = TransformersProvider()
 # Module-level functions for direct import

requirements.txt CHANGED Viewed

@@ -1,7 +1,5 @@
 # Core dependencies for OpenAI-compatible API service
-# Note: vLLM and PyTorch are installed separately in Dockerfile for CUDA support
-# vllm==0.6.5  # Installed in Dockerfile
-# torch==2.4.0  # Installed in Dockerfile
 fastapi>=0.115.0
 uvicorn[standard]>=0.30.0

 # Core dependencies for OpenAI-compatible API service
+# Note: PyTorch and Transformers are installed separately in Dockerfile for CUDA support
 fastapi>=0.115.0
 uvicorn[standard]>=0.30.0

tests/test_config.py CHANGED Viewed

@@ -9,7 +9,6 @@ from app.config import Settings
 def test_settings_defaults():
     """Test that settings have correct default values."""
     settings = Settings()
-    assert settings.vllm_base_url == "http://localhost:8000/v1"
     assert settings.model == "DragonLLM/qwen3-8b-fin-v1.0"
     assert settings.service_api_key is None
     assert settings.log_level == "info"
@@ -18,13 +17,11 @@ def test_settings_defaults():
 def test_settings_from_env():
     """Test that settings can be loaded from environment variables."""
     with patch.dict(os.environ, {
-        "VLLM_BASE_URL": "http://remote:8000/v1",
         "MODEL": "custom-model",
         "SERVICE_API_KEY": "secret-key",
         "LOG_LEVEL": "debug"
     }):
         settings = Settings()
-        assert settings.vllm_base_url == "http://remote:8000/v1"
         assert settings.model == "custom-model"
         assert settings.service_api_key == "secret-key"
         assert settings.log_level == "debug"
@@ -36,4 +33,4 @@ def test_settings_env_file():
     # In practice, you'd create a test .env file or mock the file reading
     settings = Settings()
     # Verify that the settings object can be instantiated
-    assert isinstance(settings.vllm_base_url, str)

 def test_settings_defaults():
     """Test that settings have correct default values."""
     settings = Settings()
     assert settings.model == "DragonLLM/qwen3-8b-fin-v1.0"
     assert settings.service_api_key is None
     assert settings.log_level == "info"
 def test_settings_from_env():
     """Test that settings can be loaded from environment variables."""
     with patch.dict(os.environ, {
         "MODEL": "custom-model",
         "SERVICE_API_KEY": "secret-key",
         "LOG_LEVEL": "debug"
     }):
         settings = Settings()
         assert settings.model == "custom-model"
         assert settings.service_api_key == "secret-key"
         assert settings.log_level == "debug"
     # In practice, you'd create a test .env file or mock the file reading
     settings = Settings()
     # Verify that the settings object can be instantiated
+    assert isinstance(settings.model, str)