Spaces:

jeanbaptdzd
/

open-finance-llm-8b

Paused

jeanbaptdzd commited on Nov 2

Commit

da484d7

1 Parent(s): 3db41e6

Refactor: Remove RAG, upgrade vLLM 0.9.2, add optimization mode

- Remove RAG components (query_with_context.py, PRIIPS_WORKFLOW.md)
- Upgrade vLLM 0.6.5 → 0.9.2 with PyTorch 2.5+ for better Qwen3 support
- Add automatic optimized mode (CUDA graphs) with eager fallback
- Improve HF token authentication (HF_TOKEN_LC2 priority, better error handling)
- Fix dependency compatibility issues
- Clean up OpenAI API compatibility interface
- Add comprehensive documentation (VLLM_UPGRADE_ANALYSIS.md, OPTIMIZATION_EVALUATION.md)
- Add model access and compatibility test scripts
- Update requirements.txt and Dockerfile
- Remove VLLM_USE_V1=0 (v1 engine is default in 0.9.x)

Files changed (20) hide show

Dockerfile +8 -5
OPTIMIZATION_EVALUATION.md +137 -0
PRIIPS_WORKFLOW.md +0 -182
README.md +44 -7
VLLM_COMPATIBILITY.md +152 -0
VLLM_UPGRADE_ANALYSIS.md +191 -0
app/models/openai.py +4 -3
app/providers/vllm.py +240 -40
app/routers/openai_api.py +17 -21
app/services/chat_service.py +3 -3
requirements-dev.txt +5 -2
requirements.txt +7 -3
scripts/check_vllm_compatibility.py +258 -0
scripts/query_with_context.py +0 -179
scripts/test_model_access.py +321 -0
tests/performance/README.md +6 -0
tests/performance/__init__.py +6 -0
tests/performance/benchmark.py +6 -0
tests/performance/test_inference_speed.py +6 -0
tests/performance/test_openai_compatibility.py +6 -0

Dockerfile CHANGED Viewed

@@ -25,12 +25,14 @@ RUN python3 -m pip install --upgrade pip
 WORKDIR /app
 # Install PyTorch with CUDA 12.4 support FIRST (critical for vLLM compatibility)
 RUN pip install --no-cache-dir \
-    torch==2.4.0 \
     --index-url https://download.pytorch.org/whl/cu124
-# Install vLLM (will use the PyTorch we just installed)
-RUN pip install --no-cache-dir vllm==0.6.5
 # Install application dependencies
 RUN pip install --no-cache-dir \
@@ -62,8 +64,9 @@ ENV TORCH_COMPILE_DEBUG=0
 ENV CUDA_VISIBLE_DEVICES=0
 # Optimize CUDA memory allocation
 ENV PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
-# Force vLLM to use legacy (v0) engine - more stable, single-process
-ENV VLLM_USE_V1=0
 # Expose port
 EXPOSE 7860

 WORKDIR /app
 # Install PyTorch with CUDA 12.4 support FIRST (critical for vLLM compatibility)
+# Updated to PyTorch 2.5+ for better vLLM 0.9.x compatibility
 RUN pip install --no-cache-dir \
+    torch>=2.5.0 \
     --index-url https://download.pytorch.org/whl/cu124
+# Install vLLM 0.9.2 (stable, supports CUDA 12.x, better Qwen3 support than 0.6.5)
+# vLLM 0.9.2 released July 2025 - significant improvements over 0.6.5
+RUN pip install --no-cache-dir vllm==0.9.2
 # Install application dependencies
 RUN pip install --no-cache-dir \
 ENV CUDA_VISIBLE_DEVICES=0
 # Optimize CUDA memory allocation
 ENV PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+# vLLM 0.9.x uses v1 engine by default (more efficient)
+# VLLM_USE_V1=0 can be set if needed for compatibility, but v1 is recommended
+# ENV VLLM_USE_V1=0  # Commented out - v1 engine is default and preferred in 0.9.x
 # Expose port
 EXPOSE 7860

OPTIMIZATION_EVALUATION.md ADDED Viewed

	@@ -0,0 +1,137 @@

+# vLLM Optimization Mode Evaluation
+## Current Setup: Eager Mode
+**Configuration:**
+- `enforce_eager=True` - Disables CUDA graphs
+- `VLLM_USE_V1=0` - Uses v0 engine (stable)
+**Trade-offs:**
+- ✅ **Pros:** More stable, easier debugging, fewer compatibility issues
+- ❌ **Cons:** Lower performance, higher latency, reduced throughput
+## Optimized Mode: CUDA Graphs Enabled
+**Proposed Configuration:**
+- `enforce_eager=False` - Enables CUDA graphs (default)
+- `VLLM_USE_V1=0` - Still use v0 engine for stability
+**Expected Benefits:**
+- 🚀 **Performance:** 2-3x faster inference
+- 🚀 **Throughput:** Higher tokens/second
+- 🚀 **Latency:** Lower time-to-first-token (TTFT)
+**Potential Risks:**
+- ⚠️ **Compatibility:** Qwen3 may have CUDA graph issues in vLLM 0.6.5
+- ⚠️ **Memory:** Slightly higher memory overhead
+- ⚠️ **Stability:** Possible crashes with unsupported operations
+## Evaluation Criteria
+### Can We Use Optimized Mode?
+**Factors to Consider:**
+1. **Model Architecture Support**
+   - Qwen3 in vLLM 0.6.5 may or may not fully support CUDA graphs
+   - Need to test on actual deployment
+2. **Hardware Compatibility**
+   - L4 GPU: 24GB VRAM ✅
+   - CUDA 12.4: Full CUDA graph support ✅
+   - PyTorch 2.4.0: CUDA graph support ✅
+3. **vLLM Version**
+   - v0.6.5: CUDA graphs should work for supported architectures
+   - Qwen3 support may vary
+4. **Memory Constraints**
+   - Current: `gpu_memory_utilization=0.85`
+   - CUDA graphs add ~100-200MB overhead
+   - Should still fit within L4 limits
+## Recommendation: Try Optimized Mode with Fallback
+**Strategy:** Attempt optimized mode, fall back to eager if errors occur
+### Implementation Approach
+```python
+# Try optimized mode first
+try:
+    llm_engine = LLM(
+        model=model_name,
+        trust_remote_code=True,
+        dtype="bfloat16",
+        enforce_eager=False,  # Enable CUDA graphs
+        # ... other params
+    )
+except Exception as e:
+    # Fall back to eager mode
+    logger.warning(f"CUDA graphs failed, falling back to eager mode: {e}")
+    llm_engine = LLM(
+        model=model_name,
+        trust_remote_code=True,
+        dtype="bfloat16",
+        enforce_eager=True,  # Safe fallback
+        # ... other params
+    )
+```
+## Testing Plan
+### 1. Initial Test (Optimized Mode)
+- Deploy with `enforce_eager=False`
+- Monitor startup logs
+- Check for CUDA graph compilation errors
+### 2. Performance Benchmark
+If optimized mode works:
+- Measure: tokens/second, latency, throughput
+- Compare with eager mode baseline
+### 3. Stability Test
+- Run multiple requests
+- Check for crashes or errors
+- Monitor memory usage
+### 4. Fallback Verification
+- Ensure eager mode still works as backup
+- Document any issues found
+## Expected Outcomes
+### Best Case (Optimized Works)
+- ✅ CUDA graphs compile successfully
+- ✅ 2-3x performance improvement
+- ✅ Stable operation
+- **Action:** Keep optimized mode
+### Worst Case (Optimized Fails)
+- ❌ CUDA graph compilation errors
+- ❌ Runtime crashes
+- ✅ Eager mode fallback works
+- **Action:** Stay in eager mode, consider upgrading vLLM
+### Middle Case (Partial Support)
+- ⚠️ CUDA graphs work but with warnings
+- ⚠️ Some operations fall back to eager
+- ✅ Still better than full eager mode
+- **Action:** Monitor and optimize further
+## Monitoring
+Track these metrics:
+- Model loading time
+- CUDA graph compilation time
+- Inference latency
+- Throughput (tokens/sec)
+- Memory usage
+- Error rates
+## Conclusion
+**Recommendation:** **TRY OPTIMIZED MODE** with automatic fallback
+The L4 GPU and CUDA 12.4 setup should support CUDA graphs. Qwen3 compatibility is the main unknown. With automatic fallback to eager mode, we can safely test optimized mode without risking service availability.

PRIIPS_WORKFLOW.md DELETED Viewed

@@ -1,182 +0,0 @@
-# PRIIPS Document Extraction & RAG Workflow
-Complete workflow for extracting PRIIPS KID documents and querying with LLM context.
-## 📁 Directory Structure
-```
-priips_documents/
-├── raw/           # Place your PDF documents here
-├── extracted/     # Extracted JSON documents (auto-generated)
-└── processed/     # Chunked documents for RAG (future)
-scripts/
-├── extract_priips.py      # Extract text from PDFs
-└── query_with_context.py  # Query LLM with document context
-```
-## 🚀 Quick Start
-### 1. Add PRIIPS Documents
-Place PDF documents in `priips_documents/raw/`:
-```bash
-# Naming convention: {ISIN}_{ProductName}_{Date}.pdf
-cp /path/to/your/priips.pdf priips_documents/raw/LU1234567890_GlobalEquity_2024.pdf
-```
-### 2. Extract Document Content
-```bash
-# Extract all PDFs in the raw directory
-python scripts/extract_priips.py priips_documents/raw/
-# Or extract a single file
-python scripts/extract_priips.py priips_documents/raw/LU1234567890_GlobalEquity_2024.pdf
-```
-**Output:** JSON files in `priips_documents/extracted/` with structured content:
-- Metadata (ISIN, product name, dates)
-- Raw extracted text
-- Parsed sections (objectives, risks, costs, etc.)
-### 3. Query with RAG Context
-```bash
-# Ask questions about your documents
-python scripts/query_with_context.py "What is the recommended holding period?"
-python scripts/query_with_context.py "What are the main risks of this investment?"
-python scripts/query_with_context.py "Summarize the cost structure"
-```
-**Options:**
-```bash
-# Specify different extracted directory
-python scripts/query_with_context.py "Your question" --extracted-dir custom/path/
-# Control context size and response length
-python scripts/query_with_context.py "Your question" \
-  --max-context 3000 \
-  --max-tokens 800
-```
-## 📊 Example Workflow
-```bash
-# 1. Add a PRIIPS PDF
-cp MyFund.pdf priips_documents/raw/FR0012345678_MyFund_2024.pdf
-# 2. Extract content
-python scripts/extract_priips.py priips_documents/raw/
-# Output:
-# 📄 Processing: FR0012345678_MyFund_2024.pdf
-# ✅ Extracted 12,543 characters
-# 💾 Saved to: priips_documents/extracted/FR0012345678_MyFund_2024_extracted.json
-# 3. Query the LLM
-python scripts/query_with_context.py "What is the SRI of this fund?"
-# Output:
-# 📚 Loading documents from priips_documents/extracted...
-# ✅ Loaded 1 documents
-# 🔍 Querying LLM with 1,234 chars of context...
-# 📊 Tokens used: 234
-#
-# 💬 Answer:
-# Based on the PRIIPS document, the Summary Risk Indicator (SRI) for this fund is 5 out of 7...
-```
-## 🎯 Use Cases
-### Document Comparison
-```bash
-python scripts/query_with_context.py "Compare the risk profiles of all available funds"
-```
-### Specific Information Extraction
-```bash
-python scripts/query_with_context.py "Extract all recommended holding periods"
-python scripts/query_with_context.py "List all ISINs and their product names"
-```
-### Compliance Checks
-```bash
-python scripts/query_with_context.py "Are there any funds with SRI above 6?"
-python scripts/query_with_context.py "Which funds have holding periods under 3 years?"
-```
-## 🔧 Advanced: Integrate with PydanticAI
-```python
-from pydantic_ai import Agent
-from pydantic_ai.models.openai import OpenAIModel
-# Configure with your deployed service
-model = OpenAIModel(
-    'DragonLLM/qwen3-8b-fin-v1.0',
-    base_url='https://jeanbaptdzd-priips-llm-service.hf.space/v1',
-)
-agent = Agent(model=model)
-# Load PRIIPS context
-with open('priips_documents/extracted/LU123_extracted.json') as f:
-    context = json.load(f)
-# Query with context
-result = agent.run_sync(
-    f"Based on this PRIIPS document: {context['raw_text'][:2000]}... "
-    f"What is the recommended holding period?"
-)
-```
-## 📝 Extracted Document Schema
-```json
-{
-  "metadata": {
-    "filename": "LU1234567890_GlobalEquity_2024.pdf",
-    "extraction_date": "2024-10-28T16:24:00",
-    "isin": "LU1234567890",
-    "product_name": "GlobalEquity",
-    "file_size_bytes": 245678,
-    "text_length": 12543
-  },
-  "raw_text": "Full extracted text from PDF...",
-  "sections": {
-    "summary": "What is this product? ...",
-    "objectives": "Investment objectives and policy...",
-    "risk_indicator": "SRI: 5/7 ...",
-    "performance_scenarios": "Performance scenarios...",
-    "costs": "What are the costs? ...",
-    "holding_period": "Recommended: 5 years"
-  }
-}
-```
-## 🚀 Next Steps
-1. **Add More Documents:** Place additional PRIIPS PDFs in `raw/`
-2. **Enhance Extraction:** Improve section parsing in `extract_priips.py`
-3. **Add Embeddings:** Implement vector search for better RAG
-4. **Build API:** Create REST API endpoints for document queries
-5. **Dashboard:** Build web UI for document management and queries
-## 📚 API Integration
-The LLM service is OpenAI-compatible and deployed at:
-```
-https://jeanbaptdzd-priips-llm-service.hf.space/v1
-```
-**Endpoints:**
-- `GET /` - Service status
-- `GET /v1/models` - List available models
-- `POST /v1/chat/completions` - Chat completion with context
-See `test_service.py` for integration examples.

README.md CHANGED Viewed

@@ -19,6 +19,7 @@ OpenAI-compatible API and financial document processor powered by `DragonLLM/qwe
 This service provides:
 - **OpenAI-compatible API** at `/v1/models` and `/v1/chat/completions`
 - **PRIIPs extraction** at `/extract-priips` for structured financial document parsing
 - **Provider abstraction** for easy integration with PydanticAI/DSPy
 ## 📋 API Endpoints
@@ -35,9 +36,21 @@ curl -X GET "https://your-space-url.hf.space/v1/models"
 curl -X POST "https://your-space-url.hf.space/v1/chat/completions" \
   -H "Content-Type: application/json" \
   -d '{
-    "model": "DragonLLM/gemma3-12b-fin-v0.3",
     "messages": [{"role": "user", "content": "Hello!"}],
-    "temperature": 0.7
   }'
 ```
@@ -83,10 +96,28 @@ curl -X POST "https://your-space-url.hf.space/extract-priips" \
 The service uses these environment variables:
 - `VLLM_BASE_URL`: vLLM server endpoint (default: `http://localhost:8000/v1`)
-- `MODEL`: Model name (default: `DragonLLM/LLM-Pro-Finance-Small`)
-- `SERVICE_API_KEY`: Optional API key for authentication
 - `LOG_LEVEL`: Logging level (default: `info`)
 ## 🔗 Integration Examples
@@ -96,7 +127,7 @@ from pydantic_ai import Agent
 from pydantic_ai.models.openai import OpenAIModel
 model = OpenAIModel(
-    "DragonLLM/gemma3-12b-fin-v0.3",
     base_url="https://your-space-url.hf.space/v1"
 )
@@ -108,7 +139,7 @@ agent = Agent(model=model)
 import dspy
 lm = dspy.OpenAI(
-    model="DragonLLM/gemma3-12b-fin-v0.3",
     api_base="https://your-space-url.hf.space/v1"
 )
 ```
@@ -155,4 +186,10 @@ MIT License - see LICENSE file for details.
 ---
-**Note**: This service requires a vLLM server running `DragonLLM/LLM-Pro-Finance-Small` model. For production use, ensure your vLLM server is properly configured and accessible.

 This service provides:
 - **OpenAI-compatible API** at `/v1/models` and `/v1/chat/completions`
 - **PRIIPs extraction** at `/extract-priips` for structured financial document parsing
+- **Streaming support** for real-time completions
 - **Provider abstraction** for easy integration with PydanticAI/DSPy
 ## 📋 API Endpoints
 curl -X POST "https://your-space-url.hf.space/v1/chat/completions" \
   -H "Content-Type: application/json" \
   -d '{
+    "model": "DragonLLM/qwen3-8b-fin-v1.0",
     "messages": [{"role": "user", "content": "Hello!"}],
+    "temperature": 0.7,
+    "max_tokens": 1000
+  }'
+```
+#### Streaming Chat Completions
+```bash
+curl -X POST "https://your-space-url.hf.space/v1/chat/completions" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "DragonLLM/qwen3-8b-fin-v1.0",
+    "messages": [{"role": "user", "content": "Tell me about finance"}],
+    "stream": true
   }'
 ```
 The service uses these environment variables:
+### Required for Model Access
+- **`HF_TOKEN_LC2`** (Recommended): Hugging Face token with access to DragonLLM models. Set this as a secret in your Hugging Face Space.
+  - Priority order: `HF_TOKEN_LC2` > `HF_TOKEN_LC` > `HF_TOKEN` > `HUGGING_FACE_HUB_TOKEN`
+  - The service automatically authenticates with Hugging Face Hub using this token
+  - **Important**: You must accept the model's terms at https://huggingface.co/DragonLLM/qwen3-8b-fin-v1.0 before the token will work
+### Optional Configuration
 - `VLLM_BASE_URL`: vLLM server endpoint (default: `http://localhost:8000/v1`)
+- `MODEL`: Model name (default: `DragonLLM/qwen3-8b-fin-v1.0`)
+- `SERVICE_API_KEY`: Optional API key for authentication (set via `x-api-key` header)
 - `LOG_LEVEL`: Logging level (default: `info`)
+- `VLLM_USE_EAGER`: Control optimization mode (default: `auto`)
+  - `auto`: Try optimized mode (CUDA graphs), fallback to eager if needed (recommended)
+  - `false`: Force optimized mode (CUDA graphs enabled, may fail if unsupported)
+  - `true`: Force eager mode (slower but more stable)
+### Setting Up HF_TOKEN_LC2 in Hugging Face Spaces
+1. Go to your Space settings → Secrets and variables
+2. Add a new secret named `HF_TOKEN_LC2`
+3. Set the value to your Hugging Face token with access to DragonLLM models
+4. Make sure you've accepted the terms for `DragonLLM/qwen3-8b-fin-v1.0` on Hugging Face
 ## 🔗 Integration Examples
 from pydantic_ai.models.openai import OpenAIModel
 model = OpenAIModel(
+    "DragonLLM/qwen3-8b-fin-v1.0",
     base_url="https://your-space-url.hf.space/v1"
 )
 import dspy
 lm = dspy.OpenAI(
+    model="DragonLLM/qwen3-8b-fin-v1.0",
     api_base="https://your-space-url.hf.space/v1"
 )
 ```
 ---
+**Note**: This service runs vLLM 0.9.2 (latest stable) with `DragonLLM/qwen3-8b-fin-v1.0` model. The service initializes the model automatically on startup. For production use, ensure proper GPU resources (L4 or better) are available.
+### Version Information
+- **vLLM:** 0.9.2 (upgraded from 0.6.5 - July 2025 release)
+- **PyTorch:** 2.5.0+ (CUDA 12.4)
+- **CUDA:** 12.4
+- See `VLLM_UPGRADE_ANALYSIS.md` for upgrade details

VLLM_COMPATIBILITY.md ADDED Viewed

	@@ -0,0 +1,152 @@

+# vLLM 0.6.5 + DragonLLM/qwen3-8b-fin-v1.0 Compatibility Analysis
+## Summary
+✅ **Status: LIKELY COMPATIBLE** - Configuration matches Qwen3 requirements
+## Current Configuration
+- **vLLM Version:** 0.9.2 ✅ (upgraded from 0.6.5)
+- **Model:** DragonLLM/qwen3-8b-fin-v1.0
+- **Architecture:** Qwen3
+- **PyTorch:** 2.5.0+cu124 (CUDA 12.4)
+- **Model Parameters:** ~8B (308.2K according to HF, but this seems like a reporting issue)
+**Upgrade Status:** Upgraded to vLLM 0.9.2 (July 2025) - provides significant improvements over 0.6.5 while maintaining CUDA 12.4 compatibility.
+## Compatibility Factors
+### ✅ Positive Indicators
+1. **Architecture Support**
+   - Model uses `qwen3` architecture
+   - Qwen models are generally well-supported in vLLM
+   - Code comment indicates: "vLLM: v0.6.5 (Qwen3 support + VLLM_USE_V1=0 for stability)"
+2. **Configuration Matches Requirements**
+   ```python
+   dtype="bfloat16"          # ✅ Required for Qwen3
+   trust_remote_code=True    # ✅ Required for custom architectures
+   enforce_eager=True        # ✅ Avoids CUDA graph issues
+   ```
+3. **Model Repository Info**
+   - Tags include: `text-generation-inference`, `endpoints_compatible`
+   - These tags suggest vLLM/TGI compatibility
+   - Uses `transformers` + `safetensors` format (vLLM compatible)
+4. **Environment Setup**
+   - `VLLM_USE_V1=0` - Using stable v0 engine
+   - Proper HF token authentication configured
+   - CUDA 12.4 with PyTorch 2.4.0
+### ⚠️ Potential Concerns
+1. **vLLM 0.6.5 Release Date**
+   - vLLM 0.6.5 was released in September 2024
+   - Qwen3 models may have been added in later versions
+   - **Action:** Monitor for compatibility issues during model loading
+2. **Model Size Reporting**
+   - HF shows "308.2K parameters" which seems incorrect for an 8B model
+   - This is likely a metadata issue, not a compatibility issue
+3. **Private Model Access**
+   - Model is private (requires authentication)
+   - Authentication is properly configured
+   - Must accept model terms on HF
+## Configuration Verification
+### Current vLLM Initialization
+```python
+llm_engine = LLM(
+    model="DragonLLM/qwen3-8b-fin-v1.0",
+    trust_remote_code=True,      # ✅ Required
+    dtype="bfloat16",            # ✅ Required for Qwen3
+    max_model_len=4096,          # ✅ Reasonable for L4 GPU
+    gpu_memory_utilization=0.85, # ✅ Good utilization
+    tensor_parallel_size=1,      # ✅ Single GPU
+    download_dir="/tmp/huggingface",
+    tokenizer_mode="auto",
+    enforce_eager=True,          # ✅ Stability
+    disable_log_stats=False,     # ✅ Debugging enabled
+)
+```
+### Environment Variables
+```bash
+VLLM_USE_V1=0                    # ✅ Use stable v0 engine
+CUDA_VISIBLE_DEVICES=0           # ✅ Single GPU
+HF_TOKEN (via HF_TOKEN_LC2)      # ✅ Authentication
+```
+## Testing Recommendations
+### 1. Test Model Loading
+```bash
+# Run the service and monitor startup logs
+# Check for these success indicators:
+- "✅ vLLM engine initialized successfully"
+- No architecture mismatch errors
+- Model loads without errors
+```
+### 2. Test Inference
+```python
+# Simple test request
+{
+    "model": "DragonLLM/qwen3-8b-fin-v1.0",
+    "messages": [{"role": "user", "content": "Hello"}],
+    "max_tokens": 50
+}
+```
+### 3. Monitor for Errors
+**If you see:**
+- `AttributeError: 'LlamaForCausalLM' object has no attribute 'qwen'`
+- `Model architecture not supported`
+- `dtype mismatch errors`
+**Then:** vLLM 0.6.5 may not fully support Qwen3, upgrade to vLLM 0.6.6+ or 0.7.0+
+## Upgrade Path (if needed)
+If compatibility issues occur:
+### Option 1: Upgrade vLLM (Recommended)
+```dockerfile
+# In Dockerfile, change:
+RUN pip install --no-cache-dir vllm==0.6.6
+# or
+RUN pip install --no-cache-dir vllm==0.7.0
+```
+### Option 2: Test with Latest
+```dockerfile
+RUN pip install --no-cache-dir vllm>=0.7.0
+```
+## Verification Checklist
+- [x] Model architecture: Qwen3 ✅
+- [x] dtype: bfloat16 ✅
+- [x] trust_remote_code: True ✅
+- [x] Authentication configured ✅
+- [x] PyTorch 2.4.0 with CUDA 12.4 ✅
+- [ ] Model loads successfully (test on deployment)
+- [ ] Inference works correctly (test on deployment)
+## Conclusion
+Based on the configuration and model metadata, **DragonLLM/qwen3-8b-fin-v1.0 should be compatible with vLLM 0.6.5**. The configuration follows best practices for Qwen models.
+**However**, since Qwen3 is a relatively new architecture, monitor the first deployment closely. If you encounter any architecture-related errors, upgrading to vLLM 0.6.6+ or 0.7.0+ is recommended.
+## References
+- Model: https://huggingface.co/DragonLLM/qwen3-8b-fin-v1.0
+- vLLM Docs: https://docs.vllm.ai/en/stable/models/supported_models.html
+- Qwen3 Architecture: Uses bfloat16, requires trust_remote_code

VLLM_UPGRADE_ANALYSIS.md ADDED Viewed

	@@ -0,0 +1,191 @@

+# vLLM Upgrade Analysis: 0.6.5 → Latest
+## Current Status
+- **Current Version:** vLLM 0.6.5 (September 2024)
+- **Latest Version:** vLLM 0.10.2 (October 2025) or 0.9.2
+- **Version Gap:** ~14+ months of updates
+## Latest Version Information
+### vLLM 0.10.2 (Latest - October 2025)
+- **CUDA Support:** CUDA 13.0.2
+- **PyTorch:** Likely requires newer PyTorch version
+- **New Features:**
+  - Multi-node configurations
+  - FP8 precision support (Hopper+ GPUs)
+  - NVFP4 format (Blackwell GPUs)
+  - DeepSeek-R1 and Llama-3.1-8B-Instruct support
+  - RTX PRO 6000 Blackwell Server Edition support
+### vLLM 0.9.2 (Stable - October 2025)
+- More stable release track
+- Improved GPU architecture support
+- Better memory management
+- Likely better Qwen3 support
+## Current Setup Requirements
+### Our Current Configuration
+- **CUDA:** 12.4
+- **PyTorch:** 2.4.0+cu124
+- **Python:** 3.11
+- **GPU:** L4 (24GB VRAM)
+- **Model:** Qwen3-8B
+## Compatibility Considerations
+### ⚠️ Potential Issues Upgrading to 0.10.x
+1. **CUDA 13.0.2 Requirement**
+   - vLLM 0.10.2 supports CUDA 13.0.2
+   - We're on CUDA 12.4
+   - **Solution:** May need CUDA 13 base image OR use vLLM 0.9.x which likely supports CUDA 12.x
+2. **PyTorch Version**
+   - Newer vLLM may require PyTorch 2.5+
+   - Current: PyTorch 2.4.0
+   - **Action:** Check vLLM 0.9.x requirements
+3. **Python Version**
+   - vLLM 0.9+ may require Python 3.11+
+   - Current: Python 3.11 ✅
+   - **Status:** Compatible
+### ✅ Benefits of Upgrading
+1. **Better Qwen3 Support**
+   - Newer versions likely have improved Qwen3 compatibility
+   - Better CUDA graph support
+   - More stable inference
+2. **Performance Improvements**
+   - Better memory management
+   - Optimized kernels
+   - Improved throughput
+3. **Bug Fixes**
+   - 14+ months of bug fixes
+   - Security updates
+   - Stability improvements
+4. **Feature Updates**
+   - Better streaming support
+   - Improved API compatibility
+   - New optimizations
+## Recommended Upgrade Path
+### Option 1: Upgrade to vLLM 0.9.x (Recommended)
+**Why:**
+- Better balance of features and stability
+- Likely still supports CUDA 12.4
+- Better Qwen3 support than 0.6.5
+- Not as bleeding edge as 0.10.x
+**Changes Needed:**
+```dockerfile
+# Update Dockerfile
+RUN pip install --no-cache-dir vllm>=0.9.0,<0.10.0
+# May need to update PyTorch:
+RUN pip install --no-cache-dir \
+    torch>=2.5.0 \
+    --index-url https://download.pytorch.org/whl/cu124
+```
+### Option 2: Upgrade to vLLM 0.10.x (If CUDA 13 available)
+**Why:**
+- Latest features and optimizations
+- Best performance improvements
+**Changes Needed:**
+```dockerfile
+# Update base image to CUDA 13
+FROM nvidia/cuda:13.0.2-devel-ubuntu22.04
+# Update PyTorch for CUDA 13
+RUN pip install --no-cache-dir \
+    torch>=2.5.0 \
+    --index-url https://download.pytorch.org/whl/cu130
+# Install latest vLLM
+RUN pip install --no-cache-dir vllm>=0.10.0
+```
+### Option 3: Gradual Upgrade (Safest)
+1. **First:** Upgrade to vLLM 0.7.x or 0.8.x
+   - Test Qwen3 compatibility
+   - Verify performance
+2. **Then:** Move to 0.9.x
+   - Test thoroughly
+   - Monitor stability
+3. **Finally:** Consider 0.10.x if needed
+## Code Changes Required
+### Minimal Changes Expected
+1. **Environment Variables**
+   - `VLLM_USE_V1=0` may no longer be needed (v1 engine is default in newer versions)
+   - May need to update or remove
+2. **API Changes**
+   - LLM initialization likely compatible
+   - Some parameters may be deprecated
+   - Check release notes
+3. **Streaming**
+   - Better streaming support in newer versions
+   - May need to update streaming implementation
+## Testing Checklist
+After upgrading:
+- [ ] Model loads successfully
+- [ ] Qwen3 architecture works
+- [ ] CUDA graphs work (optimized mode)
+- [ ] Inference produces correct results
+- [ ] Streaming works
+- [ ] Memory usage acceptable
+- [ ] Performance improved/stable
+- [ ] No regressions in API compatibility
+## Recommendations
+### Immediate Action: Upgrade to vLLM 0.9.x
+**Reasoning:**
+1. Still supports CUDA 12.4 (no base image change needed)
+2. Much better than 0.6.5
+3. Better Qwen3 support
+4. More stable than 0.10.x
+5. Significant improvements without breaking changes
+**Steps:**
+1. Update Dockerfile to use vLLM 0.9.2
+2. Update PyTorch to 2.5+ (may be needed)
+3. Test on deployment
+4. Monitor for issues
+### Future Consideration: vLLM 0.10.x
+Only if:
+- CUDA 13 becomes available
+- Need specific 0.10.x features
+- 0.9.x proves insufficient
+## Summary
+**Current:** vLLM 0.6.5 (old, but working)
+**Recommended:** vLLM 0.9.2 (good balance)
+**Latest:** vLLM 0.10.2 (requires CUDA 13)
+**Action:** Upgrade to vLLM 0.9.2 for best compatibility with current setup while gaining significant improvements.

app/models/openai.py CHANGED Viewed

@@ -11,11 +11,12 @@ class Message(BaseModel):
 class ChatCompletionRequest(BaseModel):
-    model: str
     messages: List[Message]
-    temperature: Optional[float] = 0.2
-    max_tokens: Optional[int] = Field(default=None, alias="max_tokens")
     stream: Optional[bool] = False
 class ChoiceMessage(BaseModel):

 class ChatCompletionRequest(BaseModel):
+    model: Optional[str] = None  # Optional, will use default from config
     messages: List[Message]
+    temperature: Optional[float] = 0.7
+    max_tokens: Optional[int] = None
     stream: Optional[bool] = False
+    top_p: Optional[float] = 1.0
 class ChoiceMessage(BaseModel):

app/providers/vllm.py CHANGED Viewed

@@ -1,7 +1,7 @@
 import os
-from typing import Dict, Any, AsyncIterator
 from vllm import LLM, SamplingParams
-from vllm.entrypoints.openai.api_server import build_async_engine_client
 import asyncio
 from huggingface_hub import login
@@ -10,60 +10,159 @@ model_name = "DragonLLM/qwen3-8b-fin-v1.0"
 llm_engine = None
 def initialize_vllm():
-    """Initialize vLLM engine with the model"""
     global llm_engine
     if llm_engine is None:
         print(f"Initializing vLLM with model: {model_name}")
         # Get HF token from environment (Hugging Face Space secret)
-        # Try HF_TOKEN_LC2 first (for DragonLLM access), then fall back to HF_TOKEN_LC
-        hf_token = os.getenv("HF_TOKEN_LC2") or os.getenv("HF_TOKEN_LC")
         if hf_token:
-            token_source = "HF_TOKEN_LC2" if os.getenv("HF_TOKEN_LC2") else "HF_TOKEN_LC"
             print(f"✅ {token_source} found (length: {len(hf_token)})")
-            # Properly authenticate with Hugging Face Hub
             try:
                 login(token=hf_token, add_to_git_credential=False)
                 print("✅ Successfully authenticated with Hugging Face Hub")
             except Exception as e:
                 print(f"⚠️  Warning: Failed to authenticate with HF Hub: {e}")
-            # Also set environment variables as fallback
             os.environ["HF_TOKEN"] = hf_token
             os.environ["HUGGING_FACE_HUB_TOKEN"] = hf_token
         else:
-            print("⚠️  WARNING: Neither HF_TOKEN_LC2 nor HF_TOKEN_LC found in environment!")
-            print("Available env vars:", list(os.environ.keys()))
         try:
-            # Initialize vLLM engine with explicit token
             print(f"Attempting to load model: {model_name}")
-            print(f"Model type: DragonLLM Qwen3 8B (bfloat16) - Back to working combo")
             print(f"Download directory: /tmp/huggingface")
             print(f"Trust remote code: True")
             print(f"L4 GPU: 24GB VRAM available")
-            print(f"Mode: Eager mode (CUDA graphs disabled for L4)")
             print(f"GPU memory utilization: 0.85")
-            print(f"vLLM: v0.6.5 (Qwen3 support + VLLM_USE_V1=0 for stability)")
-            print(f"PyTorch: 2.4.0+cu124 (CUDA 12.4 binary)")
-            llm_engine = LLM(
-                model=model_name,
-                trust_remote_code=True,
-                dtype="bfloat16",  # Use bfloat16 for Qwen3 (required)
-                max_model_len=4096,  # Reduced for L4 KV cache constraints
-                gpu_memory_utilization=0.85,  # Can use more with stable v0 engine
-                tensor_parallel_size=1,  # Single L4 GPU
-                download_dir="/tmp/huggingface",
-                tokenizer_mode="auto",
-                # Disable torch.compile on L4 due to memory constraints
-                enforce_eager=True,  # Use eager mode (no CUDA graphs/compilation)
-                # Let vLLM handle compilation and fallback gracefully
-                disable_log_stats=False,  # Enable logging for debugging
-            )
-            print(f"✅ vLLM engine initialized successfully with {model_name}!")
         except Exception as e:
-            print(f"❌ Error initializing vLLM: {e}")
             raise
@@ -88,7 +187,7 @@ class VLLMProvider:
             ]
         }
-    async def chat(self, payload: Dict[str, Any], stream: bool = False) -> Dict[str, Any]:
         import logging
         logger = logging.getLogger(__name__)
@@ -115,7 +214,11 @@ class VLLMProvider:
                 max_tokens=max_tokens,
             )
-            # Generate response using vLLM
             outputs = llm_engine.generate([prompt], sampling_params)
             # Extract the generated text
@@ -123,11 +226,16 @@ class VLLMProvider:
             logger.info(f"Generated text: {generated_text[:100]}...")
             # Build OpenAI-compatible response
             return {
-                "id": f"chatcmpl-{os.urandom(12).hex()}",
                 "object": "chat.completion",
-                "created": int(asyncio.get_event_loop().time()),
-                "model": model_name,
                 "choices": [
                     {
                         "index": 0,
@@ -139,15 +247,92 @@ class VLLMProvider:
                     }
                 ],
                 "usage": {
-                    "prompt_tokens": len(outputs[0].prompt_token_ids),
-                    "completion_tokens": len(outputs[0].outputs[0].token_ids),
-                    "total_tokens": len(outputs[0].prompt_token_ids) + len(outputs[0].outputs[0].token_ids)
                 }
             }
         except Exception as e:
             logger.error(f"Error in chat completion: {str(e)}", exc_info=True)
             raise
     def _messages_to_prompt(self, messages: list) -> str:
         """Convert OpenAI messages format to prompt"""
         prompt = ""
@@ -162,3 +347,18 @@ class VLLMProvider:
                 prompt += f"Assistant: {content}\n"
         prompt += "Assistant: "
         return prompt

 import os
+import time
+from typing import Dict, Any, AsyncIterator, Union
 from vllm import LLM, SamplingParams
 import asyncio
 from huggingface_hub import login
 llm_engine = None
 def initialize_vllm():
+    """Initialize vLLM engine with the model
+    Handles authentication with Hugging Face Hub for accessing DragonLLM models.
+    Prioritizes HF_TOKEN_LC2 (DragonLLM access) over HF_TOKEN_LC.
+    """
     global llm_engine
     if llm_engine is None:
+        import logging
+        logger = logging.getLogger(__name__)
+        logger.info(f"Initializing vLLM with model: {model_name}")
         print(f"Initializing vLLM with model: {model_name}")
         # Get HF token from environment (Hugging Face Space secret)
+        # Priority: HF_TOKEN_LC2 (for DragonLLM access) > HF_TOKEN_LC > HF_TOKEN
+        hf_token = (
+            os.getenv("HF_TOKEN_LC2") or
+            os.getenv("HF_TOKEN_LC") or
+            os.getenv("HF_TOKEN") or
+            os.getenv("HUGGING_FACE_HUB_TOKEN")
+        )
         if hf_token:
+            # Determine token source for logging
+            if os.getenv("HF_TOKEN_LC2"):
+                token_source = "HF_TOKEN_LC2"
+            elif os.getenv("HF_TOKEN_LC"):
+                token_source = "HF_TOKEN_LC"
+            elif os.getenv("HF_TOKEN"):
+                token_source = "HF_TOKEN"
+            else:
+                token_source = "HUGGING_FACE_HUB_TOKEN"
+            logger.info(f"✅ {token_source} found (length: {len(hf_token)})")
             print(f"✅ {token_source} found (length: {len(hf_token)})")
+            # Authenticate with Hugging Face Hub
             try:
                 login(token=hf_token, add_to_git_credential=False)
+                logger.info("✅ Successfully authenticated with Hugging Face Hub")
                 print("✅ Successfully authenticated with Hugging Face Hub")
             except Exception as e:
+                logger.warning(f"⚠️  Warning: Failed to authenticate with HF Hub: {e}")
                 print(f"⚠️  Warning: Failed to authenticate with HF Hub: {e}")
+            # Set all possible environment variables that vLLM/huggingface_hub might check
+            # This ensures compatibility across different versions
             os.environ["HF_TOKEN"] = hf_token
             os.environ["HUGGING_FACE_HUB_TOKEN"] = hf_token
+            # Some tools check for these variants too
+            os.environ["HF_API_TOKEN"] = hf_token
+            logger.info("✅ Hugging Face token environment variables set")
         else:
+            logger.warning("⚠️  WARNING: No HF token found in environment!")
+            logger.warning(f"   Checked: HF_TOKEN_LC2, HF_TOKEN_LC, HF_TOKEN, HUGGING_FACE_HUB_TOKEN")
+            logger.warning(f"   Available env vars: {[k for k in os.environ.keys() if 'TOKEN' in k or 'HF' in k]}")
+            print("⚠️  WARNING: No HF token found in environment!")
+            print(f"   Checked: HF_TOKEN_LC2, HF_TOKEN_LC, HF_TOKEN, HUGGING_FACE_HUB_TOKEN")
+            print(f"   Available env vars with 'TOKEN' or 'HF': {[k for k in os.environ.keys() if 'TOKEN' in k or 'HF' in k]}")
+            print("   ⚠️  Model download may fail if DragonLLM/qwen3-8b-fin-v1.0 is gated!")
         try:
+            # Initialize vLLM engine
+            # Note: vLLM 0.6.5 will use HF_TOKEN from environment for model downloads
+            logger.info(f"Attempting to load model: {model_name}")
             print(f"Attempting to load model: {model_name}")
+            print(f"Model type: DragonLLM Qwen3 8B (bfloat16)")
             print(f"Download directory: /tmp/huggingface")
             print(f"Trust remote code: True")
             print(f"L4 GPU: 24GB VRAM available")
+            # Try optimized mode first (CUDA graphs enabled)
+            # Falls back to eager mode if CUDA graphs fail
+            use_optimized = os.getenv("VLLM_USE_EAGER", "auto").lower()
+            if use_optimized == "true":
+                enforce_eager = True
+                mode_desc = "Eager mode (forced)"
+            elif use_optimized == "false":
+                enforce_eager = False
+                mode_desc = "Optimized mode (CUDA graphs enabled)"
+            else:  # "auto" - try optimized, fallback to eager
+                enforce_eager = False
+                mode_desc = "Optimized mode (auto, fallback to eager if needed)"
+            print(f"Mode: {mode_desc}")
             print(f"GPU memory utilization: 0.85")
+            print(f"vLLM: v0.9.2 (Latest stable, improved Qwen3 support)")
+            print(f"PyTorch: 2.5.0+ (CUDA 12.4 binary)")
+            # Common initialization parameters
+            init_params = {
+                "model": model_name,
+                "trust_remote_code": True,
+                "dtype": "bfloat16",  # Use bfloat16 for Qwen3 (required)
+                "max_model_len": 4096,  # Reduced for L4 KV cache constraints
+                "gpu_memory_utilization": 0.85,  # Can use more with stable v0 engine
+                "tensor_parallel_size": 1,  # Single L4 GPU
+                "download_dir": "/tmp/huggingface",
+                "tokenizer_mode": "auto",
+                "disable_log_stats": False,  # Enable logging for debugging
+            }
+            # Try optimized mode first (unless explicitly disabled)
+            if use_optimized == "auto" or use_optimized == "false":
+                try:
+                    print(f"🚀 Attempting optimized mode with CUDA graphs...")
+                    logger.info("Attempting optimized mode (enforce_eager=False)")
+                    init_params["enforce_eager"] = False
+                    llm_engine = LLM(**init_params)
+                    print(f"✅ vLLM engine initialized successfully in OPTIMIZED mode!")
+                    logger.info("✅ vLLM engine initialized in optimized mode (CUDA graphs enabled)")
+                except Exception as opt_error:
+                    error_msg = str(opt_error).lower()
+                    # Check if error is CUDA graph related
+                    if "cuda graph" in error_msg or "graph" in error_msg or use_optimized == "auto":
+                        logger.warning(f"⚠️  Optimized mode failed, falling back to eager mode: {opt_error}")
+                        print(f"⚠️  Optimized mode failed: {opt_error}")
+                        print(f"🔄 Falling back to eager mode for stability...")
+                        init_params["enforce_eager"] = True
+                        llm_engine = LLM(**init_params)
+                        print(f"✅ vLLM engine initialized successfully in EAGER mode (fallback)")
+                        logger.info("✅ vLLM engine initialized in eager mode (fallback after optimized mode failure)")
+                    else:
+                        # Re-raise if it's not a CUDA graph issue or if optimized is forced
+                        raise
+            else:
+                # Eager mode explicitly requested
+                print(f"⚙️  Using eager mode (explicitly requested)")
+                logger.info("Using eager mode (VLLM_USE_EAGER=true)")
+                init_params["enforce_eager"] = True
+                llm_engine = LLM(**init_params)
+                print(f"✅ vLLM engine initialized successfully in EAGER mode!")
+                logger.info("✅ vLLM engine initialized in eager mode")
         except Exception as e:
+            error_msg = f"❌ Error initializing vLLM: {e}"
+            logger.error(error_msg, exc_info=True)
+            print(error_msg)
+            # Provide helpful error message for authentication issues
+            if "401" in str(e) or "Unauthorized" in str(e) or "authentication" in str(e).lower():
+                print("\n🔐 Authentication Error Detected!")
+                print("   This usually means:")
+                print("   1. HF_TOKEN_LC2 is missing or invalid")
+                print("   2. You haven't accepted the model's terms on Hugging Face")
+                print("   3. The token doesn't have access to DragonLLM models")
+                print("\n   To fix:")
+                print("   1. Visit: https://huggingface.co/DragonLLM/qwen3-8b-fin-v1.0")
+                print("   2. Accept the model's terms of use")
+                print("   3. Ensure HF_TOKEN_LC2 is set as a secret in your HF Space")
             raise
             ]
         }
+    async def chat(self, payload: Dict[str, Any], stream: bool = False) -> Union[Dict[str, Any], AsyncIterator[str]]:
         import logging
         logger = logging.getLogger(__name__)
                 max_tokens=max_tokens,
             )
+            # Handle streaming vs non-streaming
+            if stream:
+                return self._chat_stream(prompt, sampling_params, payload.get("model", model_name))
+            # Generate response using vLLM (non-streaming)
             outputs = llm_engine.generate([prompt], sampling_params)
             # Extract the generated text
             logger.info(f"Generated text: {generated_text[:100]}...")
             # Build OpenAI-compatible response
+            completion_id = f"chatcmpl-{os.urandom(12).hex()}"
+            created = int(time.time())
+            prompt_tokens = len(outputs[0].prompt_token_ids)
+            completion_tokens = len(outputs[0].outputs[0].token_ids)
             return {
+                "id": completion_id,
                 "object": "chat.completion",
+                "created": created,
+                "model": payload.get("model", model_name),
                 "choices": [
                     {
                         "index": 0,
                     }
                 ],
                 "usage": {
+                    "prompt_tokens": prompt_tokens,
+                    "completion_tokens": completion_tokens,
+                    "total_tokens": prompt_tokens + completion_tokens
                 }
             }
         except Exception as e:
             logger.error(f"Error in chat completion: {str(e)}", exc_info=True)
             raise
+    async def _chat_stream(self, prompt: str, sampling_params: SamplingParams, model: str) -> AsyncIterator[str]:
+        """Stream chat completions using vLLM
+        Note: vLLM 0.6.5 with synchronous LLM doesn't support true streaming.
+        This implementation generates the full response and yields it in chunks
+        for OpenAI API compatibility. For true streaming, use AsyncLLMEngine.
+        """
+        import logging
+        logger = logging.getLogger(__name__)
+        completion_id = f"chatcmpl-{os.urandom(12).hex()}"
+        created = int(time.time())
+        # Generate response (non-streaming backend, but we'll chunk it)
+        # Run in thread pool to avoid blocking
+        loop = asyncio.get_event_loop()
+        outputs = await loop.run_in_executor(
+            None,
+            lambda: llm_engine.generate([prompt], sampling_params)
+        )
+        generated_text = outputs[0].outputs[0].text
+        finish_reason = outputs[0].outputs[0].finish_reason or "stop"
+        # Yield text in chunks (simulate streaming)
+        # Split into reasonable chunks (words or characters)
+        chunk_size = 10  # words per chunk
+        words = generated_text.split()
+        for i in range(0, len(words), chunk_size):
+            chunk_words = words[i:i + chunk_size]
+            delta_text = " ".join(chunk_words)
+            if i + chunk_size < len(words):
+                delta_text += " "
+            # Format as OpenAI SSE stream chunk
+            chunk = {
+                "id": completion_id,
+                "object": "chat.completion.chunk",
+                "created": created,
+                "model": model,
+                "choices": [
+                    {
+                        "index": 0,
+                        "delta": {
+                            "content": delta_text
+                        },
+                        "finish_reason": None
+                    }
+                ]
+            }
+            yield f"data: {self._json_dumps(chunk)}\n\n"
+            await asyncio.sleep(0)  # Yield control
+        # Send final chunk with finish_reason
+        final_chunk = {
+            "id": completion_id,
+            "object": "chat.completion.chunk",
+            "created": created,
+            "model": model,
+            "choices": [
+                {
+                    "index": 0,
+                    "delta": {},
+                    "finish_reason": finish_reason
+                }
+            ]
+        }
+        yield f"data: {self._json_dumps(final_chunk)}\n\n"
+        yield "data: [DONE]\n\n"
+    def _json_dumps(self, obj: Dict[str, Any]) -> str:
+        """JSON dump helper"""
+        import json
+        return json.dumps(obj, ensure_ascii=False)
     def _messages_to_prompt(self, messages: list) -> str:
         """Convert OpenAI messages format to prompt"""
         prompt = ""
                 prompt += f"Assistant: {content}\n"
         prompt += "Assistant: "
         return prompt
+# Module-level provider instance for backward compatibility
+_provider = VLLMProvider()
+# Module-level functions for direct import
+async def list_models() -> Dict[str, Any]:
+    """List available models"""
+    return await _provider.list_models()
+async def chat(payload: Dict[str, Any], stream: bool = False) -> Union[Dict[str, Any], AsyncIterator[str]]:
+    """Chat completion"""
+    return await _provider.chat(payload, stream=stream)

app/routers/openai_api.py CHANGED Viewed

@@ -1,5 +1,5 @@
-import time
 from typing import Any, Dict
 from fastapi import APIRouter
 from fastapi.responses import StreamingResponse, JSONResponse
@@ -8,49 +8,45 @@ from app.config import settings
 from app.models.openai import ChatCompletionRequest
 from app.services import chat_service
 router = APIRouter()
 @router.get("/models")
 async def list_models():
     return await chat_service.list_models()
 @router.post("/chat/completions")
 async def chat_completions(body: ChatCompletionRequest):
-    import logging
-    logger = logging.getLogger(__name__)
     try:
         payload: Dict[str, Any] = {
             "model": body.model or settings.model,
             "messages": [m.model_dump() for m in body.messages],
-            "temperature": body.temperature,
-            **({"max_tokens": body.max_tokens} if body.max_tokens is not None else {}),
             "stream": body.stream or False,
         }
-        logger.info(f"Chat completion request: {payload}")
         if body.stream:
-            upstream = await chat_service.chat(payload, stream=True)
-            async def event_stream():
-                async for line in upstream.aiter_lines():
-                    if not line:
-                        continue
-                    if line.startswith("data:"):
-                        yield f"{line}\n\n"
-                    else:
-                        yield f"data: {line}\n\n"
-            return StreamingResponse(event_stream(), media_type="text/event-stream")
         data = await chat_service.chat(payload, stream=False)
-        # Assume vLLM already returns OpenAI-compatible schema; pass through.
-        # If needed, normalize here.
         return JSONResponse(content=data)
     except Exception as e:
         logger.error(f"Error in chat completions endpoint: {str(e)}", exc_info=True)
         return JSONResponse(

 from typing import Any, Dict
+import logging
 from fastapi import APIRouter
 from fastapi.responses import StreamingResponse, JSONResponse
 from app.models.openai import ChatCompletionRequest
 from app.services import chat_service
+logger = logging.getLogger(__name__)
 router = APIRouter()
 @router.get("/models")
 async def list_models():
+    """List available models (OpenAI-compatible endpoint)"""
     return await chat_service.list_models()
 @router.post("/chat/completions")
 async def chat_completions(body: ChatCompletionRequest):
+    """Chat completions endpoint (OpenAI-compatible)"""
     try:
+        # Build payload with all supported parameters
         payload: Dict[str, Any] = {
             "model": body.model or settings.model,
             "messages": [m.model_dump() for m in body.messages],
+            "temperature": body.temperature or 0.7,
+            "top_p": body.top_p or 1.0,
             "stream": body.stream or False,
         }
+        # Add optional max_tokens if provided
+        if body.max_tokens is not None:
+            payload["max_tokens"] = body.max_tokens
+        logger.info(f"Chat completion request: model={payload['model']}, messages={len(payload['messages'])}, stream={payload['stream']}")
         if body.stream:
+            stream = await chat_service.chat(payload, stream=True)
+            # stream is already an AsyncIterator[str] with SSE-formatted chunks
+            return StreamingResponse(stream, media_type="text/event-stream")
+        # Non-streaming response
         data = await chat_service.chat(payload, stream=False)
         return JSONResponse(content=data)
     except Exception as e:
         logger.error(f"Error in chat completions endpoint: {str(e)}", exc_info=True)
         return JSONResponse(

app/services/chat_service.py CHANGED Viewed

@@ -1,12 +1,12 @@
 from typing import Any, Dict
-from app.providers.vllm import VLLMProvider
-# Initialize the provider
-provider = VLLMProvider()
 async def list_models() -> Dict[str, Any]:
     return await provider.list_models()
 async def chat(payload: Dict[str, Any], stream: bool = False):
     return await provider.chat(payload, stream=stream)

 from typing import Any, Dict
+from app.providers import vllm as provider
 async def list_models() -> Dict[str, Any]:
     return await provider.list_models()
 async def chat(payload: Dict[str, Any], stream: bool = False):
     return await provider.chat(payload, stream=stream)

requirements-dev.txt CHANGED Viewed

@@ -6,6 +6,9 @@ pytest>=7.4.0
 pytest-asyncio>=0.21.0
 openai>=1.0.0
-# Performance testing
-httpx>=0.27.0


6	pytest-asyncio>=0.21.0
7	openai>=1.0.0
8
9	+
10	+
11	+
12	+
13	+
14

requirements.txt CHANGED Viewed

@@ -1,5 +1,8 @@
-# Dependencies installed in Dockerfile during HF Space build
-vllm
 fastapi>=0.115.0
 uvicorn[standard]>=0.30.0
 pydantic>=2.8.0
@@ -8,4 +11,5 @@ httpx>=0.27.0
 python-dotenv>=1.0.1
 tenacity>=8.3.0
 PyMuPDF>=1.24.0
-pytest>=7.4.0

+# Core dependencies for OpenAI-compatible API service
+# Note: vLLM and PyTorch are installed separately in Dockerfile for CUDA support
+# vllm==0.6.5  # Installed in Dockerfile
+# torch==2.4.0  # Installed in Dockerfile
 fastapi>=0.115.0
 uvicorn[standard]>=0.30.0
 pydantic>=2.8.0
 python-dotenv>=1.0.1
 tenacity>=8.3.0
 PyMuPDF>=1.24.0
+python-multipart>=0.0.6
+huggingface-hub>=0.20.0

scripts/check_vllm_compatibility.py ADDED Viewed

	@@ -0,0 +1,258 @@

+#!/usr/bin/env python3
+"""
+Check compatibility between DragonLLM/qwen3-8b-fin-v1.0 and vLLM 0.6.5
+This script verifies:
+1. vLLM version installed
+2. Model architecture support
+3. Configuration compatibility
+4. Known issues or limitations
+"""
+import sys
+import subprocess
+from pathlib import Path
+# Add parent directory to path
+sys.path.insert(0, str(Path(__file__).parent.parent))
+try:
+    import vllm
+    from vllm import LLM
+    from vllm.model_executor.models import MODEL_REGISTRY
+except ImportError:
+    print("❌ Error: vLLM not installed")
+    print("   Install it with: pip install vllm==0.6.5")
+    sys.exit(1)
+try:
+    from huggingface_hub import model_info
+    from huggingface_hub.utils import HfHubHTTPError
+except ImportError:
+    print("⚠️  Warning: huggingface_hub not installed")
+    print("   Some checks will be skipped")
+    model_info = None
+MODEL_NAME = "DragonLLM/qwen3-8b-fin-v1.0"
+VLLM_VERSION = "0.6.5"
+def check_vllm_version():
+    """Check installed vLLM version"""
+    print("\n" + "="*70)
+    print("CHECK 1: vLLM Version")
+    print("="*70)
+    installed_version = vllm.__version__
+    print(f"Installed vLLM version: {installed_version}")
+    print(f"Expected version: {VLLM_VERSION}")
+    if installed_version == VLLM_VERSION:
+        print("✅ Version matches!")
+        return True
+    elif installed_version.startswith("0.6"):
+        print(f"⚠️  Version mismatch: {installed_version} (expected {VLLM_VERSION})")
+        print("   This should be compatible but may have differences")
+        return True
+    else:
+        print(f"❌ Version mismatch: {installed_version}")
+        print(f"   This may cause compatibility issues")
+        return False
+def check_model_registry():
+    """Check if Qwen3 is in vLLM's model registry"""
+    print("\n" + "="*70)
+    print("CHECK 2: Model Architecture Support")
+    print("="*70)
+    # Get all registered models
+    registered_models = list(MODEL_REGISTRY.keys())
+    # Look for Qwen variants
+    qwen_models = [m for m in registered_models if 'qwen' in m.lower()]
+    print(f"Total models in registry: {len(registered_models)}")
+    print(f"Qwen-related models found: {len(qwen_models)}")
+    if qwen_models:
+        print("\n✅ Qwen models found in registry:")
+        for model in sorted(qwen_models):
+            print(f"   - {model}")
+        # Check specifically for Qwen3
+        qwen3_models = [m for m in qwen_models if 'qwen3' in m.lower() or '3' in m]
+        if qwen3_models:
+            print("\n✅ Qwen3 support detected!")
+            for model in qwen3_models:
+                print(f"   - {model}")
+            return True
+        else:
+            print("\n⚠️  Qwen models found but Qwen3 specifically not detected")
+            print("   Qwen3 might be handled by a generic Qwen loader")
+            return True  # Still likely compatible
+    else:
+        print("\n❌ No Qwen models found in registry")
+        print("   This suggests Qwen3 may not be supported")
+        return False
+def check_model_info():
+    """Check model information from Hugging Face"""
+    print("\n" + "="*70)
+    print("CHECK 3: Model Information")
+    print("="*70)
+    if not model_info:
+        print("⚠️  Skipping (huggingface_hub not available)")
+        return None
+    try:
+        info = model_info(MODEL_NAME, token=True)
+        print(f"Model: {MODEL_NAME}")
+        print(f"Architecture: {info.config.get('architectures', ['Unknown'])[0] if hasattr(info, 'config') else 'qwen3'}")
+        # Check model config
+        if hasattr(info, 'config') and info.config:
+            config = info.config
+            print(f"\nModel Configuration:")
+            # Check for Qwen-specific config
+            if 'qwen' in str(config).lower():
+                print("   ✅ Qwen architecture detected in config")
+            # Check for required fields
+            if hasattr(config, 'torch_dtype') or 'torch_dtype' in str(config):
+                print(f"   ✅ torch_dtype found")
+            if 'bfloat16' in str(config).lower():
+                print(f"   ✅ bfloat16 support confirmed")
+        return True
+    except HfHubHTTPError as e:
+        if e.response.status_code == 401:
+            print(f"❌ Unauthorized: Need to accept model terms")
+            print(f"   Visit: https://huggingface.co/{MODEL_NAME}")
+            return False
+        else:
+            print(f"❌ Error accessing model: {e}")
+            return False
+    except Exception as e:
+        print(f"⚠️  Could not fetch model info: {e}")
+        return None
+def check_configuration():
+    """Check if the configuration used is compatible"""
+    print("\n" + "="*70)
+    print("CHECK 4: Configuration Compatibility")
+    print("="*70)
+    print("Current configuration:")
+    print(f"   - dtype: bfloat16")
+    print(f"   - trust_remote_code: True")
+    print(f"   - enforce_eager: True")
+    print(f"   - max_model_len: 4096")
+    # Check if bfloat16 is supported
+    try:
+        import torch
+        if torch.cuda.is_bf16_supported():
+            print("   ✅ CUDA supports bfloat16")
+        else:
+            print("   ⚠️  CUDA may not fully support bfloat16")
+    except Exception:
+        pass
+    print("\n✅ Configuration looks compatible")
+    print("   - bfloat16: Required for Qwen3")
+    print("   - trust_remote_code: Required for custom architectures")
+    print("   - enforce_eager: Recommended for stability")
+    return True
+def check_known_issues():
+    """Check for known compatibility issues"""
+    print("\n" + "="*70)
+    print("CHECK 5: Known Issues / Compatibility Notes")
+    print("="*70)
+    print("Known considerations for Qwen3 + vLLM 0.6.5:")
+    print("   ✅ VLLM_USE_V1=0: Using v0 engine (more stable)")
+    print("   ✅ enforce_eager=True: Avoids CUDA graph issues")
+    print("   ✅ bfloat16: Required dtype for Qwen3")
+    print("   ✅ trust_remote_code: Required for custom tokenizers")
+    print("\n⚠️  Potential Issues:")
+    print("   - Qwen3 may require newer vLLM version (check if issues occur)")
+    print("   - If model fails to load, may need vLLM 0.6.6+ or 0.7.0+")
+    print("   - Monitor for tokenizer compatibility issues")
+    return True
+def main():
+    """Run all compatibility checks"""
+    print("\n" + "#"*70)
+    print("# vLLM 0.6.5 + DragonLLM/qwen3-8b-fin-v1.0 Compatibility Check")
+    print("#"*70)
+    results = {}
+    # Check 1: Version
+    results['version'] = check_vllm_version()
+    # Check 2: Model registry
+    results['registry'] = check_model_registry()
+    # Check 3: Model info
+    results['model_info'] = check_model_info()
+    # Check 4: Configuration
+    results['configuration'] = check_configuration()
+    # Check 5: Known issues
+    results['known_issues'] = check_known_issues()
+    # Summary
+    print("\n" + "="*70)
+    print("SUMMARY")
+    print("="*70)
+    for check_name, success in results.items():
+        if success is None:
+            status = "⚠️  SKIP"
+        else:
+            status = "✅ PASS" if success else "❌ FAIL"
+        check_display = check_name.replace('_', ' ').title()
+        print(f"{status} - {check_display}")
+    passed = sum(1 for v in results.values() if v is True)
+    total = sum(1 for v in results.values() if v is not None)
+    print(f"\nResults: {passed}/{total} checks passed")
+    if results.get('version') and results.get('registry'):
+        print("\n✅ Basic compatibility looks good!")
+        print("   The model should work with vLLM 0.6.5")
+        print("\n   If you encounter issues:")
+        print("   1. Ensure HF_TOKEN_LC2 is set")
+        print("   2. Check model repository access")
+        print("   3. Verify CUDA/bfloat16 support")
+        print("   4. Consider upgrading to vLLM 0.6.6+ if problems persist")
+    elif results.get('registry') == False:
+        print("\n⚠️  Qwen3 may not be explicitly supported in vLLM 0.6.5")
+        print("   Consider:")
+        print("   1. Testing with the model anyway (might still work)")
+        print("   2. Upgrading to vLLM 0.6.6 or 0.7.0+")
+        print("   3. Using a different model if compatibility issues occur")
+    else:
+        print("\n⚠️  Some compatibility concerns detected")
+        print("   Review the checks above for details")
+if __name__ == "__main__":
+    main()

scripts/query_with_context.py DELETED Viewed

@@ -1,179 +0,0 @@
-#!/usr/bin/env python3
-"""
-Query LLM with PRIIPS Document Context
-Loads extracted PRIIPS documents and queries the LLM with RAG context.
-"""
-import sys
-import json
-import argparse
-from pathlib import Path
-from typing import List, Dict
-import requests
-# Configuration
-BASE_URL = "https://jeanbaptdzd-priips-llm-service.hf.space"
-MODEL = "DragonLLM/qwen3-8b-fin-v1.0"
-def load_extracted_documents(extracted_dir: Path) -> List[Dict]:
-    """Load all extracted PRIIPS documents."""
-    documents = []
-    for json_file in extracted_dir.glob("*_extracted.json"):
-        if json_file.name.startswith("_"):
-            continue  # Skip summary files
-        with open(json_file, "r", encoding="utf-8") as f:
-            documents.append(json.load(f))
-    return documents
-def build_context(documents: List[Dict], query: str, max_chars: int = 2000) -> str:
-    """
-    Build RAG context from documents relevant to the query.
-    Simple implementation: include all document summaries.
-    Can be enhanced with semantic search/embeddings.
-    """
-    context_parts = []
-    total_chars = 0
-    for doc in documents:
-        metadata = doc["metadata"]
-        # Build a summary of this document
-        doc_summary = f"\n--- Document: {metadata['product_name']} (ISIN: {metadata['isin']}) ---\n"
-        # Include extracted sections
-        if "sections" in doc and doc["sections"]:
-            for section_name, content in doc["sections"].items():
-                if content:
-                    section_text = f"\n{section_name.upper()}:\n{content[:300]}...\n"
-                    doc_summary += section_text
-        # Check if we have space
-        if total_chars + len(doc_summary) > max_chars:
-            break
-        context_parts.append(doc_summary)
-        total_chars += len(doc_summary)
-    if not context_parts:
-        return "No relevant documents found."
-    return "\n".join(context_parts)
-def query_llm(query: str, context: str, max_tokens: int = 500) -> str:
-    """Query the LLM with context."""
-    # Build the prompt with context
-    prompt = f"""You are a financial expert assistant specializing in PRIIPS Key Information Documents.
-Use the following context from PRIIPS documents to answer the question:
-{context}
-Question: {query}
-Provide a clear, accurate answer based on the context provided. If the context doesn't contain enough information, say so."""
-    payload = {
-        "model": MODEL,
-        "messages": [
-            {"role": "system", "content": "You are a PRIIPS financial document expert."},
-            {"role": "user", "content": prompt}
-        ],
-        "max_tokens": max_tokens,
-        "temperature": 0.3  # Lower temperature for more factual responses
-    }
-    print(f"🔍 Querying LLM with {len(context)} chars of context...")
-    try:
-        response = requests.post(
-            f"{BASE_URL}/v1/chat/completions",
-            json=payload,
-            timeout=60
-        )
-        response.raise_for_status()
-        data = response.json()
-        answer = data["choices"][0]["message"]["content"]
-        # Print usage stats
-        usage = data.get("usage", {})
-        print(f"📊 Tokens used: {usage.get('total_tokens', 'N/A')}")
-        return answer
-    except Exception as e:
-        return f"Error querying LLM: {e}"
-def main():
-    parser = argparse.ArgumentParser(
-        description="Query LLM with PRIIPS document context"
-    )
-    parser.add_argument(
-        "query",
-        type=str,
-        help="Question to ask about PRIIPS documents"
-    )
-    parser.add_argument(
-        "--extracted-dir",
-        type=str,
-        default="priips_documents/extracted",
-        help="Directory containing extracted documents"
-    )
-    parser.add_argument(
-        "--max-context",
-        type=int,
-        default=2000,
-        help="Maximum context characters to include"
-    )
-    parser.add_argument(
-        "--max-tokens",
-        type=int,
-        default=500,
-        help="Maximum tokens in response"
-    )
-    args = parser.parse_args()
-    # Setup paths
-    workspace_root = Path(__file__).parent.parent
-    extracted_dir = workspace_root / args.extracted_dir
-    if not extracted_dir.exists():
-        print(f"❌ Directory not found: {extracted_dir}")
-        print("Run extract_priips.py first to extract documents.")
-        sys.exit(1)
-    # Load documents
-    print(f"📚 Loading documents from {extracted_dir}...")
-    documents = load_extracted_documents(extracted_dir)
-    if not documents:
-        print("⚠️  No extracted documents found.")
-        print("Add PDFs to priips_documents/raw/ and run extract_priips.py")
-        sys.exit(1)
-    print(f"✅ Loaded {len(documents)} documents")
-    # Build context
-    context = build_context(documents, args.query, args.max_context)
-    # Query LLM
-    print(f"\n❓ Question: {args.query}\n")
-    answer = query_llm(args.query, context, args.max_tokens)
-    print(f"\n💬 Answer:\n{answer}\n")
-if __name__ == "__main__":
-    main()

scripts/test_model_access.py ADDED Viewed

	@@ -0,0 +1,321 @@

+#!/usr/bin/env python3
+"""
+Test script to verify access to DragonLLM models using Hugging Face Hub.
+This script tests:
+1. Token detection and authentication
+2. Model repository access
+3. Model information retrieval
+4. Token permissions
+Note: You can also use the HF MCP server if available:
+  - Uses huggingface_hub library directly
+  - Compatible with MCP server setup
+Run with: python scripts/test_model_access.py
+"""
+import os
+import sys
+from pathlib import Path
+# Add parent directory to path for imports
+sys.path.insert(0, str(Path(__file__).parent.parent))
+try:
+    from huggingface_hub import login, whoami, HfApi, model_info, get_token
+    from huggingface_hub.utils import HfHubHTTPError
+except ImportError:
+    print("❌ Error: huggingface_hub not installed")
+    print("   Install it with: pip install huggingface-hub")
+    sys.exit(1)
+# Model to test access to
+MODEL_NAME = "DragonLLM/qwen3-8b-fin-v1.0"
+def get_hf_token():
+    """Get Hugging Face token from environment variables or HF CLI cache"""
+    # First try environment variables (priority for HF Spaces)
+    token = (
+        os.getenv("HF_TOKEN_LC2") or
+        os.getenv("HF_TOKEN_LC") or
+        os.getenv("HF_TOKEN") or
+        os.getenv("HUGGING_FACE_HUB_TOKEN")
+    )
+    if token:
+        # Determine source
+        if os.getenv("HF_TOKEN_LC2"):
+            source = "HF_TOKEN_LC2 (env)"
+        elif os.getenv("HF_TOKEN_LC"):
+            source = "HF_TOKEN_LC (env)"
+        elif os.getenv("HF_TOKEN"):
+            source = "HF_TOKEN (env)"
+        else:
+            source = "HUGGING_FACE_HUB_TOKEN (env)"
+        return token, source
+    # Fall back to HF CLI cached token (if available)
+    try:
+        cached_token = get_token()
+        if cached_token:
+            return cached_token, "HF CLI cache"
+    except Exception:
+        pass
+    return None, None
+def test_token_detection():
+    """Test 1: Check if token is found in environment"""
+    print("\n" + "="*70)
+    print("TEST 1: Token Detection")
+    print("="*70)
+    token, source = get_hf_token()
+    if token:
+        print(f"✅ Token found: {source}")
+        print(f"   Token length: {len(token)} characters")
+        print(f"   Token preview: {token[:10]}...{token[-4:]}")
+        return True, token, source
+    else:
+        print("❌ No token found in environment!")
+        print("\n   Checked environment variables:")
+        print("   - HF_TOKEN_LC2 (recommended for DragonLLM)")
+        print("   - HF_TOKEN_LC")
+        print("   - HF_TOKEN")
+        print("   - HUGGING_FACE_HUB_TOKEN")
+        print("\n   To set a token:")
+        print("   export HF_TOKEN_LC2='your_token_here'")
+        print("   Or use: huggingface-cli login")
+        return False, None, None
+def test_authentication(token):
+    """Test 2: Authenticate with Hugging Face Hub"""
+    print("\n" + "="*70)
+    print("TEST 2: Hugging Face Hub Authentication")
+    print("="*70)
+    try:
+        # Login with token
+        login(token=token, add_to_git_credential=False)
+        print("✅ Successfully authenticated with Hugging Face Hub")
+        # Get user info
+        try:
+            user_info = whoami()
+            print(f"✅ Logged in as: {user_info.get('name', 'Unknown')}")
+            if 'type' in user_info:
+                print(f"   Account type: {user_info['type']}")
+            return True
+        except Exception as e:
+            print(f"⚠️  Authenticated but couldn't get user info: {e}")
+            return True  # Still authenticated even if we can't get user info
+    except Exception as e:
+        print(f"❌ Authentication failed: {e}")
+        print("\n   Possible causes:")
+        print("   1. Invalid token")
+        print("   2. Token expired")
+        print("   3. Network connectivity issues")
+        return False
+def test_model_access(model_name):
+    """Test 3: Check if we can access the model repository"""
+    print("\n" + "="*70)
+    print("TEST 3: Model Repository Access")
+    print("="*70)
+    print(f"Model: {model_name}")
+    try:
+        # Try to get model info
+        print(f"   Attempting to access model repository...")
+        info = model_info(model_name, token=True)
+        print(f"✅ Successfully accessed model repository!")
+        print(f"   Model ID: {info.id}")
+        print(f"   Model tags: {', '.join(info.tags) if info.tags else 'None'}")
+        # Check if model is gated
+        if hasattr(info, 'gated') and info.gated:
+            print(f"   ⚠️  Model is GATED - requires accepting terms")
+        # Check available files
+        if hasattr(info, 'siblings'):
+            file_count = len(info.siblings) if info.siblings else 0
+            print(f"   Files in repository: {file_count}")
+            if file_count > 0 and info.siblings:
+                print(f"   Sample files:")
+                for sibling in info.siblings[:5]:
+                    print(f"     - {sibling.rfilename} ({sibling.size / (1024**2):.1f} MB)")
+                if file_count > 5:
+                    print(f"     ... and {file_count - 5} more files")
+        return True
+    except HfHubHTTPError as e:
+        if e.response.status_code == 401:
+            print(f"❌ Unauthorized (401): Token doesn't have access to this model")
+            print("\n   Possible causes:")
+            print("   1. You haven't accepted the model's terms of use")
+            print(f"   2. Visit: https://huggingface.co/{model_name}")
+            print("   3. Click 'Agree and access repository'")
+            print("   4. Token doesn't have proper permissions")
+            return False
+        elif e.response.status_code == 403:
+            print(f"❌ Forbidden (403): Access denied to this model")
+            print("\n   This model may be private or require special access")
+            return False
+        elif e.response.status_code == 404:
+            print(f"❌ Not Found (404): Model doesn't exist")
+            return False
+        else:
+            print(f"❌ HTTP Error {e.response.status_code}: {e}")
+            return False
+    except Exception as e:
+        print(f"❌ Error accessing model: {e}")
+        print(f"   Error type: {type(e).__name__}")
+        return False
+def test_model_files(model_name):
+    """Test 4: Check if we can list model files"""
+    print("\n" + "="*70)
+    print("TEST 4: Model Files Access")
+    print("="*70)
+    try:
+        api = HfApi()
+        files = api.list_repo_files(
+            repo_id=model_name,
+            repo_type="model",
+            token=True
+        )
+        if files:
+            print(f"✅ Found {len(files)} files in model repository")
+            print(f"   Key files:")
+            # Show important files
+            important_files = [
+                f for f in files if any(
+                    ext in f.lower()
+                    for ext in ['.safetensors', '.bin', 'config.json', 'tokenizer', 'model']
+                )
+            ]
+            for file in important_files[:10]:
+                print(f"     - {file}")
+            if len(files) > 10:
+                print(f"     ... and {len(files) - 10} more files")
+            return True
+        else:
+            print("⚠️  No files found in repository")
+            return False
+    except Exception as e:
+        print(f"❌ Error listing files: {e}")
+        return False
+def test_token_permissions(token):
+    """Test 5: Check token permissions"""
+    print("\n" + "="*70)
+    print("TEST 5: Token Permissions")
+    print("="*70)
+    try:
+        api = HfApi()
+        user_info = api.whoami(token=token)
+        print(f"✅ Token has valid permissions")
+        print(f"   User: {user_info.get('name', 'Unknown')}")
+        print(f"   Type: {user_info.get('type', 'Unknown')}")
+        # Check if user has read access
+        if 'canRead' in user_info:
+            print(f"   Can read repositories: {user_info['canRead']}")
+        return True
+    except Exception as e:
+        print(f"❌ Error checking permissions: {e}")
+        return False
+def main():
+    """Run all tests"""
+    print("\n" + "#"*70)
+    print("# DragonLLM Model Access Test")
+    print("#"*70)
+    print(f"Testing access to: {MODEL_NAME}")
+    results = {}
+    # Test 1: Token detection
+    success, token, source = test_token_detection()
+    results['token_detection'] = success
+    if not success:
+        print("\n" + "="*70)
+        print("❌ Cannot proceed without a token")
+        print("="*70)
+        return
+    # Test 2: Authentication
+    results['authentication'] = test_authentication(token)
+    if not results['authentication']:
+        print("\n" + "="*70)
+        print("❌ Authentication failed - cannot proceed")
+        print("="*70)
+        return
+    # Test 3: Model access
+    results['model_access'] = test_model_access(MODEL_NAME)
+    # Test 4: Model files (only if model access succeeded)
+    if results['model_access']:
+        results['model_files'] = test_model_files(MODEL_NAME)
+    else:
+        results['model_files'] = False
+    # Test 5: Token permissions
+    results['token_permissions'] = test_token_permissions(token)
+    # Summary
+    print("\n" + "="*70)
+    print("SUMMARY")
+    print("="*70)
+    for test_name, success in results.items():
+        status = "✅ PASS" if success else "❌ FAIL"
+        test_display = test_name.replace('_', ' ').title()
+        print(f"{status} - {test_display}")
+    passed = sum(1 for v in results.values() if v)
+    total = len(results)
+    print(f"\nResults: {passed}/{total} tests passed")
+    if passed == total:
+        print("\n🎉 All tests passed! You have full access to the DragonLLM model.")
+        print("   The model can be loaded in your application.")
+    elif results.get('token_detection') and results.get('authentication'):
+        print("\n⚠️  Authentication works but model access failed.")
+        print("   This usually means:")
+        print("   1. You need to accept the model's terms of use")
+        print(f"   2. Visit: https://huggingface.co/{MODEL_NAME}")
+        print("   3. Click 'Agree and access repository'")
+    else:
+        print("\n❌ Some tests failed. Check the errors above for details.")
+if __name__ == "__main__":
+    main()

tests/performance/README.md CHANGED Viewed

@@ -269,3 +269,9 @@ For issues or questions:
 - Review DEPLOYMENT.md for configuration details
 - Verify vLLM is properly initialized with model


269	- Review DEPLOYMENT.md for configuration details
270	- Verify vLLM is properly initialized with model
271
272	+
273	+
274	+
275	+
276	+
277	+

tests/performance/__init__.py CHANGED Viewed

	@@ -1,2 +1,8 @@
1	# Performance test suite
2


1	# Performance test suite
2
3	+
4	+
5	+
6	+
7	+
8	+

tests/performance/benchmark.py CHANGED Viewed

@@ -342,3 +342,9 @@ async def main():
 if __name__ == "__main__":
     asyncio.run(main())


342	if __name__ == "__main__":
343	asyncio.run(main())
344
345	+
346	+
347	+
348	+
349	+
350	+

tests/performance/test_inference_speed.py CHANGED Viewed

@@ -240,3 +240,9 @@ async def test_temperature_variance(client):
 if __name__ == "__main__":
     pytest.main([__file__, "-v", "-s"])


240	if __name__ == "__main__":
241	pytest.main([__file__, "-v", "-s"])
242
243	+
244	+
245	+
246	+
247	+
248	+

tests/performance/test_openai_compatibility.py CHANGED Viewed

@@ -343,3 +343,9 @@ class TestResponseFormat:
 if __name__ == "__main__":
     pytest.main([__file__, "-v", "-s"])


343	if __name__ == "__main__":
344	pytest.main([__file__, "-v", "-s"])
345
346	+
347	+
348	+
349	+
350	+
351	+