Spaces:
Sleeping
Sleeping
Dinuk-Di commited on
Commit Β·
2e2fd75
1
Parent(s): dbbbfec
Cuda Removed
Browse files- Dockerfile +15 -9
- README.md +127 -7
- app/model.py +33 -38
- app/schema.py +4 -2
- requirements.txt +2 -2
Dockerfile
CHANGED
|
@@ -1,27 +1,24 @@
|
|
| 1 |
-
FROM
|
| 2 |
|
| 3 |
ENV DEBIAN_FRONTEND=noninteractive \
|
| 4 |
PYTHONUNBUFFERED=1 \
|
| 5 |
PYTHONDONTWRITEBYTECODE=1 \
|
| 6 |
HF_HOME=/app/.cache/huggingface \
|
|
|
|
| 7 |
PORT=7860
|
| 8 |
|
| 9 |
RUN apt-get update && apt-get install -y --no-install-recommends \
|
| 10 |
-
python3.11 python3.11-dev python3-pip \
|
| 11 |
ffmpeg libsndfile1 git curl \
|
| 12 |
&& apt-get clean && rm -rf /var/lib/apt/lists/*
|
| 13 |
|
| 14 |
-
RUN ln -sf /usr/bin/python3.11 /usr/bin/python3 && \
|
| 15 |
-
ln -sf /usr/bin/python3 /usr/bin/python
|
| 16 |
-
|
| 17 |
WORKDIR /app
|
| 18 |
|
|
|
|
| 19 |
RUN pip install --no-cache-dir --upgrade pip setuptools wheel packaging
|
| 20 |
|
|
|
|
| 21 |
RUN pip install --no-cache-dir torch torchvision torchaudio \
|
| 22 |
-
--index-url https://download.pytorch.org/whl/
|
| 23 |
-
|
| 24 |
-
RUN pip install --no-cache-dir flash-attn --no-build-isolation
|
| 25 |
|
| 26 |
COPY requirements.txt .
|
| 27 |
RUN pip install --no-cache-dir -r requirements.txt
|
|
@@ -32,4 +29,13 @@ RUN useradd -m -u 1000 appuser && chown -R appuser:appuser /app
|
|
| 32 |
USER appuser
|
| 33 |
|
| 34 |
EXPOSE 7860
|
| 35 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
FROM python:3.11-slim
|
| 2 |
|
| 3 |
ENV DEBIAN_FRONTEND=noninteractive \
|
| 4 |
PYTHONUNBUFFERED=1 \
|
| 5 |
PYTHONDONTWRITEBYTECODE=1 \
|
| 6 |
HF_HOME=/app/.cache/huggingface \
|
| 7 |
+
TRANSFORMERS_CACHE=/app/.cache/huggingface \
|
| 8 |
PORT=7860
|
| 9 |
|
| 10 |
RUN apt-get update && apt-get install -y --no-install-recommends \
|
|
|
|
| 11 |
ffmpeg libsndfile1 git curl \
|
| 12 |
&& apt-get clean && rm -rf /var/lib/apt/lists/*
|
| 13 |
|
|
|
|
|
|
|
|
|
|
| 14 |
WORKDIR /app
|
| 15 |
|
| 16 |
+
# Upgrade build tools first
|
| 17 |
RUN pip install --no-cache-dir --upgrade pip setuptools wheel packaging
|
| 18 |
|
| 19 |
+
# CPU-only PyTorch (no CUDA, no nvcc needed)
|
| 20 |
RUN pip install --no-cache-dir torch torchvision torchaudio \
|
| 21 |
+
--index-url https://download.pytorch.org/whl/cpu
|
|
|
|
|
|
|
| 22 |
|
| 23 |
COPY requirements.txt .
|
| 24 |
RUN pip install --no-cache-dir -r requirements.txt
|
|
|
|
| 29 |
USER appuser
|
| 30 |
|
| 31 |
EXPOSE 7860
|
| 32 |
+
|
| 33 |
+
HEALTHCHECK --interval=60s --timeout=15s --start-period=300s --retries=3 \
|
| 34 |
+
CMD curl -f http://localhost:7860/ || exit 1
|
| 35 |
+
|
| 36 |
+
CMD ["python", "-m", "uvicorn", "main:app", \
|
| 37 |
+
"--host", "0.0.0.0", \
|
| 38 |
+
"--port", "7860", \
|
| 39 |
+
"--workers", "1", \
|
| 40 |
+
"--loop", "uvloop", \
|
| 41 |
+
"--log-level", "info"]
|
README.md
CHANGED
|
@@ -1,10 +1,130 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
---
|
| 9 |
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Boqapi
|
| 2 |
+
|
| 3 |
+
**Boqapi** is a production-ready Multimodal RAG (Retrieval-Augmented Generation) API. It supports text, image, audio, and video modalities for both ingestion and querying, leveraging the power of [Qwen2.5-Omni-7B](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) for multimodal reasoning and generation, and `all-MiniLM-L6-v2` for generating embeddings.
|
| 4 |
+
|
| 5 |
+
## Features
|
| 6 |
+
- **Multimodal Generation**: Powered by `Qwen2.5-Omni-7B`, the API processes audio, visual, and textual inputs to generate relevant text or audio outputs.
|
| 7 |
+
- **Document Ingestion**: Seamlessly ingest text, image, audio, or video components. Text gets directly embedded. Non-text modalities generate a text descriptor embedded for RAG vector search.
|
| 8 |
+
- **RAG Querying**: Combines in-memory vector similarity search (Cosine Similarity) with the reasoning capabilities of Qwen2.5-Omni-7B.
|
| 9 |
+
- **FastAPI Backend**: Provides high performance and asynchronous request handling with CORS, Rate-Limiting, and global exception management.
|
| 10 |
+
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
## Architecture Details
|
| 14 |
+
|
| 15 |
+
### Ingestion Flow
|
| 16 |
+
When multimodal documents are submitted to the `/ingest/documents` endpoint:
|
| 17 |
+
1. **Text**: Embedded directly using `sentence-transformers/all-MiniLM-L6-v2`.
|
| 18 |
+
2. **Audio/Video/Image**: A modality-specific text descriptor is created and embedded.
|
| 19 |
+
3. **Storage**: Both the document metadata and embeddings are placed into an in-memory vector store for quick retrieval.
|
| 20 |
+
|
| 21 |
+
```mermaid
|
| 22 |
+
flowchart LR
|
| 23 |
+
A[Multimodal Document] --> T{Is Text?}
|
| 24 |
+
T -- Yes --> B[Extract Text]
|
| 25 |
+
T -- No --> C[Create Descriptor]
|
| 26 |
+
B --> E[SentenceTransformer Embedding]
|
| 27 |
+
C --> E
|
| 28 |
+
E --> V[(In-Memory Vector Store)]
|
| 29 |
+
```
|
| 30 |
+
|
| 31 |
+
### RAG Query Flow
|
| 32 |
+
When a user submits a query along with optional media files to the `/rag/query` endpoint:
|
| 33 |
+
1. **Retrieval**: The textual component of the query is embedded, and cosine similarity is used to find the top-K relevant documents from the vector store.
|
| 34 |
+
2. **Conversation Assembly**: Retrieved contexts and the user's media inputs (images, audio, video) are formatted into a structured prompt.
|
| 35 |
+
3. **Inference**: The prompt is processed by `Qwen2.5-Omni-7B`. If audio output is requested, it synthezises speech (e.g., using "Chelsie").
|
| 36 |
+
4. **Response**: The API returns the generated text, retrieved documents, token usage, performance metrics, and optionally base64-encoded audio.
|
| 37 |
+
|
| 38 |
+
```mermaid
|
| 39 |
+
flowchart TD
|
| 40 |
+
Q[User Query + Media] --> R[Embed Text Query]
|
| 41 |
+
R --> S[Vector Similarity Search]
|
| 42 |
+
S --> V[(In-Memory Vector Store)]
|
| 43 |
+
V --> C
|
| 44 |
+
S --> |Top-K Docs| C[Assemble Context & Media Prompt]
|
| 45 |
+
C --> I[[Qwen2.5-Omni-7B Inference Engine]]
|
| 46 |
+
I --> O[Generated Text / Audio Response]
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
---
|
| 50 |
+
|
| 51 |
+
## Setup and Execution
|
| 52 |
+
|
| 53 |
+
### Prerequisites
|
| 54 |
+
- Nvidia GPU (CUDA 12.1+ recommended)
|
| 55 |
+
- Docker (for containerized execution)
|
| 56 |
+
- Python 3.11+ (for local execution)
|
| 57 |
+
|
| 58 |
+
### 1. Running with Docker
|
| 59 |
+
|
| 60 |
+
The provided `Dockerfile` builds on `nvidia/cuda:12.1.1` and handles all heavy dependencies (Flash Attention 2, torch, ffmpeg, etc.).
|
| 61 |
+
|
| 62 |
+
```bash
|
| 63 |
+
# Build the image
|
| 64 |
+
docker build -t boqapi .
|
| 65 |
+
|
| 66 |
+
# Run the container with GPU support
|
| 67 |
+
docker run --gpus all -p 7860:7860 boqapi
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
### 2. Running Locally
|
| 71 |
+
|
| 72 |
+
If you prefer running directly on your host machine:
|
| 73 |
+
|
| 74 |
+
```bash
|
| 75 |
+
# Clone the repository and navigate to the project directory
|
| 76 |
+
cd boqapi
|
| 77 |
+
|
| 78 |
+
# Install system dependencies
|
| 79 |
+
sudo apt-get install ffmpeg libsndfile1
|
| 80 |
+
|
| 81 |
+
# Install Python dependencies
|
| 82 |
+
pip install -r requirements.txt
|
| 83 |
+
|
| 84 |
+
# Run the FastAPI application
|
| 85 |
+
python app/main.py
|
| 86 |
+
# OR
|
| 87 |
+
uvicorn main:app --host 0.0.0.0 --port 7860 --workers 1
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
*(Note: Ensure you have Flash Attention 2 compatible hardware if `flash-attn` is installed.)*
|
| 91 |
+
|
| 92 |
---
|
| 93 |
|
| 94 |
+
## Example API Usage
|
| 95 |
+
|
| 96 |
+
The API is secured (by default) with a static API key. Pass `x-api-key: dev-secret` in your headers.
|
| 97 |
+
|
| 98 |
+
### 1. Health Check
|
| 99 |
+
```bash
|
| 100 |
+
curl -X GET "http://localhost:7860/health"
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
### 2. Ingest Documents
|
| 104 |
+
```bash
|
| 105 |
+
curl -X POST "http://localhost:7860/ingest/documents" \
|
| 106 |
+
-H "x-api-key: dev-secret" \
|
| 107 |
+
-H "Content-Type: application/json" \
|
| 108 |
+
-d '{
|
| 109 |
+
"user_id": "user123",
|
| 110 |
+
"documents": [
|
| 111 |
+
{
|
| 112 |
+
"modality": "text",
|
| 113 |
+
"content": "The Qwen model supports text, image, audio, and video."
|
| 114 |
+
}
|
| 115 |
+
]
|
| 116 |
+
}'
|
| 117 |
+
```
|
| 118 |
+
|
| 119 |
+
### 3. RAG Query
|
| 120 |
+
```bash
|
| 121 |
+
curl -X POST "http://localhost:7860/rag/query" \
|
| 122 |
+
-H "x-api-key: dev-secret" \
|
| 123 |
+
-H "Content-Type: application/json" \
|
| 124 |
+
-d '{
|
| 125 |
+
"user_id": "user123",
|
| 126 |
+
"query_text": "What does the Qwen model support?",
|
| 127 |
+
"top_k": 3,
|
| 128 |
+
"return_audio": false
|
| 129 |
+
}'
|
| 130 |
+
```
|
app/model.py
CHANGED
|
@@ -2,10 +2,10 @@
|
|
| 2 |
import torch
|
| 3 |
import logging
|
| 4 |
import time
|
| 5 |
-
from
|
|
|
|
| 6 |
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
|
| 7 |
from qwen_omni_utils import process_mm_info
|
| 8 |
-
from typing import Optional, Tuple, List, Dict, Any
|
| 9 |
|
| 10 |
logger = logging.getLogger(__name__)
|
| 11 |
|
|
@@ -16,34 +16,35 @@ _processor: Optional[Qwen2_5OmniProcessor] = None
|
|
| 16 |
_model_load_time: float = 0.0
|
| 17 |
|
| 18 |
|
| 19 |
-
def load_model(enable_audio_output: bool = False)
|
| 20 |
-
Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
|
| 21 |
-
]:
|
| 22 |
global _model, _processor, _model_load_time
|
|
|
|
| 23 |
if _model is not None and _processor is not None:
|
| 24 |
return _model, _processor
|
| 25 |
|
| 26 |
logger.info(f"Loading model: {MODEL_ID}")
|
| 27 |
start = time.time()
|
| 28 |
|
|
|
|
|
|
|
|
|
|
| 29 |
load_kwargs: Dict[str, Any] = {
|
| 30 |
-
|
| 31 |
-
"
|
|
|
|
|
|
|
| 32 |
}
|
| 33 |
|
| 34 |
-
if torch.cuda.is_available():
|
| 35 |
-
load_kwargs["attn_implementation"] = "flash_attention_2"
|
| 36 |
-
logger.info("Flash Attention 2 enabled.")
|
| 37 |
-
|
| 38 |
_model = Qwen2_5OmniForConditionalGeneration.from_pretrained(MODEL_ID, **load_kwargs)
|
| 39 |
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
|
| 44 |
_processor = Qwen2_5OmniProcessor.from_pretrained(MODEL_ID)
|
| 45 |
_model_load_time = time.time() - start
|
| 46 |
-
logger.info(f"Model loaded in {_model_load_time:.2f}s")
|
|
|
|
| 47 |
return _model, _processor
|
| 48 |
|
| 49 |
|
|
@@ -63,16 +64,17 @@ def run_inference(
|
|
| 63 |
conversation: List[Dict],
|
| 64 |
return_audio: bool = False,
|
| 65 |
speaker: str = "Chelsie",
|
| 66 |
-
max_new_tokens: int =
|
| 67 |
temperature: float = 0.7,
|
| 68 |
use_audio_in_video: bool = True,
|
| 69 |
) -> Tuple[str, Optional[bytes], int, int]:
|
| 70 |
-
"""
|
| 71 |
-
Returns: (text_output, audio_bytes_or_None, prompt_tokens, completion_tokens)
|
| 72 |
-
"""
|
| 73 |
model = get_model()
|
| 74 |
processor = get_processor()
|
| 75 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 76 |
text_template = processor.apply_chat_template(
|
| 77 |
conversation,
|
| 78 |
add_generation_prompt=True,
|
|
@@ -90,7 +92,12 @@ def run_inference(
|
|
| 90 |
return_tensors="pt",
|
| 91 |
padding=True,
|
| 92 |
use_audio_in_video=use_audio_in_video,
|
| 93 |
-
).to(model.device)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
|
| 95 |
prompt_tokens = inputs["input_ids"].shape[-1]
|
| 96 |
|
|
@@ -99,28 +106,16 @@ def run_inference(
|
|
| 99 |
"max_new_tokens": max_new_tokens,
|
| 100 |
"temperature": temperature,
|
| 101 |
"do_sample": temperature > 0,
|
| 102 |
-
"return_audio":
|
| 103 |
}
|
| 104 |
|
| 105 |
-
if return_audio:
|
| 106 |
-
generate_kwargs["speaker"] = speaker
|
| 107 |
-
|
| 108 |
with torch.inference_mode():
|
| 109 |
outputs = model.generate(**inputs, **generate_kwargs)
|
| 110 |
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
buf = io.BytesIO()
|
| 116 |
-
sf.write(buf, audio_np, samplerate=24000, format="WAV")
|
| 117 |
-
audio_bytes = buf.getvalue()
|
| 118 |
-
else:
|
| 119 |
-
text_ids = outputs
|
| 120 |
-
audio_bytes = None
|
| 121 |
-
|
| 122 |
-
completion_tokens = text_ids.shape[-1] - prompt_tokens
|
| 123 |
-
decoded = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
|
| 124 |
answer = decoded[0] if decoded else ""
|
| 125 |
|
| 126 |
-
return answer,
|
|
|
|
| 2 |
import torch
|
| 3 |
import logging
|
| 4 |
import time
|
| 5 |
+
from typing import Optional, Tuple, List, Dict, Any
|
| 6 |
+
|
| 7 |
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
|
| 8 |
from qwen_omni_utils import process_mm_info
|
|
|
|
| 9 |
|
| 10 |
logger = logging.getLogger(__name__)
|
| 11 |
|
|
|
|
| 16 |
_model_load_time: float = 0.0
|
| 17 |
|
| 18 |
|
| 19 |
+
def load_model(enable_audio_output: bool = False):
|
|
|
|
|
|
|
| 20 |
global _model, _processor, _model_load_time
|
| 21 |
+
|
| 22 |
if _model is not None and _processor is not None:
|
| 23 |
return _model, _processor
|
| 24 |
|
| 25 |
logger.info(f"Loading model: {MODEL_ID}")
|
| 26 |
start = time.time()
|
| 27 |
|
| 28 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 29 |
+
logger.info(f"Using device: {device}")
|
| 30 |
+
|
| 31 |
load_kwargs: Dict[str, Any] = {
|
| 32 |
+
# Use float32 on CPU β bfloat16 is poorly supported on CPU
|
| 33 |
+
"torch_dtype": torch.bfloat16 if device == "cuda" else torch.float32,
|
| 34 |
+
"device_map": "auto" if device == "cuda" else "cpu",
|
| 35 |
+
# NO flash_attention_2 β only works with GPU + nvcc
|
| 36 |
}
|
| 37 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
_model = Qwen2_5OmniForConditionalGeneration.from_pretrained(MODEL_ID, **load_kwargs)
|
| 39 |
|
| 40 |
+
# Always disable talker on CPU β saves ~2GB and talker requires GPU
|
| 41 |
+
_model.disable_talker()
|
| 42 |
+
logger.info("Audio talker disabled (CPU mode β saves memory).")
|
| 43 |
|
| 44 |
_processor = Qwen2_5OmniProcessor.from_pretrained(MODEL_ID)
|
| 45 |
_model_load_time = time.time() - start
|
| 46 |
+
logger.info(f"Model loaded in {_model_load_time:.2f}s on {device}")
|
| 47 |
+
|
| 48 |
return _model, _processor
|
| 49 |
|
| 50 |
|
|
|
|
| 64 |
conversation: List[Dict],
|
| 65 |
return_audio: bool = False,
|
| 66 |
speaker: str = "Chelsie",
|
| 67 |
+
max_new_tokens: int = 256,
|
| 68 |
temperature: float = 0.7,
|
| 69 |
use_audio_in_video: bool = True,
|
| 70 |
) -> Tuple[str, Optional[bytes], int, int]:
|
|
|
|
|
|
|
|
|
|
| 71 |
model = get_model()
|
| 72 |
processor = get_processor()
|
| 73 |
|
| 74 |
+
# Force return_audio=False on CPU since talker is disabled
|
| 75 |
+
if not torch.cuda.is_available():
|
| 76 |
+
return_audio = False
|
| 77 |
+
|
| 78 |
text_template = processor.apply_chat_template(
|
| 79 |
conversation,
|
| 80 |
add_generation_prompt=True,
|
|
|
|
| 92 |
return_tensors="pt",
|
| 93 |
padding=True,
|
| 94 |
use_audio_in_video=use_audio_in_video,
|
| 95 |
+
).to(model.device)
|
| 96 |
+
|
| 97 |
+
# Match dtype for CPU (float32)
|
| 98 |
+
if not torch.cuda.is_available():
|
| 99 |
+
inputs = {k: v.float() if v.dtype == torch.float16 else v
|
| 100 |
+
for k, v in inputs.items()}
|
| 101 |
|
| 102 |
prompt_tokens = inputs["input_ids"].shape[-1]
|
| 103 |
|
|
|
|
| 106 |
"max_new_tokens": max_new_tokens,
|
| 107 |
"temperature": temperature,
|
| 108 |
"do_sample": temperature > 0,
|
| 109 |
+
"return_audio": False, # Always False β talker disabled on CPU
|
| 110 |
}
|
| 111 |
|
|
|
|
|
|
|
|
|
|
| 112 |
with torch.inference_mode():
|
| 113 |
outputs = model.generate(**inputs, **generate_kwargs)
|
| 114 |
|
| 115 |
+
completion_tokens = outputs.shape[-1] - prompt_tokens
|
| 116 |
+
decoded = processor.batch_decode(
|
| 117 |
+
outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False
|
| 118 |
+
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 119 |
answer = decoded[0] if decoded else ""
|
| 120 |
|
| 121 |
+
return answer, None, prompt_tokens, completion_tokens
|
app/schema.py
CHANGED
|
@@ -25,9 +25,11 @@ class RAGQueryRequest(BaseModel):
|
|
| 25 |
query_text: Optional[str] = Field(default=None, description="Natural language query")
|
| 26 |
media_inputs: Optional[List[MediaInput]] = Field(default=[], description="List of multimodal inputs")
|
| 27 |
top_k: int = Field(default=5, ge=1, le=20, description="Number of RAG context chunks to retrieve")
|
| 28 |
-
return_audio: bool = Field(
|
|
|
|
|
|
|
| 29 |
speaker: Literal["Chelsie", "Ethan"] = Field(default="Chelsie")
|
| 30 |
-
max_new_tokens: int = Field(default=
|
| 31 |
temperature: float = Field(default=0.7, ge=0.0, le=2.0)
|
| 32 |
|
| 33 |
@validator("media_inputs", always=True)
|
|
|
|
| 25 |
query_text: Optional[str] = Field(default=None, description="Natural language query")
|
| 26 |
media_inputs: Optional[List[MediaInput]] = Field(default=[], description="List of multimodal inputs")
|
| 27 |
top_k: int = Field(default=5, ge=1, le=20, description="Number of RAG context chunks to retrieve")
|
| 28 |
+
return_audio: bool = Field(
|
| 29 |
+
default=False,
|
| 30 |
+
description="Audio output (GPU only β disabled on CPU deployments)")
|
| 31 |
speaker: Literal["Chelsie", "Ethan"] = Field(default="Chelsie")
|
| 32 |
+
max_new_tokens: int = Field(default=256, ge=32, le=512)
|
| 33 |
temperature: float = Field(default=0.7, ge=0.0, le=2.0)
|
| 34 |
|
| 35 |
@validator("media_inputs", always=True)
|
requirements.txt
CHANGED
|
@@ -8,6 +8,6 @@ qwen-omni-utils[decord]
|
|
| 8 |
sentence-transformers>=3.0.0
|
| 9 |
scikit-learn>=1.4.0
|
| 10 |
soundfile>=0.12.1
|
| 11 |
-
torch>=2.3.0
|
| 12 |
-
slowapi>=0.1.9
|
| 13 |
numpy>=1.26.0
|
|
|
|
|
|
|
|
|
| 8 |
sentence-transformers>=3.0.0
|
| 9 |
scikit-learn>=1.4.0
|
| 10 |
soundfile>=0.12.1
|
|
|
|
|
|
|
| 11 |
numpy>=1.26.0
|
| 12 |
+
slowapi>=0.1.9
|
| 13 |
+
# flash-attn REMOVED β requires nvcc/GPU to compile
|