Spaces:

dinukpathiraja
/

boqapi

Sleeping

App Files Files Community

Dinuk-Di commited on Mar 24

Commit

2e2fd75

1 Parent(s): dbbbfec

Cuda Removed

Browse files

Files changed (5) hide show

Dockerfile +15 -9
README.md +127 -7
app/model.py +33 -38
app/schema.py +4 -2
requirements.txt +2 -2

Dockerfile CHANGED Viewed

@@ -1,27 +1,24 @@
-FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04
 ENV DEBIAN_FRONTEND=noninteractive \
     PYTHONUNBUFFERED=1 \
     PYTHONDONTWRITEBYTECODE=1 \
     HF_HOME=/app/.cache/huggingface \
     PORT=7860
 RUN apt-get update && apt-get install -y --no-install-recommends \
-    python3.11 python3.11-dev python3-pip \
     ffmpeg libsndfile1 git curl \
     && apt-get clean && rm -rf /var/lib/apt/lists/*
-RUN ln -sf /usr/bin/python3.11 /usr/bin/python3 && \
-    ln -sf /usr/bin/python3 /usr/bin/python
 WORKDIR /app
 RUN pip install --no-cache-dir --upgrade pip setuptools wheel packaging
 RUN pip install --no-cache-dir torch torchvision torchaudio \
-    --index-url https://download.pytorch.org/whl/cu121
-RUN pip install --no-cache-dir flash-attn --no-build-isolation
 COPY requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt
@@ -32,4 +29,13 @@ RUN useradd -m -u 1000 appuser && chown -R appuser:appuser /app
 USER appuser
 EXPOSE 7860
-CMD ["python", "-m", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "7860", "--workers", "1"]

+FROM python:3.11-slim
 ENV DEBIAN_FRONTEND=noninteractive \
     PYTHONUNBUFFERED=1 \
     PYTHONDONTWRITEBYTECODE=1 \
     HF_HOME=/app/.cache/huggingface \
+    TRANSFORMERS_CACHE=/app/.cache/huggingface \
     PORT=7860
 RUN apt-get update && apt-get install -y --no-install-recommends \
     ffmpeg libsndfile1 git curl \
     && apt-get clean && rm -rf /var/lib/apt/lists/*
 WORKDIR /app
+# Upgrade build tools first
 RUN pip install --no-cache-dir --upgrade pip setuptools wheel packaging
+# CPU-only PyTorch (no CUDA, no nvcc needed)
 RUN pip install --no-cache-dir torch torchvision torchaudio \
+    --index-url https://download.pytorch.org/whl/cpu
 COPY requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt
 USER appuser
 EXPOSE 7860
+HEALTHCHECK --interval=60s --timeout=15s --start-period=300s --retries=3 \
+    CMD curl -f http://localhost:7860/ || exit 1
+CMD ["python", "-m", "uvicorn", "main:app", \
+     "--host", "0.0.0.0", \
+     "--port", "7860", \
+     "--workers", "1", \
+     "--loop", "uvloop", \
+     "--log-level", "info"]

README.md CHANGED Viewed

@@ -1,10 +1,130 @@
 ---
-title: Boqapi
-emoji: 🐨
-colorFrom: purple
-colorTo: yellow
-sdk: docker
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Boqapi
+**Boqapi** is a production-ready Multimodal RAG (Retrieval-Augmented Generation) API. It supports text, image, audio, and video modalities for both ingestion and querying, leveraging the power of [Qwen2.5-Omni-7B](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) for multimodal reasoning and generation, and `all-MiniLM-L6-v2` for generating embeddings.
+## Features
+- **Multimodal Generation**: Powered by `Qwen2.5-Omni-7B`, the API processes audio, visual, and textual inputs to generate relevant text or audio outputs.
+- **Document Ingestion**: Seamlessly ingest text, image, audio, or video components. Text gets directly embedded. Non-text modalities generate a text descriptor embedded for RAG vector search.
+- **RAG Querying**: Combines in-memory vector similarity search (Cosine Similarity) with the reasoning capabilities of Qwen2.5-Omni-7B.
+- **FastAPI Backend**: Provides high performance and asynchronous request handling with CORS, Rate-Limiting, and global exception management.
+---
+## Architecture Details
+### Ingestion Flow
+When multimodal documents are submitted to the `/ingest/documents` endpoint:
+1. **Text**: Embedded directly using `sentence-transformers/all-MiniLM-L6-v2`.
+2. **Audio/Video/Image**: A modality-specific text descriptor is created and embedded.
+3. **Storage**: Both the document metadata and embeddings are placed into an in-memory vector store for quick retrieval.
+```mermaid
+flowchart LR
+    A[Multimodal Document] --> T{Is Text?}
+    T -- Yes --> B[Extract Text]
+    T -- No --> C[Create Descriptor]
+    B --> E[SentenceTransformer Embedding]
+    C --> E
+    E --> V[(In-Memory Vector Store)]
+```
+### RAG Query Flow
+When a user submits a query along with optional media files to the `/rag/query` endpoint:
+1. **Retrieval**: The textual component of the query is embedded, and cosine similarity is used to find the top-K relevant documents from the vector store.
+2. **Conversation Assembly**: Retrieved contexts and the user's media inputs (images, audio, video) are formatted into a structured prompt.
+3. **Inference**: The prompt is processed by `Qwen2.5-Omni-7B`. If audio output is requested, it synthezises speech (e.g., using "Chelsie").
+4. **Response**: The API returns the generated text, retrieved documents, token usage, performance metrics, and optionally base64-encoded audio.
+```mermaid
+flowchart TD
+    Q[User Query + Media] --> R[Embed Text Query]
+    R --> S[Vector Similarity Search]
+    S --> V[(In-Memory Vector Store)]
+    V --> C
+    S --> |Top-K Docs| C[Assemble Context & Media Prompt]
+    C --> I[[Qwen2.5-Omni-7B Inference Engine]]
+    I --> O[Generated Text / Audio Response]
+```
 ---
+## Setup and Execution
+### Prerequisites
+- Nvidia GPU (CUDA 12.1+ recommended)
+- Docker (for containerized execution)
+- Python 3.11+ (for local execution)
+### 1. Running with Docker
+The provided `Dockerfile` builds on `nvidia/cuda:12.1.1` and handles all heavy dependencies (Flash Attention 2, torch, ffmpeg, etc.).
+```bash
+# Build the image
+docker build -t boqapi .
+# Run the container with GPU support
+docker run --gpus all -p 7860:7860 boqapi
+```
+### 2. Running Locally
+If you prefer running directly on your host machine:
+```bash
+# Clone the repository and navigate to the project directory
+cd boqapi
+# Install system dependencies
+sudo apt-get install ffmpeg libsndfile1
+# Install Python dependencies
+pip install -r requirements.txt
+# Run the FastAPI application
+python app/main.py
+# OR
+uvicorn main:app --host 0.0.0.0 --port 7860 --workers 1
+```
+*(Note: Ensure you have Flash Attention 2 compatible hardware if `flash-attn` is installed.)*
 ---
+## Example API Usage
+The API is secured (by default) with a static API key. Pass `x-api-key: dev-secret` in your headers.
+### 1. Health Check
+```bash
+curl -X GET "http://localhost:7860/health"
+```
+### 2. Ingest Documents
+```bash
+curl -X POST "http://localhost:7860/ingest/documents" \
+     -H "x-api-key: dev-secret" \
+     -H "Content-Type: application/json" \
+     -d '{
+           "user_id": "user123",
+           "documents": [
+             {
+               "modality": "text",
+               "content": "The Qwen model supports text, image, audio, and video."
+             }
+           ]
+         }'
+```
+### 3. RAG Query
+```bash
+curl -X POST "http://localhost:7860/rag/query" \
+     -H "x-api-key: dev-secret" \
+     -H "Content-Type: application/json" \
+     -d '{
+           "user_id": "user123",
+           "query_text": "What does the Qwen model support?",
+           "top_k": 3,
+           "return_audio": false
+         }'
+```

app/model.py CHANGED Viewed

@@ -2,10 +2,10 @@
 import torch
 import logging
 import time
-from functools import lru_cache
 from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
 from qwen_omni_utils import process_mm_info
-from typing import Optional, Tuple, List, Dict, Any
 logger = logging.getLogger(__name__)
@@ -16,34 +16,35 @@ _processor: Optional[Qwen2_5OmniProcessor] = None
 _model_load_time: float = 0.0
-def load_model(enable_audio_output: bool = False) -> Tuple[
-    Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
-]:
     global _model, _processor, _model_load_time
     if _model is not None and _processor is not None:
         return _model, _processor
     logger.info(f"Loading model: {MODEL_ID}")
     start = time.time()
     load_kwargs: Dict[str, Any] = {
-        "torch_dtype": torch.bfloat16,
-        "device_map": "auto",
     }
-    if torch.cuda.is_available():
-        load_kwargs["attn_implementation"] = "flash_attention_2"
-        logger.info("Flash Attention 2 enabled.")
     _model = Qwen2_5OmniForConditionalGeneration.from_pretrained(MODEL_ID, **load_kwargs)
-    if not enable_audio_output:
-        _model.disable_talker()
-        logger.info("Audio talker disabled — saving ~2GB VRAM.")
     _processor = Qwen2_5OmniProcessor.from_pretrained(MODEL_ID)
     _model_load_time = time.time() - start
-    logger.info(f"Model loaded in {_model_load_time:.2f}s")
     return _model, _processor
@@ -63,16 +64,17 @@ def run_inference(
     conversation: List[Dict],
     return_audio: bool = False,
     speaker: str = "Chelsie",
-    max_new_tokens: int = 512,
     temperature: float = 0.7,
     use_audio_in_video: bool = True,
 ) -> Tuple[str, Optional[bytes], int, int]:
-    """
-    Returns: (text_output, audio_bytes_or_None, prompt_tokens, completion_tokens)
-    """
     model = get_model()
     processor = get_processor()
     text_template = processor.apply_chat_template(
         conversation,
         add_generation_prompt=True,
@@ -90,7 +92,12 @@ def run_inference(
         return_tensors="pt",
         padding=True,
         use_audio_in_video=use_audio_in_video,
-    ).to(model.device).to(model.dtype)
     prompt_tokens = inputs["input_ids"].shape[-1]
@@ -99,28 +106,16 @@ def run_inference(
         "max_new_tokens": max_new_tokens,
         "temperature": temperature,
         "do_sample": temperature > 0,
-        "return_audio": return_audio,
     }
-    if return_audio:
-        generate_kwargs["speaker"] = speaker
     with torch.inference_mode():
         outputs = model.generate(**inputs, **generate_kwargs)
-    if return_audio:
-        text_ids, audio_tensor = outputs
-        audio_np = audio_tensor.reshape(-1).detach().cpu().numpy()
-        import io, soundfile as sf
-        buf = io.BytesIO()
-        sf.write(buf, audio_np, samplerate=24000, format="WAV")
-        audio_bytes = buf.getvalue()
-    else:
-        text_ids = outputs
-        audio_bytes = None
-    completion_tokens = text_ids.shape[-1] - prompt_tokens
-    decoded = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
     answer = decoded[0] if decoded else ""
-    return answer, audio_bytes, prompt_tokens, completion_tokens

 import torch
 import logging
 import time
+from typing import Optional, Tuple, List, Dict, Any
 from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
 from qwen_omni_utils import process_mm_info
 logger = logging.getLogger(__name__)
 _model_load_time: float = 0.0
+def load_model(enable_audio_output: bool = False):
     global _model, _processor, _model_load_time
     if _model is not None and _processor is not None:
         return _model, _processor
     logger.info(f"Loading model: {MODEL_ID}")
     start = time.time()
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    logger.info(f"Using device: {device}")
     load_kwargs: Dict[str, Any] = {
+        # Use float32 on CPU — bfloat16 is poorly supported on CPU
+        "torch_dtype": torch.bfloat16 if device == "cuda" else torch.float32,
+        "device_map": "auto" if device == "cuda" else "cpu",
+        # NO flash_attention_2 — only works with GPU + nvcc
     }
     _model = Qwen2_5OmniForConditionalGeneration.from_pretrained(MODEL_ID, **load_kwargs)
+    # Always disable talker on CPU — saves ~2GB and talker requires GPU
+    _model.disable_talker()
+    logger.info("Audio talker disabled (CPU mode — saves memory).")
     _processor = Qwen2_5OmniProcessor.from_pretrained(MODEL_ID)
     _model_load_time = time.time() - start
+    logger.info(f"Model loaded in {_model_load_time:.2f}s on {device}")
     return _model, _processor
     conversation: List[Dict],
     return_audio: bool = False,
     speaker: str = "Chelsie",
+    max_new_tokens: int = 256,
     temperature: float = 0.7,
     use_audio_in_video: bool = True,
 ) -> Tuple[str, Optional[bytes], int, int]:
     model = get_model()
     processor = get_processor()
+    # Force return_audio=False on CPU since talker is disabled
+    if not torch.cuda.is_available():
+        return_audio = False
     text_template = processor.apply_chat_template(
         conversation,
         add_generation_prompt=True,
         return_tensors="pt",
         padding=True,
         use_audio_in_video=use_audio_in_video,
+    ).to(model.device)
+    # Match dtype for CPU (float32)
+    if not torch.cuda.is_available():
+        inputs = {k: v.float() if v.dtype == torch.float16 else v
+                  for k, v in inputs.items()}
     prompt_tokens = inputs["input_ids"].shape[-1]
         "max_new_tokens": max_new_tokens,
         "temperature": temperature,
         "do_sample": temperature > 0,
+        "return_audio": False,  # Always False — talker disabled on CPU
     }
     with torch.inference_mode():
         outputs = model.generate(**inputs, **generate_kwargs)
+    completion_tokens = outputs.shape[-1] - prompt_tokens
+    decoded = processor.batch_decode(
+        outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False
+    )
     answer = decoded[0] if decoded else ""
+    return answer, None, prompt_tokens, completion_tokens

app/schema.py CHANGED Viewed

@@ -25,9 +25,11 @@ class RAGQueryRequest(BaseModel):
     query_text: Optional[str] = Field(default=None, description="Natural language query")
     media_inputs: Optional[List[MediaInput]] = Field(default=[], description="List of multimodal inputs")
     top_k: int = Field(default=5, ge=1, le=20, description="Number of RAG context chunks to retrieve")
-    return_audio: bool = Field(default=False, description="Whether to return audio response")
     speaker: Literal["Chelsie", "Ethan"] = Field(default="Chelsie")
-    max_new_tokens: int = Field(default=512, ge=64, le=2048)
     temperature: float = Field(default=0.7, ge=0.0, le=2.0)
     @validator("media_inputs", always=True)

     query_text: Optional[str] = Field(default=None, description="Natural language query")
     media_inputs: Optional[List[MediaInput]] = Field(default=[], description="List of multimodal inputs")
     top_k: int = Field(default=5, ge=1, le=20, description="Number of RAG context chunks to retrieve")
+    return_audio: bool = Field(
+    default=False,
+    description="Audio output (GPU only — disabled on CPU deployments)")
     speaker: Literal["Chelsie", "Ethan"] = Field(default="Chelsie")
+    max_new_tokens: int = Field(default=256, ge=32, le=512)
     temperature: float = Field(default=0.7, ge=0.0, le=2.0)
     @validator("media_inputs", always=True)

requirements.txt CHANGED Viewed

@@ -8,6 +8,6 @@ qwen-omni-utils[decord]
 sentence-transformers>=3.0.0
 scikit-learn>=1.4.0
 soundfile>=0.12.1
-torch>=2.3.0
-slowapi>=0.1.9
 numpy>=1.26.0

 sentence-transformers>=3.0.0
 scikit-learn>=1.4.0
 soundfile>=0.12.1
 numpy>=1.26.0
+slowapi>=0.1.9
+# flash-attn REMOVED — requires nvcc/GPU to compile