Dinuk-Di commited on
Commit
2e2fd75
Β·
1 Parent(s): dbbbfec

Cuda Removed

Browse files
Files changed (5) hide show
  1. Dockerfile +15 -9
  2. README.md +127 -7
  3. app/model.py +33 -38
  4. app/schema.py +4 -2
  5. requirements.txt +2 -2
Dockerfile CHANGED
@@ -1,27 +1,24 @@
1
- FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04
2
 
3
  ENV DEBIAN_FRONTEND=noninteractive \
4
  PYTHONUNBUFFERED=1 \
5
  PYTHONDONTWRITEBYTECODE=1 \
6
  HF_HOME=/app/.cache/huggingface \
 
7
  PORT=7860
8
 
9
  RUN apt-get update && apt-get install -y --no-install-recommends \
10
- python3.11 python3.11-dev python3-pip \
11
  ffmpeg libsndfile1 git curl \
12
  && apt-get clean && rm -rf /var/lib/apt/lists/*
13
 
14
- RUN ln -sf /usr/bin/python3.11 /usr/bin/python3 && \
15
- ln -sf /usr/bin/python3 /usr/bin/python
16
-
17
  WORKDIR /app
18
 
 
19
  RUN pip install --no-cache-dir --upgrade pip setuptools wheel packaging
20
 
 
21
  RUN pip install --no-cache-dir torch torchvision torchaudio \
22
- --index-url https://download.pytorch.org/whl/cu121
23
-
24
- RUN pip install --no-cache-dir flash-attn --no-build-isolation
25
 
26
  COPY requirements.txt .
27
  RUN pip install --no-cache-dir -r requirements.txt
@@ -32,4 +29,13 @@ RUN useradd -m -u 1000 appuser && chown -R appuser:appuser /app
32
  USER appuser
33
 
34
  EXPOSE 7860
35
- CMD ["python", "-m", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "7860", "--workers", "1"]
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.11-slim
2
 
3
  ENV DEBIAN_FRONTEND=noninteractive \
4
  PYTHONUNBUFFERED=1 \
5
  PYTHONDONTWRITEBYTECODE=1 \
6
  HF_HOME=/app/.cache/huggingface \
7
+ TRANSFORMERS_CACHE=/app/.cache/huggingface \
8
  PORT=7860
9
 
10
  RUN apt-get update && apt-get install -y --no-install-recommends \
 
11
  ffmpeg libsndfile1 git curl \
12
  && apt-get clean && rm -rf /var/lib/apt/lists/*
13
 
 
 
 
14
  WORKDIR /app
15
 
16
+ # Upgrade build tools first
17
  RUN pip install --no-cache-dir --upgrade pip setuptools wheel packaging
18
 
19
+ # CPU-only PyTorch (no CUDA, no nvcc needed)
20
  RUN pip install --no-cache-dir torch torchvision torchaudio \
21
+ --index-url https://download.pytorch.org/whl/cpu
 
 
22
 
23
  COPY requirements.txt .
24
  RUN pip install --no-cache-dir -r requirements.txt
 
29
  USER appuser
30
 
31
  EXPOSE 7860
32
+
33
+ HEALTHCHECK --interval=60s --timeout=15s --start-period=300s --retries=3 \
34
+ CMD curl -f http://localhost:7860/ || exit 1
35
+
36
+ CMD ["python", "-m", "uvicorn", "main:app", \
37
+ "--host", "0.0.0.0", \
38
+ "--port", "7860", \
39
+ "--workers", "1", \
40
+ "--loop", "uvloop", \
41
+ "--log-level", "info"]
README.md CHANGED
@@ -1,10 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- title: Boqapi
3
- emoji: 🐨
4
- colorFrom: purple
5
- colorTo: yellow
6
- sdk: docker
7
- pinned: false
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  ---
9
 
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Boqapi
2
+
3
+ **Boqapi** is a production-ready Multimodal RAG (Retrieval-Augmented Generation) API. It supports text, image, audio, and video modalities for both ingestion and querying, leveraging the power of [Qwen2.5-Omni-7B](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) for multimodal reasoning and generation, and `all-MiniLM-L6-v2` for generating embeddings.
4
+
5
+ ## Features
6
+ - **Multimodal Generation**: Powered by `Qwen2.5-Omni-7B`, the API processes audio, visual, and textual inputs to generate relevant text or audio outputs.
7
+ - **Document Ingestion**: Seamlessly ingest text, image, audio, or video components. Text gets directly embedded. Non-text modalities generate a text descriptor embedded for RAG vector search.
8
+ - **RAG Querying**: Combines in-memory vector similarity search (Cosine Similarity) with the reasoning capabilities of Qwen2.5-Omni-7B.
9
+ - **FastAPI Backend**: Provides high performance and asynchronous request handling with CORS, Rate-Limiting, and global exception management.
10
+
11
+ ---
12
+
13
+ ## Architecture Details
14
+
15
+ ### Ingestion Flow
16
+ When multimodal documents are submitted to the `/ingest/documents` endpoint:
17
+ 1. **Text**: Embedded directly using `sentence-transformers/all-MiniLM-L6-v2`.
18
+ 2. **Audio/Video/Image**: A modality-specific text descriptor is created and embedded.
19
+ 3. **Storage**: Both the document metadata and embeddings are placed into an in-memory vector store for quick retrieval.
20
+
21
+ ```mermaid
22
+ flowchart LR
23
+ A[Multimodal Document] --> T{Is Text?}
24
+ T -- Yes --> B[Extract Text]
25
+ T -- No --> C[Create Descriptor]
26
+ B --> E[SentenceTransformer Embedding]
27
+ C --> E
28
+ E --> V[(In-Memory Vector Store)]
29
+ ```
30
+
31
+ ### RAG Query Flow
32
+ When a user submits a query along with optional media files to the `/rag/query` endpoint:
33
+ 1. **Retrieval**: The textual component of the query is embedded, and cosine similarity is used to find the top-K relevant documents from the vector store.
34
+ 2. **Conversation Assembly**: Retrieved contexts and the user's media inputs (images, audio, video) are formatted into a structured prompt.
35
+ 3. **Inference**: The prompt is processed by `Qwen2.5-Omni-7B`. If audio output is requested, it synthezises speech (e.g., using "Chelsie").
36
+ 4. **Response**: The API returns the generated text, retrieved documents, token usage, performance metrics, and optionally base64-encoded audio.
37
+
38
+ ```mermaid
39
+ flowchart TD
40
+ Q[User Query + Media] --> R[Embed Text Query]
41
+ R --> S[Vector Similarity Search]
42
+ S --> V[(In-Memory Vector Store)]
43
+ V --> C
44
+ S --> |Top-K Docs| C[Assemble Context & Media Prompt]
45
+ C --> I[[Qwen2.5-Omni-7B Inference Engine]]
46
+ I --> O[Generated Text / Audio Response]
47
+ ```
48
+
49
  ---
50
+
51
+ ## Setup and Execution
52
+
53
+ ### Prerequisites
54
+ - Nvidia GPU (CUDA 12.1+ recommended)
55
+ - Docker (for containerized execution)
56
+ - Python 3.11+ (for local execution)
57
+
58
+ ### 1. Running with Docker
59
+
60
+ The provided `Dockerfile` builds on `nvidia/cuda:12.1.1` and handles all heavy dependencies (Flash Attention 2, torch, ffmpeg, etc.).
61
+
62
+ ```bash
63
+ # Build the image
64
+ docker build -t boqapi .
65
+
66
+ # Run the container with GPU support
67
+ docker run --gpus all -p 7860:7860 boqapi
68
+ ```
69
+
70
+ ### 2. Running Locally
71
+
72
+ If you prefer running directly on your host machine:
73
+
74
+ ```bash
75
+ # Clone the repository and navigate to the project directory
76
+ cd boqapi
77
+
78
+ # Install system dependencies
79
+ sudo apt-get install ffmpeg libsndfile1
80
+
81
+ # Install Python dependencies
82
+ pip install -r requirements.txt
83
+
84
+ # Run the FastAPI application
85
+ python app/main.py
86
+ # OR
87
+ uvicorn main:app --host 0.0.0.0 --port 7860 --workers 1
88
+ ```
89
+
90
+ *(Note: Ensure you have Flash Attention 2 compatible hardware if `flash-attn` is installed.)*
91
+
92
  ---
93
 
94
+ ## Example API Usage
95
+
96
+ The API is secured (by default) with a static API key. Pass `x-api-key: dev-secret` in your headers.
97
+
98
+ ### 1. Health Check
99
+ ```bash
100
+ curl -X GET "http://localhost:7860/health"
101
+ ```
102
+
103
+ ### 2. Ingest Documents
104
+ ```bash
105
+ curl -X POST "http://localhost:7860/ingest/documents" \
106
+ -H "x-api-key: dev-secret" \
107
+ -H "Content-Type: application/json" \
108
+ -d '{
109
+ "user_id": "user123",
110
+ "documents": [
111
+ {
112
+ "modality": "text",
113
+ "content": "The Qwen model supports text, image, audio, and video."
114
+ }
115
+ ]
116
+ }'
117
+ ```
118
+
119
+ ### 3. RAG Query
120
+ ```bash
121
+ curl -X POST "http://localhost:7860/rag/query" \
122
+ -H "x-api-key: dev-secret" \
123
+ -H "Content-Type: application/json" \
124
+ -d '{
125
+ "user_id": "user123",
126
+ "query_text": "What does the Qwen model support?",
127
+ "top_k": 3,
128
+ "return_audio": false
129
+ }'
130
+ ```
app/model.py CHANGED
@@ -2,10 +2,10 @@
2
  import torch
3
  import logging
4
  import time
5
- from functools import lru_cache
 
6
  from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
7
  from qwen_omni_utils import process_mm_info
8
- from typing import Optional, Tuple, List, Dict, Any
9
 
10
  logger = logging.getLogger(__name__)
11
 
@@ -16,34 +16,35 @@ _processor: Optional[Qwen2_5OmniProcessor] = None
16
  _model_load_time: float = 0.0
17
 
18
 
19
- def load_model(enable_audio_output: bool = False) -> Tuple[
20
- Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
21
- ]:
22
  global _model, _processor, _model_load_time
 
23
  if _model is not None and _processor is not None:
24
  return _model, _processor
25
 
26
  logger.info(f"Loading model: {MODEL_ID}")
27
  start = time.time()
28
 
 
 
 
29
  load_kwargs: Dict[str, Any] = {
30
- "torch_dtype": torch.bfloat16,
31
- "device_map": "auto",
 
 
32
  }
33
 
34
- if torch.cuda.is_available():
35
- load_kwargs["attn_implementation"] = "flash_attention_2"
36
- logger.info("Flash Attention 2 enabled.")
37
-
38
  _model = Qwen2_5OmniForConditionalGeneration.from_pretrained(MODEL_ID, **load_kwargs)
39
 
40
- if not enable_audio_output:
41
- _model.disable_talker()
42
- logger.info("Audio talker disabled β€” saving ~2GB VRAM.")
43
 
44
  _processor = Qwen2_5OmniProcessor.from_pretrained(MODEL_ID)
45
  _model_load_time = time.time() - start
46
- logger.info(f"Model loaded in {_model_load_time:.2f}s")
 
47
  return _model, _processor
48
 
49
 
@@ -63,16 +64,17 @@ def run_inference(
63
  conversation: List[Dict],
64
  return_audio: bool = False,
65
  speaker: str = "Chelsie",
66
- max_new_tokens: int = 512,
67
  temperature: float = 0.7,
68
  use_audio_in_video: bool = True,
69
  ) -> Tuple[str, Optional[bytes], int, int]:
70
- """
71
- Returns: (text_output, audio_bytes_or_None, prompt_tokens, completion_tokens)
72
- """
73
  model = get_model()
74
  processor = get_processor()
75
 
 
 
 
 
76
  text_template = processor.apply_chat_template(
77
  conversation,
78
  add_generation_prompt=True,
@@ -90,7 +92,12 @@ def run_inference(
90
  return_tensors="pt",
91
  padding=True,
92
  use_audio_in_video=use_audio_in_video,
93
- ).to(model.device).to(model.dtype)
 
 
 
 
 
94
 
95
  prompt_tokens = inputs["input_ids"].shape[-1]
96
 
@@ -99,28 +106,16 @@ def run_inference(
99
  "max_new_tokens": max_new_tokens,
100
  "temperature": temperature,
101
  "do_sample": temperature > 0,
102
- "return_audio": return_audio,
103
  }
104
 
105
- if return_audio:
106
- generate_kwargs["speaker"] = speaker
107
-
108
  with torch.inference_mode():
109
  outputs = model.generate(**inputs, **generate_kwargs)
110
 
111
- if return_audio:
112
- text_ids, audio_tensor = outputs
113
- audio_np = audio_tensor.reshape(-1).detach().cpu().numpy()
114
- import io, soundfile as sf
115
- buf = io.BytesIO()
116
- sf.write(buf, audio_np, samplerate=24000, format="WAV")
117
- audio_bytes = buf.getvalue()
118
- else:
119
- text_ids = outputs
120
- audio_bytes = None
121
-
122
- completion_tokens = text_ids.shape[-1] - prompt_tokens
123
- decoded = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
124
  answer = decoded[0] if decoded else ""
125
 
126
- return answer, audio_bytes, prompt_tokens, completion_tokens
 
2
  import torch
3
  import logging
4
  import time
5
+ from typing import Optional, Tuple, List, Dict, Any
6
+
7
  from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
8
  from qwen_omni_utils import process_mm_info
 
9
 
10
  logger = logging.getLogger(__name__)
11
 
 
16
  _model_load_time: float = 0.0
17
 
18
 
19
+ def load_model(enable_audio_output: bool = False):
 
 
20
  global _model, _processor, _model_load_time
21
+
22
  if _model is not None and _processor is not None:
23
  return _model, _processor
24
 
25
  logger.info(f"Loading model: {MODEL_ID}")
26
  start = time.time()
27
 
28
+ device = "cuda" if torch.cuda.is_available() else "cpu"
29
+ logger.info(f"Using device: {device}")
30
+
31
  load_kwargs: Dict[str, Any] = {
32
+ # Use float32 on CPU β€” bfloat16 is poorly supported on CPU
33
+ "torch_dtype": torch.bfloat16 if device == "cuda" else torch.float32,
34
+ "device_map": "auto" if device == "cuda" else "cpu",
35
+ # NO flash_attention_2 β€” only works with GPU + nvcc
36
  }
37
 
 
 
 
 
38
  _model = Qwen2_5OmniForConditionalGeneration.from_pretrained(MODEL_ID, **load_kwargs)
39
 
40
+ # Always disable talker on CPU β€” saves ~2GB and talker requires GPU
41
+ _model.disable_talker()
42
+ logger.info("Audio talker disabled (CPU mode β€” saves memory).")
43
 
44
  _processor = Qwen2_5OmniProcessor.from_pretrained(MODEL_ID)
45
  _model_load_time = time.time() - start
46
+ logger.info(f"Model loaded in {_model_load_time:.2f}s on {device}")
47
+
48
  return _model, _processor
49
 
50
 
 
64
  conversation: List[Dict],
65
  return_audio: bool = False,
66
  speaker: str = "Chelsie",
67
+ max_new_tokens: int = 256,
68
  temperature: float = 0.7,
69
  use_audio_in_video: bool = True,
70
  ) -> Tuple[str, Optional[bytes], int, int]:
 
 
 
71
  model = get_model()
72
  processor = get_processor()
73
 
74
+ # Force return_audio=False on CPU since talker is disabled
75
+ if not torch.cuda.is_available():
76
+ return_audio = False
77
+
78
  text_template = processor.apply_chat_template(
79
  conversation,
80
  add_generation_prompt=True,
 
92
  return_tensors="pt",
93
  padding=True,
94
  use_audio_in_video=use_audio_in_video,
95
+ ).to(model.device)
96
+
97
+ # Match dtype for CPU (float32)
98
+ if not torch.cuda.is_available():
99
+ inputs = {k: v.float() if v.dtype == torch.float16 else v
100
+ for k, v in inputs.items()}
101
 
102
  prompt_tokens = inputs["input_ids"].shape[-1]
103
 
 
106
  "max_new_tokens": max_new_tokens,
107
  "temperature": temperature,
108
  "do_sample": temperature > 0,
109
+ "return_audio": False, # Always False β€” talker disabled on CPU
110
  }
111
 
 
 
 
112
  with torch.inference_mode():
113
  outputs = model.generate(**inputs, **generate_kwargs)
114
 
115
+ completion_tokens = outputs.shape[-1] - prompt_tokens
116
+ decoded = processor.batch_decode(
117
+ outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False
118
+ )
 
 
 
 
 
 
 
 
 
119
  answer = decoded[0] if decoded else ""
120
 
121
+ return answer, None, prompt_tokens, completion_tokens
app/schema.py CHANGED
@@ -25,9 +25,11 @@ class RAGQueryRequest(BaseModel):
25
  query_text: Optional[str] = Field(default=None, description="Natural language query")
26
  media_inputs: Optional[List[MediaInput]] = Field(default=[], description="List of multimodal inputs")
27
  top_k: int = Field(default=5, ge=1, le=20, description="Number of RAG context chunks to retrieve")
28
- return_audio: bool = Field(default=False, description="Whether to return audio response")
 
 
29
  speaker: Literal["Chelsie", "Ethan"] = Field(default="Chelsie")
30
- max_new_tokens: int = Field(default=512, ge=64, le=2048)
31
  temperature: float = Field(default=0.7, ge=0.0, le=2.0)
32
 
33
  @validator("media_inputs", always=True)
 
25
  query_text: Optional[str] = Field(default=None, description="Natural language query")
26
  media_inputs: Optional[List[MediaInput]] = Field(default=[], description="List of multimodal inputs")
27
  top_k: int = Field(default=5, ge=1, le=20, description="Number of RAG context chunks to retrieve")
28
+ return_audio: bool = Field(
29
+ default=False,
30
+ description="Audio output (GPU only β€” disabled on CPU deployments)")
31
  speaker: Literal["Chelsie", "Ethan"] = Field(default="Chelsie")
32
+ max_new_tokens: int = Field(default=256, ge=32, le=512)
33
  temperature: float = Field(default=0.7, ge=0.0, le=2.0)
34
 
35
  @validator("media_inputs", always=True)
requirements.txt CHANGED
@@ -8,6 +8,6 @@ qwen-omni-utils[decord]
8
  sentence-transformers>=3.0.0
9
  scikit-learn>=1.4.0
10
  soundfile>=0.12.1
11
- torch>=2.3.0
12
- slowapi>=0.1.9
13
  numpy>=1.26.0
 
 
 
8
  sentence-transformers>=3.0.0
9
  scikit-learn>=1.4.0
10
  soundfile>=0.12.1
 
 
11
  numpy>=1.26.0
12
+ slowapi>=0.1.9
13
+ # flash-attn REMOVED β€” requires nvcc/GPU to compile