Spaces:

colin730
/

SummarizerApp

Running

ming commited on Nov 28, 2025

Commit

7fff563

1 Parent(s): fd2a8c1

Implement Option 3: Use FP16 for 2-3x faster inference

Speed optimization:
- Added v4_use_fp16_for_speed config option
- When enabled, uses FP16 instead of 4-bit quantization
- FP16 is 2-3x faster than 4-bit NF4 quantization
- Enabled by default in Dockerfile for maximum speed

Memory trade-off:
- FP16 uses ~2-3GB GPU memory (vs ~1GB for 4-bit)
- Still fits comfortably on T4 GPU (16GB total)

Expected results:
- Generation time: 24.9s → 8-12s (2-3x speedup)
- Same output quality
- Faster token generation (~20-30 tokens/sec vs ~6 tokens/sec)

This completes the speed optimization plan (Option 1 + 2 + 3)

Files changed (3) hide show

Dockerfile +1 -0
app/core/config.py +5 -0
app/services/structured_summarizer.py +17 -1

Dockerfile CHANGED Viewed

@@ -12,6 +12,7 @@ ENV PYTHONDONTWRITEBYTECODE=1 \
     ENABLE_V4_WARMUP=true \
     V4_MODEL_ID=Qwen/Qwen2.5-1.5B-Instruct \
     V4_ENABLE_QUANTIZATION=true \
     HF_HOME=/tmp/huggingface \
     TRANSFORMERS_NO_TORCHAO=1

     ENABLE_V4_WARMUP=true \
     V4_MODEL_ID=Qwen/Qwen2.5-1.5B-Instruct \
     V4_ENABLE_QUANTIZATION=true \
+    V4_USE_FP16_FOR_SPEED=true \
     HF_HOME=/tmp/huggingface \
     TRANSFORMERS_NO_TORCHAO=1

app/core/config.py CHANGED Viewed

@@ -122,6 +122,11 @@ class Settings(BaseSettings):
         env="V4_ENABLE_QUANTIZATION",
         description="Enable INT8 quantization for V4 model (reduces memory from ~2GB to ~1GB). Quantization takes ~1-2 minutes on startup.",
     )
     @validator("log_level")
     def validate_log_level(cls, v):

         env="V4_ENABLE_QUANTIZATION",
         description="Enable INT8 quantization for V4 model (reduces memory from ~2GB to ~1GB). Quantization takes ~1-2 minutes on startup.",
     )
+    v4_use_fp16_for_speed: bool = Field(
+        default=False,
+        env="V4_USE_FP16_FOR_SPEED",
+        description="Use FP16 instead of 4-bit quantization for 2-3x faster inference (uses ~2-3GB GPU memory instead of ~1GB)",
+    )
     @validator("log_level")
     def validate_log_level(cls, v):

app/services/structured_summarizer.py CHANGED Viewed

@@ -81,10 +81,14 @@ class StructuredSummarizer:
                 logger.info("CUDA is NOT available. V4 model will run on CPU.")
             # ------------------------------------------------------------------
-            # Preferred path: 4-bit NF4 on GPU via bitsandbytes
             # ------------------------------------------------------------------
             if (
                 use_cuda
                 and getattr(settings, "v4_enable_quantization", True)
                 and HAS_BITSANDBYTES
             ):
@@ -104,6 +108,18 @@ class StructuredSummarizer:
                     trust_remote_code=True,
                 )
                 quantization_desc = "4-bit NF4 (bitsandbytes, GPU)"
             else:
                 # ------------------------------------------------------------------

                 logger.info("CUDA is NOT available. V4 model will run on CPU.")
             # ------------------------------------------------------------------
+            # Preferred path: 4-bit NF4 on GPU via bitsandbytes (memory efficient)
+            # OR FP16 for speed (2-3x faster, uses more memory)
             # ------------------------------------------------------------------
+            use_fp16_for_speed = getattr(settings, "v4_use_fp16_for_speed", False)
             if (
                 use_cuda
+                and not use_fp16_for_speed
                 and getattr(settings, "v4_enable_quantization", True)
                 and HAS_BITSANDBYTES
             ):
                     trust_remote_code=True,
                 )
                 quantization_desc = "4-bit NF4 (bitsandbytes, GPU)"
+            elif use_cuda and use_fp16_for_speed:
+                # Use FP16 for 2-3x faster inference (uses ~2-3GB GPU memory)
+                logger.info("Loading V4 model in FP16 for maximum speed (2-3x faster than 4-bit)...")
+                self.model = AutoModelForCausalLM.from_pretrained(
+                    settings.v4_model_id,
+                    torch_dtype=torch.float16,
+                    device_map="auto",
+                    cache_dir=settings.hf_cache_dir,
+                    trust_remote_code=True,
+                )
+                quantization_desc = "FP16 (GPU, fast)"
             else:
                 # ------------------------------------------------------------------