Spaces:

colin730
/

SummarizerApp

Sleeping

ming commited on Dec 20, 2025

Commit

8ca69c7

1 Parent(s): bd7d2c1

docs: Update README.md with comprehensive V4 API documentation

- Add V4 API endpoints documentation (stream and stream-ndjson)
- Document V4 configuration environment variables
- Add V4 performance metrics (memory, inference speed, GPU support)
- Include comprehensive V4 usage examples (Python, cURL)
- Document three summarization styles (executive, skimmer, eli5)
- Add V4 output schema with 6 structured fields
- Update deployment options for different memory constraints
- Update security section with SSRF protection details
- Correct V2 warmup default to false
- Add troubleshooting section for V4-specific issues

Files changed (1) hide show

README.md +278 -19

README.md CHANGED Viewed

@@ -11,11 +11,15 @@ app_port: 7860
 # Text Summarizer API
-A FastAPI-based text summarization service powered by Ollama and Mistral 7B model.
 ## 🚀 Features
-- **Fast text summarization** using local LLM inference
 - **RESTful API** with FastAPI
 - **Health monitoring** and logging
 - **Docker containerized** for easy deployment
@@ -45,6 +49,12 @@ POST /api/v2/summarize/stream
 POST /api/v3/scrape-and-summarize/stream
 ```
 ## 🌐 Live Deployment
 **✅ Successfully deployed and tested on Hugging Face Spaces!**
@@ -56,14 +66,28 @@ POST /api/v3/scrape-and-summarize/stream
 ### Quick Test
 ```bash
-# Test the live deployment
 curl https://colin730-SummarizerApp.hf.space/health
 curl -X POST https://colin730-SummarizerApp.hf.space/api/v2/summarize/stream \
   -H "Content-Type: application/json" \
   -d '{"text":"This is a test of the live API.","max_tokens":50}'
 ```
-**Request Format (V1 and V2 compatible):**
 ```json
 {
   "text": "Your long text to summarize here...",
@@ -72,6 +96,33 @@ curl -X POST https://colin730-SummarizerApp.hf.space/api/v2/summarize/stream \
 }
 ```
 ### API Documentation
 - **Swagger UI**: `/docs`
 - **ReDoc**: `/redoc`
@@ -105,6 +156,15 @@ The service uses the following environment variables:
 - `SCRAPING_UA_ROTATION`: Enable user-agent rotation (default: `true`)
 - `SCRAPING_RATE_LIMIT_PER_MINUTE`: Rate limit per IP (default: `10`)
 ### Server Configuration
 - `SERVER_HOST`: Server host (default: `127.0.0.1`)
 - `SERVER_PORT`: Server port (default: `8000`)
@@ -132,13 +192,29 @@ This app is optimized for deployment on Hugging Face Spaces using Docker SDK.
 - Optimized for free tier resource limits
 **Environment Variables for HF Spaces:**
 ```bash
 ENABLE_V1_WARMUP=false
-ENABLE_V2_WARMUP=true
 HF_MODEL_ID=sshleifer/distilbart-cnn-6-6
 HF_HOME=/tmp/huggingface
 ```
 ## 📊 Performance
 ### V1 (Ollama + Transformers Pipeline)
@@ -160,12 +236,30 @@ HF_HOME=/tmp/huggingface
 - **Total latency**: 2-5 seconds (scrape + summarize)
 - **Success rate**: 95%+ article extraction
 ### Memory Optimization
 - **V1 warmup disabled by default** (`ENABLE_V1_WARMUP=false`)
-- **V2 warmup enabled by default** (`ENABLE_V2_WARMUP=true`)
-- **HuggingFace Spaces**: V2-only deployment (no Ollama)
-- **Local development**: V1 endpoints work if Ollama is running externally
-- **distilbart-cnn-6-6 model**: Optimized for HuggingFace Spaces free tier with CNN/DailyMail fine-tuning
 ## 🛠️ Development
@@ -312,6 +406,142 @@ for line in response.iter_lines():
 - Website blocks scrapers
 - User has already extracted the text manually
 ### Android Client (SSE)
 ```kotlin
 // Android SSE client example
@@ -366,6 +596,16 @@ curl -X POST "https://colin730-SummarizerApp.hf.space/api/v3/scrape-and-summariz
 curl -X POST "https://colin730-SummarizerApp.hf.space/api/v3/scrape-and-summarize/stream" \
   -H "Content-Type: application/json" \
   -d '{"text": "Your article text here (minimum 50 characters)...", "max_tokens": 256}'
 ```
 ### Test Script
@@ -378,7 +618,10 @@ curl -X POST "https://colin730-SummarizerApp.hf.space/api/v3/scrape-and-summariz
 - Non-root user execution
 - Input validation and sanitization
-- Rate limiting (configurable)
 - API key authentication (optional)
 ## 📈 Monitoring
@@ -393,10 +636,16 @@ The service includes:
 ### Common Issues
-1. **Model not loading**: Check if Ollama is running and model is pulled
-2. **Out of memory**: Ensure sufficient RAM (8GB+) for Mistral 7B
 3. **Slow startup**: Normal on first run due to model download
-4. **API errors**: Check logs via `/docs` endpoint
 ### Logs
 View application logs in the Hugging Face Spaces interface or check the health endpoint for service status.
@@ -420,15 +669,25 @@ MIT License - see LICENSE file for details.
 **Successfully deployed and tested on Hugging Face Spaces!** 🚀
 - ✅ **Proxy-aware FastAPI** with `root_path` support
-- ✅ **All endpoints working** (health, docs, V2 API)
 - ✅ **Real-time streaming** summarization
 - ✅ **No 404 errors** - all paths correctly configured
 - ✅ **Test script included** for easy verification
-### Recent Fixes Applied
-- Added `root_path=os.getenv("HF_SPACE_ROOT_PATH", "")` for HF Spaces proxy awareness
-- Ensured binding to `0.0.0.0:7860` as required by HF Spaces
-- Verified V2 router paths (`/api/v2/summarize/stream`) with no double prefixes
-- Created test script for external endpoint verification
 **Live Space:** https://colin730-SummarizerApp.hf.space 🎯

 # Text Summarizer API
+A FastAPI-based text summarization service with multiple summarization engines: Ollama, HuggingFace Transformers, Web Scraping, and Structured Output with Qwen models.
 ## 🚀 Features
+- **Multiple Summarization Engines**: Ollama, HuggingFace Transformers, and Qwen models
+- **Structured JSON Output**: V4 API returns rich metadata (title, key points, category, sentiment, reading time)
+- **Web Scraping Integration**: V3 and V4 APIs can scrape articles directly from URLs
+- **Real-time Streaming**: All endpoints support Server-Sent Events (SSE) streaming
+- **GPU Acceleration**: V4 supports CUDA, MPS (Apple Silicon), with automatic quantization
 - **RESTful API** with FastAPI
 - **Health monitoring** and logging
 - **Docker containerized** for easy deployment
 POST /api/v3/scrape-and-summarize/stream
 ```
+### V4 API (Structured Output with Qwen)
+```
+POST /api/v4/scrape-and-summarize/stream
+POST /api/v4/scrape-and-summarize/stream-ndjson
+```
 ## 🌐 Live Deployment
 **✅ Successfully deployed and tested on Hugging Face Spaces!**
 ### Quick Test
 ```bash
+# Test the live deployment - health check
 curl https://colin730-SummarizerApp.hf.space/health
+# Test V2 API (lightweight streaming)
 curl -X POST https://colin730-SummarizerApp.hf.space/api/v2/summarize/stream \
   -H "Content-Type: application/json" \
   -d '{"text":"This is a test of the live API.","max_tokens":50}'
+# Test V3 API (web scraping)
+curl -X POST https://colin730-SummarizerApp.hf.space/api/v3/scrape-and-summarize/stream \
+  -H "Content-Type: application/json" \
+  -d '{"url":"https://example.com/article","max_tokens":128}'
+# Test V4 API (structured output, if enabled)
+curl -X POST https://colin730-SummarizerApp.hf.space/api/v4/scrape-and-summarize/stream-ndjson \
+  -H "Content-Type: application/json" \
+  -d '{"text":"This is a test article. It contains important information.","style":"executive","max_tokens":256}'
 ```
+**Request Formats by API Version:**
+V1/V2 (Simple text summarization):
 ```json
 {
   "text": "Your long text to summarize here...",
 }
 ```
+V3 (URL scraping or text):
+```json
+{
+  "url": "https://example.com/article",
+  "max_tokens": 256,
+  "include_metadata": true,
+  "use_cache": true
+}
+```
+V4 (Structured output with styles):
+```json
+{
+  "url": "https://example.com/article",
+  "style": "executive",
+  "max_tokens": 512,
+  "include_metadata": true,
+  "use_cache": true
+}
+```
+**Which API to Use?**
+- **V1**: Local deployment with Ollama (requires external service)
+- **V2**: Lightweight cloud deployment, simple text summaries
+- **V3**: When you need to scrape articles from URLs + simple summaries
+- **V4**: When you need rich metadata (category, sentiment, key points) + GPU acceleration
 ### API Documentation
 - **Swagger UI**: `/docs`
 - **ReDoc**: `/redoc`
 - `SCRAPING_UA_ROTATION`: Enable user-agent rotation (default: `true`)
 - `SCRAPING_RATE_LIMIT_PER_MINUTE`: Rate limit per IP (default: `10`)
+### V4 Configuration (Structured Summarization)
+- `ENABLE_V4_STRUCTURED`: Enable V4 API (default: `true`)
+- `ENABLE_V4_WARMUP`: Load model at startup (default: `false` to save memory)
+- `V4_MODEL_ID`: Model to use (default: `Qwen/Qwen2.5-1.5B-Instruct`, alternative: `Qwen/Qwen2.5-3B-Instruct`)
+- `V4_MAX_TOKENS`: Max tokens to generate (default: `256`, range: 128-2048)
+- `V4_TEMPERATURE`: Sampling temperature (default: `0.2` for consistent output)
+- `V4_ENABLE_QUANTIZATION`: Enable INT8 quantization on CPU or 4-bit NF4 on CUDA (default: `true`)
+- `V4_USE_FP16_FOR_SPEED`: Use FP16 precision for 2-3x faster inference on GPU (default: `false`)
 ### Server Configuration
 - `SERVER_HOST`: Server host (default: `127.0.0.1`)
 - `SERVER_PORT`: Server port (default: `8000`)
 - Optimized for free tier resource limits
 **Environment Variables for HF Spaces:**
+For memory-constrained deployments (free tier):
 ```bash
 ENABLE_V1_WARMUP=false
+ENABLE_V2_WARMUP=false
+ENABLE_V3_SCRAPING=true
+ENABLE_V4_STRUCTURED=false
 HF_MODEL_ID=sshleifer/distilbart-cnn-6-6
 HF_HOME=/tmp/huggingface
 ```
+For GPU-enabled deployments (paid tier with 16GB+ RAM):
+```bash
+ENABLE_V1_WARMUP=false
+ENABLE_V2_WARMUP=false
+ENABLE_V3_SCRAPING=true
+ENABLE_V4_STRUCTURED=true
+ENABLE_V4_WARMUP=false
+V4_MODEL_ID=Qwen/Qwen2.5-3B-Instruct
+V4_ENABLE_QUANTIZATION=true
+V4_USE_FP16_FOR_SPEED=true
+```
 ## 📊 Performance
 ### V1 (Ollama + Transformers Pipeline)
 - **Total latency**: 2-5 seconds (scrape + summarize)
 - **Success rate**: 95%+ article extraction
+### V4 (Structured Summarization with Qwen)
+- **V4 Models**: Qwen/Qwen2.5-1.5B-Instruct (default) or Qwen/Qwen2.5-3B-Instruct (higher quality)
+- **Memory usage**:
+  - 1.5B model: ~2-3GB RAM (FP16 on GPU), ~1GB (4-bit NF4 on CUDA)
+  - 3B model: ~6-7GB RAM (FP16 on GPU), ~3-4GB (4-bit NF4 on CUDA)
+- **Inference speed**:
+  - 1.5B model: 20-46 seconds per request
+  - 3B model: 40-60 seconds per request
+  - NDJSON streaming: 43% faster time-to-first-token
+- **GPU acceleration**: CUDA > MPS (Apple Silicon) > CPU (4x speed difference)
+- **Output format**: Structured JSON with 6 fields (title, summary, key_points, category, sentiment, read_time_min)
+- **Styles**: executive, skimmer, eli5
 ### Memory Optimization
 - **V1 warmup disabled by default** (`ENABLE_V1_WARMUP=false`)
+- **V2 warmup disabled by default** (`ENABLE_V2_WARMUP=false`)
+- **V4 warmup disabled by default** (`ENABLE_V4_WARMUP=false`) - Saves 2-7GB RAM
+- **HuggingFace Spaces deployment options**:
+  - V2-only: ~500MB (fits free tier)
+  - V2+V3: ~550MB (fits free tier)
+  - V2+V3+V4 (1.5B): ~3GB (requires paid tier)
+  - V2+V3+V4 (3B): ~7GB (requires paid tier)
+- **Local development**: All versions can run simultaneously with 8-10GB RAM
+- **GPU deployment**: V4 benefits significantly from CUDA or MPS acceleration
 ## 🛠️ Development
 - Website blocks scrapers
 - User has already extracted the text manually
+### V4 API (Structured Output with Qwen) - High-Quality Summaries
+**V4 supports two streaming formats and three summarization styles**
+#### Streaming Format 1: Standard JSON Streaming (stream)
+```python
+import requests
+import json
+# V4 scrape article from URL and stream structured JSON
+response = requests.post(
+    "https://colin730-SummarizerApp.hf.space/api/v4/scrape-and-summarize/stream",
+    json={
+        "url": "https://example.com/article",
+        "style": "executive",  # Options: "executive", "skimmer", "eli5"
+        "max_tokens": 256,
+        "include_metadata": True,
+        "use_cache": True
+    },
+    stream=True
+)
+for line in response.iter_lines():
+    if line.startswith(b'data: '):
+        data = json.loads(line[6:])
+        # First event: metadata
+        if data.get("type") == "metadata":
+            print(f"Style: {data['data']['style']}")
+            print(f"Scrape time: {data['data']['scrape_latency_ms']}ms\n")
+        # Content events (streaming JSON tokens)
+        elif "content" in data:
+            print(data["content"], end="")
+            if data["done"]:
+                # Parse final JSON
+                summary = json.loads(accumulated_content)
+                print(f"\n\nTitle: {summary['title']}")
+                print(f"Category: {summary['category']}")
+                print(f"Sentiment: {summary['sentiment']}")
+                print(f"Key Points: {summary['key_points']}")
+                break
+```
+#### Streaming Format 2: NDJSON Patch Streaming (stream-ndjson) - 43% Faster
+```python
+import requests
+import json
+# V4 NDJSON streaming - progressive JSON updates for real-time UI
+response = requests.post(
+    "https://colin730-SummarizerApp.hf.space/api/v4/scrape-and-summarize/stream-ndjson",
+    json={
+        "text": "Your article text here (minimum 50 characters)...",
+        "style": "skimmer",  # Brief, fact-focused summary
+        "max_tokens": 512,
+        "include_metadata": True
+    },
+    stream=True
+)
+summary = {}
+for line in response.iter_lines():
+    if line.startswith(b'data: '):
+        event = json.loads(line[6:])
+        # First event: metadata
+        if event.get("type") == "metadata":
+            print(f"Input: {event['data']['input_type']}")
+            print(f"Style: {event['data']['style']}\n")
+        # NDJSON patch events
+        elif "delta" in event:
+            delta = event["delta"]
+            state = event["state"]
+            if delta and delta.get("op") == "set":
+                # Field set operation
+                field = delta["field"]
+                value = delta["value"]
+                summary[field] = value
+                print(f"{field}: {value}")
+            elif delta and delta.get("op") == "append":
+                # Array append operation
+                field = delta["field"]
+                value = delta["value"]
+                if field not in summary:
+                    summary[field] = []
+                summary[field].append(value)
+                print(f"+ {field}: {value}")
+            elif delta and delta.get("op") == "done":
+                # Final state
+                print(f"\n✅ Complete! Total time: {event.get('latency_ms', 0):.0f}ms")
+                print(f"Tokens used: {event.get('tokens_used', 0)}")
+                break
+```
+#### Summarization Styles
+**Executive Style** (`"executive"`):
+- Target audience: Business professionals, decision makers
+- Format: Concise, action-oriented, business impact focus
+- Example output: Strategic insights, financial implications, market trends
+**Skimmer Style** (`"skimmer"`):
+- Target audience: Busy readers wanting quick facts
+- Format: Bullet-point style, scannable, fact-dense
+- Example output: Core facts, numbers, dates, names
+**ELI5 Style** (`"eli5"`):
+- Target audience: General public, non-technical readers
+- Format: Simple explanations, analogies, relatable examples
+- Example output: What it means, why it matters, real-world impact
+#### V4 Output Schema
+All V4 responses return structured JSON with these 6 fields:
+```json
+{
+  "title": "Click-worthy title (<100 chars)",
+  "main_summary": "2-4 sentence summary (<500 chars)",
+  "key_points": [
+    "Key point 1",
+    "Key point 2",
+    "Key point 3"
+  ],
+  "category": "Technology",
+  "sentiment": "Positive",
+  "read_time_min": 5
+}
+```
 ### Android Client (SSE)
 ```kotlin
 // Android SSE client example
 curl -X POST "https://colin730-SummarizerApp.hf.space/api/v3/scrape-and-summarize/stream" \
   -H "Content-Type: application/json" \
   -d '{"text": "Your article text here (minimum 50 characters)...", "max_tokens": 256}'
+# V4 API - Standard JSON streaming (URL mode)
+curl -X POST "https://colin730-SummarizerApp.hf.space/api/v4/scrape-and-summarize/stream" \
+  -H "Content-Type: application/json" \
+  -d '{"url": "https://example.com/article", "style": "executive", "max_tokens": 256}'
+# V4 API - NDJSON patch streaming (Text mode) - 43% faster time-to-first-token
+curl -X POST "https://colin730-SummarizerApp.hf.space/api/v4/scrape-and-summarize/stream-ndjson" \
+  -H "Content-Type: application/json" \
+  -d '{"text": "Your article text (minimum 50 chars)...", "style": "skimmer", "max_tokens": 512}'
 ```
 ### Test Script
 - Non-root user execution
 - Input validation and sanitization
+- **SSRF protection**: V3 and V4 APIs block localhost and private IP ranges
+- **Rate limiting**: Configurable per-IP rate limits for scraping endpoints
+- **URL validation**: Strict URL format checking (HTTP/HTTPS only)
+- **Content limits**: Maximum text lengths enforced (50,000 chars for V3/V4)
 - API key authentication (optional)
 ## 📈 Monitoring
 ### Common Issues
+1. **Model not loading**: Check if Ollama is running and model is pulled (V1 only)
+2. **Out of memory**:
+   - V1: Ensure 2-4GB RAM available
+   - V2/V3: Ensure ~500-550MB RAM available
+   - V4 (1.5B): Ensure 2-3GB RAM available
+   - V4 (3B): Ensure 6-7GB RAM available
 3. **Slow startup**: Normal on first run due to model download
+4. **V4 slow inference**: Enable GPU acceleration (CUDA or MPS) and FP16 for 2-4x speedup
+5. **V4 quantization slow**: Quantization takes 1-2 minutes on startup; disable warmup to defer until first request
+6. **API errors**: Check logs via `/docs` endpoint
 ### Logs
 View application logs in the Hugging Face Spaces interface or check the health endpoint for service status.
 **Successfully deployed and tested on Hugging Face Spaces!** 🚀
 - ✅ **Proxy-aware FastAPI** with `root_path` support
+- ✅ **All endpoints working** (health, docs, V1-V4 APIs)
 - ✅ **Real-time streaming** summarization
+- ✅ **Structured JSON output** with V4 API
+- ✅ **GPU acceleration support** (CUDA, MPS, CPU fallback)
 - ✅ **No 404 errors** - all paths correctly configured
 - ✅ **Test script included** for easy verification
+### API Versions Available
+- **V1**: Ollama + Transformers (requires external Ollama service)
+- **V2**: HuggingFace streaming (lightweight, ~500MB)
+- **V3**: Web scraping + Summarization (lightweight, ~550MB)
+- **V4**: Structured output with Qwen (GPU-optimized, 2-7GB depending on model)
+### Recent Features
+- Added V4 structured summarization API with Qwen models
+- NDJSON patch streaming for 43% faster time-to-first-token
+- Three summarization styles: executive, skimmer, eli5
+- GPU optimization with CUDA/MPS/CPU auto-detection
+- Automatic quantization (4-bit NF4, FP16, INT8)
+- Rich metadata output (category, sentiment, reading time)
 **Live Space:** https://colin730-SummarizerApp.hf.space 🎯