Spaces:
Running
Running
| title: Text Summarizer API | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: docker | |
| pinned: false | |
| license: mit | |
| app_port: 7860 | |
| # Text Summarizer API | |
| A FastAPI-based text summarization service with multiple summarization engines: Ollama, HuggingFace Transformers, Web Scraping, and Structured Output with Qwen models. | |
| ## π Features | |
| - **Multiple Summarization Engines**: Ollama, HuggingFace Transformers, and Qwen models | |
| - **Structured JSON Output**: V4 API returns rich metadata (title, key points, category, sentiment, reading time) | |
| - **Web Scraping Integration**: V3 and V4 APIs can scrape articles directly from URLs | |
| - **Real-time Streaming**: All endpoints support Server-Sent Events (SSE) streaming | |
| - **GPU Acceleration**: V4 supports CUDA, MPS (Apple Silicon), with automatic quantization | |
| - **RESTful API** with FastAPI | |
| - **Health monitoring** and logging | |
| - **Docker containerized** for easy deployment | |
| - **Free deployment** on Hugging Face Spaces | |
| ## π‘ API Endpoints | |
| ### Health Check | |
| ``` | |
| GET /health | |
| ``` | |
| ### V1 API (Ollama + Transformers Pipeline) | |
| ``` | |
| POST /api/v1/summarize | |
| POST /api/v1/summarize/stream | |
| POST /api/v1/summarize/pipeline/stream | |
| ``` | |
| ### V2 API (HuggingFace Streaming) | |
| ``` | |
| POST /api/v2/summarize/stream | |
| ``` | |
| ### V3 API (Web Scraping + Summarization) | |
| ``` | |
| POST /api/v3/scrape-and-summarize/stream | |
| ``` | |
| ### V4 API (Structured Output with Qwen) | |
| ``` | |
| POST /api/v4/scrape-and-summarize/stream | |
| POST /api/v4/scrape-and-summarize/stream-ndjson | |
| ``` | |
| ## π Live Deployment | |
| **β Successfully deployed and tested on Hugging Face Spaces!** | |
| - **Live Space:** https://colin730-SummarizerApp.hf.space | |
| - **API Documentation:** https://colin730-SummarizerApp.hf.space/docs | |
| - **Health Check:** https://colin730-SummarizerApp.hf.space/health | |
| - **V2 Streaming API:** https://colin730-SummarizerApp.hf.space/api/v2/summarize/stream | |
| ### Quick Test | |
| ```bash | |
| # Test the live deployment - health check | |
| curl https://colin730-SummarizerApp.hf.space/health | |
| # Test V2 API (lightweight streaming) | |
| curl -X POST https://colin730-SummarizerApp.hf.space/api/v2/summarize/stream \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"text":"This is a test of the live API.","max_tokens":50}' | |
| # Test V3 API (web scraping) | |
| curl -X POST https://colin730-SummarizerApp.hf.space/api/v3/scrape-and-summarize/stream \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"url":"https://example.com/article","max_tokens":128}' | |
| # Test V4 API (structured output, if enabled) | |
| curl -X POST https://colin730-SummarizerApp.hf.space/api/v4/scrape-and-summarize/stream-ndjson \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"text":"This is a test article. It contains important information.","style":"executive","max_tokens":256}' | |
| ``` | |
| **Request Formats by API Version:** | |
| V1/V2 (Simple text summarization): | |
| ```json | |
| { | |
| "text": "Your long text to summarize here...", | |
| "max_tokens": 256, | |
| "prompt": "Summarize the following text concisely:" | |
| } | |
| ``` | |
| V3 (URL scraping or text): | |
| ```json | |
| { | |
| "url": "https://example.com/article", | |
| "max_tokens": 256, | |
| "include_metadata": true, | |
| "use_cache": true | |
| } | |
| ``` | |
| V4 (Structured output with styles): | |
| ```json | |
| { | |
| "url": "https://example.com/article", | |
| "style": "executive", | |
| "max_tokens": 512, | |
| "include_metadata": true, | |
| "use_cache": true | |
| } | |
| ``` | |
| **Which API to Use?** | |
| - **V1**: Local deployment with Ollama (requires external service) | |
| - **V2**: Lightweight cloud deployment, simple text summaries | |
| - **V3**: When you need to scrape articles from URLs + simple summaries | |
| - **V4**: When you need rich metadata (category, sentiment, key points) + GPU acceleration | |
| ### API Documentation | |
| - **Swagger UI**: `/docs` | |
| - **ReDoc**: `/redoc` | |
| ## π§ Configuration | |
| The service uses the following environment variables: | |
| ### V1 Configuration (Ollama) | |
| - `OLLAMA_MODEL`: Model to use (default: `llama3.2:1b`) | |
| - `OLLAMA_HOST`: Ollama service host (default: `http://localhost:11434`) | |
| - `OLLAMA_TIMEOUT`: Request timeout in seconds (default: `60`) | |
| - `ENABLE_V1_WARMUP`: Enable V1 warmup (default: `false`) | |
| ### V2 Configuration (HuggingFace) | |
| - `HF_MODEL_ID`: HuggingFace model ID (default: `sshleifer/distilbart-cnn-6-6`) | |
| - `HF_DEVICE_MAP`: Device mapping (default: `auto` for GPU fallback to CPU) | |
| - `HF_TORCH_DTYPE`: Torch dtype (default: `auto`) | |
| - `HF_HOME`: HuggingFace cache directory (default: `/tmp/huggingface`) | |
| - `HF_MAX_NEW_TOKENS`: Max new tokens (default: `128`) | |
| - `HF_TEMPERATURE`: Sampling temperature (default: `0.7`) | |
| - `HF_TOP_P`: Nucleus sampling (default: `0.95`) | |
| - `ENABLE_V2_WARMUP`: Enable V2 warmup (default: `true`) | |
| ### V3 Configuration (Web Scraping) | |
| - `ENABLE_V3_SCRAPING`: Enable V3 API (default: `true`) | |
| - `SCRAPING_TIMEOUT`: HTTP timeout for scraping (default: `10` seconds) | |
| - `SCRAPING_MAX_TEXT_LENGTH`: Max text to extract (default: `50000` chars) | |
| - `SCRAPING_CACHE_ENABLED`: Enable caching (default: `true`) | |
| - `SCRAPING_CACHE_TTL`: Cache TTL (default: `3600` seconds / 1 hour) | |
| - `SCRAPING_UA_ROTATION`: Enable user-agent rotation (default: `true`) | |
| - `SCRAPING_RATE_LIMIT_PER_MINUTE`: Rate limit per IP (default: `10`) | |
| ### V4 Configuration (Structured Summarization) | |
| - `ENABLE_V4_STRUCTURED`: Enable V4 API (default: `true`) | |
| - `ENABLE_V4_WARMUP`: Load model at startup (default: `false` to save memory) | |
| - `V4_MODEL_ID`: Model to use (default: `Qwen/Qwen2.5-1.5B-Instruct`, alternative: `Qwen/Qwen2.5-3B-Instruct`) | |
| - `V4_MAX_TOKENS`: Max tokens to generate (default: `256`, range: 128-2048) | |
| - `V4_TEMPERATURE`: Sampling temperature (default: `0.2` for consistent output) | |
| - `V4_ENABLE_QUANTIZATION`: Enable INT8 quantization on CPU or 4-bit NF4 on CUDA (default: `true`) | |
| - `V4_USE_FP16_FOR_SPEED`: Use FP16 precision for 2-3x faster inference on GPU (default: `false`) | |
| ### Server Configuration | |
| - `SERVER_HOST`: Server host (default: `127.0.0.1`) | |
| - `SERVER_PORT`: Server port (default: `8000`) | |
| - `LOG_LEVEL`: Logging level (default: `INFO`) | |
| ## π³ Docker Deployment | |
| ### Local Development | |
| ```bash | |
| # Build and run with docker-compose | |
| docker-compose up --build | |
| # Or run directly | |
| docker build -f Dockerfile.hf -t summarizer-app . | |
| docker run -p 7860:7860 summarizer-app | |
| ``` | |
| ### Hugging Face Spaces | |
| This app is optimized for deployment on Hugging Face Spaces using Docker SDK. | |
| **V2-Only Deployment on HF Spaces:** | |
| - Uses `t5-small` model (~250MB) for fast startup | |
| - No Ollama dependency (saves memory and disk space) | |
| - Model downloads during warmup for instant first request | |
| - Optimized for free tier resource limits | |
| **Environment Variables for HF Spaces:** | |
| For memory-constrained deployments (free tier): | |
| ```bash | |
| ENABLE_V1_WARMUP=false | |
| ENABLE_V2_WARMUP=false | |
| ENABLE_V3_SCRAPING=true | |
| ENABLE_V4_STRUCTURED=false | |
| HF_MODEL_ID=sshleifer/distilbart-cnn-6-6 | |
| HF_HOME=/tmp/huggingface | |
| ``` | |
| For GPU-enabled deployments (paid tier with 16GB+ RAM): | |
| ```bash | |
| ENABLE_V1_WARMUP=false | |
| ENABLE_V2_WARMUP=false | |
| ENABLE_V3_SCRAPING=true | |
| ENABLE_V4_STRUCTURED=true | |
| ENABLE_V4_WARMUP=false | |
| V4_MODEL_ID=Qwen/Qwen2.5-3B-Instruct | |
| V4_ENABLE_QUANTIZATION=true | |
| V4_USE_FP16_FOR_SPEED=true | |
| ``` | |
| ## π Performance | |
| ### V1 (Ollama + Transformers Pipeline) | |
| - **V1 Models**: llama3.2:1b (Ollama) + distilbart-cnn-6-6 (Transformers) | |
| - **Memory usage**: ~2-4GB RAM (when V1 warmup enabled) | |
| - **Inference speed**: ~2-5 seconds per request | |
| - **Startup time**: ~30-60 seconds (when V1 warmup enabled) | |
| ### V2 (HuggingFace Streaming) - Primary on HF Spaces | |
| - **V2 Model**: sshleifer/distilbart-cnn-6-6 (~300MB download) | |
| - **Memory usage**: ~500MB RAM (when V2 warmup enabled) | |
| - **Inference speed**: Real-time token streaming | |
| - **Startup time**: ~30-60 seconds (includes model download when V2 warmup enabled) | |
| ### V3 (Web Scraping + Summarization) | |
| - **Dependencies**: trafilatura, httpx, lxml (lightweight, no JavaScript rendering) | |
| - **Memory usage**: ~550MB RAM (V2 + scraping: +10-50MB) | |
| - **Scraping speed**: 200-500ms typical, <10ms on cache hit | |
| - **Total latency**: 2-5 seconds (scrape + summarize) | |
| - **Success rate**: 95%+ article extraction | |
| ### V4 (Structured Summarization with Qwen) | |
| - **V4 Models**: Qwen/Qwen2.5-1.5B-Instruct (default) or Qwen/Qwen2.5-3B-Instruct (higher quality) | |
| - **Memory usage**: | |
| - 1.5B model: ~2-3GB RAM (FP16 on GPU), ~1GB (4-bit NF4 on CUDA) | |
| - 3B model: ~6-7GB RAM (FP16 on GPU), ~3-4GB (4-bit NF4 on CUDA) | |
| - **Inference speed**: | |
| - 1.5B model: 20-46 seconds per request | |
| - 3B model: 40-60 seconds per request | |
| - NDJSON streaming: 43% faster time-to-first-token | |
| - **GPU acceleration**: CUDA > MPS (Apple Silicon) > CPU (4x speed difference) | |
| - **Output format**: Structured JSON with 6 fields (title, summary, key_points, category, sentiment, read_time_min) | |
| - **Styles**: executive, skimmer, eli5 | |
| ### Memory Optimization | |
| - **V1 warmup disabled by default** (`ENABLE_V1_WARMUP=false`) | |
| - **V2 warmup disabled by default** (`ENABLE_V2_WARMUP=false`) | |
| - **V4 warmup disabled by default** (`ENABLE_V4_WARMUP=false`) - Saves 2-7GB RAM | |
| - **HuggingFace Spaces deployment options**: | |
| - V2-only: ~500MB (fits free tier) | |
| - V2+V3: ~550MB (fits free tier) | |
| - V2+V3+V4 (1.5B): ~3GB (requires paid tier) | |
| - V2+V3+V4 (3B): ~7GB (requires paid tier) | |
| - **Local development**: All versions can run simultaneously with 8-10GB RAM | |
| - **GPU deployment**: V4 benefits significantly from CUDA or MPS acceleration | |
| ## π οΈ Development | |
| ### Setup | |
| ```bash | |
| # Install dependencies | |
| pip install -r requirements.txt | |
| # Run locally | |
| uvicorn app.main:app --host 0.0.0.0 --port 7860 | |
| ``` | |
| ### Testing | |
| ```bash | |
| # Run tests | |
| pytest | |
| # Run with coverage | |
| pytest --cov=app | |
| ``` | |
| ## π Usage Examples | |
| ### V1 API (Ollama) | |
| ```python | |
| import requests | |
| import json | |
| # V1 streaming summarization | |
| response = requests.post( | |
| "https://colin730-SummarizerApp.hf.space/api/v1/summarize/stream", | |
| json={ | |
| "text": "Your long article or text here...", | |
| "max_tokens": 256 | |
| }, | |
| stream=True | |
| ) | |
| for line in response.iter_lines(): | |
| if line.startswith(b'data: '): | |
| data = json.loads(line[6:]) | |
| print(data["content"], end="") | |
| if data["done"]: | |
| break | |
| ``` | |
| ### V2 API (HuggingFace Streaming) - Recommended | |
| ```python | |
| import requests | |
| import json | |
| # V2 streaming summarization (same request format as V1) | |
| response = requests.post( | |
| "https://colin730-SummarizerApp.hf.space/api/v2/summarize/stream", | |
| json={ | |
| "text": "Your long article or text here...", | |
| "max_tokens": 128 # V2 uses max_new_tokens | |
| }, | |
| stream=True | |
| ) | |
| for line in response.iter_lines(): | |
| if line.startswith(b'data: '): | |
| data = json.loads(line[6:]) | |
| print(data["content"], end="") | |
| if data["done"]: | |
| break | |
| ``` | |
| ### V3 API (Web Scraping + Summarization) - Android App Primary Use Case | |
| **V3 supports two modes: URL scraping or direct text summarization** | |
| #### Mode 1: URL Scraping (recommended for articles) | |
| ```python | |
| import requests | |
| import json | |
| # V3 scrape article from URL and stream summarization | |
| response = requests.post( | |
| "https://colin730-SummarizerApp.hf.space/api/v3/scrape-and-summarize/stream", | |
| json={ | |
| "url": "https://example.com/article", | |
| "max_tokens": 256, | |
| "include_metadata": True, # Get article title, author, etc. | |
| "use_cache": True # Use cached content if available | |
| }, | |
| stream=True | |
| ) | |
| for line in response.iter_lines(): | |
| if line.startswith(b'data: '): | |
| data = json.loads(line[6:]) | |
| # First event: metadata | |
| if data.get("type") == "metadata": | |
| print(f"Input type: {data['data']['input_type']}") # 'url' | |
| print(f"Title: {data['data']['title']}") | |
| print(f"Author: {data['data']['author']}") | |
| print(f"Scrape time: {data['data']['scrape_latency_ms']}ms\n") | |
| # Content events | |
| elif "content" in data: | |
| print(data["content"], end="") | |
| if data["done"]: | |
| print(f"\n\nTotal time: {data['latency_ms']}ms") | |
| break | |
| ``` | |
| #### Mode 2: Direct Text Summarization (fallback when scraping fails) | |
| ```python | |
| import requests | |
| import json | |
| # V3 direct text summarization (no scraping) | |
| response = requests.post( | |
| "https://colin730-SummarizerApp.hf.space/api/v3/scrape-and-summarize/stream", | |
| json={ | |
| "text": "Your article text here... (minimum 50 characters)", | |
| "max_tokens": 256, | |
| "include_metadata": True | |
| }, | |
| stream=True | |
| ) | |
| for line in response.iter_lines(): | |
| if line.startswith(b'data: '): | |
| data = json.loads(line[6:]) | |
| # First event: metadata | |
| if data.get("type") == "metadata": | |
| print(f"Input type: {data['data']['input_type']}") # 'text' | |
| print(f"Text length: {data['data']['text_length']} chars\n") | |
| # Content events | |
| elif "content" in data: | |
| print(data["content"], end="") | |
| if data["done"]: | |
| break | |
| ``` | |
| **Note:** Provide either `url` OR `text`, not both. Text mode is useful as a fallback when: | |
| - Article is behind a paywall | |
| - Website blocks scrapers | |
| - User has already extracted the text manually | |
| ### V4 API (Structured Output with Qwen) - High-Quality Summaries | |
| **V4 supports two streaming formats and three summarization styles** | |
| #### Streaming Format 1: Standard JSON Streaming (stream) | |
| ```python | |
| import requests | |
| import json | |
| # V4 scrape article from URL and stream structured JSON | |
| response = requests.post( | |
| "https://colin730-SummarizerApp.hf.space/api/v4/scrape-and-summarize/stream", | |
| json={ | |
| "url": "https://example.com/article", | |
| "style": "executive", # Options: "executive", "skimmer", "eli5" | |
| "max_tokens": 256, | |
| "include_metadata": True, | |
| "use_cache": True | |
| }, | |
| stream=True | |
| ) | |
| for line in response.iter_lines(): | |
| if line.startswith(b'data: '): | |
| data = json.loads(line[6:]) | |
| # First event: metadata | |
| if data.get("type") == "metadata": | |
| print(f"Style: {data['data']['style']}") | |
| print(f"Scrape time: {data['data']['scrape_latency_ms']}ms\n") | |
| # Content events (streaming JSON tokens) | |
| elif "content" in data: | |
| print(data["content"], end="") | |
| if data["done"]: | |
| # Parse final JSON | |
| summary = json.loads(accumulated_content) | |
| print(f"\n\nTitle: {summary['title']}") | |
| print(f"Category: {summary['category']}") | |
| print(f"Sentiment: {summary['sentiment']}") | |
| print(f"Key Points: {summary['key_points']}") | |
| break | |
| ``` | |
| #### Streaming Format 2: NDJSON Patch Streaming (stream-ndjson) - 43% Faster | |
| ```python | |
| import requests | |
| import json | |
| # V4 NDJSON streaming - progressive JSON updates for real-time UI | |
| response = requests.post( | |
| "https://colin730-SummarizerApp.hf.space/api/v4/scrape-and-summarize/stream-ndjson", | |
| json={ | |
| "text": "Your article text here (minimum 50 characters)...", | |
| "style": "skimmer", # Brief, fact-focused summary | |
| "max_tokens": 512, | |
| "include_metadata": True | |
| }, | |
| stream=True | |
| ) | |
| summary = {} | |
| for line in response.iter_lines(): | |
| if line.startswith(b'data: '): | |
| event = json.loads(line[6:]) | |
| # First event: metadata | |
| if event.get("type") == "metadata": | |
| print(f"Input: {event['data']['input_type']}") | |
| print(f"Style: {event['data']['style']}\n") | |
| # NDJSON patch events | |
| elif "delta" in event: | |
| delta = event["delta"] | |
| state = event["state"] | |
| if delta and delta.get("op") == "set": | |
| # Field set operation | |
| field = delta["field"] | |
| value = delta["value"] | |
| summary[field] = value | |
| print(f"{field}: {value}") | |
| elif delta and delta.get("op") == "append": | |
| # Array append operation | |
| field = delta["field"] | |
| value = delta["value"] | |
| if field not in summary: | |
| summary[field] = [] | |
| summary[field].append(value) | |
| print(f"+ {field}: {value}") | |
| elif delta and delta.get("op") == "done": | |
| # Final state | |
| print(f"\nβ Complete! Total time: {event.get('latency_ms', 0):.0f}ms") | |
| print(f"Tokens used: {event.get('tokens_used', 0)}") | |
| break | |
| ``` | |
| #### Summarization Styles | |
| **Executive Style** (`"executive"`): | |
| - Target audience: Business professionals, decision makers | |
| - Format: Concise, action-oriented, business impact focus | |
| - Example output: Strategic insights, financial implications, market trends | |
| **Skimmer Style** (`"skimmer"`): | |
| - Target audience: Busy readers wanting quick facts | |
| - Format: Bullet-point style, scannable, fact-dense | |
| - Example output: Core facts, numbers, dates, names | |
| **ELI5 Style** (`"eli5"`): | |
| - Target audience: General public, non-technical readers | |
| - Format: Simple explanations, analogies, relatable examples | |
| - Example output: What it means, why it matters, real-world impact | |
| #### V4 Output Schema | |
| All V4 responses return structured JSON with these 6 fields: | |
| ```json | |
| { | |
| "title": "Click-worthy title (<100 chars)", | |
| "main_summary": "2-4 sentence summary (<500 chars)", | |
| "key_points": [ | |
| "Key point 1", | |
| "Key point 2", | |
| "Key point 3" | |
| ], | |
| "category": "Technology", | |
| "sentiment": "Positive", | |
| "read_time_min": 5 | |
| } | |
| ``` | |
| ### Android Client (SSE) | |
| ```kotlin | |
| // Android SSE client example | |
| val client = OkHttpClient() | |
| val request = Request.Builder() | |
| .url("https://colin730-SummarizerApp.hf.space/api/v2/summarize/stream") | |
| .post(RequestBody.create( | |
| MediaType.parse("application/json"), | |
| """{"text": "Your text...", "max_tokens": 128}""" | |
| )) | |
| .build() | |
| client.newCall(request).enqueue(object : Callback { | |
| override fun onResponse(call: Call, response: Response) { | |
| val source = response.body()?.source() | |
| source?.use { bufferedSource -> | |
| while (true) { | |
| val line = bufferedSource.readUtf8Line() | |
| if (line?.startsWith("data: ") == true) { | |
| val json = line.substring(6) | |
| val data = Gson().fromJson(json, Map::class.java) | |
| // Update UI with data["content"] | |
| if (data["done"] == true) break | |
| } | |
| } | |
| } | |
| } | |
| }) | |
| ``` | |
| ### cURL Examples | |
| ```bash | |
| # Test live deployment | |
| curl https://colin730-SummarizerApp.hf.space/health | |
| # V1 API (if Ollama is available) | |
| curl -X POST "https://colin730-SummarizerApp.hf.space/api/v1/summarize/stream" \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"text": "Your text...", "max_tokens": 256}' | |
| # V2 API (HuggingFace streaming - recommended) | |
| curl -X POST "https://colin730-SummarizerApp.hf.space/api/v2/summarize/stream" \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"text": "Your text...", "max_tokens": 128}' | |
| # V3 API - URL mode (web scraping + summarization) | |
| curl -X POST "https://colin730-SummarizerApp.hf.space/api/v3/scrape-and-summarize/stream" \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"url": "https://example.com/article", "max_tokens": 256, "include_metadata": true}' | |
| # V3 API - Text mode (direct summarization, no scraping) | |
| curl -X POST "https://colin730-SummarizerApp.hf.space/api/v3/scrape-and-summarize/stream" \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"text": "Your article text here (minimum 50 characters)...", "max_tokens": 256}' | |
| # V4 API - Standard JSON streaming (URL mode) | |
| curl -X POST "https://colin730-SummarizerApp.hf.space/api/v4/scrape-and-summarize/stream" \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"url": "https://example.com/article", "style": "executive", "max_tokens": 256}' | |
| # V4 API - NDJSON patch streaming (Text mode) - 43% faster time-to-first-token | |
| curl -X POST "https://colin730-SummarizerApp.hf.space/api/v4/scrape-and-summarize/stream-ndjson" \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"text": "Your article text (minimum 50 chars)...", "style": "skimmer", "max_tokens": 512}' | |
| ``` | |
| ### Test Script | |
| ```bash | |
| # Use the included test script | |
| ./scripts/test_endpoints.sh https://colin730-SummarizerApp.hf.space | |
| ``` | |
| ## π Security | |
| - Non-root user execution | |
| - Input validation and sanitization | |
| - **SSRF protection**: V3 and V4 APIs block localhost and private IP ranges | |
| - **Rate limiting**: Configurable per-IP rate limits for scraping endpoints | |
| - **URL validation**: Strict URL format checking (HTTP/HTTPS only) | |
| - **Content limits**: Maximum text lengths enforced (50,000 chars for V3/V4) | |
| - API key authentication (optional) | |
| ## π Monitoring | |
| The service includes: | |
| - Health check endpoint | |
| - Request logging | |
| - Error tracking | |
| - Performance metrics | |
| ## π Troubleshooting | |
| ### Common Issues | |
| 1. **Model not loading**: Check if Ollama is running and model is pulled (V1 only) | |
| 2. **Out of memory**: | |
| - V1: Ensure 2-4GB RAM available | |
| - V2/V3: Ensure ~500-550MB RAM available | |
| - V4 (1.5B): Ensure 2-3GB RAM available | |
| - V4 (3B): Ensure 6-7GB RAM available | |
| 3. **Slow startup**: Normal on first run due to model download | |
| 4. **V4 slow inference**: Enable GPU acceleration (CUDA or MPS) and FP16 for 2-4x speedup | |
| 5. **V4 quantization slow**: Quantization takes 1-2 minutes on startup; disable warmup to defer until first request | |
| 6. **API errors**: Check logs via `/docs` endpoint | |
| ### Logs | |
| View application logs in the Hugging Face Spaces interface or check the health endpoint for service status. | |
| ## π License | |
| MIT License - see LICENSE file for details. | |
| ## π€ Contributing | |
| 1. Fork the repository | |
| 2. Create a feature branch | |
| 3. Make your changes | |
| 4. Add tests | |
| 5. Submit a pull request | |
| --- | |
| ## β Deployment Status | |
| **Successfully deployed and tested on Hugging Face Spaces!** π | |
| - β **Proxy-aware FastAPI** with `root_path` support | |
| - β **All endpoints working** (health, docs, V1-V4 APIs) | |
| - β **Real-time streaming** summarization | |
| - β **Structured JSON output** with V4 API | |
| - β **GPU acceleration support** (CUDA, MPS, CPU fallback) | |
| - β **No 404 errors** - all paths correctly configured | |
| - β **Test script included** for easy verification | |
| ### API Versions Available | |
| - **V1**: Ollama + Transformers (requires external Ollama service) | |
| - **V2**: HuggingFace streaming (lightweight, ~500MB) | |
| - **V3**: Web scraping + Summarization (lightweight, ~550MB) | |
| - **V4**: Structured output with Qwen (GPU-optimized, 2-7GB depending on model) | |
| ### Recent Features | |
| - Added V4 structured summarization API with Qwen models | |
| - NDJSON patch streaming for 43% faster time-to-first-token | |
| - Three summarization styles: executive, skimmer, eli5 | |
| - GPU optimization with CUDA/MPS/CPU auto-detection | |
| - Automatic quantization (4-bit NF4, FP16, INT8) | |
| - Rich metadata output (category, sentiment, reading time) | |
| **Live Space:** https://colin730-SummarizerApp.hf.space π― | |