Spaces:
Running
title: Text Summarizer API
emoji: π
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
app_port: 7860
Text Summarizer API
A FastAPI-based text summarization service with multiple summarization engines: Ollama, HuggingFace Transformers, Web Scraping, and Structured Output with Qwen models.
π Features
- Multiple Summarization Engines: Ollama, HuggingFace Transformers, and Qwen models
- Structured JSON Output: V4 API returns rich metadata (title, key points, category, sentiment, reading time)
- Web Scraping Integration: V3 and V4 APIs can scrape articles directly from URLs
- Real-time Streaming: All endpoints support Server-Sent Events (SSE) streaming
- GPU Acceleration: V4 supports CUDA, MPS (Apple Silicon), with automatic quantization
- RESTful API with FastAPI
- Health monitoring and logging
- Docker containerized for easy deployment
- Free deployment on Hugging Face Spaces
π‘ API Endpoints
Health Check
GET /health
V1 API (Ollama + Transformers Pipeline)
POST /api/v1/summarize
POST /api/v1/summarize/stream
POST /api/v1/summarize/pipeline/stream
V2 API (HuggingFace Streaming)
POST /api/v2/summarize/stream
V3 API (Web Scraping + Summarization)
POST /api/v3/scrape-and-summarize/stream
V4 API (Structured Output with Qwen)
POST /api/v4/scrape-and-summarize/stream
POST /api/v4/scrape-and-summarize/stream-ndjson
π Live Deployment
β Successfully deployed and tested on Hugging Face Spaces!
- Live Space: https://colin730-SummarizerApp.hf.space
- API Documentation: https://colin730-SummarizerApp.hf.space/docs
- Health Check: https://colin730-SummarizerApp.hf.space/health
- V2 Streaming API: https://colin730-SummarizerApp.hf.space/api/v2/summarize/stream
Quick Test
# Test the live deployment - health check
curl https://colin730-SummarizerApp.hf.space/health
# Test V2 API (lightweight streaming)
curl -X POST https://colin730-SummarizerApp.hf.space/api/v2/summarize/stream \
-H "Content-Type: application/json" \
-d '{"text":"This is a test of the live API.","max_tokens":50}'
# Test V3 API (web scraping)
curl -X POST https://colin730-SummarizerApp.hf.space/api/v3/scrape-and-summarize/stream \
-H "Content-Type: application/json" \
-d '{"url":"https://example.com/article","max_tokens":128}'
# Test V4 API (structured output, if enabled)
curl -X POST https://colin730-SummarizerApp.hf.space/api/v4/scrape-and-summarize/stream-ndjson \
-H "Content-Type: application/json" \
-d '{"text":"This is a test article. It contains important information.","style":"executive","max_tokens":256}'
Request Formats by API Version:
V1/V2 (Simple text summarization):
{
"text": "Your long text to summarize here...",
"max_tokens": 256,
"prompt": "Summarize the following text concisely:"
}
V3 (URL scraping or text):
{
"url": "https://example.com/article",
"max_tokens": 256,
"include_metadata": true,
"use_cache": true
}
V4 (Structured output with styles):
{
"url": "https://example.com/article",
"style": "executive",
"max_tokens": 512,
"include_metadata": true,
"use_cache": true
}
Which API to Use?
- V1: Local deployment with Ollama (requires external service)
- V2: Lightweight cloud deployment, simple text summaries
- V3: When you need to scrape articles from URLs + simple summaries
- V4: When you need rich metadata (category, sentiment, key points) + GPU acceleration
API Documentation
- Swagger UI:
/docs - ReDoc:
/redoc
π§ Configuration
The service uses the following environment variables:
V1 Configuration (Ollama)
OLLAMA_MODEL: Model to use (default:llama3.2:1b)OLLAMA_HOST: Ollama service host (default:http://localhost:11434)OLLAMA_TIMEOUT: Request timeout in seconds (default:60)ENABLE_V1_WARMUP: Enable V1 warmup (default:false)
V2 Configuration (HuggingFace)
HF_MODEL_ID: HuggingFace model ID (default:sshleifer/distilbart-cnn-6-6)HF_DEVICE_MAP: Device mapping (default:autofor GPU fallback to CPU)HF_TORCH_DTYPE: Torch dtype (default:auto)HF_HOME: HuggingFace cache directory (default:/tmp/huggingface)HF_MAX_NEW_TOKENS: Max new tokens (default:128)HF_TEMPERATURE: Sampling temperature (default:0.7)HF_TOP_P: Nucleus sampling (default:0.95)ENABLE_V2_WARMUP: Enable V2 warmup (default:true)
V3 Configuration (Web Scraping)
ENABLE_V3_SCRAPING: Enable V3 API (default:true)SCRAPING_TIMEOUT: HTTP timeout for scraping (default:10seconds)SCRAPING_MAX_TEXT_LENGTH: Max text to extract (default:50000chars)SCRAPING_CACHE_ENABLED: Enable caching (default:true)SCRAPING_CACHE_TTL: Cache TTL (default:3600seconds / 1 hour)SCRAPING_UA_ROTATION: Enable user-agent rotation (default:true)SCRAPING_RATE_LIMIT_PER_MINUTE: Rate limit per IP (default:10)
V4 Configuration (Structured Summarization)
ENABLE_V4_STRUCTURED: Enable V4 API (default:true)ENABLE_V4_WARMUP: Load model at startup (default:falseto save memory)V4_MODEL_ID: Model to use (default:Qwen/Qwen2.5-1.5B-Instruct, alternative:Qwen/Qwen2.5-3B-Instruct)V4_MAX_TOKENS: Max tokens to generate (default:256, range: 128-2048)V4_TEMPERATURE: Sampling temperature (default:0.2for consistent output)V4_ENABLE_QUANTIZATION: Enable INT8 quantization on CPU or 4-bit NF4 on CUDA (default:true)V4_USE_FP16_FOR_SPEED: Use FP16 precision for 2-3x faster inference on GPU (default:false)
Server Configuration
SERVER_HOST: Server host (default:127.0.0.1)SERVER_PORT: Server port (default:8000)LOG_LEVEL: Logging level (default:INFO)
π³ Docker Deployment
Local Development
# Build and run with docker-compose
docker-compose up --build
# Or run directly
docker build -f Dockerfile.hf -t summarizer-app .
docker run -p 7860:7860 summarizer-app
Hugging Face Spaces
This app is optimized for deployment on Hugging Face Spaces using Docker SDK.
V2-Only Deployment on HF Spaces:
- Uses
t5-smallmodel (~250MB) for fast startup - No Ollama dependency (saves memory and disk space)
- Model downloads during warmup for instant first request
- Optimized for free tier resource limits
Environment Variables for HF Spaces:
For memory-constrained deployments (free tier):
ENABLE_V1_WARMUP=false
ENABLE_V2_WARMUP=false
ENABLE_V3_SCRAPING=true
ENABLE_V4_STRUCTURED=false
HF_MODEL_ID=sshleifer/distilbart-cnn-6-6
HF_HOME=/tmp/huggingface
For GPU-enabled deployments (paid tier with 16GB+ RAM):
ENABLE_V1_WARMUP=false
ENABLE_V2_WARMUP=false
ENABLE_V3_SCRAPING=true
ENABLE_V4_STRUCTURED=true
ENABLE_V4_WARMUP=false
V4_MODEL_ID=Qwen/Qwen2.5-3B-Instruct
V4_ENABLE_QUANTIZATION=true
V4_USE_FP16_FOR_SPEED=true
π Performance
V1 (Ollama + Transformers Pipeline)
- V1 Models: llama3.2:1b (Ollama) + distilbart-cnn-6-6 (Transformers)
- Memory usage: ~2-4GB RAM (when V1 warmup enabled)
- Inference speed: ~2-5 seconds per request
- Startup time: ~30-60 seconds (when V1 warmup enabled)
V2 (HuggingFace Streaming) - Primary on HF Spaces
- V2 Model: sshleifer/distilbart-cnn-6-6 (~300MB download)
- Memory usage: ~500MB RAM (when V2 warmup enabled)
- Inference speed: Real-time token streaming
- Startup time: ~30-60 seconds (includes model download when V2 warmup enabled)
V3 (Web Scraping + Summarization)
- Dependencies: trafilatura, httpx, lxml (lightweight, no JavaScript rendering)
- Memory usage: ~550MB RAM (V2 + scraping: +10-50MB)
- Scraping speed: 200-500ms typical, <10ms on cache hit
- Total latency: 2-5 seconds (scrape + summarize)
- Success rate: 95%+ article extraction
V4 (Structured Summarization with Qwen)
- V4 Models: Qwen/Qwen2.5-1.5B-Instruct (default) or Qwen/Qwen2.5-3B-Instruct (higher quality)
- Memory usage:
- 1.5B model: ~2-3GB RAM (FP16 on GPU), ~1GB (4-bit NF4 on CUDA)
- 3B model: ~6-7GB RAM (FP16 on GPU), ~3-4GB (4-bit NF4 on CUDA)
- Inference speed:
- 1.5B model: 20-46 seconds per request
- 3B model: 40-60 seconds per request
- NDJSON streaming: 43% faster time-to-first-token
- GPU acceleration: CUDA > MPS (Apple Silicon) > CPU (4x speed difference)
- Output format: Structured JSON with 6 fields (title, summary, key_points, category, sentiment, read_time_min)
- Styles: executive, skimmer, eli5
Memory Optimization
- V1 warmup disabled by default (
ENABLE_V1_WARMUP=false) - V2 warmup disabled by default (
ENABLE_V2_WARMUP=false) - V4 warmup disabled by default (
ENABLE_V4_WARMUP=false) - Saves 2-7GB RAM - HuggingFace Spaces deployment options:
- V2-only: ~500MB (fits free tier)
- V2+V3: ~550MB (fits free tier)
- V2+V3+V4 (1.5B): ~3GB (requires paid tier)
- V2+V3+V4 (3B): ~7GB (requires paid tier)
- Local development: All versions can run simultaneously with 8-10GB RAM
- GPU deployment: V4 benefits significantly from CUDA or MPS acceleration
π οΈ Development
Setup
# Install dependencies
pip install -r requirements.txt
# Run locally
uvicorn app.main:app --host 0.0.0.0 --port 7860
Testing
# Run tests
pytest
# Run with coverage
pytest --cov=app
π Usage Examples
V1 API (Ollama)
import requests
import json
# V1 streaming summarization
response = requests.post(
"https://colin730-SummarizerApp.hf.space/api/v1/summarize/stream",
json={
"text": "Your long article or text here...",
"max_tokens": 256
},
stream=True
)
for line in response.iter_lines():
if line.startswith(b'data: '):
data = json.loads(line[6:])
print(data["content"], end="")
if data["done"]:
break
V2 API (HuggingFace Streaming) - Recommended
import requests
import json
# V2 streaming summarization (same request format as V1)
response = requests.post(
"https://colin730-SummarizerApp.hf.space/api/v2/summarize/stream",
json={
"text": "Your long article or text here...",
"max_tokens": 128 # V2 uses max_new_tokens
},
stream=True
)
for line in response.iter_lines():
if line.startswith(b'data: '):
data = json.loads(line[6:])
print(data["content"], end="")
if data["done"]:
break
V3 API (Web Scraping + Summarization) - Android App Primary Use Case
V3 supports two modes: URL scraping or direct text summarization
Mode 1: URL Scraping (recommended for articles)
import requests
import json
# V3 scrape article from URL and stream summarization
response = requests.post(
"https://colin730-SummarizerApp.hf.space/api/v3/scrape-and-summarize/stream",
json={
"url": "https://example.com/article",
"max_tokens": 256,
"include_metadata": True, # Get article title, author, etc.
"use_cache": True # Use cached content if available
},
stream=True
)
for line in response.iter_lines():
if line.startswith(b'data: '):
data = json.loads(line[6:])
# First event: metadata
if data.get("type") == "metadata":
print(f"Input type: {data['data']['input_type']}") # 'url'
print(f"Title: {data['data']['title']}")
print(f"Author: {data['data']['author']}")
print(f"Scrape time: {data['data']['scrape_latency_ms']}ms\n")
# Content events
elif "content" in data:
print(data["content"], end="")
if data["done"]:
print(f"\n\nTotal time: {data['latency_ms']}ms")
break
Mode 2: Direct Text Summarization (fallback when scraping fails)
import requests
import json
# V3 direct text summarization (no scraping)
response = requests.post(
"https://colin730-SummarizerApp.hf.space/api/v3/scrape-and-summarize/stream",
json={
"text": "Your article text here... (minimum 50 characters)",
"max_tokens": 256,
"include_metadata": True
},
stream=True
)
for line in response.iter_lines():
if line.startswith(b'data: '):
data = json.loads(line[6:])
# First event: metadata
if data.get("type") == "metadata":
print(f"Input type: {data['data']['input_type']}") # 'text'
print(f"Text length: {data['data']['text_length']} chars\n")
# Content events
elif "content" in data:
print(data["content"], end="")
if data["done"]:
break
Note: Provide either url OR text, not both. Text mode is useful as a fallback when:
- Article is behind a paywall
- Website blocks scrapers
- User has already extracted the text manually
V4 API (Structured Output with Qwen) - High-Quality Summaries
V4 supports two streaming formats and three summarization styles
Streaming Format 1: Standard JSON Streaming (stream)
import requests
import json
# V4 scrape article from URL and stream structured JSON
response = requests.post(
"https://colin730-SummarizerApp.hf.space/api/v4/scrape-and-summarize/stream",
json={
"url": "https://example.com/article",
"style": "executive", # Options: "executive", "skimmer", "eli5"
"max_tokens": 256,
"include_metadata": True,
"use_cache": True
},
stream=True
)
for line in response.iter_lines():
if line.startswith(b'data: '):
data = json.loads(line[6:])
# First event: metadata
if data.get("type") == "metadata":
print(f"Style: {data['data']['style']}")
print(f"Scrape time: {data['data']['scrape_latency_ms']}ms\n")
# Content events (streaming JSON tokens)
elif "content" in data:
print(data["content"], end="")
if data["done"]:
# Parse final JSON
summary = json.loads(accumulated_content)
print(f"\n\nTitle: {summary['title']}")
print(f"Category: {summary['category']}")
print(f"Sentiment: {summary['sentiment']}")
print(f"Key Points: {summary['key_points']}")
break
Streaming Format 2: NDJSON Patch Streaming (stream-ndjson) - 43% Faster
import requests
import json
# V4 NDJSON streaming - progressive JSON updates for real-time UI
response = requests.post(
"https://colin730-SummarizerApp.hf.space/api/v4/scrape-and-summarize/stream-ndjson",
json={
"text": "Your article text here (minimum 50 characters)...",
"style": "skimmer", # Brief, fact-focused summary
"max_tokens": 512,
"include_metadata": True
},
stream=True
)
summary = {}
for line in response.iter_lines():
if line.startswith(b'data: '):
event = json.loads(line[6:])
# First event: metadata
if event.get("type") == "metadata":
print(f"Input: {event['data']['input_type']}")
print(f"Style: {event['data']['style']}\n")
# NDJSON patch events
elif "delta" in event:
delta = event["delta"]
state = event["state"]
if delta and delta.get("op") == "set":
# Field set operation
field = delta["field"]
value = delta["value"]
summary[field] = value
print(f"{field}: {value}")
elif delta and delta.get("op") == "append":
# Array append operation
field = delta["field"]
value = delta["value"]
if field not in summary:
summary[field] = []
summary[field].append(value)
print(f"+ {field}: {value}")
elif delta and delta.get("op") == "done":
# Final state
print(f"\nβ
Complete! Total time: {event.get('latency_ms', 0):.0f}ms")
print(f"Tokens used: {event.get('tokens_used', 0)}")
break
Summarization Styles
Executive Style ("executive"):
- Target audience: Business professionals, decision makers
- Format: Concise, action-oriented, business impact focus
- Example output: Strategic insights, financial implications, market trends
Skimmer Style ("skimmer"):
- Target audience: Busy readers wanting quick facts
- Format: Bullet-point style, scannable, fact-dense
- Example output: Core facts, numbers, dates, names
ELI5 Style ("eli5"):
- Target audience: General public, non-technical readers
- Format: Simple explanations, analogies, relatable examples
- Example output: What it means, why it matters, real-world impact
V4 Output Schema
All V4 responses return structured JSON with these 6 fields:
{
"title": "Click-worthy title (<100 chars)",
"main_summary": "2-4 sentence summary (<500 chars)",
"key_points": [
"Key point 1",
"Key point 2",
"Key point 3"
],
"category": "Technology",
"sentiment": "Positive",
"read_time_min": 5
}
Android Client (SSE)
// Android SSE client example
val client = OkHttpClient()
val request = Request.Builder()
.url("https://colin730-SummarizerApp.hf.space/api/v2/summarize/stream")
.post(RequestBody.create(
MediaType.parse("application/json"),
"""{"text": "Your text...", "max_tokens": 128}"""
))
.build()
client.newCall(request).enqueue(object : Callback {
override fun onResponse(call: Call, response: Response) {
val source = response.body()?.source()
source?.use { bufferedSource ->
while (true) {
val line = bufferedSource.readUtf8Line()
if (line?.startsWith("data: ") == true) {
val json = line.substring(6)
val data = Gson().fromJson(json, Map::class.java)
// Update UI with data["content"]
if (data["done"] == true) break
}
}
}
}
})
cURL Examples
# Test live deployment
curl https://colin730-SummarizerApp.hf.space/health
# V1 API (if Ollama is available)
curl -X POST "https://colin730-SummarizerApp.hf.space/api/v1/summarize/stream" \
-H "Content-Type: application/json" \
-d '{"text": "Your text...", "max_tokens": 256}'
# V2 API (HuggingFace streaming - recommended)
curl -X POST "https://colin730-SummarizerApp.hf.space/api/v2/summarize/stream" \
-H "Content-Type: application/json" \
-d '{"text": "Your text...", "max_tokens": 128}'
# V3 API - URL mode (web scraping + summarization)
curl -X POST "https://colin730-SummarizerApp.hf.space/api/v3/scrape-and-summarize/stream" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/article", "max_tokens": 256, "include_metadata": true}'
# V3 API - Text mode (direct summarization, no scraping)
curl -X POST "https://colin730-SummarizerApp.hf.space/api/v3/scrape-and-summarize/stream" \
-H "Content-Type: application/json" \
-d '{"text": "Your article text here (minimum 50 characters)...", "max_tokens": 256}'
# V4 API - Standard JSON streaming (URL mode)
curl -X POST "https://colin730-SummarizerApp.hf.space/api/v4/scrape-and-summarize/stream" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/article", "style": "executive", "max_tokens": 256}'
# V4 API - NDJSON patch streaming (Text mode) - 43% faster time-to-first-token
curl -X POST "https://colin730-SummarizerApp.hf.space/api/v4/scrape-and-summarize/stream-ndjson" \
-H "Content-Type: application/json" \
-d '{"text": "Your article text (minimum 50 chars)...", "style": "skimmer", "max_tokens": 512}'
Test Script
# Use the included test script
./scripts/test_endpoints.sh https://colin730-SummarizerApp.hf.space
π Security
- Non-root user execution
- Input validation and sanitization
- SSRF protection: V3 and V4 APIs block localhost and private IP ranges
- Rate limiting: Configurable per-IP rate limits for scraping endpoints
- URL validation: Strict URL format checking (HTTP/HTTPS only)
- Content limits: Maximum text lengths enforced (50,000 chars for V3/V4)
- API key authentication (optional)
π Monitoring
The service includes:
- Health check endpoint
- Request logging
- Error tracking
- Performance metrics
π Troubleshooting
Common Issues
- Model not loading: Check if Ollama is running and model is pulled (V1 only)
- Out of memory:
- V1: Ensure 2-4GB RAM available
- V2/V3: Ensure ~500-550MB RAM available
- V4 (1.5B): Ensure 2-3GB RAM available
- V4 (3B): Ensure 6-7GB RAM available
- Slow startup: Normal on first run due to model download
- V4 slow inference: Enable GPU acceleration (CUDA or MPS) and FP16 for 2-4x speedup
- V4 quantization slow: Quantization takes 1-2 minutes on startup; disable warmup to defer until first request
- API errors: Check logs via
/docsendpoint
Logs
View application logs in the Hugging Face Spaces interface or check the health endpoint for service status.
π License
MIT License - see LICENSE file for details.
π€ Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
β Deployment Status
Successfully deployed and tested on Hugging Face Spaces! π
- β
Proxy-aware FastAPI with
root_pathsupport - β All endpoints working (health, docs, V1-V4 APIs)
- β Real-time streaming summarization
- β Structured JSON output with V4 API
- β GPU acceleration support (CUDA, MPS, CPU fallback)
- β No 404 errors - all paths correctly configured
- β Test script included for easy verification
API Versions Available
- V1: Ollama + Transformers (requires external Ollama service)
- V2: HuggingFace streaming (lightweight, ~500MB)
- V3: Web scraping + Summarization (lightweight, ~550MB)
- V4: Structured output with Qwen (GPU-optimized, 2-7GB depending on model)
Recent Features
- Added V4 structured summarization API with Qwen models
- NDJSON patch streaming for 43% faster time-to-first-token
- Three summarization styles: executive, skimmer, eli5
- GPU optimization with CUDA/MPS/CPU auto-detection
- Automatic quantization (4-bit NF4, FP16, INT8)
- Rich metadata output (category, sentiment, reading time)
Live Space: https://colin730-SummarizerApp.hf.space π―