Spaces:

colin730
/

SummarizerApp

Running

App Files Files Community

SummarizerApp / README.md

ming

docs: Update README.md with comprehensive V4 API documentation

8ca69c7 about 2 months ago

preview code

raw

history blame contribute delete

22.8 kB

	---
	title: Text Summarizer API
	emoji: 📝
	colorFrom: blue
	colorTo: purple
	sdk: docker
	pinned: false
	license: mit
	app_port: 7860
	---

	# Text Summarizer API

	A FastAPI-based text summarization service with multiple summarization engines: Ollama, HuggingFace Transformers, Web Scraping, and Structured Output with Qwen models.

	## 🚀 Features

	- Multiple Summarization Engines: Ollama, HuggingFace Transformers, and Qwen models
	- Structured JSON Output: V4 API returns rich metadata (title, key points, category, sentiment, reading time)
	- Web Scraping Integration: V3 and V4 APIs can scrape articles directly from URLs
	- Real-time Streaming: All endpoints support Server-Sent Events (SSE) streaming
	- GPU Acceleration: V4 supports CUDA, MPS (Apple Silicon), with automatic quantization
	- RESTful API with FastAPI
	- Health monitoring and logging
	- Docker containerized for easy deployment
	- Free deployment on Hugging Face Spaces

	## 📡 API Endpoints

	### Health Check
	```
	GET /health
	```

	### V1 API (Ollama + Transformers Pipeline)
	```
	POST /api/v1/summarize
	POST /api/v1/summarize/stream
	POST /api/v1/summarize/pipeline/stream
	```

	### V2 API (HuggingFace Streaming)
	```
	POST /api/v2/summarize/stream
	```

	### V3 API (Web Scraping + Summarization)
	```
	POST /api/v3/scrape-and-summarize/stream
	```

	### V4 API (Structured Output with Qwen)
	```
	POST /api/v4/scrape-and-summarize/stream
	POST /api/v4/scrape-and-summarize/stream-ndjson
	```

	## 🌐 Live Deployment

	✅ Successfully deployed and tested on Hugging Face Spaces!

	- Live Space: https://colin730-SummarizerApp.hf.space
	- API Documentation: https://colin730-SummarizerApp.hf.space/docs
	- Health Check: https://colin730-SummarizerApp.hf.space/health
	- V2 Streaming API: https://colin730-SummarizerApp.hf.space/api/v2/summarize/stream

	### Quick Test
	```bash
	# Test the live deployment - health check
	curl https://colin730-SummarizerApp.hf.space/health

	# Test V2 API (lightweight streaming)
	curl -X POST https://colin730-SummarizerApp.hf.space/api/v2/summarize/stream \
	-H "Content-Type: application/json" \
	-d '{"text":"This is a test of the live API.","max_tokens":50}'

	# Test V3 API (web scraping)
	curl -X POST https://colin730-SummarizerApp.hf.space/api/v3/scrape-and-summarize/stream \
	-H "Content-Type: application/json" \
	-d '{"url":"https://example.com/article","max_tokens":128}'

	# Test V4 API (structured output, if enabled)
	curl -X POST https://colin730-SummarizerApp.hf.space/api/v4/scrape-and-summarize/stream-ndjson \
	-H "Content-Type: application/json" \
	-d '{"text":"This is a test article. It contains important information.","style":"executive","max_tokens":256}'
	```

	Request Formats by API Version:

	V1/V2 (Simple text summarization):
	```json
	{
	"text": "Your long text to summarize here...",
	"max_tokens": 256,
	"prompt": "Summarize the following text concisely:"
	}
	```

	V3 (URL scraping or text):
	```json
	{
	"url": "https://example.com/article",
	"max_tokens": 256,
	"include_metadata": true,
	"use_cache": true
	}
	```

	V4 (Structured output with styles):
	```json
	{
	"url": "https://example.com/article",
	"style": "executive",
	"max_tokens": 512,
	"include_metadata": true,
	"use_cache": true
	}
	```

	Which API to Use?
	- V1: Local deployment with Ollama (requires external service)
	- V2: Lightweight cloud deployment, simple text summaries
	- V3: When you need to scrape articles from URLs + simple summaries
	- V4: When you need rich metadata (category, sentiment, key points) + GPU acceleration

	### API Documentation
	- Swagger UI: `/docs`
	- ReDoc: `/redoc`

	## 🔧 Configuration

	The service uses the following environment variables:

	### V1 Configuration (Ollama)
	- `OLLAMA_MODEL`: Model to use (default: `llama3.2:1b`)
	- `OLLAMA_HOST`: Ollama service host (default: `http://localhost:11434`)
	- `OLLAMA_TIMEOUT`: Request timeout in seconds (default: `60`)
	- `ENABLE_V1_WARMUP`: Enable V1 warmup (default: `false`)

	### V2 Configuration (HuggingFace)
	- `HF_MODEL_ID`: HuggingFace model ID (default: `sshleifer/distilbart-cnn-6-6`)
	- `HF_DEVICE_MAP`: Device mapping (default: `auto` for GPU fallback to CPU)
	- `HF_TORCH_DTYPE`: Torch dtype (default: `auto`)
	- `HF_HOME`: HuggingFace cache directory (default: `/tmp/huggingface`)
	- `HF_MAX_NEW_TOKENS`: Max new tokens (default: `128`)
	- `HF_TEMPERATURE`: Sampling temperature (default: `0.7`)
	- `HF_TOP_P`: Nucleus sampling (default: `0.95`)
	- `ENABLE_V2_WARMUP`: Enable V2 warmup (default: `true`)

	### V3 Configuration (Web Scraping)
	- `ENABLE_V3_SCRAPING`: Enable V3 API (default: `true`)
	- `SCRAPING_TIMEOUT`: HTTP timeout for scraping (default: `10` seconds)
	- `SCRAPING_MAX_TEXT_LENGTH`: Max text to extract (default: `50000` chars)
	- `SCRAPING_CACHE_ENABLED`: Enable caching (default: `true`)
	- `SCRAPING_CACHE_TTL`: Cache TTL (default: `3600` seconds / 1 hour)
	- `SCRAPING_UA_ROTATION`: Enable user-agent rotation (default: `true`)
	- `SCRAPING_RATE_LIMIT_PER_MINUTE`: Rate limit per IP (default: `10`)

	### V4 Configuration (Structured Summarization)
	- `ENABLE_V4_STRUCTURED`: Enable V4 API (default: `true`)
	- `ENABLE_V4_WARMUP`: Load model at startup (default: `false` to save memory)
	- `V4_MODEL_ID`: Model to use (default: `Qwen/Qwen2.5-1.5B-Instruct`, alternative: `Qwen/Qwen2.5-3B-Instruct`)
	- `V4_MAX_TOKENS`: Max tokens to generate (default: `256`, range: 128-2048)
	- `V4_TEMPERATURE`: Sampling temperature (default: `0.2` for consistent output)
	- `V4_ENABLE_QUANTIZATION`: Enable INT8 quantization on CPU or 4-bit NF4 on CUDA (default: `true`)
	- `V4_USE_FP16_FOR_SPEED`: Use FP16 precision for 2-3x faster inference on GPU (default: `false`)

	### Server Configuration
	- `SERVER_HOST`: Server host (default: `127.0.0.1`)
	- `SERVER_PORT`: Server port (default: `8000`)
	- `LOG_LEVEL`: Logging level (default: `INFO`)

	## 🐳 Docker Deployment

	### Local Development
	```bash
	# Build and run with docker-compose
	docker-compose up --build

	# Or run directly
	docker build -f Dockerfile.hf -t summarizer-app .
	docker run -p 7860:7860 summarizer-app
	```

	### Hugging Face Spaces
	This app is optimized for deployment on Hugging Face Spaces using Docker SDK.

	V2-Only Deployment on HF Spaces:
	- Uses `t5-small` model (~250MB) for fast startup
	- No Ollama dependency (saves memory and disk space)
	- Model downloads during warmup for instant first request
	- Optimized for free tier resource limits

	Environment Variables for HF Spaces:

	For memory-constrained deployments (free tier):
	```bash
	ENABLE_V1_WARMUP=false
	ENABLE_V2_WARMUP=false
	ENABLE_V3_SCRAPING=true
	ENABLE_V4_STRUCTURED=false
	HF_MODEL_ID=sshleifer/distilbart-cnn-6-6
	HF_HOME=/tmp/huggingface
	```

	For GPU-enabled deployments (paid tier with 16GB+ RAM):
	```bash
	ENABLE_V1_WARMUP=false
	ENABLE_V2_WARMUP=false
	ENABLE_V3_SCRAPING=true
	ENABLE_V4_STRUCTURED=true
	ENABLE_V4_WARMUP=false
	V4_MODEL_ID=Qwen/Qwen2.5-3B-Instruct
	V4_ENABLE_QUANTIZATION=true
	V4_USE_FP16_FOR_SPEED=true
	```

	## 📊 Performance

	### V1 (Ollama + Transformers Pipeline)
	- V1 Models: llama3.2:1b (Ollama) + distilbart-cnn-6-6 (Transformers)
	- Memory usage: ~2-4GB RAM (when V1 warmup enabled)
	- Inference speed: ~2-5 seconds per request
	- Startup time: ~30-60 seconds (when V1 warmup enabled)

	### V2 (HuggingFace Streaming) - Primary on HF Spaces
	- V2 Model: sshleifer/distilbart-cnn-6-6 (~300MB download)
	- Memory usage: ~500MB RAM (when V2 warmup enabled)
	- Inference speed: Real-time token streaming
	- Startup time: ~30-60 seconds (includes model download when V2 warmup enabled)

	### V3 (Web Scraping + Summarization)
	- Dependencies: trafilatura, httpx, lxml (lightweight, no JavaScript rendering)
	- Memory usage: ~550MB RAM (V2 + scraping: +10-50MB)
	- Scraping speed: 200-500ms typical, <10ms on cache hit
	- Total latency: 2-5 seconds (scrape + summarize)
	- Success rate: 95%+ article extraction

	### V4 (Structured Summarization with Qwen)
	- V4 Models: Qwen/Qwen2.5-1.5B-Instruct (default) or Qwen/Qwen2.5-3B-Instruct (higher quality)
	- Memory usage:
	- 1.5B model: ~2-3GB RAM (FP16 on GPU), ~1GB (4-bit NF4 on CUDA)
	- 3B model: ~6-7GB RAM (FP16 on GPU), ~3-4GB (4-bit NF4 on CUDA)
	- Inference speed:
	- 1.5B model: 20-46 seconds per request
	- 3B model: 40-60 seconds per request
	- NDJSON streaming: 43% faster time-to-first-token
	- GPU acceleration: CUDA > MPS (Apple Silicon) > CPU (4x speed difference)
	- Output format: Structured JSON with 6 fields (title, summary, key_points, category, sentiment, read_time_min)
	- Styles: executive, skimmer, eli5

	### Memory Optimization
	- V1 warmup disabled by default (`ENABLE_V1_WARMUP=false`)
	- V2 warmup disabled by default (`ENABLE_V2_WARMUP=false`)
	- V4 warmup disabled by default (`ENABLE_V4_WARMUP=false`) - Saves 2-7GB RAM
	- HuggingFace Spaces deployment options:
	- V2-only: ~500MB (fits free tier)
	- V2+V3: ~550MB (fits free tier)
	- V2+V3+V4 (1.5B): ~3GB (requires paid tier)
	- V2+V3+V4 (3B): ~7GB (requires paid tier)
	- Local development: All versions can run simultaneously with 8-10GB RAM
	- GPU deployment: V4 benefits significantly from CUDA or MPS acceleration

	## 🛠️ Development

	### Setup
	```bash
	# Install dependencies
	pip install -r requirements.txt

	# Run locally
	uvicorn app.main:app --host 0.0.0.0 --port 7860
	```

	### Testing
	```bash
	# Run tests
	pytest

	# Run with coverage
	pytest --cov=app
	```

	## 📝 Usage Examples

	### V1 API (Ollama)
	```python
	import requests
	import json

	# V1 streaming summarization
	response = requests.post(
	"https://colin730-SummarizerApp.hf.space/api/v1/summarize/stream",
	json={
	"text": "Your long article or text here...",
	"max_tokens": 256
	},
	stream=True
	)

	for line in response.iter_lines():
	if line.startswith(b'data: '):
	data = json.loads(line[6:])
	print(data["content"], end="")
	if data["done"]:
	break
	```

	### V2 API (HuggingFace Streaming) - Recommended
	```python
	import requests
	import json

	# V2 streaming summarization (same request format as V1)
	response = requests.post(
	"https://colin730-SummarizerApp.hf.space/api/v2/summarize/stream",
	json={
	"text": "Your long article or text here...",
	"max_tokens": 128 # V2 uses max_new_tokens
	},
	stream=True
	)

	for line in response.iter_lines():
	if line.startswith(b'data: '):
	data = json.loads(line[6:])
	print(data["content"], end="")
	if data["done"]:
	break
	```

	### V3 API (Web Scraping + Summarization) - Android App Primary Use Case

	V3 supports two modes: URL scraping or direct text summarization

	#### Mode 1: URL Scraping (recommended for articles)
	```python
	import requests
	import json

	# V3 scrape article from URL and stream summarization
	response = requests.post(
	"https://colin730-SummarizerApp.hf.space/api/v3/scrape-and-summarize/stream",
	json={
	"url": "https://example.com/article",
	"max_tokens": 256,
	"include_metadata": True, # Get article title, author, etc.
	"use_cache": True # Use cached content if available
	},
	stream=True
	)

	for line in response.iter_lines():
	if line.startswith(b'data: '):
	data = json.loads(line[6:])

	# First event: metadata
	if data.get("type") == "metadata":
	print(f"Input type: {data['data']['input_type']}") # 'url'
	print(f"Title: {data['data']['title']}")
	print(f"Author: {data['data']['author']}")
	print(f"Scrape time: {data['data']['scrape_latency_ms']}ms\n")

	# Content events
	elif "content" in data:
	print(data["content"], end="")
	if data["done"]:
	print(f"\n\nTotal time: {data['latency_ms']}ms")
	break
	```

	#### Mode 2: Direct Text Summarization (fallback when scraping fails)
	```python
	import requests
	import json

	# V3 direct text summarization (no scraping)
	response = requests.post(
	"https://colin730-SummarizerApp.hf.space/api/v3/scrape-and-summarize/stream",
	json={
	"text": "Your article text here... (minimum 50 characters)",
	"max_tokens": 256,
	"include_metadata": True
	},
	stream=True
	)

	for line in response.iter_lines():
	if line.startswith(b'data: '):
	data = json.loads(line[6:])

	# First event: metadata
	if data.get("type") == "metadata":
	print(f"Input type: {data['data']['input_type']}") # 'text'
	print(f"Text length: {data['data']['text_length']} chars\n")

	# Content events
	elif "content" in data:
	print(data["content"], end="")
	if data["done"]:
	break
	```

	Note: Provide either `url` OR `text`, not both. Text mode is useful as a fallback when:
	- Article is behind a paywall
	- Website blocks scrapers
	- User has already extracted the text manually

	### V4 API (Structured Output with Qwen) - High-Quality Summaries

	V4 supports two streaming formats and three summarization styles

	#### Streaming Format 1: Standard JSON Streaming (stream)
	```python
	import requests
	import json

	# V4 scrape article from URL and stream structured JSON
	response = requests.post(
	"https://colin730-SummarizerApp.hf.space/api/v4/scrape-and-summarize/stream",
	json={
	"url": "https://example.com/article",
	"style": "executive", # Options: "executive", "skimmer", "eli5"
	"max_tokens": 256,
	"include_metadata": True,
	"use_cache": True
	},
	stream=True
	)

	for line in response.iter_lines():
	if line.startswith(b'data: '):
	data = json.loads(line[6:])

	# First event: metadata
	if data.get("type") == "metadata":
	print(f"Style: {data['data']['style']}")
	print(f"Scrape time: {data['data']['scrape_latency_ms']}ms\n")

	# Content events (streaming JSON tokens)
	elif "content" in data:
	print(data["content"], end="")
	if data["done"]:
	# Parse final JSON
	summary = json.loads(accumulated_content)
	print(f"\n\nTitle: {summary['title']}")
	print(f"Category: {summary['category']}")
	print(f"Sentiment: {summary['sentiment']}")
	print(f"Key Points: {summary['key_points']}")
	break
	```

	#### Streaming Format 2: NDJSON Patch Streaming (stream-ndjson) - 43% Faster
	```python
	import requests
	import json

	# V4 NDJSON streaming - progressive JSON updates for real-time UI
	response = requests.post(
	"https://colin730-SummarizerApp.hf.space/api/v4/scrape-and-summarize/stream-ndjson",
	json={
	"text": "Your article text here (minimum 50 characters)...",
	"style": "skimmer", # Brief, fact-focused summary
	"max_tokens": 512,
	"include_metadata": True
	},
	stream=True
	)

	summary = {}

	for line in response.iter_lines():
	if line.startswith(b'data: '):
	event = json.loads(line[6:])

	# First event: metadata
	if event.get("type") == "metadata":
	print(f"Input: {event['data']['input_type']}")
	print(f"Style: {event['data']['style']}\n")

	# NDJSON patch events
	elif "delta" in event:
	delta = event["delta"]
	state = event["state"]

	if delta and delta.get("op") == "set":
	# Field set operation
	field = delta["field"]
	value = delta["value"]
	summary[field] = value
	print(f"{field}: {value}")

	elif delta and delta.get("op") == "append":
	# Array append operation
	field = delta["field"]
	value = delta["value"]
	if field not in summary:
	summary[field] = []
	summary[field].append(value)
	print(f"+ {field}: {value}")

	elif delta and delta.get("op") == "done":
	# Final state
	print(f"\n✅ Complete! Total time: {event.get('latency_ms', 0):.0f}ms")
	print(f"Tokens used: {event.get('tokens_used', 0)}")
	break
	```

	#### Summarization Styles

	Executive Style (`"executive"`):
	- Target audience: Business professionals, decision makers
	- Format: Concise, action-oriented, business impact focus
	- Example output: Strategic insights, financial implications, market trends

	Skimmer Style (`"skimmer"`):
	- Target audience: Busy readers wanting quick facts
	- Format: Bullet-point style, scannable, fact-dense
	- Example output: Core facts, numbers, dates, names

	ELI5 Style (`"eli5"`):
	- Target audience: General public, non-technical readers
	- Format: Simple explanations, analogies, relatable examples
	- Example output: What it means, why it matters, real-world impact

	#### V4 Output Schema

	All V4 responses return structured JSON with these 6 fields:

	```json
	{
	"title": "Click-worthy title (<100 chars)",
	"main_summary": "2-4 sentence summary (<500 chars)",
	"key_points": [
	"Key point 1",
	"Key point 2",
	"Key point 3"
	],
	"category": "Technology",
	"sentiment": "Positive",
	"read_time_min": 5
	}
	```

	### Android Client (SSE)
	```kotlin
	// Android SSE client example
	val client = OkHttpClient()
	val request = Request.Builder()
	.url("https://colin730-SummarizerApp.hf.space/api/v2/summarize/stream")
	.post(RequestBody.create(
	MediaType.parse("application/json"),
	"""{"text": "Your text...", "max_tokens": 128}"""
	))
	.build()

	client.newCall(request).enqueue(object : Callback {
	override fun onResponse(call: Call, response: Response) {
	val source = response.body()?.source()
	source?.use { bufferedSource ->
	while (true) {
	val line = bufferedSource.readUtf8Line()
	if (line?.startsWith("data: ") == true) {
	val json = line.substring(6)
	val data = Gson().fromJson(json, Map::class.java)
	// Update UI with data["content"]
	if (data["done"] == true) break
	}
	}
	}
	}
	})
	```

	### cURL Examples
	```bash
	# Test live deployment
	curl https://colin730-SummarizerApp.hf.space/health

	# V1 API (if Ollama is available)
	curl -X POST "https://colin730-SummarizerApp.hf.space/api/v1/summarize/stream" \
	-H "Content-Type: application/json" \
	-d '{"text": "Your text...", "max_tokens": 256}'

	# V2 API (HuggingFace streaming - recommended)
	curl -X POST "https://colin730-SummarizerApp.hf.space/api/v2/summarize/stream" \
	-H "Content-Type: application/json" \
	-d '{"text": "Your text...", "max_tokens": 128}'

	# V3 API - URL mode (web scraping + summarization)
	curl -X POST "https://colin730-SummarizerApp.hf.space/api/v3/scrape-and-summarize/stream" \
	-H "Content-Type: application/json" \
	-d '{"url": "https://example.com/article", "max_tokens": 256, "include_metadata": true}'

	# V3 API - Text mode (direct summarization, no scraping)
	curl -X POST "https://colin730-SummarizerApp.hf.space/api/v3/scrape-and-summarize/stream" \
	-H "Content-Type: application/json" \
	-d '{"text": "Your article text here (minimum 50 characters)...", "max_tokens": 256}'

	# V4 API - Standard JSON streaming (URL mode)
	curl -X POST "https://colin730-SummarizerApp.hf.space/api/v4/scrape-and-summarize/stream" \
	-H "Content-Type: application/json" \
	-d '{"url": "https://example.com/article", "style": "executive", "max_tokens": 256}'

	# V4 API - NDJSON patch streaming (Text mode) - 43% faster time-to-first-token
	curl -X POST "https://colin730-SummarizerApp.hf.space/api/v4/scrape-and-summarize/stream-ndjson" \
	-H "Content-Type: application/json" \
	-d '{"text": "Your article text (minimum 50 chars)...", "style": "skimmer", "max_tokens": 512}'
	```

	### Test Script
	```bash
	# Use the included test script
	./scripts/test_endpoints.sh https://colin730-SummarizerApp.hf.space
	```

	## 🔒 Security

	- Non-root user execution
	- Input validation and sanitization
	- SSRF protection: V3 and V4 APIs block localhost and private IP ranges
	- Rate limiting: Configurable per-IP rate limits for scraping endpoints
	- URL validation: Strict URL format checking (HTTP/HTTPS only)
	- Content limits: Maximum text lengths enforced (50,000 chars for V3/V4)
	- API key authentication (optional)

	## 📈 Monitoring

	The service includes:
	- Health check endpoint
	- Request logging
	- Error tracking
	- Performance metrics

	## 🆘 Troubleshooting

	### Common Issues

	1. Model not loading: Check if Ollama is running and model is pulled (V1 only)
	2. Out of memory:
	- V1: Ensure 2-4GB RAM available
	- V2/V3: Ensure ~500-550MB RAM available
	- V4 (1.5B): Ensure 2-3GB RAM available
	- V4 (3B): Ensure 6-7GB RAM available
	3. Slow startup: Normal on first run due to model download
	4. V4 slow inference: Enable GPU acceleration (CUDA or MPS) and FP16 for 2-4x speedup
	5. V4 quantization slow: Quantization takes 1-2 minutes on startup; disable warmup to defer until first request
	6. API errors: Check logs via `/docs` endpoint

	### Logs
	View application logs in the Hugging Face Spaces interface or check the health endpoint for service status.

	## 📄 License

	MIT License - see LICENSE file for details.

	## 🤝 Contributing

	1. Fork the repository
	2. Create a feature branch
	3. Make your changes
	4. Add tests
	5. Submit a pull request

	---

	## ✅ Deployment Status

	Successfully deployed and tested on Hugging Face Spaces! 🚀

	- ✅ Proxy-aware FastAPI with `root_path` support
	- ✅ All endpoints working (health, docs, V1-V4 APIs)
	- ✅ Real-time streaming summarization
	- ✅ Structured JSON output with V4 API
	- ✅ GPU acceleration support (CUDA, MPS, CPU fallback)
	- ✅ No 404 errors - all paths correctly configured
	- ✅ Test script included for easy verification

	### API Versions Available
	- V1: Ollama + Transformers (requires external Ollama service)
	- V2: HuggingFace streaming (lightweight, ~500MB)
	- V3: Web scraping + Summarization (lightweight, ~550MB)
	- V4: Structured output with Qwen (GPU-optimized, 2-7GB depending on model)

	### Recent Features
	- Added V4 structured summarization API with Qwen models
	- NDJSON patch streaming for 43% faster time-to-first-token
	- Three summarization styles: executive, skimmer, eli5
	- GPU optimization with CUDA/MPS/CPU auto-detection
	- Automatic quantization (4-bit NF4, FP16, INT8)
	- Rich metadata output (category, sentiment, reading time)

	Live Space: https://colin730-SummarizerApp.hf.space 🎯