SummarizerApp / README.md
ming
docs: Update README.md with comprehensive V4 API documentation
8ca69c7
metadata
title: Text Summarizer API
emoji: πŸ“
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
app_port: 7860

Text Summarizer API

A FastAPI-based text summarization service with multiple summarization engines: Ollama, HuggingFace Transformers, Web Scraping, and Structured Output with Qwen models.

πŸš€ Features

  • Multiple Summarization Engines: Ollama, HuggingFace Transformers, and Qwen models
  • Structured JSON Output: V4 API returns rich metadata (title, key points, category, sentiment, reading time)
  • Web Scraping Integration: V3 and V4 APIs can scrape articles directly from URLs
  • Real-time Streaming: All endpoints support Server-Sent Events (SSE) streaming
  • GPU Acceleration: V4 supports CUDA, MPS (Apple Silicon), with automatic quantization
  • RESTful API with FastAPI
  • Health monitoring and logging
  • Docker containerized for easy deployment
  • Free deployment on Hugging Face Spaces

πŸ“‘ API Endpoints

Health Check

GET /health

V1 API (Ollama + Transformers Pipeline)

POST /api/v1/summarize
POST /api/v1/summarize/stream
POST /api/v1/summarize/pipeline/stream

V2 API (HuggingFace Streaming)

POST /api/v2/summarize/stream

V3 API (Web Scraping + Summarization)

POST /api/v3/scrape-and-summarize/stream

V4 API (Structured Output with Qwen)

POST /api/v4/scrape-and-summarize/stream
POST /api/v4/scrape-and-summarize/stream-ndjson

🌐 Live Deployment

βœ… Successfully deployed and tested on Hugging Face Spaces!

Quick Test

# Test the live deployment - health check
curl https://colin730-SummarizerApp.hf.space/health

# Test V2 API (lightweight streaming)
curl -X POST https://colin730-SummarizerApp.hf.space/api/v2/summarize/stream \
  -H "Content-Type: application/json" \
  -d '{"text":"This is a test of the live API.","max_tokens":50}'

# Test V3 API (web scraping)
curl -X POST https://colin730-SummarizerApp.hf.space/api/v3/scrape-and-summarize/stream \
  -H "Content-Type: application/json" \
  -d '{"url":"https://example.com/article","max_tokens":128}'

# Test V4 API (structured output, if enabled)
curl -X POST https://colin730-SummarizerApp.hf.space/api/v4/scrape-and-summarize/stream-ndjson \
  -H "Content-Type: application/json" \
  -d '{"text":"This is a test article. It contains important information.","style":"executive","max_tokens":256}'

Request Formats by API Version:

V1/V2 (Simple text summarization):

{
  "text": "Your long text to summarize here...",
  "max_tokens": 256,
  "prompt": "Summarize the following text concisely:"
}

V3 (URL scraping or text):

{
  "url": "https://example.com/article",
  "max_tokens": 256,
  "include_metadata": true,
  "use_cache": true
}

V4 (Structured output with styles):

{
  "url": "https://example.com/article",
  "style": "executive",
  "max_tokens": 512,
  "include_metadata": true,
  "use_cache": true
}

Which API to Use?

  • V1: Local deployment with Ollama (requires external service)
  • V2: Lightweight cloud deployment, simple text summaries
  • V3: When you need to scrape articles from URLs + simple summaries
  • V4: When you need rich metadata (category, sentiment, key points) + GPU acceleration

API Documentation

  • Swagger UI: /docs
  • ReDoc: /redoc

πŸ”§ Configuration

The service uses the following environment variables:

V1 Configuration (Ollama)

  • OLLAMA_MODEL: Model to use (default: llama3.2:1b)
  • OLLAMA_HOST: Ollama service host (default: http://localhost:11434)
  • OLLAMA_TIMEOUT: Request timeout in seconds (default: 60)
  • ENABLE_V1_WARMUP: Enable V1 warmup (default: false)

V2 Configuration (HuggingFace)

  • HF_MODEL_ID: HuggingFace model ID (default: sshleifer/distilbart-cnn-6-6)
  • HF_DEVICE_MAP: Device mapping (default: auto for GPU fallback to CPU)
  • HF_TORCH_DTYPE: Torch dtype (default: auto)
  • HF_HOME: HuggingFace cache directory (default: /tmp/huggingface)
  • HF_MAX_NEW_TOKENS: Max new tokens (default: 128)
  • HF_TEMPERATURE: Sampling temperature (default: 0.7)
  • HF_TOP_P: Nucleus sampling (default: 0.95)
  • ENABLE_V2_WARMUP: Enable V2 warmup (default: true)

V3 Configuration (Web Scraping)

  • ENABLE_V3_SCRAPING: Enable V3 API (default: true)
  • SCRAPING_TIMEOUT: HTTP timeout for scraping (default: 10 seconds)
  • SCRAPING_MAX_TEXT_LENGTH: Max text to extract (default: 50000 chars)
  • SCRAPING_CACHE_ENABLED: Enable caching (default: true)
  • SCRAPING_CACHE_TTL: Cache TTL (default: 3600 seconds / 1 hour)
  • SCRAPING_UA_ROTATION: Enable user-agent rotation (default: true)
  • SCRAPING_RATE_LIMIT_PER_MINUTE: Rate limit per IP (default: 10)

V4 Configuration (Structured Summarization)

  • ENABLE_V4_STRUCTURED: Enable V4 API (default: true)
  • ENABLE_V4_WARMUP: Load model at startup (default: false to save memory)
  • V4_MODEL_ID: Model to use (default: Qwen/Qwen2.5-1.5B-Instruct, alternative: Qwen/Qwen2.5-3B-Instruct)
  • V4_MAX_TOKENS: Max tokens to generate (default: 256, range: 128-2048)
  • V4_TEMPERATURE: Sampling temperature (default: 0.2 for consistent output)
  • V4_ENABLE_QUANTIZATION: Enable INT8 quantization on CPU or 4-bit NF4 on CUDA (default: true)
  • V4_USE_FP16_FOR_SPEED: Use FP16 precision for 2-3x faster inference on GPU (default: false)

Server Configuration

  • SERVER_HOST: Server host (default: 127.0.0.1)
  • SERVER_PORT: Server port (default: 8000)
  • LOG_LEVEL: Logging level (default: INFO)

🐳 Docker Deployment

Local Development

# Build and run with docker-compose
docker-compose up --build

# Or run directly
docker build -f Dockerfile.hf -t summarizer-app .
docker run -p 7860:7860 summarizer-app

Hugging Face Spaces

This app is optimized for deployment on Hugging Face Spaces using Docker SDK.

V2-Only Deployment on HF Spaces:

  • Uses t5-small model (~250MB) for fast startup
  • No Ollama dependency (saves memory and disk space)
  • Model downloads during warmup for instant first request
  • Optimized for free tier resource limits

Environment Variables for HF Spaces:

For memory-constrained deployments (free tier):

ENABLE_V1_WARMUP=false
ENABLE_V2_WARMUP=false
ENABLE_V3_SCRAPING=true
ENABLE_V4_STRUCTURED=false
HF_MODEL_ID=sshleifer/distilbart-cnn-6-6
HF_HOME=/tmp/huggingface

For GPU-enabled deployments (paid tier with 16GB+ RAM):

ENABLE_V1_WARMUP=false
ENABLE_V2_WARMUP=false
ENABLE_V3_SCRAPING=true
ENABLE_V4_STRUCTURED=true
ENABLE_V4_WARMUP=false
V4_MODEL_ID=Qwen/Qwen2.5-3B-Instruct
V4_ENABLE_QUANTIZATION=true
V4_USE_FP16_FOR_SPEED=true

πŸ“Š Performance

V1 (Ollama + Transformers Pipeline)

  • V1 Models: llama3.2:1b (Ollama) + distilbart-cnn-6-6 (Transformers)
  • Memory usage: ~2-4GB RAM (when V1 warmup enabled)
  • Inference speed: ~2-5 seconds per request
  • Startup time: ~30-60 seconds (when V1 warmup enabled)

V2 (HuggingFace Streaming) - Primary on HF Spaces

  • V2 Model: sshleifer/distilbart-cnn-6-6 (~300MB download)
  • Memory usage: ~500MB RAM (when V2 warmup enabled)
  • Inference speed: Real-time token streaming
  • Startup time: ~30-60 seconds (includes model download when V2 warmup enabled)

V3 (Web Scraping + Summarization)

  • Dependencies: trafilatura, httpx, lxml (lightweight, no JavaScript rendering)
  • Memory usage: ~550MB RAM (V2 + scraping: +10-50MB)
  • Scraping speed: 200-500ms typical, <10ms on cache hit
  • Total latency: 2-5 seconds (scrape + summarize)
  • Success rate: 95%+ article extraction

V4 (Structured Summarization with Qwen)

  • V4 Models: Qwen/Qwen2.5-1.5B-Instruct (default) or Qwen/Qwen2.5-3B-Instruct (higher quality)
  • Memory usage:
    • 1.5B model: ~2-3GB RAM (FP16 on GPU), ~1GB (4-bit NF4 on CUDA)
    • 3B model: ~6-7GB RAM (FP16 on GPU), ~3-4GB (4-bit NF4 on CUDA)
  • Inference speed:
    • 1.5B model: 20-46 seconds per request
    • 3B model: 40-60 seconds per request
    • NDJSON streaming: 43% faster time-to-first-token
  • GPU acceleration: CUDA > MPS (Apple Silicon) > CPU (4x speed difference)
  • Output format: Structured JSON with 6 fields (title, summary, key_points, category, sentiment, read_time_min)
  • Styles: executive, skimmer, eli5

Memory Optimization

  • V1 warmup disabled by default (ENABLE_V1_WARMUP=false)
  • V2 warmup disabled by default (ENABLE_V2_WARMUP=false)
  • V4 warmup disabled by default (ENABLE_V4_WARMUP=false) - Saves 2-7GB RAM
  • HuggingFace Spaces deployment options:
    • V2-only: ~500MB (fits free tier)
    • V2+V3: ~550MB (fits free tier)
    • V2+V3+V4 (1.5B): ~3GB (requires paid tier)
    • V2+V3+V4 (3B): ~7GB (requires paid tier)
  • Local development: All versions can run simultaneously with 8-10GB RAM
  • GPU deployment: V4 benefits significantly from CUDA or MPS acceleration

πŸ› οΈ Development

Setup

# Install dependencies
pip install -r requirements.txt

# Run locally
uvicorn app.main:app --host 0.0.0.0 --port 7860

Testing

# Run tests
pytest

# Run with coverage
pytest --cov=app

πŸ“ Usage Examples

V1 API (Ollama)

import requests
import json

# V1 streaming summarization
response = requests.post(
    "https://colin730-SummarizerApp.hf.space/api/v1/summarize/stream",
    json={
        "text": "Your long article or text here...",
        "max_tokens": 256
    },
    stream=True
)

for line in response.iter_lines():
    if line.startswith(b'data: '):
        data = json.loads(line[6:])
        print(data["content"], end="")
        if data["done"]:
            break

V2 API (HuggingFace Streaming) - Recommended

import requests
import json

# V2 streaming summarization (same request format as V1)
response = requests.post(
    "https://colin730-SummarizerApp.hf.space/api/v2/summarize/stream",
    json={
        "text": "Your long article or text here...",
        "max_tokens": 128  # V2 uses max_new_tokens
    },
    stream=True
)

for line in response.iter_lines():
    if line.startswith(b'data: '):
        data = json.loads(line[6:])
        print(data["content"], end="")
        if data["done"]:
            break

V3 API (Web Scraping + Summarization) - Android App Primary Use Case

V3 supports two modes: URL scraping or direct text summarization

Mode 1: URL Scraping (recommended for articles)

import requests
import json

# V3 scrape article from URL and stream summarization
response = requests.post(
    "https://colin730-SummarizerApp.hf.space/api/v3/scrape-and-summarize/stream",
    json={
        "url": "https://example.com/article",
        "max_tokens": 256,
        "include_metadata": True,  # Get article title, author, etc.
        "use_cache": True  # Use cached content if available
    },
    stream=True
)

for line in response.iter_lines():
    if line.startswith(b'data: '):
        data = json.loads(line[6:])

        # First event: metadata
        if data.get("type") == "metadata":
            print(f"Input type: {data['data']['input_type']}")  # 'url'
            print(f"Title: {data['data']['title']}")
            print(f"Author: {data['data']['author']}")
            print(f"Scrape time: {data['data']['scrape_latency_ms']}ms\n")

        # Content events
        elif "content" in data:
            print(data["content"], end="")
            if data["done"]:
                print(f"\n\nTotal time: {data['latency_ms']}ms")
                break

Mode 2: Direct Text Summarization (fallback when scraping fails)

import requests
import json

# V3 direct text summarization (no scraping)
response = requests.post(
    "https://colin730-SummarizerApp.hf.space/api/v3/scrape-and-summarize/stream",
    json={
        "text": "Your article text here... (minimum 50 characters)",
        "max_tokens": 256,
        "include_metadata": True
    },
    stream=True
)

for line in response.iter_lines():
    if line.startswith(b'data: '):
        data = json.loads(line[6:])

        # First event: metadata
        if data.get("type") == "metadata":
            print(f"Input type: {data['data']['input_type']}")  # 'text'
            print(f"Text length: {data['data']['text_length']} chars\n")

        # Content events
        elif "content" in data:
            print(data["content"], end="")
            if data["done"]:
                break

Note: Provide either url OR text, not both. Text mode is useful as a fallback when:

  • Article is behind a paywall
  • Website blocks scrapers
  • User has already extracted the text manually

V4 API (Structured Output with Qwen) - High-Quality Summaries

V4 supports two streaming formats and three summarization styles

Streaming Format 1: Standard JSON Streaming (stream)

import requests
import json

# V4 scrape article from URL and stream structured JSON
response = requests.post(
    "https://colin730-SummarizerApp.hf.space/api/v4/scrape-and-summarize/stream",
    json={
        "url": "https://example.com/article",
        "style": "executive",  # Options: "executive", "skimmer", "eli5"
        "max_tokens": 256,
        "include_metadata": True,
        "use_cache": True
    },
    stream=True
)

for line in response.iter_lines():
    if line.startswith(b'data: '):
        data = json.loads(line[6:])
        
        # First event: metadata
        if data.get("type") == "metadata":
            print(f"Style: {data['data']['style']}")
            print(f"Scrape time: {data['data']['scrape_latency_ms']}ms\n")
        
        # Content events (streaming JSON tokens)
        elif "content" in data:
            print(data["content"], end="")
            if data["done"]:
                # Parse final JSON
                summary = json.loads(accumulated_content)
                print(f"\n\nTitle: {summary['title']}")
                print(f"Category: {summary['category']}")
                print(f"Sentiment: {summary['sentiment']}")
                print(f"Key Points: {summary['key_points']}")
                break

Streaming Format 2: NDJSON Patch Streaming (stream-ndjson) - 43% Faster

import requests
import json

# V4 NDJSON streaming - progressive JSON updates for real-time UI
response = requests.post(
    "https://colin730-SummarizerApp.hf.space/api/v4/scrape-and-summarize/stream-ndjson",
    json={
        "text": "Your article text here (minimum 50 characters)...",
        "style": "skimmer",  # Brief, fact-focused summary
        "max_tokens": 512,
        "include_metadata": True
    },
    stream=True
)

summary = {}

for line in response.iter_lines():
    if line.startswith(b'data: '):
        event = json.loads(line[6:])
        
        # First event: metadata
        if event.get("type") == "metadata":
            print(f"Input: {event['data']['input_type']}")
            print(f"Style: {event['data']['style']}\n")
        
        # NDJSON patch events
        elif "delta" in event:
            delta = event["delta"]
            state = event["state"]
            
            if delta and delta.get("op") == "set":
                # Field set operation
                field = delta["field"]
                value = delta["value"]
                summary[field] = value
                print(f"{field}: {value}")
            
            elif delta and delta.get("op") == "append":
                # Array append operation
                field = delta["field"]
                value = delta["value"]
                if field not in summary:
                    summary[field] = []
                summary[field].append(value)
                print(f"+ {field}: {value}")
            
            elif delta and delta.get("op") == "done":
                # Final state
                print(f"\nβœ… Complete! Total time: {event.get('latency_ms', 0):.0f}ms")
                print(f"Tokens used: {event.get('tokens_used', 0)}")
                break

Summarization Styles

Executive Style ("executive"):

  • Target audience: Business professionals, decision makers
  • Format: Concise, action-oriented, business impact focus
  • Example output: Strategic insights, financial implications, market trends

Skimmer Style ("skimmer"):

  • Target audience: Busy readers wanting quick facts
  • Format: Bullet-point style, scannable, fact-dense
  • Example output: Core facts, numbers, dates, names

ELI5 Style ("eli5"):

  • Target audience: General public, non-technical readers
  • Format: Simple explanations, analogies, relatable examples
  • Example output: What it means, why it matters, real-world impact

V4 Output Schema

All V4 responses return structured JSON with these 6 fields:

{
  "title": "Click-worthy title (<100 chars)",
  "main_summary": "2-4 sentence summary (<500 chars)",
  "key_points": [
    "Key point 1",
    "Key point 2",
    "Key point 3"
  ],
  "category": "Technology",
  "sentiment": "Positive",
  "read_time_min": 5
}

Android Client (SSE)

// Android SSE client example
val client = OkHttpClient()
val request = Request.Builder()
    .url("https://colin730-SummarizerApp.hf.space/api/v2/summarize/stream")
    .post(RequestBody.create(
        MediaType.parse("application/json"),
        """{"text": "Your text...", "max_tokens": 128}"""
    ))
    .build()

client.newCall(request).enqueue(object : Callback {
    override fun onResponse(call: Call, response: Response) {
        val source = response.body()?.source()
        source?.use { bufferedSource ->
            while (true) {
                val line = bufferedSource.readUtf8Line()
                if (line?.startsWith("data: ") == true) {
                    val json = line.substring(6)
                    val data = Gson().fromJson(json, Map::class.java)
                    // Update UI with data["content"]
                    if (data["done"] == true) break
                }
            }
        }
    }
})

cURL Examples

# Test live deployment
curl https://colin730-SummarizerApp.hf.space/health

# V1 API (if Ollama is available)
curl -X POST "https://colin730-SummarizerApp.hf.space/api/v1/summarize/stream" \
  -H "Content-Type: application/json" \
  -d '{"text": "Your text...", "max_tokens": 256}'

# V2 API (HuggingFace streaming - recommended)
curl -X POST "https://colin730-SummarizerApp.hf.space/api/v2/summarize/stream" \
  -H "Content-Type: application/json" \
  -d '{"text": "Your text...", "max_tokens": 128}'

# V3 API - URL mode (web scraping + summarization)
curl -X POST "https://colin730-SummarizerApp.hf.space/api/v3/scrape-and-summarize/stream" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/article", "max_tokens": 256, "include_metadata": true}'

# V3 API - Text mode (direct summarization, no scraping)
curl -X POST "https://colin730-SummarizerApp.hf.space/api/v3/scrape-and-summarize/stream" \
  -H "Content-Type: application/json" \
  -d '{"text": "Your article text here (minimum 50 characters)...", "max_tokens": 256}'

# V4 API - Standard JSON streaming (URL mode)
curl -X POST "https://colin730-SummarizerApp.hf.space/api/v4/scrape-and-summarize/stream" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/article", "style": "executive", "max_tokens": 256}'

# V4 API - NDJSON patch streaming (Text mode) - 43% faster time-to-first-token
curl -X POST "https://colin730-SummarizerApp.hf.space/api/v4/scrape-and-summarize/stream-ndjson" \
  -H "Content-Type: application/json" \
  -d '{"text": "Your article text (minimum 50 chars)...", "style": "skimmer", "max_tokens": 512}'

Test Script

# Use the included test script
./scripts/test_endpoints.sh https://colin730-SummarizerApp.hf.space

πŸ”’ Security

  • Non-root user execution
  • Input validation and sanitization
  • SSRF protection: V3 and V4 APIs block localhost and private IP ranges
  • Rate limiting: Configurable per-IP rate limits for scraping endpoints
  • URL validation: Strict URL format checking (HTTP/HTTPS only)
  • Content limits: Maximum text lengths enforced (50,000 chars for V3/V4)
  • API key authentication (optional)

πŸ“ˆ Monitoring

The service includes:

  • Health check endpoint
  • Request logging
  • Error tracking
  • Performance metrics

πŸ†˜ Troubleshooting

Common Issues

  1. Model not loading: Check if Ollama is running and model is pulled (V1 only)
  2. Out of memory:
    • V1: Ensure 2-4GB RAM available
    • V2/V3: Ensure ~500-550MB RAM available
    • V4 (1.5B): Ensure 2-3GB RAM available
    • V4 (3B): Ensure 6-7GB RAM available
  3. Slow startup: Normal on first run due to model download
  4. V4 slow inference: Enable GPU acceleration (CUDA or MPS) and FP16 for 2-4x speedup
  5. V4 quantization slow: Quantization takes 1-2 minutes on startup; disable warmup to defer until first request
  6. API errors: Check logs via /docs endpoint

Logs

View application logs in the Hugging Face Spaces interface or check the health endpoint for service status.

πŸ“„ License

MIT License - see LICENSE file for details.

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

βœ… Deployment Status

Successfully deployed and tested on Hugging Face Spaces! πŸš€

  • βœ… Proxy-aware FastAPI with root_path support
  • βœ… All endpoints working (health, docs, V1-V4 APIs)
  • βœ… Real-time streaming summarization
  • βœ… Structured JSON output with V4 API
  • βœ… GPU acceleration support (CUDA, MPS, CPU fallback)
  • βœ… No 404 errors - all paths correctly configured
  • βœ… Test script included for easy verification

API Versions Available

  • V1: Ollama + Transformers (requires external Ollama service)
  • V2: HuggingFace streaming (lightweight, ~500MB)
  • V3: Web scraping + Summarization (lightweight, ~550MB)
  • V4: Structured output with Qwen (GPU-optimized, 2-7GB depending on model)

Recent Features

  • Added V4 structured summarization API with Qwen models
  • NDJSON patch streaming for 43% faster time-to-first-token
  • Three summarization styles: executive, skimmer, eli5
  • GPU optimization with CUDA/MPS/CPU auto-detection
  • Automatic quantization (4-bit NF4, FP16, INT8)
  • Rich metadata output (category, sentiment, reading time)

Live Space: https://colin730-SummarizerApp.hf.space 🎯