Test Article Title
This is a test article with meaningful content.
It has multiple paragraphs to test extraction.
# V3 Web Scraping API Implementation Plan ## Table of Contents 1. [Overview](#overview) 2. [Motivation](#motivation) 3. [Architecture Design](#architecture-design) 4. [Component Specifications](#component-specifications) 5. [API Design](#api-design) 6. [Implementation Details](#implementation-details) 7. [Testing Strategy](#testing-strategy) 8. [Deployment Considerations](#deployment-considerations) 9. [Performance Benchmarks](#performance-benchmarks) 10. [Future Enhancements](#future-enhancements) --- ## Overview The V3 API introduces backend web scraping capabilities to the SummerizerApp, enabling the Android app to send article URLs and receive streamed summarizations without handling web scraping client-side. **Key Goals:** - Move web scraping from Android app to backend - Solve JavaScript rendering, performance, and anti-scraping issues - Maintain HuggingFace Spaces deployment compatibility (<600MB memory) - Provide consistent, high-quality article extraction - Enable caching for improved performance --- ## Motivation ### Current Pain Points (Client-Side Scraping) **1. Performance Issues** - Mobile devices have limited CPU/network resources - Scraping takes 5-15 seconds on mobile - High battery drain - Excessive data usage (downloads full HTML + assets) **2. JavaScript Rendering** - Many modern sites require JavaScript execution - Mobile webviews inconsistent across Android versions - Hard to debug rendering issues **3. Inconsistent Extraction** - Different sites have different structures - Custom parsing logic needed per site - Quality varies significantly **4. Anti-Scraping Measures** - Mobile IPs easily identified and blocked - Limited control over user-agents and headers - Rate limiting hard to implement per-device ### Benefits of Backend Scraping | Aspect | Client-Side | Backend (V3) | |--------|-------------|--------------| | **Performance** | 5-15s | 2-5s | | **Battery Impact** | High | None | | **Data Usage** | Full page | Summary only | | **Success Rate** | 60-70% | 95%+ | | **Maintenance** | App updates | Instant server updates | | **Caching** | Per-device | Shared across users | | **Anti-Scraping** | Easily blocked | Sophisticated rotation | --- ## Architecture Design ### System Overview ``` ┌─────────────┐ │ Android App │ └──────┬──────┘ │ POST /api/v3/scrape-and-summarize/stream │ { "url": "https://...", "max_tokens": 256 } ↓ ┌──────────────────────────────────────────────────────┐ │ FastAPI Backend │ │ │ │ ┌────────────────────────────────────────────────┐ │ │ │ V3 Router (/api/v3) │ │ │ │ ┌─────────────────────────────────────────┐ │ │ │ │ │ 1. Validate URL & Check Cache │ │ │ │ │ │ 2. Scrape Article (ArticleScraperService)│ │ │ │ │ │ 3. Validate Content Quality │ │ │ │ │ │ 4. Cache Scraped Content │ │ │ │ │ │ 5. Stream Summarization (V2 HF Service) │ │ │ │ │ └─────────────────────────────────────────┘ │ │ │ └────────────────────────────────────────────────┘ │ │ │ │ Services: │ │ ├─ ArticleScraperService (trafilatura) │ │ ├─ HFStreamingSummarizer (existing V2) │ │ └─ CacheService (in-memory TTL) │ └──────────────────────────────────────────────────────┘ │ │ Server-Sent Events Stream ↓ ┌─────────────┐ │ Android App │ Receives summary tokens in real-time └─────────────┘ ``` ### Technology Stack **Primary Stack (Always Enabled):** - **Trafilatura** - Article extraction (F1 score: 0.958) - **httpx** - Async HTTP client (already in stack) - **lxml** - Fast HTML parsing - **In-Memory Cache** - TTL-based caching **Optional Stack (Enterprise/Local Only):** - **Playwright** - JavaScript rendering fallback (NOT for HF Spaces) ### Request Flow ``` 1. Android App → POST /api/v3/scrape-and-summarize/stream ↓ 2. Middleware: Request ID tracking, CORS, timing ↓ 3. V3 Route Handler: Schema validation ↓ 4. Check Cache: URL already scraped recently? ├─ YES → Use cached content (skip to step 8) └─ NO → Continue to step 5 ↓ 5. ArticleScraperService.scrape_article(url) ├─ Generate random user-agent & headers ├─ Fetch HTML with httpx (timeout: 10s) ├─ Extract with trafilatura ├─ Validate content quality (length, structure) └─ Extract metadata (title, author, date) ↓ 6. Validation: Content length > 100 chars? ├─ YES → Continue └─ NO → Return 422 error ↓ 7. Cache: Store scraped content (TTL: 1 hour) ↓ 8. HFStreamingSummarizer.summarize_text_stream() └─ Reuse existing V2 logic ↓ 9. Stream Response: Server-Sent Events ├─ metadata event (title, scrape_latency) ├─ content chunks (tokens streaming) └─ done event (total_latency) ``` --- ## Component Specifications ### 1. Article Scraper Service **File:** `app/services/article_scraper.py` **Responsibilities:** - Fetch HTML from URLs - Extract article content with trafilatura - Rotate user-agents to avoid blocks - Extract metadata (title, author, date, site_name) - Validate content quality - Handle errors gracefully **Key Methods:** ```python class ArticleScraperService: async def scrape_article( self, url: str, use_cache: bool = True ) -> Dict[str, Any]: """ Scrape article content from URL. Returns: { 'text': str, # Extracted article text 'title': str, # Article title 'author': str, # Author name (if available) 'date': str, # Publication date (if available) 'site_name': str, # Website name 'url': str, # Original URL 'method': str, # 'static' or 'js_rendered' 'scrape_time_ms': float } """ pass def _get_random_headers(self) -> Dict[str, str]: """Generate realistic browser headers with random user-agent.""" pass def _validate_content_quality(self, text: str) -> bool: """Check if extracted content meets quality threshold.""" pass ``` **Dependencies:** - `trafilatura` - Article extraction - `httpx` - Async HTTP requests - `lxml` - HTML parsing --- ### 2. Caching Layer **File:** `app/core/cache.py` **Responsibilities:** - Store scraped content in memory - TTL-based expiration (default: 1 hour) - URL-based key hashing - Auto-cleanup of expired entries - Cache statistics logging **Key Methods:** ```python class SimpleCache: def __init__(self, ttl_seconds: int = 3600): """Initialize cache with TTL in seconds.""" pass def get(self, url: str) -> Optional[Dict]: """Get cached content for URL, None if not found/expired.""" pass def set(self, url: str, data: Dict) -> None: """Cache content with TTL.""" pass def clear_expired(self) -> int: """Remove expired entries, return count removed.""" pass def stats(self) -> Dict[str, int]: """Return cache statistics (size, hits, misses).""" pass ``` **Why In-Memory Cache?** - Zero additional dependencies - No external services needed - Fast (sub-millisecond access) - Perfect for single-instance HF Spaces deployment - Simple to implement and maintain --- ### 3. V3 API Structure **Directory:** `app/api/v3/` #### 3.1 Routes (`routes.py`) ```python from fastapi import APIRouter from app.api.v3 import scrape_summarize api_router = APIRouter() api_router.include_router( scrape_summarize.router, tags=["V3 - Web Scraping & Summarization"] ) ``` #### 3.2 Schemas (`schemas.py`) ```python from pydantic import BaseModel, Field, validator from typing import Optional import re class ScrapeAndSummarizeRequest(BaseModel): """Request schema for scrape-and-summarize endpoint.""" url: str = Field( ..., description="URL of article to scrape and summarize", example="https://example.com/article" ) max_tokens: Optional[int] = Field( default=256, ge=1, le=2048, description="Maximum tokens in summary" ) temperature: Optional[float] = Field( default=0.3, ge=0.0, le=2.0, description="Sampling temperature (lower = more focused)" ) top_p: Optional[float] = Field( default=0.9, ge=0.0, le=1.0, description="Nucleus sampling parameter" ) prompt: Optional[str] = Field( default="Summarize this article concisely:", description="Custom summarization prompt" ) include_metadata: Optional[bool] = Field( default=True, description="Include article metadata in response" ) use_cache: Optional[bool] = Field( default=True, description="Use cached content if available" ) @validator('url') def validate_url(cls, v): """Validate URL format.""" url_pattern = re.compile( r'^https?://' # http:// or https:// r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\.?|' # domain r'localhost|' # localhost r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # or IP r'(?::\d+)?' # optional port r'(?:/?|[/?]\S+)$', re.IGNORECASE ) if not url_pattern.match(v): raise ValueError('Invalid URL format') return v class ArticleMetadata(BaseModel): """Article metadata extracted during scraping.""" title: Optional[str] = Field(None, description="Article title") author: Optional[str] = Field(None, description="Author name") date_published: Optional[str] = Field(None, description="Publication date") site_name: Optional[str] = Field(None, description="Website name") url: str = Field(..., description="Original URL") extracted_text_length: int = Field(..., description="Length of extracted text") scrape_method: str = Field(..., description="Scraping method used") scrape_latency_ms: float = Field(..., description="Time taken to scrape (ms)") class ErrorResponse(BaseModel): """Error response schema.""" detail: str = Field(..., description="Error message") code: str = Field(..., description="Error code") request_id: Optional[str] = Field(None, description="Request tracking ID") ``` #### 3.3 Endpoint Implementation (`scrape_summarize.py`) **Streaming Endpoint:** ```python from fastapi import APIRouter, HTTPException, Request from fastapi.responses import StreamingResponse from app.api.v3.schemas import ScrapeAndSummarizeRequest from app.services.article_scraper import article_scraper_service from app.services.hf_streaming_summarizer import hf_streaming_service from app.core.logging import get_logger import json import time router = APIRouter() logger = get_logger(__name__) @router.post("/scrape-and-summarize/stream") async def scrape_and_summarize_stream( request: Request, payload: ScrapeAndSummarizeRequest ): """ Scrape article from URL and stream summarization. Process: 1. Scrape article content from URL (with caching) 2. Validate content quality 3. Stream summarization using V2 HF engine Returns: Server-Sent Events stream with: - Metadata event (title, author, scrape latency) - Content chunks (streaming summary tokens) - Done event (final latency) """ request_id = getattr(request.state, 'request_id', 'unknown') logger.info(f"[{request_id}] V3 scrape-and-summarize request for: {payload.url}") # Step 1: Scrape article scrape_start = time.time() try: article_data = await article_scraper_service.scrape_article( url=payload.url, use_cache=payload.use_cache ) except Exception as e: logger.error(f"[{request_id}] Scraping failed: {e}") raise HTTPException( status_code=502, detail=f"Failed to scrape article: {str(e)}" ) scrape_latency_ms = (time.time() - scrape_start) * 1000 logger.info(f"[{request_id}] Scraped in {scrape_latency_ms:.2f}ms, " f"extracted {len(article_data['text'])} chars") # Step 2: Validate content if len(article_data['text']) < 100: raise HTTPException( status_code=422, detail="Insufficient content extracted from URL. " "Article may be behind paywall or site may block scrapers." ) # Step 3: Stream summarization return StreamingResponse( _stream_generator(article_data, payload, scrape_latency_ms, request_id), media_type="text/event-stream", headers={ "Cache-Control": "no-cache", "Connection": "keep-alive", "X-Accel-Buffering": "no", "X-Request-ID": request_id, } ) async def _stream_generator(article_data, payload, scrape_latency_ms, request_id): """Generate SSE stream for scraping + summarization.""" # Send metadata event first if payload.include_metadata: metadata_event = { "type": "metadata", "data": { "title": article_data.get('title'), "author": article_data.get('author'), "date": article_data.get('date'), "site_name": article_data.get('site_name'), "url": article_data.get('url'), "scrape_method": article_data.get('method', 'static'), "scrape_latency_ms": scrape_latency_ms, "extracted_text_length": len(article_data['text']), } } yield f"data: {json.dumps(metadata_event)}\n\n" # Stream summarization chunks (reuse V2 HF service) summarization_start = time.time() tokens_used = 0 try: async for chunk in hf_streaming_service.summarize_text_stream( text=article_data['text'], max_new_tokens=payload.max_tokens, temperature=payload.temperature, top_p=payload.top_p, prompt=payload.prompt, ): # Forward V2 chunks as-is if not chunk.get('done', False): tokens_used = chunk.get('tokens_used', tokens_used) yield f"data: {json.dumps(chunk)}\n\n" except Exception as e: logger.error(f"[{request_id}] Summarization failed: {e}") error_event = { "type": "error", "error": str(e), "done": True } yield f"data: {json.dumps(error_event)}\n\n" return summarization_latency_ms = (time.time() - summarization_start) * 1000 total_latency_ms = scrape_latency_ms + summarization_latency_ms logger.info(f"[{request_id}] V3 request completed in {total_latency_ms:.2f}ms " f"(scrape: {scrape_latency_ms:.2f}ms, summary: {summarization_latency_ms:.2f}ms)") ``` --- ### 4. Configuration Updates **File:** `app/core/config.py` **New Settings:** ```python class Settings(BaseSettings): # ... existing settings ... # V3 Web Scraping Configuration enable_v3_scraping: bool = Field( default=True, env="ENABLE_V3_SCRAPING", description="Enable V3 web scraping API" ) scraping_timeout: int = Field( default=10, env="SCRAPING_TIMEOUT", ge=1, le=60, description="HTTP timeout for scraping requests (seconds)" ) scraping_max_text_length: int = Field( default=50000, env="SCRAPING_MAX_TEXT_LENGTH", description="Maximum text length to extract (chars)" ) scraping_cache_enabled: bool = Field( default=True, env="SCRAPING_CACHE_ENABLED", description="Enable in-memory caching of scraped content" ) scraping_cache_ttl: int = Field( default=3600, env="SCRAPING_CACHE_TTL", description="Cache TTL in seconds (default: 1 hour)" ) scraping_user_agent_rotation: bool = Field( default=True, env="SCRAPING_UA_ROTATION", description="Enable user-agent rotation" ) scraping_rate_limit_per_minute: int = Field( default=10, env="SCRAPING_RATE_LIMIT_PER_MINUTE", ge=1, le=100, description="Max scraping requests per minute per IP" ) ``` **Environment Variables (.env):** ```bash # V3 Web Scraping Configuration ENABLE_V3_SCRAPING=true SCRAPING_TIMEOUT=10 SCRAPING_MAX_TEXT_LENGTH=50000 SCRAPING_CACHE_ENABLED=true SCRAPING_CACHE_TTL=3600 SCRAPING_UA_ROTATION=true SCRAPING_RATE_LIMIT_PER_MINUTE=10 ``` --- ### 5. Main Application Integration **File:** `app/main.py` **Changes:** ```python from app.core.config import settings from app.services.article_scraper import article_scraper_service # Conditionally include V3 router if settings.enable_v3_scraping: from app.api.v3.routes import api_router as v3_api_router app.include_router(v3_api_router, prefix="/api/v3") logger.info("✅ V3 Web Scraping API enabled") else: logger.info("⏭️ V3 Web Scraping API disabled") @app.on_event("startup") async def startup_event(): # ... existing V1/V2 warmup ... # V3 scraping service info if settings.enable_v3_scraping: logger.info(f"V3 scraping timeout: {settings.scraping_timeout}s") logger.info(f"V3 cache enabled: {settings.scraping_cache_enabled}") if settings.scraping_cache_enabled: logger.info(f"V3 cache TTL: {settings.scraping_cache_ttl}s") ``` --- ## API Design ### Endpoint: POST /api/v3/scrape-and-summarize/stream **Request Body:** ```json { "url": "https://example.com/article", "max_tokens": 256, "temperature": 0.3, "top_p": 0.9, "prompt": "Summarize this article concisely:", "include_metadata": true, "use_cache": true } ``` **Response (Server-Sent Events):** ``` data: {"type":"metadata","data":{"title":"Article Title","author":"John Doe","date":"2024-01-15","site_name":"Example Blog","scrape_method":"static","scrape_latency_ms":450.2,"extracted_text_length":3421}} data: {"content":"The","done":false,"tokens_used":1} data: {"content":" article","done":false,"tokens_used":3} data: {"content":" discusses","done":false,"tokens_used":5} ... data: {"content":"","done":true,"latency_ms":2340.5} ``` **Error Responses:** | Status Code | Description | Example | |-------------|-------------|---------| | 400 | Invalid request | `{"detail":"Invalid URL format","code":"INVALID_REQUEST"}` | | 422 | Content extraction failed | `{"detail":"Insufficient content extracted","code":"EXTRACTION_FAILED"}` | | 429 | Rate limit exceeded | `{"detail":"Too many requests","code":"RATE_LIMIT"}` | | 502 | Scraping failed | `{"detail":"Failed to scrape article: Connection timeout","code":"SCRAPING_ERROR"}` | | 504 | Timeout | `{"detail":"Scraping timeout exceeded","code":"TIMEOUT"}` | --- ## Implementation Details ### User-Agent Rotation **File:** `app/services/article_scraper.py` ```python USER_AGENTS = [ # Chrome on Windows (most common) "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 " "(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36", # Chrome on macOS "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 " "(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36", # Firefox on Windows "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) " "Gecko/20100101 Firefox/121.0", # Safari on macOS "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 " "(KHTML, like Gecko) Version/17.1 Safari/605.1.15", ] def _get_random_headers(self) -> Dict[str, str]: """Generate realistic browser headers.""" return { "User-Agent": random.choice(USER_AGENTS), "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate, br", "DNT": "1", "Connection": "keep-alive", "Upgrade-Insecure-Requests": "1", "Sec-Fetch-Dest": "document", "Sec-Fetch-Mode": "navigate", "Sec-Fetch-Site": "none", "Sec-Fetch-User": "?1", "Cache-Control": "max-age=0", } ``` ### Rate Limiting **Per-IP Rate Limiting (FastAPI middleware):** ```python # File: app/core/rate_limiter.py from slowapi import Limiter, _rate_limit_exceeded_handler from slowapi.util import get_remote_address from slowapi.errors import RateLimitExceeded limiter = Limiter(key_func=get_remote_address) # In routes.py: @router.post("/scrape-and-summarize/stream") @limiter.limit(f"{settings.scraping_rate_limit_per_minute}/minute") async def scrape_and_summarize_stream( request: Request, payload: ScrapeAndSummarizeRequest ): pass ``` **Per-Domain Rate Limiting:** ```python # File: app/core/domain_rate_limiter.py from collections import defaultdict from datetime import datetime, timedelta from urllib.parse import urlparse class DomainRateLimiter: """Prevent hammering same domain repeatedly.""" def __init__(self, max_requests: int = 10, window_seconds: int = 60): self._requests = defaultdict(list) self._max_requests = max_requests self._window = window_seconds def check_rate_limit(self, url: str) -> bool: """Check if request is within rate limit for domain.""" domain = urlparse(url).netloc now = datetime.now() window_start = now - timedelta(seconds=self._window) # Clean old requests self._requests[domain] = [ ts for ts in self._requests[domain] if ts > window_start ] # Check limit if len(self._requests[domain]) >= self._max_requests: return False # Rate limit exceeded # Record request self._requests[domain].append(now) return True # Global instance domain_rate_limiter = DomainRateLimiter(max_requests=10, window_seconds=60) ``` ### Content Quality Validation ```python def _validate_content_quality(self, text: str) -> tuple[bool, str]: """ Validate extracted content meets quality threshold. Returns: (is_valid, reason) """ # Check minimum length if len(text) < 100: return False, "Content too short (< 100 chars)" # Check for mostly whitespace non_whitespace = len(text.replace(' ', '').replace('\n', '').replace('\t', '')) if non_whitespace < 50: return False, "Mostly whitespace" # Check for reasonable sentence structure (basic heuristic) sentence_endings = text.count('.') + text.count('!') + text.count('?') if sentence_endings < 3: return False, "No clear sentence structure" # Check word count words = text.split() if len(words) < 50: return False, "Too few words (< 50)" return True, "OK" ``` --- ## Testing Strategy ### Unit Tests **File:** `tests/test_article_scraper.py` **Coverage:** - Article extraction with various HTML structures - User-agent rotation - Content quality validation - Metadata extraction - Error handling (timeouts, 404s, invalid HTML) - Cache hit/miss scenarios **Example Test:** ```python import pytest from unittest.mock import Mock, patch from app.services.article_scraper import ArticleScraperService @pytest.mark.asyncio async def test_scrape_article_success(): """Test successful article scraping.""" service = ArticleScraperService() # Mock HTML response mock_html = """
This is a test article with meaningful content.
It has multiple paragraphs to test extraction.