Spaces:

WORKWITHSHAFISK
/

segmentopulse-backend

Paused

segmentopulse-backend / app /services /utils /image_enricher.py

SHAFI

added Massive Tech News Ingestion , more than 10+ news providers added to the ingestion part

ff4f05b about 2 months ago

8.35 kB

	"""
	app/services/utils/image_enricher.py
	─────────────────────────────────────────────────────────────────────────────
	Shared Image Enrichment Utility for Segmento Pulse.

	What this does:
	Given any article URL, this tool visits the page and tries to find the
	main (top) image that the website publisher chose for that article.

	It does this by reading two standard HTML meta tags:
	1. og:image — Open Graph (used by Facebook, LinkedIn, Twitter)
	2. twitter:image — Twitter Card image

	Almost every modern news website, blog, and tech publication sets at least
	one of these tags. They are the industry-standard way to declare "this is
	my article's main image".

	── WHY WE USE bs4 + httpx INSTEAD OF newspaper4k ────────────────────────────

	The user directive requested newspaper4k (a modern async fork of newspaper3k).
	However, newspaper4k is not in our requirements.txt and would add a heavy new
	dependency with many sub-packages (including lxml, Pillow, and others).

	Our current stack already has everything we need:
	✓ httpx — async HTTP client (already in requirements.txt)
	✓ beautifulsoup4 — HTML parser (already in requirements.txt)
	✓ lxml — fast XML/HTML parser (already in requirements.txt)

	The og:image meta tag approach is exactly what newspaper4k uses internally
	for its top_image property. We get the same result without a new dependency.

	This decision follows our Version First-Scan Protocol: never add a library
	when an existing installed library can do the same job.

	── HOW THE TIMEOUT PROTECTION WORKS ─────────────────────────────────────────

	Some websites are slow, broken, or behind Cloudflare protection pages.
	If we waited forever for them, our entire ingestion pipeline would freeze.

	Two layers of protection:
	1. httpx timeout: 3 seconds max to receive any response at all.
	If the server doesn't respond in 3 seconds, httpx raises TimeoutException.

	2. asyncio.wait_for: 4 seconds total ceiling for the entire function.
	Even if httpx somehow hangs (rare), this outer guard kills it.

	3. Universal try/except: Catches EVERYTHING. A bad image URL will NEVER
	crash a provider. The worst it can do is return "".

	The function signature is intentionally similar to newspaper4k's approach
	so that future migration is a one-line change if newspaper4k is later added.
	"""

	# ── Standard Library ──────────────────────────────────────────────────────────
	import asyncio
	import logging
	from typing import Optional

	# ── Third-party (already in requirements.txt) ─────────────────────────────────
	import httpx
	from bs4 import BeautifulSoup

	logger = logging.getLogger(__name__)

	# ── Timing constants ──────────────────────────────────────────────────────────

	# How long to wait for the target website to respond (seconds).
	# 3 seconds is generous enough for normal websites, short enough to not
	# freeze our pipeline if a URL is broken or behind Cloudflare.
	HTTP_FETCH_TIMEOUT = 3.0

	# Hard outer ceiling for the entire extract_top_image() call.
	# Even if httpx somehow hangs past its own timeout, asyncio.wait_for
	# will forcibly cancel the task at this point.
	OUTER_TIMEOUT_SECONDS = 4.0


	async def extract_top_image(url: str) -> str:
	"""
	Visit an article URL and extract its main (top) image.

	Looks for the image in two standard HTML meta tags, in this order:
	1. <meta property="og:image" content="...">
	2. <meta name="twitter:image" content="...">

	Args:
	url (str): Full article URL (must start with "http").

	Returns:
	str: The image URL if found and valid. "" if not found or any error.

	This function NEVER raises an exception. If anything goes wrong
	(timeout, bad HTML, no meta tag found), it returns "" silently.
	The pipeline treats "" as "no image" and shows the Pulse banner instead.
	"""
	if not url or not url.startswith("http"):
	return ""

	try:
	# Wrap everything in asyncio.wait_for so we have a hard ceiling.
	# If _fetch_and_extract takes longer than OUTER_TIMEOUT_SECONDS, it
	# is cancelled automatically and we return "" from the except block.
	image_url = await asyncio.wait_for(
	_fetch_and_extract(url),
	timeout=OUTER_TIMEOUT_SECONDS,
	)
	return image_url

	except asyncio.TimeoutError:
	logger.debug(f"[ImageEnricher] Outer timeout for: {url[:60]}")
	return ""
	except Exception as e:
	logger.debug(f"[ImageEnricher] Failed for '{url[:60]}': {e}")
	return ""


	async def _fetch_and_extract(url: str) -> str:
	"""
	Internal helper: download the HTML and pull out the og:image tag.

	Separated from extract_top_image() so asyncio.wait_for() has a clean
	coroutine to cancel if needed.

	Args:
	url (str): Full article URL.

	Returns:
	str: Image URL from meta tag, or "" if none found.
	"""
	try:
	async with httpx.AsyncClient(timeout=HTTP_FETCH_TIMEOUT) as client:
	response = await client.get(
	url,
	headers={
	# Some sites block requests without a browser User-Agent.
	# We mimic a normal browser to get past basic protections.
	"User-Agent": (
	"Mozilla/5.0 (compatible; SegmentoPulse-ImageBot/1.0; "
	"+https://segmento.in)"
	),
	# Tell the server we only need enough HTML to read the <head>.
	# This does NOT guarantee the server sends less data, but it
	# is polite and some servers respect it.
	"Accept": "text/html",
	},
	follow_redirects=True,
	)

	if response.status_code != 200:
	return ""

	html = response.text

	except Exception:
	# Network error, timeout, SSL error, etc.
	return ""

	# ── Parse the HTML and look for meta image tags ───────────────────────────
	# We only need the <head> section — everything in <body> is irrelevant
	# and would slow down BeautifulSoup's parsing.
	# NOTE: We pass only the first 10,000 characters to avoid processing huge
	# HTML files. og:image is always in the <head> which is near the top.
	try:
	soup = BeautifulSoup(html[:10_000], "lxml")
	except Exception:
	# If lxml fails (malformed HTML), try the built-in html.parser
	try:
	soup = BeautifulSoup(html[:10_000], "html.parser")
	except Exception:
	return ""

	# ── Priority 1: Open Graph image (most reliable) ─────────────────────────
	og_tag = soup.find("meta", property="og:image")
	if og_tag:
	image_url = (og_tag.get("content") or "").strip()
	if image_url and image_url.startswith("http"):
	logger.debug(f"[ImageEnricher] og:image found for {url[:50]}")
	return image_url

	# ── Priority 2: Twitter Card image (common fallback) ─────────────────────
	tw_tag = soup.find("meta", attrs={"name": "twitter:image"})
	if tw_tag:
	image_url = (tw_tag.get("content") or "").strip()
	if image_url and image_url.startswith("http"):
	logger.debug(f"[ImageEnricher] twitter:image found for {url[:50]}")
	return image_url

	# No image tag found — return empty, let the banner fallback handle it.
	logger.debug(f"[ImageEnricher] No meta image tag found for: {url[:60]}")
	return ""