Spaces:

oharu121
/

hf-master-pinger

Running

App Files Files Community

oharu121 commited on Jan 26

Commit

3f64fde

1 Parent(s): 65f215b

feat(dashboard): separate keep-alive and DB readiness monitoring (#3)

Browse files

Files changed (7) hide show

.claude/settings.local.json +2 -1
.plans/separate-keepalive-and-readiness-monitoring.md +68 -0
CHANGELOG.md +13 -1
README.md +24 -10
app.py +44 -16
main.py +8 -3
pyproject.toml +1 -1

.claude/settings.local.json CHANGED Viewed

@@ -1,7 +1,8 @@
 {
   "permissions": {
     "allow": [
-      "Bash(gh api:*)"
     ]
   }
 }

 {
   "permissions": {
     "allow": [
+      "Bash(gh api:*)",
+      "Bash(uv run pyright:*)"
     ]
   }
 }

.plans/separate-keepalive-and-readiness-monitoring.md ADDED Viewed

	@@ -0,0 +1,68 @@

+# Separate Keep-Alive and DB Readiness Monitoring
+**Status:** Completed
+**Date:** 2026-01-27
+## Goal
+Refactor the monitoring system to clearly separate two concerns:
+1. **Keep-Alive Pings** - Simple health checks to prevent Spaces from sleeping
+2. **DB Readiness Monitoring** - Health checks with failure tracking and auto-restart
+## Problem
+The original implementation conflated keep-alive pings with DB readiness monitoring. This made it unclear what each worker entry was doing and prevented independent tuning of intervals.
+## Solution
+### Two-Entry Pattern
+For Spaces that need both keep-alive and DB monitoring (like n8n), use two separate config entries:
+```python
+# Keep-alive (prevents sleep, frequent)
+{"url": "https://oharu121-n8n-workflow.hf.space/health", "interval_minutes": 5},
+# DB readiness monitoring (auto-restart, less frequent)
+{
+    "url": "https://oharu121-n8n-workflow.hf.space/healthz/readiness",
+    "interval_minutes": 60,
+    "space_id": "oharu121/n8n-workflow",
+},
+```
+### Two-Table Dashboard
+The Gradio UI now displays two separate tables:
+**Table 1: Keep-Alive Pings**
+| Worker | Interval | Last Ping | Status |
+|--------|----------|-----------|--------|
+Simple view showing if Spaces are running.
+**Table 2: DB Readiness Monitoring**
+| Space | Interval | Last Check | Status | Failures | Last Restart | Restarts |
+|-------|----------|------------|--------|----------|--------------|----------|
+Detailed view with failure tracking and restart history.
+### Enhanced Tracking
+Added new fields for monitored workers:
+- `last_restart_time` - When the last restart was triggered
+- `total_restarts` - Total number of restarts since startup
+## Files Modified
+- [main.py](../main.py) - Updated WORKERS config, added restart tracking
+- [app.py](../app.py) - Two-table dashboard, `extract_worker_name()` helper
+- [README.md](../README.md) - Updated configuration documentation
+## Benefits
+1. **Clear separation of concerns** - Each config entry has one purpose
+2. **Independent tuning** - Adjust keep-alive and DB check intervals separately
+3. **Self-documenting config** - Reading the config shows what each entry does
+4. **Better failure visibility** - DB failures get their own dedicated table
+5. **Restart history** - Track when and how often restarts occur

CHANGELOG.md CHANGED Viewed

@@ -5,7 +5,19 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
-## [Unreleased]
 ## [0.2.0] - 2026-01-27

 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [0.3.0] - 2026-01-27
+### Added
+- Two-table dashboard separating Keep-Alive Pings from DB Readiness Monitoring
+- `extract_worker_name()` helper to display friendly Space names from URLs
+- Restart history tracking with `last_restart_time` and `total_restarts` fields
+- n8n keep-alive entry (`/health` at 5 min) separate from DB readiness (`/healthz/readiness` at 60 min)
+### Changed
+- Dashboard now shows two distinct tables for different monitoring purposes
+- Worker display names extracted from URLs (e.g., `n8n-workflow` instead of full URL)
 ## [0.2.0] - 2026-01-27

README.md CHANGED Viewed

@@ -49,36 +49,50 @@ A centralized pinger service that keeps multiple Hugging Face Spaces alive by pi
 ## Configuration
-Edit `main.py` to configure workers:
 ```python
 WORKERS = [
-    {"url": "https://your-space.hf.space/health", "interval_minutes": 2},
-    {"url": "https://another-space.hf.space/health", "interval_minutes": 5},
 ]
 ```
-### Auto-Restart Configuration
-To enable automatic restart of a Space after consecutive health check failures, add `space_id` to the worker config:
 ```python
 WORKERS = [
     {
         "url": "https://your-space.hf.space/healthz/readiness",
         "interval_minutes": 60,
-        "space_id": "username/space-name",  # Enables auto-restart
     },
 ]
 ```
 | Setting | Default | Description |
 |---------|---------|-------------|
-| `space_id` | None | HF Space ID (e.g., `username/space-name`). Required for auto-restart. |
-| Failure threshold | 3 | Number of consecutive failures before auto-restart |
-**Requirements:**
-- `HF_TOKEN` environment variable with write permission
 - Generate token at https://huggingface.co/settings/tokens
 ## Endpoints

 ## Configuration
+Edit `main.py` to configure workers. There are two types of workers:
+### Keep-Alive Pings
+Simple health checks to prevent Spaces from sleeping:
 ```python
 WORKERS = [
+    {"url": "https://your-space.hf.space/health", "interval_minutes": 5},
+    {"url": "https://another-space.hf.space/healthz", "interval_minutes": 10},
 ]
 ```
+### DB Readiness Monitoring with Auto-Restart
+For Spaces that need database connectivity monitoring and automatic restart on failure, add `space_id`:
 ```python
 WORKERS = [
+    # Keep-alive (prevents sleep)
+    {"url": "https://your-space.hf.space/health", "interval_minutes": 5},
+    # DB readiness monitoring (triggers restart on failure)
     {
         "url": "https://your-space.hf.space/healthz/readiness",
         "interval_minutes": 60,
+        "space_id": "username/space-name",
     },
 ]
 ```
+This pattern allows you to:
+- Keep a Space awake with frequent pings (5 min)
+- Monitor DB health less frequently (60 min)
+- Auto-restart after 3 consecutive DB check failures
 | Setting | Default | Description |
 |---------|---------|-------------|
+| `url` | Required | Health check endpoint URL |
+| `interval_minutes` | Required | Ping frequency in minutes |
+| `space_id` | None | HF Space ID for auto-restart (e.g., `username/space-name`) |
+| Failure threshold | 3 | Consecutive failures before auto-restart |
+**Requirements for auto-restart:**
+- `HF_RESTART_TOKEN` environment variable with write permission
 - Generate token at https://huggingface.co/settings/tokens
 ## Endpoints

app.py CHANGED Viewed

@@ -1,3 +1,4 @@
 from datetime import datetime
 import gradio as gr
@@ -6,7 +7,14 @@ import uvicorn
 from main import WORKERS, app, start_time, worker_status
-def get_status() -> tuple[str, str, list[list[str]]]:
     """Get current status for Gradio UI."""
     # Overall status
     status = "🟢 Online" if start_time else "🔴 Offline"
@@ -20,44 +28,64 @@ def get_status() -> tuple[str, str, list[list[str]]]:
     else:
         uptime = "N/A"
-    # Worker table data
-    table_data = []
     for worker in WORKERS:
-        url = worker["url"]
         interval = f"{worker['interval_minutes']} min"
         ws = worker_status.get(url, {})
-        last_ping = ws.get("last_ping", "Not yet")
         ping_status = "✅" if ws.get("status") == "ok" else ("❌" if ws.get("status") == "failed" else "⏳")
-        consecutive_failures = ws.get("consecutive_failures", 0)
-        auto_restart = "🔄" if worker.get("space_id") else ""
-        table_data.append([url, interval, str(last_ping), ping_status, str(consecutive_failures), auto_restart])
-    return status, uptime, table_data
 def create_ui() -> gr.Blocks:
     """Create Gradio UI."""
     with gr.Blocks(title="HF Master Pinger") as demo:
-        gr.Markdown("# 🏓 HF Master Pinger")
         gr.Markdown("Centralized pinger to keep HF Spaces alive")
         with gr.Row():
             status_text = gr.Textbox(label="Status", interactive=False)
             uptime_text = gr.Textbox(label="Uptime", interactive=False)
-        worker_table = gr.Dataframe(
-            headers=["URL", "Interval", "Last Ping", "Status", "Failures", "Auto-Restart"],
-            label="Workers",
             interactive=False,
         )
-        refresh_btn = gr.Button("🔄 Refresh")
         def refresh():
             return get_status()
-        refresh_btn.click(fn=refresh, outputs=[status_text, uptime_text, worker_table])
-        demo.load(fn=refresh, outputs=[status_text, uptime_text, worker_table])
     return demo

+import re
 from datetime import datetime
 import gradio as gr
 from main import WORKERS, app, start_time, worker_status
+def extract_worker_name(url: str) -> str:
+    """Extract Space name from HF URL."""
+    # https://oharu121-n8n-workflow.hf.space/health -> n8n-workflow
+    match = re.search(r"https://[^-]+-(.+?)\.hf\.space", url)
+    return match.group(1) if match else url
+def get_status() -> tuple[str, str, list[list[str]], list[list[str]]]:
     """Get current status for Gradio UI."""
     # Overall status
     status = "🟢 Online" if start_time else "🔴 Offline"
     else:
         uptime = "N/A"
+    # Split workers into two tables
+    keepalive_data = []
+    readiness_data = []
     for worker in WORKERS:
+        url = str(worker["url"])
+        name = extract_worker_name(url)
         interval = f"{worker['interval_minutes']} min"
         ws = worker_status.get(url, {})
+        last_ping = str(ws.get("last_ping", "Not yet"))
         ping_status = "✅" if ws.get("status") == "ok" else ("❌" if ws.get("status") == "failed" else "⏳")
+        if worker.get("space_id"):
+            # DB Readiness table
+            consecutive_failures = str(ws.get("consecutive_failures", 0))
+            last_restart = str(ws.get("last_restart_time", "Never"))
+            total_restarts = str(ws.get("total_restarts", 0))
+            readiness_data.append([name, interval, last_ping, ping_status, consecutive_failures, last_restart, total_restarts])
+        else:
+            # Keep-alive table
+            keepalive_data.append([name, interval, last_ping, ping_status])
+    return status, uptime, keepalive_data, readiness_data
 def create_ui() -> gr.Blocks:
     """Create Gradio UI."""
     with gr.Blocks(title="HF Master Pinger") as demo:
+        gr.Markdown("# HF Master Pinger")
         gr.Markdown("Centralized pinger to keep HF Spaces alive")
         with gr.Row():
             status_text = gr.Textbox(label="Status", interactive=False)
             uptime_text = gr.Textbox(label="Uptime", interactive=False)
+        gr.Markdown("## Keep-Alive Pings")
+        gr.Markdown("Simple health checks to prevent Spaces from sleeping")
+        keepalive_table = gr.Dataframe(
+            headers=["Worker", "Interval", "Last Ping", "Status"],
+            label="Keep-Alive Workers",
+            interactive=False,
+        )
+        gr.Markdown("## DB Readiness Monitoring")
+        gr.Markdown("Health checks with failure tracking and auto-restart")
+        readiness_table = gr.Dataframe(
+            headers=["Space", "Interval", "Last Check", "Status", "Failures", "Last Restart", "Restarts"],
+            label="Monitored Workers",
             interactive=False,
         )
+        refresh_btn = gr.Button("Refresh")
         def refresh():
             return get_status()
+        refresh_btn.click(fn=refresh, outputs=[status_text, uptime_text, keepalive_table, readiness_table])
+        demo.load(fn=refresh, outputs=[status_text, uptime_text, keepalive_table, readiness_table])
     return demo

main.py CHANGED Viewed

@@ -13,15 +13,18 @@ logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger(__name__)
 WORKERS = [
     {"url": "https://oharu121-discord-gemini-bot.hf.space/healthz", "interval_minutes": 5},
     {
         "url": "https://oharu121-n8n-workflow.hf.space/healthz/readiness",
         "interval_minutes": 60,
         "space_id": "oharu121/n8n-workflow",
     },
-    {"url": "https://oharu121-keiba-oracle.hf.space/healthz", "interval_minutes": 10},
-    {"url": "https://oharu121-rag-demo.hf.space/healthz", "interval_minutes": 10},
-    {"url": "https://oharu121-rich-chat-demo.hf.space/healthz", "interval_minutes": 10},
 ]
 HF_RESTART_TOKEN = os.getenv("HF_RESTART_TOKEN")
@@ -105,6 +108,8 @@ async def ping_worker_job(worker: dict[str, str | int]):
         restarted = await restart_space(str(space_id))
         if restarted:
             worker_status[url]["consecutive_failures"] = 0
 async def self_ping():

 logger = logging.getLogger(__name__)
 WORKERS = [
+    # Keep-alive pings (no space_id)
     {"url": "https://oharu121-discord-gemini-bot.hf.space/healthz", "interval_minutes": 5},
+    {"url": "https://oharu121-n8n-workflow.hf.space/health", "interval_minutes": 5},
+    {"url": "https://oharu121-keiba-oracle.hf.space/healthz", "interval_minutes": 10},
+    {"url": "https://oharu121-rag-demo.hf.space/healthz", "interval_minutes": 10},
+    {"url": "https://oharu121-rich-chat-demo.hf.space/healthz", "interval_minutes": 10},
+    # DB readiness monitoring (with space_id for auto-restart)
     {
         "url": "https://oharu121-n8n-workflow.hf.space/healthz/readiness",
         "interval_minutes": 60,
         "space_id": "oharu121/n8n-workflow",
     },
 ]
 HF_RESTART_TOKEN = os.getenv("HF_RESTART_TOKEN")
         restarted = await restart_space(str(space_id))
         if restarted:
             worker_status[url]["consecutive_failures"] = 0
+            worker_status[url]["last_restart_time"] = datetime.now().isoformat()
+            worker_status[url]["total_restarts"] = int(worker_status[url].get("total_restarts", 0)) + 1
 async def self_ping():

pyproject.toml CHANGED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "hf-master-pinger"
-version = "0.2.0"
 description = "Centralized pinger to keep HF Spaces alive"
 readme = "README.md"
 requires-python = ">=3.12"

 [project]
 name = "hf-master-pinger"
+version = "0.3.0"
 description = "Centralized pinger to keep HF Spaces alive"
 readme = "README.md"
 requires-python = ">=3.12"