oharu121 commited on
Commit
3f64fde
Β·
1 Parent(s): 65f215b

feat(dashboard): separate keep-alive and DB readiness monitoring (#3)

Browse files
.claude/settings.local.json CHANGED
@@ -1,7 +1,8 @@
1
  {
2
  "permissions": {
3
  "allow": [
4
- "Bash(gh api:*)"
 
5
  ]
6
  }
7
  }
 
1
  {
2
  "permissions": {
3
  "allow": [
4
+ "Bash(gh api:*)",
5
+ "Bash(uv run pyright:*)"
6
  ]
7
  }
8
  }
.plans/separate-keepalive-and-readiness-monitoring.md ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Separate Keep-Alive and DB Readiness Monitoring
2
+
3
+ **Status:** Completed
4
+ **Date:** 2026-01-27
5
+
6
+ ## Goal
7
+
8
+ Refactor the monitoring system to clearly separate two concerns:
9
+ 1. **Keep-Alive Pings** - Simple health checks to prevent Spaces from sleeping
10
+ 2. **DB Readiness Monitoring** - Health checks with failure tracking and auto-restart
11
+
12
+ ## Problem
13
+
14
+ The original implementation conflated keep-alive pings with DB readiness monitoring. This made it unclear what each worker entry was doing and prevented independent tuning of intervals.
15
+
16
+ ## Solution
17
+
18
+ ### Two-Entry Pattern
19
+
20
+ For Spaces that need both keep-alive and DB monitoring (like n8n), use two separate config entries:
21
+
22
+ ```python
23
+ # Keep-alive (prevents sleep, frequent)
24
+ {"url": "https://oharu121-n8n-workflow.hf.space/health", "interval_minutes": 5},
25
+
26
+ # DB readiness monitoring (auto-restart, less frequent)
27
+ {
28
+ "url": "https://oharu121-n8n-workflow.hf.space/healthz/readiness",
29
+ "interval_minutes": 60,
30
+ "space_id": "oharu121/n8n-workflow",
31
+ },
32
+ ```
33
+
34
+ ### Two-Table Dashboard
35
+
36
+ The Gradio UI now displays two separate tables:
37
+
38
+ **Table 1: Keep-Alive Pings**
39
+ | Worker | Interval | Last Ping | Status |
40
+ |--------|----------|-----------|--------|
41
+
42
+ Simple view showing if Spaces are running.
43
+
44
+ **Table 2: DB Readiness Monitoring**
45
+ | Space | Interval | Last Check | Status | Failures | Last Restart | Restarts |
46
+ |-------|----------|------------|--------|----------|--------------|----------|
47
+
48
+ Detailed view with failure tracking and restart history.
49
+
50
+ ### Enhanced Tracking
51
+
52
+ Added new fields for monitored workers:
53
+ - `last_restart_time` - When the last restart was triggered
54
+ - `total_restarts` - Total number of restarts since startup
55
+
56
+ ## Files Modified
57
+
58
+ - [main.py](../main.py) - Updated WORKERS config, added restart tracking
59
+ - [app.py](../app.py) - Two-table dashboard, `extract_worker_name()` helper
60
+ - [README.md](../README.md) - Updated configuration documentation
61
+
62
+ ## Benefits
63
+
64
+ 1. **Clear separation of concerns** - Each config entry has one purpose
65
+ 2. **Independent tuning** - Adjust keep-alive and DB check intervals separately
66
+ 3. **Self-documenting config** - Reading the config shows what each entry does
67
+ 4. **Better failure visibility** - DB failures get their own dedicated table
68
+ 5. **Restart history** - Track when and how often restarts occur
CHANGELOG.md CHANGED
@@ -5,7 +5,19 @@ All notable changes to this project will be documented in this file.
5
  The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
6
  and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
 
8
- ## [Unreleased]
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
  ## [0.2.0] - 2026-01-27
11
 
 
5
  The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
6
  and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
 
8
+ ## [0.3.0] - 2026-01-27
9
+
10
+ ### Added
11
+
12
+ - Two-table dashboard separating Keep-Alive Pings from DB Readiness Monitoring
13
+ - `extract_worker_name()` helper to display friendly Space names from URLs
14
+ - Restart history tracking with `last_restart_time` and `total_restarts` fields
15
+ - n8n keep-alive entry (`/health` at 5 min) separate from DB readiness (`/healthz/readiness` at 60 min)
16
+
17
+ ### Changed
18
+
19
+ - Dashboard now shows two distinct tables for different monitoring purposes
20
+ - Worker display names extracted from URLs (e.g., `n8n-workflow` instead of full URL)
21
 
22
  ## [0.2.0] - 2026-01-27
23
 
README.md CHANGED
@@ -49,36 +49,50 @@ A centralized pinger service that keeps multiple Hugging Face Spaces alive by pi
49
 
50
  ## Configuration
51
 
52
- Edit `main.py` to configure workers:
 
 
 
 
53
 
54
  ```python
55
  WORKERS = [
56
- {"url": "https://your-space.hf.space/health", "interval_minutes": 2},
57
- {"url": "https://another-space.hf.space/health", "interval_minutes": 5},
58
  ]
59
  ```
60
 
61
- ### Auto-Restart Configuration
62
 
63
- To enable automatic restart of a Space after consecutive health check failures, add `space_id` to the worker config:
64
 
65
  ```python
66
  WORKERS = [
 
 
 
67
  {
68
  "url": "https://your-space.hf.space/healthz/readiness",
69
  "interval_minutes": 60,
70
- "space_id": "username/space-name", # Enables auto-restart
71
  },
72
  ]
73
  ```
74
 
 
 
 
 
 
75
  | Setting | Default | Description |
76
  |---------|---------|-------------|
77
- | `space_id` | None | HF Space ID (e.g., `username/space-name`). Required for auto-restart. |
78
- | Failure threshold | 3 | Number of consecutive failures before auto-restart |
 
 
79
 
80
- **Requirements:**
81
- - `HF_TOKEN` environment variable with write permission
82
  - Generate token at https://huggingface.co/settings/tokens
83
 
84
  ## Endpoints
 
49
 
50
  ## Configuration
51
 
52
+ Edit `main.py` to configure workers. There are two types of workers:
53
+
54
+ ### Keep-Alive Pings
55
+
56
+ Simple health checks to prevent Spaces from sleeping:
57
 
58
  ```python
59
  WORKERS = [
60
+ {"url": "https://your-space.hf.space/health", "interval_minutes": 5},
61
+ {"url": "https://another-space.hf.space/healthz", "interval_minutes": 10},
62
  ]
63
  ```
64
 
65
+ ### DB Readiness Monitoring with Auto-Restart
66
 
67
+ For Spaces that need database connectivity monitoring and automatic restart on failure, add `space_id`:
68
 
69
  ```python
70
  WORKERS = [
71
+ # Keep-alive (prevents sleep)
72
+ {"url": "https://your-space.hf.space/health", "interval_minutes": 5},
73
+ # DB readiness monitoring (triggers restart on failure)
74
  {
75
  "url": "https://your-space.hf.space/healthz/readiness",
76
  "interval_minutes": 60,
77
+ "space_id": "username/space-name",
78
  },
79
  ]
80
  ```
81
 
82
+ This pattern allows you to:
83
+ - Keep a Space awake with frequent pings (5 min)
84
+ - Monitor DB health less frequently (60 min)
85
+ - Auto-restart after 3 consecutive DB check failures
86
+
87
  | Setting | Default | Description |
88
  |---------|---------|-------------|
89
+ | `url` | Required | Health check endpoint URL |
90
+ | `interval_minutes` | Required | Ping frequency in minutes |
91
+ | `space_id` | None | HF Space ID for auto-restart (e.g., `username/space-name`) |
92
+ | Failure threshold | 3 | Consecutive failures before auto-restart |
93
 
94
+ **Requirements for auto-restart:**
95
+ - `HF_RESTART_TOKEN` environment variable with write permission
96
  - Generate token at https://huggingface.co/settings/tokens
97
 
98
  ## Endpoints
app.py CHANGED
@@ -1,3 +1,4 @@
 
1
  from datetime import datetime
2
 
3
  import gradio as gr
@@ -6,7 +7,14 @@ import uvicorn
6
  from main import WORKERS, app, start_time, worker_status
7
 
8
 
9
- def get_status() -> tuple[str, str, list[list[str]]]:
 
 
 
 
 
 
 
10
  """Get current status for Gradio UI."""
11
  # Overall status
12
  status = "🟒 Online" if start_time else "πŸ”΄ Offline"
@@ -20,44 +28,64 @@ def get_status() -> tuple[str, str, list[list[str]]]:
20
  else:
21
  uptime = "N/A"
22
 
23
- # Worker table data
24
- table_data = []
 
 
25
  for worker in WORKERS:
26
- url = worker["url"]
 
27
  interval = f"{worker['interval_minutes']} min"
28
  ws = worker_status.get(url, {})
29
- last_ping = ws.get("last_ping", "Not yet")
30
  ping_status = "βœ…" if ws.get("status") == "ok" else ("❌" if ws.get("status") == "failed" else "⏳")
31
- consecutive_failures = ws.get("consecutive_failures", 0)
32
- auto_restart = "πŸ”„" if worker.get("space_id") else ""
33
- table_data.append([url, interval, str(last_ping), ping_status, str(consecutive_failures), auto_restart])
34
 
35
- return status, uptime, table_data
 
 
 
 
 
 
 
 
 
 
36
 
37
 
38
  def create_ui() -> gr.Blocks:
39
  """Create Gradio UI."""
40
  with gr.Blocks(title="HF Master Pinger") as demo:
41
- gr.Markdown("# πŸ“ HF Master Pinger")
42
  gr.Markdown("Centralized pinger to keep HF Spaces alive")
43
 
44
  with gr.Row():
45
  status_text = gr.Textbox(label="Status", interactive=False)
46
  uptime_text = gr.Textbox(label="Uptime", interactive=False)
47
 
48
- worker_table = gr.Dataframe(
49
- headers=["URL", "Interval", "Last Ping", "Status", "Failures", "Auto-Restart"],
50
- label="Workers",
 
 
 
 
 
 
 
 
 
 
51
  interactive=False,
52
  )
53
 
54
- refresh_btn = gr.Button("πŸ”„ Refresh")
55
 
56
  def refresh():
57
  return get_status()
58
 
59
- refresh_btn.click(fn=refresh, outputs=[status_text, uptime_text, worker_table])
60
- demo.load(fn=refresh, outputs=[status_text, uptime_text, worker_table])
61
 
62
  return demo
63
 
 
1
+ import re
2
  from datetime import datetime
3
 
4
  import gradio as gr
 
7
  from main import WORKERS, app, start_time, worker_status
8
 
9
 
10
+ def extract_worker_name(url: str) -> str:
11
+ """Extract Space name from HF URL."""
12
+ # https://oharu121-n8n-workflow.hf.space/health -> n8n-workflow
13
+ match = re.search(r"https://[^-]+-(.+?)\.hf\.space", url)
14
+ return match.group(1) if match else url
15
+
16
+
17
+ def get_status() -> tuple[str, str, list[list[str]], list[list[str]]]:
18
  """Get current status for Gradio UI."""
19
  # Overall status
20
  status = "🟒 Online" if start_time else "πŸ”΄ Offline"
 
28
  else:
29
  uptime = "N/A"
30
 
31
+ # Split workers into two tables
32
+ keepalive_data = []
33
+ readiness_data = []
34
+
35
  for worker in WORKERS:
36
+ url = str(worker["url"])
37
+ name = extract_worker_name(url)
38
  interval = f"{worker['interval_minutes']} min"
39
  ws = worker_status.get(url, {})
40
+ last_ping = str(ws.get("last_ping", "Not yet"))
41
  ping_status = "βœ…" if ws.get("status") == "ok" else ("❌" if ws.get("status") == "failed" else "⏳")
 
 
 
42
 
43
+ if worker.get("space_id"):
44
+ # DB Readiness table
45
+ consecutive_failures = str(ws.get("consecutive_failures", 0))
46
+ last_restart = str(ws.get("last_restart_time", "Never"))
47
+ total_restarts = str(ws.get("total_restarts", 0))
48
+ readiness_data.append([name, interval, last_ping, ping_status, consecutive_failures, last_restart, total_restarts])
49
+ else:
50
+ # Keep-alive table
51
+ keepalive_data.append([name, interval, last_ping, ping_status])
52
+
53
+ return status, uptime, keepalive_data, readiness_data
54
 
55
 
56
  def create_ui() -> gr.Blocks:
57
  """Create Gradio UI."""
58
  with gr.Blocks(title="HF Master Pinger") as demo:
59
+ gr.Markdown("# HF Master Pinger")
60
  gr.Markdown("Centralized pinger to keep HF Spaces alive")
61
 
62
  with gr.Row():
63
  status_text = gr.Textbox(label="Status", interactive=False)
64
  uptime_text = gr.Textbox(label="Uptime", interactive=False)
65
 
66
+ gr.Markdown("## Keep-Alive Pings")
67
+ gr.Markdown("Simple health checks to prevent Spaces from sleeping")
68
+ keepalive_table = gr.Dataframe(
69
+ headers=["Worker", "Interval", "Last Ping", "Status"],
70
+ label="Keep-Alive Workers",
71
+ interactive=False,
72
+ )
73
+
74
+ gr.Markdown("## DB Readiness Monitoring")
75
+ gr.Markdown("Health checks with failure tracking and auto-restart")
76
+ readiness_table = gr.Dataframe(
77
+ headers=["Space", "Interval", "Last Check", "Status", "Failures", "Last Restart", "Restarts"],
78
+ label="Monitored Workers",
79
  interactive=False,
80
  )
81
 
82
+ refresh_btn = gr.Button("Refresh")
83
 
84
  def refresh():
85
  return get_status()
86
 
87
+ refresh_btn.click(fn=refresh, outputs=[status_text, uptime_text, keepalive_table, readiness_table])
88
+ demo.load(fn=refresh, outputs=[status_text, uptime_text, keepalive_table, readiness_table])
89
 
90
  return demo
91
 
main.py CHANGED
@@ -13,15 +13,18 @@ logging.basicConfig(level=logging.INFO)
13
  logger = logging.getLogger(__name__)
14
 
15
  WORKERS = [
 
16
  {"url": "https://oharu121-discord-gemini-bot.hf.space/healthz", "interval_minutes": 5},
 
 
 
 
 
17
  {
18
  "url": "https://oharu121-n8n-workflow.hf.space/healthz/readiness",
19
  "interval_minutes": 60,
20
  "space_id": "oharu121/n8n-workflow",
21
  },
22
- {"url": "https://oharu121-keiba-oracle.hf.space/healthz", "interval_minutes": 10},
23
- {"url": "https://oharu121-rag-demo.hf.space/healthz", "interval_minutes": 10},
24
- {"url": "https://oharu121-rich-chat-demo.hf.space/healthz", "interval_minutes": 10},
25
  ]
26
 
27
  HF_RESTART_TOKEN = os.getenv("HF_RESTART_TOKEN")
@@ -105,6 +108,8 @@ async def ping_worker_job(worker: dict[str, str | int]):
105
  restarted = await restart_space(str(space_id))
106
  if restarted:
107
  worker_status[url]["consecutive_failures"] = 0
 
 
108
 
109
 
110
  async def self_ping():
 
13
  logger = logging.getLogger(__name__)
14
 
15
  WORKERS = [
16
+ # Keep-alive pings (no space_id)
17
  {"url": "https://oharu121-discord-gemini-bot.hf.space/healthz", "interval_minutes": 5},
18
+ {"url": "https://oharu121-n8n-workflow.hf.space/health", "interval_minutes": 5},
19
+ {"url": "https://oharu121-keiba-oracle.hf.space/healthz", "interval_minutes": 10},
20
+ {"url": "https://oharu121-rag-demo.hf.space/healthz", "interval_minutes": 10},
21
+ {"url": "https://oharu121-rich-chat-demo.hf.space/healthz", "interval_minutes": 10},
22
+ # DB readiness monitoring (with space_id for auto-restart)
23
  {
24
  "url": "https://oharu121-n8n-workflow.hf.space/healthz/readiness",
25
  "interval_minutes": 60,
26
  "space_id": "oharu121/n8n-workflow",
27
  },
 
 
 
28
  ]
29
 
30
  HF_RESTART_TOKEN = os.getenv("HF_RESTART_TOKEN")
 
108
  restarted = await restart_space(str(space_id))
109
  if restarted:
110
  worker_status[url]["consecutive_failures"] = 0
111
+ worker_status[url]["last_restart_time"] = datetime.now().isoformat()
112
+ worker_status[url]["total_restarts"] = int(worker_status[url].get("total_restarts", 0)) + 1
113
 
114
 
115
  async def self_ping():
pyproject.toml CHANGED
@@ -1,6 +1,6 @@
1
  [project]
2
  name = "hf-master-pinger"
3
- version = "0.2.0"
4
  description = "Centralized pinger to keep HF Spaces alive"
5
  readme = "README.md"
6
  requires-python = ">=3.12"
 
1
  [project]
2
  name = "hf-master-pinger"
3
+ version = "0.3.0"
4
  description = "Centralized pinger to keep HF Spaces alive"
5
  readme = "README.md"
6
  requires-python = ">=3.12"