Spaces:
Running
Running
feat(dashboard): separate keep-alive and DB readiness monitoring (#3)
Browse files- .claude/settings.local.json +2 -1
- .plans/separate-keepalive-and-readiness-monitoring.md +68 -0
- CHANGELOG.md +13 -1
- README.md +24 -10
- app.py +44 -16
- main.py +8 -3
- pyproject.toml +1 -1
.claude/settings.local.json
CHANGED
|
@@ -1,7 +1,8 @@
|
|
| 1 |
{
|
| 2 |
"permissions": {
|
| 3 |
"allow": [
|
| 4 |
-
"Bash(gh api:*)"
|
|
|
|
| 5 |
]
|
| 6 |
}
|
| 7 |
}
|
|
|
|
| 1 |
{
|
| 2 |
"permissions": {
|
| 3 |
"allow": [
|
| 4 |
+
"Bash(gh api:*)",
|
| 5 |
+
"Bash(uv run pyright:*)"
|
| 6 |
]
|
| 7 |
}
|
| 8 |
}
|
.plans/separate-keepalive-and-readiness-monitoring.md
ADDED
|
@@ -0,0 +1,68 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Separate Keep-Alive and DB Readiness Monitoring
|
| 2 |
+
|
| 3 |
+
**Status:** Completed
|
| 4 |
+
**Date:** 2026-01-27
|
| 5 |
+
|
| 6 |
+
## Goal
|
| 7 |
+
|
| 8 |
+
Refactor the monitoring system to clearly separate two concerns:
|
| 9 |
+
1. **Keep-Alive Pings** - Simple health checks to prevent Spaces from sleeping
|
| 10 |
+
2. **DB Readiness Monitoring** - Health checks with failure tracking and auto-restart
|
| 11 |
+
|
| 12 |
+
## Problem
|
| 13 |
+
|
| 14 |
+
The original implementation conflated keep-alive pings with DB readiness monitoring. This made it unclear what each worker entry was doing and prevented independent tuning of intervals.
|
| 15 |
+
|
| 16 |
+
## Solution
|
| 17 |
+
|
| 18 |
+
### Two-Entry Pattern
|
| 19 |
+
|
| 20 |
+
For Spaces that need both keep-alive and DB monitoring (like n8n), use two separate config entries:
|
| 21 |
+
|
| 22 |
+
```python
|
| 23 |
+
# Keep-alive (prevents sleep, frequent)
|
| 24 |
+
{"url": "https://oharu121-n8n-workflow.hf.space/health", "interval_minutes": 5},
|
| 25 |
+
|
| 26 |
+
# DB readiness monitoring (auto-restart, less frequent)
|
| 27 |
+
{
|
| 28 |
+
"url": "https://oharu121-n8n-workflow.hf.space/healthz/readiness",
|
| 29 |
+
"interval_minutes": 60,
|
| 30 |
+
"space_id": "oharu121/n8n-workflow",
|
| 31 |
+
},
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
### Two-Table Dashboard
|
| 35 |
+
|
| 36 |
+
The Gradio UI now displays two separate tables:
|
| 37 |
+
|
| 38 |
+
**Table 1: Keep-Alive Pings**
|
| 39 |
+
| Worker | Interval | Last Ping | Status |
|
| 40 |
+
|--------|----------|-----------|--------|
|
| 41 |
+
|
| 42 |
+
Simple view showing if Spaces are running.
|
| 43 |
+
|
| 44 |
+
**Table 2: DB Readiness Monitoring**
|
| 45 |
+
| Space | Interval | Last Check | Status | Failures | Last Restart | Restarts |
|
| 46 |
+
|-------|----------|------------|--------|----------|--------------|----------|
|
| 47 |
+
|
| 48 |
+
Detailed view with failure tracking and restart history.
|
| 49 |
+
|
| 50 |
+
### Enhanced Tracking
|
| 51 |
+
|
| 52 |
+
Added new fields for monitored workers:
|
| 53 |
+
- `last_restart_time` - When the last restart was triggered
|
| 54 |
+
- `total_restarts` - Total number of restarts since startup
|
| 55 |
+
|
| 56 |
+
## Files Modified
|
| 57 |
+
|
| 58 |
+
- [main.py](../main.py) - Updated WORKERS config, added restart tracking
|
| 59 |
+
- [app.py](../app.py) - Two-table dashboard, `extract_worker_name()` helper
|
| 60 |
+
- [README.md](../README.md) - Updated configuration documentation
|
| 61 |
+
|
| 62 |
+
## Benefits
|
| 63 |
+
|
| 64 |
+
1. **Clear separation of concerns** - Each config entry has one purpose
|
| 65 |
+
2. **Independent tuning** - Adjust keep-alive and DB check intervals separately
|
| 66 |
+
3. **Self-documenting config** - Reading the config shows what each entry does
|
| 67 |
+
4. **Better failure visibility** - DB failures get their own dedicated table
|
| 68 |
+
5. **Restart history** - Track when and how often restarts occur
|
CHANGELOG.md
CHANGED
|
@@ -5,7 +5,19 @@ All notable changes to this project will be documented in this file.
|
|
| 5 |
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
|
| 6 |
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
| 7 |
|
| 8 |
-
## [
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
|
| 10 |
## [0.2.0] - 2026-01-27
|
| 11 |
|
|
|
|
| 5 |
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
|
| 6 |
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
| 7 |
|
| 8 |
+
## [0.3.0] - 2026-01-27
|
| 9 |
+
|
| 10 |
+
### Added
|
| 11 |
+
|
| 12 |
+
- Two-table dashboard separating Keep-Alive Pings from DB Readiness Monitoring
|
| 13 |
+
- `extract_worker_name()` helper to display friendly Space names from URLs
|
| 14 |
+
- Restart history tracking with `last_restart_time` and `total_restarts` fields
|
| 15 |
+
- n8n keep-alive entry (`/health` at 5 min) separate from DB readiness (`/healthz/readiness` at 60 min)
|
| 16 |
+
|
| 17 |
+
### Changed
|
| 18 |
+
|
| 19 |
+
- Dashboard now shows two distinct tables for different monitoring purposes
|
| 20 |
+
- Worker display names extracted from URLs (e.g., `n8n-workflow` instead of full URL)
|
| 21 |
|
| 22 |
## [0.2.0] - 2026-01-27
|
| 23 |
|
README.md
CHANGED
|
@@ -49,36 +49,50 @@ A centralized pinger service that keeps multiple Hugging Face Spaces alive by pi
|
|
| 49 |
|
| 50 |
## Configuration
|
| 51 |
|
| 52 |
-
Edit `main.py` to configure workers:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
|
| 54 |
```python
|
| 55 |
WORKERS = [
|
| 56 |
-
{"url": "https://your-space.hf.space/health", "interval_minutes":
|
| 57 |
-
{"url": "https://another-space.hf.space/
|
| 58 |
]
|
| 59 |
```
|
| 60 |
|
| 61 |
-
### Auto-Restart
|
| 62 |
|
| 63 |
-
|
| 64 |
|
| 65 |
```python
|
| 66 |
WORKERS = [
|
|
|
|
|
|
|
|
|
|
| 67 |
{
|
| 68 |
"url": "https://your-space.hf.space/healthz/readiness",
|
| 69 |
"interval_minutes": 60,
|
| 70 |
-
"space_id": "username/space-name",
|
| 71 |
},
|
| 72 |
]
|
| 73 |
```
|
| 74 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
| Setting | Default | Description |
|
| 76 |
|---------|---------|-------------|
|
| 77 |
-
| `
|
| 78 |
-
|
|
|
|
|
|
|
|
| 79 |
|
| 80 |
-
**Requirements:**
|
| 81 |
-
- `
|
| 82 |
- Generate token at https://huggingface.co/settings/tokens
|
| 83 |
|
| 84 |
## Endpoints
|
|
|
|
| 49 |
|
| 50 |
## Configuration
|
| 51 |
|
| 52 |
+
Edit `main.py` to configure workers. There are two types of workers:
|
| 53 |
+
|
| 54 |
+
### Keep-Alive Pings
|
| 55 |
+
|
| 56 |
+
Simple health checks to prevent Spaces from sleeping:
|
| 57 |
|
| 58 |
```python
|
| 59 |
WORKERS = [
|
| 60 |
+
{"url": "https://your-space.hf.space/health", "interval_minutes": 5},
|
| 61 |
+
{"url": "https://another-space.hf.space/healthz", "interval_minutes": 10},
|
| 62 |
]
|
| 63 |
```
|
| 64 |
|
| 65 |
+
### DB Readiness Monitoring with Auto-Restart
|
| 66 |
|
| 67 |
+
For Spaces that need database connectivity monitoring and automatic restart on failure, add `space_id`:
|
| 68 |
|
| 69 |
```python
|
| 70 |
WORKERS = [
|
| 71 |
+
# Keep-alive (prevents sleep)
|
| 72 |
+
{"url": "https://your-space.hf.space/health", "interval_minutes": 5},
|
| 73 |
+
# DB readiness monitoring (triggers restart on failure)
|
| 74 |
{
|
| 75 |
"url": "https://your-space.hf.space/healthz/readiness",
|
| 76 |
"interval_minutes": 60,
|
| 77 |
+
"space_id": "username/space-name",
|
| 78 |
},
|
| 79 |
]
|
| 80 |
```
|
| 81 |
|
| 82 |
+
This pattern allows you to:
|
| 83 |
+
- Keep a Space awake with frequent pings (5 min)
|
| 84 |
+
- Monitor DB health less frequently (60 min)
|
| 85 |
+
- Auto-restart after 3 consecutive DB check failures
|
| 86 |
+
|
| 87 |
| Setting | Default | Description |
|
| 88 |
|---------|---------|-------------|
|
| 89 |
+
| `url` | Required | Health check endpoint URL |
|
| 90 |
+
| `interval_minutes` | Required | Ping frequency in minutes |
|
| 91 |
+
| `space_id` | None | HF Space ID for auto-restart (e.g., `username/space-name`) |
|
| 92 |
+
| Failure threshold | 3 | Consecutive failures before auto-restart |
|
| 93 |
|
| 94 |
+
**Requirements for auto-restart:**
|
| 95 |
+
- `HF_RESTART_TOKEN` environment variable with write permission
|
| 96 |
- Generate token at https://huggingface.co/settings/tokens
|
| 97 |
|
| 98 |
## Endpoints
|
app.py
CHANGED
|
@@ -1,3 +1,4 @@
|
|
|
|
|
| 1 |
from datetime import datetime
|
| 2 |
|
| 3 |
import gradio as gr
|
|
@@ -6,7 +7,14 @@ import uvicorn
|
|
| 6 |
from main import WORKERS, app, start_time, worker_status
|
| 7 |
|
| 8 |
|
| 9 |
-
def
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
"""Get current status for Gradio UI."""
|
| 11 |
# Overall status
|
| 12 |
status = "π’ Online" if start_time else "π΄ Offline"
|
|
@@ -20,44 +28,64 @@ def get_status() -> tuple[str, str, list[list[str]]]:
|
|
| 20 |
else:
|
| 21 |
uptime = "N/A"
|
| 22 |
|
| 23 |
-
#
|
| 24 |
-
|
|
|
|
|
|
|
| 25 |
for worker in WORKERS:
|
| 26 |
-
url = worker["url"]
|
|
|
|
| 27 |
interval = f"{worker['interval_minutes']} min"
|
| 28 |
ws = worker_status.get(url, {})
|
| 29 |
-
last_ping = ws.get("last_ping", "Not yet")
|
| 30 |
ping_status = "β
" if ws.get("status") == "ok" else ("β" if ws.get("status") == "failed" else "β³")
|
| 31 |
-
consecutive_failures = ws.get("consecutive_failures", 0)
|
| 32 |
-
auto_restart = "π" if worker.get("space_id") else ""
|
| 33 |
-
table_data.append([url, interval, str(last_ping), ping_status, str(consecutive_failures), auto_restart])
|
| 34 |
|
| 35 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
|
| 38 |
def create_ui() -> gr.Blocks:
|
| 39 |
"""Create Gradio UI."""
|
| 40 |
with gr.Blocks(title="HF Master Pinger") as demo:
|
| 41 |
-
gr.Markdown("#
|
| 42 |
gr.Markdown("Centralized pinger to keep HF Spaces alive")
|
| 43 |
|
| 44 |
with gr.Row():
|
| 45 |
status_text = gr.Textbox(label="Status", interactive=False)
|
| 46 |
uptime_text = gr.Textbox(label="Uptime", interactive=False)
|
| 47 |
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
interactive=False,
|
| 52 |
)
|
| 53 |
|
| 54 |
-
refresh_btn = gr.Button("
|
| 55 |
|
| 56 |
def refresh():
|
| 57 |
return get_status()
|
| 58 |
|
| 59 |
-
refresh_btn.click(fn=refresh, outputs=[status_text, uptime_text,
|
| 60 |
-
demo.load(fn=refresh, outputs=[status_text, uptime_text,
|
| 61 |
|
| 62 |
return demo
|
| 63 |
|
|
|
|
| 1 |
+
import re
|
| 2 |
from datetime import datetime
|
| 3 |
|
| 4 |
import gradio as gr
|
|
|
|
| 7 |
from main import WORKERS, app, start_time, worker_status
|
| 8 |
|
| 9 |
|
| 10 |
+
def extract_worker_name(url: str) -> str:
|
| 11 |
+
"""Extract Space name from HF URL."""
|
| 12 |
+
# https://oharu121-n8n-workflow.hf.space/health -> n8n-workflow
|
| 13 |
+
match = re.search(r"https://[^-]+-(.+?)\.hf\.space", url)
|
| 14 |
+
return match.group(1) if match else url
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
def get_status() -> tuple[str, str, list[list[str]], list[list[str]]]:
|
| 18 |
"""Get current status for Gradio UI."""
|
| 19 |
# Overall status
|
| 20 |
status = "π’ Online" if start_time else "π΄ Offline"
|
|
|
|
| 28 |
else:
|
| 29 |
uptime = "N/A"
|
| 30 |
|
| 31 |
+
# Split workers into two tables
|
| 32 |
+
keepalive_data = []
|
| 33 |
+
readiness_data = []
|
| 34 |
+
|
| 35 |
for worker in WORKERS:
|
| 36 |
+
url = str(worker["url"])
|
| 37 |
+
name = extract_worker_name(url)
|
| 38 |
interval = f"{worker['interval_minutes']} min"
|
| 39 |
ws = worker_status.get(url, {})
|
| 40 |
+
last_ping = str(ws.get("last_ping", "Not yet"))
|
| 41 |
ping_status = "β
" if ws.get("status") == "ok" else ("β" if ws.get("status") == "failed" else "β³")
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
+
if worker.get("space_id"):
|
| 44 |
+
# DB Readiness table
|
| 45 |
+
consecutive_failures = str(ws.get("consecutive_failures", 0))
|
| 46 |
+
last_restart = str(ws.get("last_restart_time", "Never"))
|
| 47 |
+
total_restarts = str(ws.get("total_restarts", 0))
|
| 48 |
+
readiness_data.append([name, interval, last_ping, ping_status, consecutive_failures, last_restart, total_restarts])
|
| 49 |
+
else:
|
| 50 |
+
# Keep-alive table
|
| 51 |
+
keepalive_data.append([name, interval, last_ping, ping_status])
|
| 52 |
+
|
| 53 |
+
return status, uptime, keepalive_data, readiness_data
|
| 54 |
|
| 55 |
|
| 56 |
def create_ui() -> gr.Blocks:
|
| 57 |
"""Create Gradio UI."""
|
| 58 |
with gr.Blocks(title="HF Master Pinger") as demo:
|
| 59 |
+
gr.Markdown("# HF Master Pinger")
|
| 60 |
gr.Markdown("Centralized pinger to keep HF Spaces alive")
|
| 61 |
|
| 62 |
with gr.Row():
|
| 63 |
status_text = gr.Textbox(label="Status", interactive=False)
|
| 64 |
uptime_text = gr.Textbox(label="Uptime", interactive=False)
|
| 65 |
|
| 66 |
+
gr.Markdown("## Keep-Alive Pings")
|
| 67 |
+
gr.Markdown("Simple health checks to prevent Spaces from sleeping")
|
| 68 |
+
keepalive_table = gr.Dataframe(
|
| 69 |
+
headers=["Worker", "Interval", "Last Ping", "Status"],
|
| 70 |
+
label="Keep-Alive Workers",
|
| 71 |
+
interactive=False,
|
| 72 |
+
)
|
| 73 |
+
|
| 74 |
+
gr.Markdown("## DB Readiness Monitoring")
|
| 75 |
+
gr.Markdown("Health checks with failure tracking and auto-restart")
|
| 76 |
+
readiness_table = gr.Dataframe(
|
| 77 |
+
headers=["Space", "Interval", "Last Check", "Status", "Failures", "Last Restart", "Restarts"],
|
| 78 |
+
label="Monitored Workers",
|
| 79 |
interactive=False,
|
| 80 |
)
|
| 81 |
|
| 82 |
+
refresh_btn = gr.Button("Refresh")
|
| 83 |
|
| 84 |
def refresh():
|
| 85 |
return get_status()
|
| 86 |
|
| 87 |
+
refresh_btn.click(fn=refresh, outputs=[status_text, uptime_text, keepalive_table, readiness_table])
|
| 88 |
+
demo.load(fn=refresh, outputs=[status_text, uptime_text, keepalive_table, readiness_table])
|
| 89 |
|
| 90 |
return demo
|
| 91 |
|
main.py
CHANGED
|
@@ -13,15 +13,18 @@ logging.basicConfig(level=logging.INFO)
|
|
| 13 |
logger = logging.getLogger(__name__)
|
| 14 |
|
| 15 |
WORKERS = [
|
|
|
|
| 16 |
{"url": "https://oharu121-discord-gemini-bot.hf.space/healthz", "interval_minutes": 5},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
{
|
| 18 |
"url": "https://oharu121-n8n-workflow.hf.space/healthz/readiness",
|
| 19 |
"interval_minutes": 60,
|
| 20 |
"space_id": "oharu121/n8n-workflow",
|
| 21 |
},
|
| 22 |
-
{"url": "https://oharu121-keiba-oracle.hf.space/healthz", "interval_minutes": 10},
|
| 23 |
-
{"url": "https://oharu121-rag-demo.hf.space/healthz", "interval_minutes": 10},
|
| 24 |
-
{"url": "https://oharu121-rich-chat-demo.hf.space/healthz", "interval_minutes": 10},
|
| 25 |
]
|
| 26 |
|
| 27 |
HF_RESTART_TOKEN = os.getenv("HF_RESTART_TOKEN")
|
|
@@ -105,6 +108,8 @@ async def ping_worker_job(worker: dict[str, str | int]):
|
|
| 105 |
restarted = await restart_space(str(space_id))
|
| 106 |
if restarted:
|
| 107 |
worker_status[url]["consecutive_failures"] = 0
|
|
|
|
|
|
|
| 108 |
|
| 109 |
|
| 110 |
async def self_ping():
|
|
|
|
| 13 |
logger = logging.getLogger(__name__)
|
| 14 |
|
| 15 |
WORKERS = [
|
| 16 |
+
# Keep-alive pings (no space_id)
|
| 17 |
{"url": "https://oharu121-discord-gemini-bot.hf.space/healthz", "interval_minutes": 5},
|
| 18 |
+
{"url": "https://oharu121-n8n-workflow.hf.space/health", "interval_minutes": 5},
|
| 19 |
+
{"url": "https://oharu121-keiba-oracle.hf.space/healthz", "interval_minutes": 10},
|
| 20 |
+
{"url": "https://oharu121-rag-demo.hf.space/healthz", "interval_minutes": 10},
|
| 21 |
+
{"url": "https://oharu121-rich-chat-demo.hf.space/healthz", "interval_minutes": 10},
|
| 22 |
+
# DB readiness monitoring (with space_id for auto-restart)
|
| 23 |
{
|
| 24 |
"url": "https://oharu121-n8n-workflow.hf.space/healthz/readiness",
|
| 25 |
"interval_minutes": 60,
|
| 26 |
"space_id": "oharu121/n8n-workflow",
|
| 27 |
},
|
|
|
|
|
|
|
|
|
|
| 28 |
]
|
| 29 |
|
| 30 |
HF_RESTART_TOKEN = os.getenv("HF_RESTART_TOKEN")
|
|
|
|
| 108 |
restarted = await restart_space(str(space_id))
|
| 109 |
if restarted:
|
| 110 |
worker_status[url]["consecutive_failures"] = 0
|
| 111 |
+
worker_status[url]["last_restart_time"] = datetime.now().isoformat()
|
| 112 |
+
worker_status[url]["total_restarts"] = int(worker_status[url].get("total_restarts", 0)) + 1
|
| 113 |
|
| 114 |
|
| 115 |
async def self_ping():
|
pyproject.toml
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
[project]
|
| 2 |
name = "hf-master-pinger"
|
| 3 |
-
version = "0.
|
| 4 |
description = "Centralized pinger to keep HF Spaces alive"
|
| 5 |
readme = "README.md"
|
| 6 |
requires-python = ">=3.12"
|
|
|
|
| 1 |
[project]
|
| 2 |
name = "hf-master-pinger"
|
| 3 |
+
version = "0.3.0"
|
| 4 |
description = "Centralized pinger to keep HF Spaces alive"
|
| 5 |
readme = "README.md"
|
| 6 |
requires-python = ">=3.12"
|