zhimin-z commited on
Commit
21c6416
·
0 Parent(s):
Files changed (9) hide show
  1. .gitattributes +35 -0
  2. .github/workflows/hf_sync.yml +35 -0
  3. .gitignore +6 -0
  4. Dockerfile +22 -0
  5. README.md +80 -0
  6. app.py +651 -0
  7. docker-compose.yml +20 -0
  8. msr.py +808 -0
  9. requirements.txt +10 -0
.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
.github/workflows/hf_sync.yml ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: Sync to Hugging Face Space
2
+
3
+ on:
4
+ push:
5
+ branches:
6
+ - main
7
+
8
+ jobs:
9
+ sync:
10
+ runs-on: ubuntu-latest
11
+
12
+ steps:
13
+ - name: Checkout GitHub Repository
14
+ uses: actions/checkout@v3
15
+ with:
16
+ fetch-depth: 0 # Fetch the entire history to avoid shallow clone issues
17
+
18
+ - name: Install Git LFS
19
+ run: |
20
+ curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
21
+ sudo apt-get install git-lfs
22
+ git lfs install
23
+
24
+ - name: Configure Git
25
+ run: |
26
+ git config --global user.name "GitHub Actions Bot"
27
+ git config --global user.email "actions@github.com"
28
+
29
+ - name: Push to Hugging Face
30
+ env:
31
+ HF_TOKEN: ${{ secrets.HF_TOKEN }}
32
+ run: |
33
+ git remote add huggingface https://user:${HF_TOKEN}@huggingface.co/spaces/SWE-Arena/SWE-Release
34
+ git fetch huggingface
35
+ git push huggingface main --force
.gitignore ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ *.claude
2
+ *.env
3
+ *.venv
4
+ *.ipynb
5
+ *.pyc
6
+ *.duckdb
Dockerfile ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.12-slim
2
+
3
+ # Set working directory
4
+ WORKDIR /app
5
+
6
+ # Install system dependencies
7
+ RUN apt-get update && apt-get install -y \
8
+ gcc \
9
+ g++ \
10
+ && rm -rf /var/lib/apt/lists/*
11
+
12
+ # Copy requirements file
13
+ COPY requirements.txt .
14
+
15
+ # Install Python dependencies
16
+ RUN pip install --no-cache-dir -r requirements.txt
17
+
18
+ # Set environment variables
19
+ ENV PYTHONUNBUFFERED=1
20
+
21
+ # Run the mining script with scheduler
22
+ CMD ["python", "msr.py"]
README.md ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: SWE-Release
3
+ emoji: 📢
4
+ colorFrom: gray
5
+ colorTo: yellow
6
+ sdk: gradio
7
+ sdk_version: 5.49.1
8
+ app_file: app.py
9
+ hf_oauth: true
10
+ pinned: false
11
+ short_description: Track GitHub releases statistics for SWE assistants
12
+ ---
13
+
14
+ # SWE Agent Release Leaderboard
15
+
16
+ SWE-Release ranks software engineering assistants by their real-world GitHub release activity.
17
+
18
+ No benchmarks. No sandboxes. Just real releases tracked from public repositories.
19
+
20
+ ## Why This Exists
21
+
22
+ Most AI coding agent benchmarks use synthetic tasks and simulated environments. This leaderboard measures real-world activity: how many releases is the agent publishing? How active is it across different projects? Is the agent's usage growing?
23
+
24
+ If an agent is consistently publishing releases across different projects, that tells you something no benchmark can.
25
+
26
+ ## What We Track
27
+
28
+ Key metrics from the last 180 days:
29
+
30
+ **Leaderboard Table**
31
+ - **Total Releases**: Total number of releases published by the agent
32
+ - **Agent Name**: Display name of the agent
33
+ - **Website**: Link to the agent's homepage or documentation
34
+
35
+ **Monthly Trends**
36
+ - Release volume over time (bar charts)
37
+ - Activity patterns across months
38
+
39
+ We focus on 180 days to highlight current capabilities and active assistants.
40
+
41
+ ## How It Works
42
+
43
+ **Data Collection**
44
+ We mine GitHub activity from [GHArchive](https://www.gharchive.org/), tracking:
45
+ - Releases published by the agent (`ReleaseEvent` data)
46
+
47
+ **Regular Updates**
48
+ Leaderboard refreshes weekly (Thursday at 00:00 UTC).
49
+
50
+ **Community Submissions**
51
+ Anyone can submit an agent. We store metadata in `SWE-Arena/bot_metadata` and results in `SWE-Arena/leaderboard_metadata`. All submissions are validated via GitHub API.
52
+
53
+ ## Using the Leaderboard
54
+
55
+ ### Browsing
56
+ Leaderboard tab features:
57
+ - Searchable table (by agent name or website)
58
+ - Monthly charts (release volumes and activity trends)
59
+ - Sortable columns (by releases published)
60
+
61
+ ### Adding Your Agent
62
+ Submit Agent tab requires:
63
+ - **GitHub identifier**: Agent's GitHub username (e.g., `my-agent[bot]`)
64
+ - **Agent name**: Display name for the leaderboard
65
+ - **Organization**: Your organization or team name
66
+ - **Website**: Link to homepage or documentation
67
+
68
+ Submissions are validated against GitHub's API and data loads automatically during the next weekly update.
69
+
70
+ ## What's Next
71
+
72
+ Planned improvements:
73
+ - Repository-based analysis (which repos are agents releasing to)
74
+ - Extended metrics (release types, pre-releases vs stable)
75
+ - Organization and team breakdown
76
+ - Release patterns (frequency, versioning strategies)
77
+
78
+ ## Questions or Issues?
79
+
80
+ [Open an issue](https://github.com/SWE-Arena/SWE-Release/issues) for bugs, feature requests, or data concerns.
app.py ADDED
@@ -0,0 +1,651 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ from gradio_leaderboard import Leaderboard
3
+ import json
4
+ import os
5
+ import time
6
+ import requests
7
+ from huggingface_hub import HfApi, hf_hub_download
8
+ from huggingface_hub.errors import HfHubHTTPError
9
+ import backoff
10
+ from dotenv import load_dotenv
11
+ import pandas as pd
12
+ import random
13
+ import plotly.graph_objects as go
14
+ from apscheduler.schedulers.background import BackgroundScheduler
15
+ from apscheduler.triggers.cron import CronTrigger
16
+
17
+ # Load environment variables
18
+ load_dotenv()
19
+
20
+ # =============================================================================
21
+ # CONFIGURATION
22
+ # =============================================================================
23
+
24
+ AGENTS_REPO = "SWE-Arena/bot_metadata" # HuggingFace dataset for agent metadata
25
+ LEADERBOARD_FILENAME = f"{os.getenv('COMPOSE_PROJECT_NAME')}.json"
26
+ LEADERBOARD_REPO = "SWE-Arena/leaderboard_metadata" # HuggingFace dataset for leaderboard data
27
+ MAX_RETRIES = 5
28
+
29
+ LEADERBOARD_COLUMNS = [
30
+ ("Agent Name", "string"),
31
+ ("Website", "string"),
32
+ ("Total Releases", "number"),
33
+ ]
34
+
35
+ # =============================================================================
36
+ # HUGGINGFACE API WRAPPERS WITH BACKOFF
37
+ # =============================================================================
38
+
39
+ def is_rate_limit_error(e):
40
+ """Check if exception is a HuggingFace rate limit error (429)."""
41
+ if isinstance(e, HfHubHTTPError):
42
+ return e.response.status_code == 429
43
+ return False
44
+
45
+
46
+ @backoff.on_exception(
47
+ backoff.expo,
48
+ HfHubHTTPError,
49
+ max_tries=MAX_RETRIES,
50
+ base=300,
51
+ max_value=3600,
52
+ giveup=lambda e: not is_rate_limit_error(e),
53
+ on_backoff=lambda details: print(
54
+ f"Rate limited. Retrying in {details['wait']/60:.1f} minutes ({details['wait']:.0f}s) - attempt {details['tries']}/5..."
55
+ )
56
+ )
57
+ def list_repo_files_with_backoff(api, **kwargs):
58
+ """Wrapper for api.list_repo_files() with exponential backoff for rate limits."""
59
+ return api.list_repo_files(**kwargs)
60
+
61
+
62
+ @backoff.on_exception(
63
+ backoff.expo,
64
+ HfHubHTTPError,
65
+ max_tries=MAX_RETRIES,
66
+ base=300,
67
+ max_value=3600,
68
+ giveup=lambda e: not is_rate_limit_error(e),
69
+ on_backoff=lambda details: print(
70
+ f"Rate limited. Retrying in {details['wait']/60:.1f} minutes ({details['wait']:.0f}s) - attempt {details['tries']}/5..."
71
+ )
72
+ )
73
+ def hf_hub_download_with_backoff(**kwargs):
74
+ """Wrapper for hf_hub_download() with exponential backoff for rate limits."""
75
+ return hf_hub_download(**kwargs)
76
+
77
+
78
+ # =============================================================================
79
+ # GITHUB USERNAME VALIDATION
80
+ # =============================================================================
81
+
82
+ def validate_github_username(identifier):
83
+ """Verify that a GitHub identifier exists."""
84
+ try:
85
+ response = requests.get(f'https://api.github.com/users/{identifier}', timeout=10)
86
+ return (True, "Username is valid") if response.status_code == 200 else (False, "GitHub identifier not found" if response.status_code == 404 else f"Validation error: HTTP {response.status_code}")
87
+ except Exception as e:
88
+ return False, f"Validation error: {str(e)}"
89
+
90
+
91
+ # =============================================================================
92
+ # HUGGINGFACE DATASET OPERATIONS
93
+ # =============================================================================
94
+
95
+ def load_agents_from_hf():
96
+ """Load all agent metadata JSON files from HuggingFace dataset."""
97
+ try:
98
+ api = HfApi()
99
+ agents = []
100
+
101
+ # List all files in the repository
102
+ files = list_repo_files_with_backoff(api=api, repo_id=AGENTS_REPO, repo_type="dataset")
103
+
104
+ # Filter for JSON files only
105
+ json_files = [f for f in files if f.endswith('.json')]
106
+
107
+ # Download and parse each JSON file
108
+ for json_file in json_files:
109
+ try:
110
+ file_path = hf_hub_download_with_backoff(
111
+ repo_id=AGENTS_REPO,
112
+ filename=json_file,
113
+ repo_type="dataset"
114
+ )
115
+
116
+ with open(file_path, 'r') as f:
117
+ agent_data = json.load(f)
118
+
119
+ # Only process agents with status == "active"
120
+ if agent_data.get('status') != 'active':
121
+ continue
122
+
123
+ # Extract github_identifier from filename (e.g., "agent[bot].json" -> "agent[bot]")
124
+ filename_identifier = json_file.replace('.json', '')
125
+
126
+ # Add or override github_identifier to match filename
127
+ agent_data['github_identifier'] = filename_identifier
128
+
129
+ agents.append(agent_data)
130
+
131
+ except Exception as e:
132
+ print(f"Warning: Could not load {json_file}: {str(e)}")
133
+ continue
134
+
135
+ print(f"Loaded {len(agents)} agents from HuggingFace")
136
+ return agents
137
+
138
+ except Exception as e:
139
+ print(f"Could not load agents from HuggingFace: {str(e)}")
140
+ return None
141
+
142
+
143
+ def get_hf_token():
144
+ """Get HuggingFace token from environment variables."""
145
+ token = os.getenv('HF_TOKEN')
146
+ if not token:
147
+ print("Warning: HF_TOKEN not found in environment variables")
148
+ return token
149
+
150
+
151
+ def upload_with_retry(api, path_or_fileobj, path_in_repo, repo_id, repo_type, token, max_retries=5):
152
+ """
153
+ Upload file to HuggingFace with exponential backoff retry logic.
154
+
155
+ Args:
156
+ api: HfApi instance
157
+ path_or_fileobj: Local file path to upload
158
+ path_in_repo: Target path in the repository
159
+ repo_id: Repository ID
160
+ repo_type: Type of repository (e.g., "dataset")
161
+ token: HuggingFace token
162
+ max_retries: Maximum number of retry attempts
163
+
164
+ Returns:
165
+ True if upload succeeded, raises exception if all retries failed
166
+ """
167
+ delay = 2.0 # Initial delay in seconds
168
+
169
+ for attempt in range(max_retries):
170
+ try:
171
+ api.upload_file(
172
+ path_or_fileobj=path_or_fileobj,
173
+ path_in_repo=path_in_repo,
174
+ repo_id=repo_id,
175
+ repo_type=repo_type,
176
+ token=token
177
+ )
178
+ if attempt > 0:
179
+ print(f" Upload succeeded on attempt {attempt + 1}/{max_retries}")
180
+ return True
181
+
182
+ except Exception as e:
183
+ if attempt < max_retries - 1:
184
+ wait_time = delay + random.uniform(0, 1.0)
185
+ print(f" Upload failed (attempt {attempt + 1}/{max_retries}): {str(e)}")
186
+ print(f" Retrying in {wait_time:.1f} seconds...")
187
+ time.sleep(wait_time)
188
+ delay = min(delay * 2, 60.0) # Exponential backoff, max 60s
189
+ else:
190
+ print(f" Upload failed after {max_retries} attempts: {str(e)}")
191
+ raise
192
+
193
+
194
+ def save_agent_to_hf(data):
195
+ """Save a new agent to HuggingFace dataset as {identifier}.json in root."""
196
+ try:
197
+ api = HfApi()
198
+ token = get_hf_token()
199
+
200
+ if not token:
201
+ raise Exception("No HuggingFace token found. Please set HF_TOKEN in your Space settings.")
202
+
203
+ identifier = data['github_identifier']
204
+ filename = f"{identifier}.json"
205
+
206
+ # Save locally first
207
+ with open(filename, 'w') as f:
208
+ json.dump(data, f, indent=2)
209
+
210
+ try:
211
+ # Upload to HuggingFace (root directory)
212
+ upload_with_retry(
213
+ api=api,
214
+ path_or_fileobj=filename,
215
+ path_in_repo=filename,
216
+ repo_id=AGENTS_REPO,
217
+ repo_type="dataset",
218
+ token=token
219
+ )
220
+ print(f"Saved agent to HuggingFace: {filename}")
221
+ return True
222
+ finally:
223
+ # Always clean up local file, even if upload fails
224
+ if os.path.exists(filename):
225
+ os.remove(filename)
226
+
227
+ except Exception as e:
228
+ print(f"Error saving agent: {str(e)}")
229
+ return False
230
+
231
+
232
+ def load_leaderboard_data_from_hf():
233
+ """
234
+ Load leaderboard data and monthly metrics from HuggingFace dataset.
235
+
236
+ Returns:
237
+ dict: Dictionary with 'leaderboard', 'monthly_metrics', and 'metadata' keys
238
+ Returns None if file doesn't exist or error occurs
239
+ """
240
+ try:
241
+ token = get_hf_token()
242
+
243
+ # Download file
244
+ file_path = hf_hub_download_with_backoff(
245
+ repo_id=LEADERBOARD_REPO,
246
+ filename=LEADERBOARD_FILENAME,
247
+ repo_type="dataset",
248
+ token=token
249
+ )
250
+
251
+ # Load JSON data
252
+ with open(file_path, 'r') as f:
253
+ data = json.load(f)
254
+
255
+ last_updated = data.get('metadata', {}).get('last_updated', 'Unknown')
256
+ print(f"Loaded leaderboard data from HuggingFace (last updated: {last_updated})")
257
+
258
+ return data
259
+
260
+ except Exception as e:
261
+ print(f"Could not load leaderboard data from HuggingFace: {str(e)}")
262
+ return None
263
+
264
+
265
+ # =============================================================================
266
+ # UI FUNCTIONS
267
+ # =============================================================================
268
+
269
+ def create_monthly_metrics_plot(top_n=5):
270
+ """
271
+ Create a Plotly figure showing monthly total releases as bar charts.
272
+
273
+ Args:
274
+ top_n: Number of top agents to show (default: 5)
275
+ """
276
+ # Load from saved dataset
277
+ saved_data = load_leaderboard_data_from_hf()
278
+
279
+ if not saved_data or 'monthly_metrics' not in saved_data:
280
+ # Return an empty figure with a message
281
+ fig = go.Figure()
282
+ fig.add_annotation(
283
+ text="No data available for visualization",
284
+ xref="paper", yref="paper",
285
+ x=0.5, y=0.5, showarrow=False,
286
+ font=dict(size=16)
287
+ )
288
+ fig.update_layout(
289
+ title=None,
290
+ xaxis_title=None,
291
+ height=500
292
+ )
293
+ return fig
294
+
295
+ metrics = saved_data['monthly_metrics']
296
+ print(f"Loaded monthly metrics from saved dataset")
297
+
298
+ # Apply top_n filter if specified
299
+ if top_n is not None and top_n > 0 and metrics.get('agents'):
300
+ # Calculate total releases for each agent
301
+ agent_totals = []
302
+ for agent_name in metrics['agents']:
303
+ agent_data = metrics['data'].get(agent_name, {})
304
+ total_releases = sum(agent_data.get('total_releases', []))
305
+ agent_totals.append((agent_name, total_releases))
306
+
307
+ # Sort by total releases and take top N
308
+ agent_totals.sort(key=lambda x: x[1], reverse=True)
309
+ top_agents = [agent_name for agent_name, _ in agent_totals[:top_n]]
310
+
311
+ # Filter metrics to only include top agents
312
+ metrics = {
313
+ 'agents': top_agents,
314
+ 'months': metrics['months'],
315
+ 'data': {agent: metrics['data'][agent] for agent in top_agents if agent in metrics['data']}
316
+ }
317
+
318
+ if not metrics['agents'] or not metrics['months']:
319
+ # Return an empty figure with a message
320
+ fig = go.Figure()
321
+ fig.add_annotation(
322
+ text="No data available for visualization",
323
+ xref="paper", yref="paper",
324
+ x=0.5, y=0.5, showarrow=False,
325
+ font=dict(size=16)
326
+ )
327
+ fig.update_layout(
328
+ title=None,
329
+ xaxis_title=None,
330
+ height=500
331
+ )
332
+ return fig
333
+
334
+ # Create figure
335
+ fig = go.Figure()
336
+
337
+ # Generate unique colors for many agents using HSL color space
338
+ def generate_color(index, total):
339
+ """Generate distinct colors using HSL color space for better distribution"""
340
+ hue = (index * 360 / total) % 360
341
+ saturation = 70 + (index % 3) * 10 # Vary saturation slightly
342
+ lightness = 45 + (index % 2) * 10 # Vary lightness slightly
343
+ return f'hsl({hue}, {saturation}%, {lightness}%)'
344
+
345
+ agents = metrics['agents']
346
+ months = metrics['months']
347
+ data = metrics['data']
348
+
349
+ # Generate colors for all agents
350
+ agent_colors = {agent: generate_color(idx, len(agents)) for idx, agent in enumerate(agents)}
351
+
352
+ # Add bar traces for each agent
353
+ for idx, agent_name in enumerate(agents):
354
+ color = agent_colors[agent_name]
355
+ agent_data = data[agent_name]
356
+
357
+ # Add bar trace for total releases
358
+ # Only show bars for months where agent has releases
359
+ x_bars = []
360
+ y_bars = []
361
+ for month, count in zip(months, agent_data['total_releases']):
362
+ if count > 0: # Only include months with releases
363
+ x_bars.append(month)
364
+ y_bars.append(count)
365
+
366
+ if x_bars and y_bars: # Only add trace if there's data
367
+ fig.add_trace(
368
+ go.Bar(
369
+ x=x_bars,
370
+ y=y_bars,
371
+ name=agent_name,
372
+ marker=dict(color=color, opacity=0.7),
373
+ hovertemplate='<b>Agent: %{fullData.name}</b><br>' +
374
+ 'Month: %{x}<br>' +
375
+ 'Total Releases: %{y}<br>' +
376
+ '<extra></extra>',
377
+ offsetgroup=agent_name # Group bars by agent for proper spacing
378
+ )
379
+ )
380
+
381
+ # Update axes labels
382
+ fig.update_xaxes(title_text=None)
383
+ fig.update_yaxes(title_text="<b>Total Releases</b>")
384
+
385
+ # Update layout
386
+ show_legend = (top_n is not None and top_n <= 10)
387
+ fig.update_layout(
388
+ title=None,
389
+ hovermode='closest', # Show individual agent info on hover
390
+ barmode='group',
391
+ height=600,
392
+ showlegend=show_legend,
393
+ margin=dict(l=50, r=150 if show_legend else 50, t=50, b=50) # More right margin when legend is shown
394
+ )
395
+
396
+ return fig
397
+
398
+
399
+ def get_leaderboard_dataframe():
400
+ """
401
+ Load leaderboard from saved dataset and convert to pandas DataFrame for display.
402
+ Returns formatted DataFrame sorted by total releases.
403
+ """
404
+ # Load from saved dataset
405
+ saved_data = load_leaderboard_data_from_hf()
406
+
407
+ if not saved_data or 'leaderboard' not in saved_data:
408
+ print(f"No leaderboard data available")
409
+ # Return empty DataFrame with correct columns if no data
410
+ column_names = [col[0] for col in LEADERBOARD_COLUMNS]
411
+ return pd.DataFrame(columns=column_names)
412
+
413
+ cache_dict = saved_data['leaderboard']
414
+ last_updated = saved_data.get('metadata', {}).get('last_updated', 'Unknown')
415
+ print(f"Loaded leaderboard from saved dataset (last updated: {last_updated})")
416
+ print(f"Cache dict size: {len(cache_dict)}")
417
+
418
+ if not cache_dict:
419
+ print("WARNING: cache_dict is empty!")
420
+ # Return empty DataFrame with correct columns if no data
421
+ column_names = [col[0] for col in LEADERBOARD_COLUMNS]
422
+ return pd.DataFrame(columns=column_names)
423
+
424
+ rows = []
425
+ filtered_count = 0
426
+ for identifier, data in cache_dict.items():
427
+ total_releases = data.get('total_releases', 0)
428
+ print(f" Agent '{identifier}': {total_releases} total releases")
429
+
430
+ # Filter out agents with zero releases
431
+ if total_releases == 0:
432
+ filtered_count += 1
433
+ continue
434
+
435
+ # Only include display-relevant fields
436
+ rows.append([
437
+ data.get('name', 'Unknown'),
438
+ data.get('website', 'N/A'),
439
+ total_releases,
440
+ ])
441
+
442
+ print(f"Filtered out {filtered_count} agents with 0 total releases")
443
+ print(f"Leaderboard will show {len(rows)} agents")
444
+
445
+ # Create DataFrame
446
+ column_names = [col[0] for col in LEADERBOARD_COLUMNS]
447
+ df = pd.DataFrame(rows, columns=column_names)
448
+
449
+ # Ensure numeric types
450
+ numeric_cols = ["Total Releases"]
451
+ for col in numeric_cols:
452
+ if col in df.columns:
453
+ df[col] = pd.to_numeric(df[col], errors='coerce').fillna(0)
454
+
455
+ # Sort by Total Releases descending
456
+ if "Total Releases" in df.columns and not df.empty:
457
+ df = df.sort_values(by="Total Releases", ascending=False).reset_index(drop=True)
458
+
459
+ print(f"Final DataFrame shape: {df.shape}")
460
+ print("="*60 + "\n")
461
+
462
+ return df
463
+
464
+
465
+ def submit_agent(identifier, agent_name, organization, website):
466
+ """
467
+ Submit a new agent to the leaderboard.
468
+ Validates input and saves submission.
469
+ """
470
+ # Validate required fields
471
+ if not identifier or not identifier.strip():
472
+ return "ERROR: GitHub identifier is required", gr.update()
473
+ if not agent_name or not agent_name.strip():
474
+ return "ERROR: Agent name is required", gr.update()
475
+ if not organization or not organization.strip():
476
+ return "ERROR: Organization name is required", gr.update()
477
+ if not website or not website.strip():
478
+ return "ERROR: Website URL is required", gr.update()
479
+
480
+ # Clean inputs
481
+ identifier = identifier.strip()
482
+ agent_name = agent_name.strip()
483
+ organization = organization.strip()
484
+ website = website.strip()
485
+
486
+ # Validate GitHub identifier
487
+ is_valid, message = validate_github_username(identifier)
488
+ if not is_valid:
489
+ return f"ERROR: {message}", gr.update()
490
+
491
+ # Check for duplicates by loading agents from HuggingFace
492
+ agents = load_agents_from_hf()
493
+ if agents:
494
+ existing_names = {agent['github_identifier'] for agent in agents}
495
+ if identifier in existing_names:
496
+ return f"WARNING: Agent with identifier '{identifier}' already exists", gr.update()
497
+
498
+ # Create submission
499
+ submission = {
500
+ 'name': agent_name,
501
+ 'organization': organization,
502
+ 'github_identifier': identifier,
503
+ 'website': website,
504
+ 'status': 'active'
505
+ }
506
+
507
+ # Save to HuggingFace
508
+ if not save_agent_to_hf(submission):
509
+ return "ERROR: Failed to save submission", gr.update()
510
+
511
+ # Return success message - data will be populated by backend updates
512
+ return f"SUCCESS: Successfully submitted {agent_name}! Total releases data will be automatically populated by the backend system via the maintainers.", gr.update()
513
+
514
+
515
+ # =============================================================================
516
+ # DATA RELOAD FUNCTION
517
+ # =============================================================================
518
+
519
+ def reload_leaderboard_data():
520
+ """
521
+ Reload leaderboard data from HuggingFace.
522
+ This function is called by the scheduler on a daily basis.
523
+ """
524
+ print(f"\n{'='*80}")
525
+ print(f"Reloading leaderboard data from HuggingFace...")
526
+ print(f"{'='*80}\n")
527
+
528
+ try:
529
+ data = load_leaderboard_data_from_hf()
530
+ if data:
531
+ print(f"Successfully reloaded leaderboard data")
532
+ print(f" Last updated: {data.get('metadata', {}).get('last_updated', 'Unknown')}")
533
+ print(f" Agents: {len(data.get('leaderboard', {}))}")
534
+ else:
535
+ print(f"No data available")
536
+ except Exception as e:
537
+ print(f"Error reloading leaderboard data: {str(e)}")
538
+
539
+ print(f"{'='*80}\n")
540
+
541
+
542
+ # =============================================================================
543
+ # GRADIO APPLICATION
544
+ # =============================================================================
545
+
546
+ print(f"\nStarting SWE Agent Release Leaderboard")
547
+ print(f" Data source: {LEADERBOARD_REPO}")
548
+ print(f" Reload frequency: Daily at 12:00 AM UTC\n")
549
+
550
+ # Start APScheduler for daily data reload at 12:00 AM UTC
551
+ scheduler = BackgroundScheduler(timezone="UTC")
552
+ scheduler.add_job(
553
+ reload_leaderboard_data,
554
+ trigger=CronTrigger(hour=0, minute=0), # 12:00 AM UTC daily
555
+ id='daily_data_reload',
556
+ name='Daily Data Reload',
557
+ replace_existing=True
558
+ )
559
+ scheduler.start()
560
+ print(f"\n{'='*80}")
561
+ print(f"Scheduler initialized successfully")
562
+ print(f"Reload schedule: Daily at 12:00 AM UTC")
563
+ print(f"On startup: Loads cached data from HuggingFace on demand")
564
+ print(f"{'='*80}\n")
565
+
566
+ # Create Gradio interface
567
+ with gr.Blocks(title="SWE Agent Release Leaderboard", theme=gr.themes.Soft()) as app:
568
+ gr.Markdown("# SWE Agent Release Leaderboard")
569
+ gr.Markdown(f"Track and compare total releases by SWE agents")
570
+
571
+ with gr.Tabs():
572
+
573
+ # Leaderboard Tab
574
+ with gr.Tab("Leaderboard"):
575
+ gr.Markdown("*Statistics are based on total releases by agents*")
576
+ leaderboard_table = Leaderboard(
577
+ value=pd.DataFrame(columns=[col[0] for col in LEADERBOARD_COLUMNS]), # Empty initially
578
+ datatype=LEADERBOARD_COLUMNS,
579
+ search_columns=["Agent Name", "Website"],
580
+ filter_columns=[]
581
+ )
582
+
583
+ # Load leaderboard data when app starts
584
+ app.load(
585
+ fn=get_leaderboard_dataframe,
586
+ inputs=[],
587
+ outputs=[leaderboard_table]
588
+ )
589
+
590
+ # Monthly Metrics Section
591
+ gr.Markdown("---") # Divider
592
+ gr.Markdown("### Monthly Performance - Top 5 Agents")
593
+ gr.Markdown("*Shows total releases for the most active agents*")
594
+
595
+ monthly_metrics_plot = gr.Plot(label="Monthly Metrics")
596
+
597
+ # Load monthly metrics when app starts
598
+ app.load(
599
+ fn=lambda: create_monthly_metrics_plot(),
600
+ inputs=[],
601
+ outputs=[monthly_metrics_plot]
602
+ )
603
+
604
+
605
+ # Submit Agent Tab
606
+ with gr.Tab("Submit Agent"):
607
+
608
+ gr.Markdown("### Submit Your Agent")
609
+ gr.Markdown("Fill in the details below to add your agent to the leaderboard.")
610
+
611
+ with gr.Row():
612
+ with gr.Column():
613
+ github_input = gr.Textbox(
614
+ label="GitHub Identifier*",
615
+ placeholder="Your agent username (e.g., my-agent[bot])"
616
+ )
617
+ name_input = gr.Textbox(
618
+ label="Agent Name*",
619
+ placeholder="Your agent's display name"
620
+ )
621
+
622
+ with gr.Column():
623
+ organization_input = gr.Textbox(
624
+ label="Organization*",
625
+ placeholder="Your organization or team name"
626
+ )
627
+ website_input = gr.Textbox(
628
+ label="Website*",
629
+ placeholder="https://your-agent-website.com"
630
+ )
631
+
632
+ submit_button = gr.Button(
633
+ "Submit Agent",
634
+ variant="primary"
635
+ )
636
+ submission_status = gr.Textbox(
637
+ label="Submission Status",
638
+ interactive=False
639
+ )
640
+
641
+ # Event handler
642
+ submit_button.click(
643
+ fn=submit_agent,
644
+ inputs=[github_input, name_input, organization_input, website_input],
645
+ outputs=[submission_status, leaderboard_table]
646
+ )
647
+
648
+
649
+ # Launch application
650
+ if __name__ == "__main__":
651
+ app.launch()
docker-compose.yml ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ services:
2
+ miner:
3
+ build:
4
+ context: .
5
+ dockerfile: Dockerfile
6
+ container_name: ${COMPOSE_PROJECT_NAME}
7
+ restart: unless-stopped
8
+ env_file:
9
+ - .env
10
+ volumes:
11
+ - .:/app
12
+ - ../gharchive:/gharchive:ro
13
+ - ../bot_data:/bot_data:ro
14
+ environment:
15
+ - PYTHONUNBUFFERED=1
16
+ logging:
17
+ driver: "json-file"
18
+ options:
19
+ max-size: "10m"
20
+ max-file: "3"
msr.py ADDED
@@ -0,0 +1,808 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import os
3
+ import time
4
+ from datetime import datetime, timezone, timedelta
5
+ from collections import defaultdict
6
+ from concurrent.futures import ThreadPoolExecutor, as_completed
7
+ from huggingface_hub import HfApi, hf_hub_download
8
+ from huggingface_hub.errors import HfHubHTTPError
9
+ from dotenv import load_dotenv
10
+ import duckdb
11
+ import backoff
12
+ import requests
13
+ import requests.exceptions
14
+ from apscheduler.schedulers.blocking import BlockingScheduler
15
+ from apscheduler.triggers.cron import CronTrigger
16
+ import logging
17
+ import traceback
18
+ import subprocess
19
+ import re
20
+
21
+ # Load environment variables
22
+ load_dotenv()
23
+
24
+ # =============================================================================
25
+ # CONFIGURATION
26
+ # =============================================================================
27
+
28
+ AGENTS_REPO = "SWE-Arena/bot_data"
29
+ AGENTS_REPO_LOCAL_PATH = os.path.expanduser("~/bot_data") # Local git clone path
30
+ DUCKDB_CACHE_FILE = "cache.duckdb"
31
+ GHARCHIVE_DATA_LOCAL_PATH = os.path.expanduser("~/gharchive/data")
32
+ LEADERBOARD_FILENAME = f"{os.getenv('COMPOSE_PROJECT_NAME')}.json"
33
+ LEADERBOARD_REPO = "SWE-Arena/leaderboard_data"
34
+ LEADERBOARD_TIME_FRAME_DAYS = 180
35
+
36
+ # Git sync configuration (mandatory to get latest bot data)
37
+ GIT_SYNC_TIMEOUT = 300 # 5 minutes timeout for git pull
38
+
39
+ # OPTIMIZED DUCKDB CONFIGURATION
40
+ DUCKDB_THREADS = 8
41
+ DUCKDB_MEMORY_LIMIT = "64GB"
42
+
43
+ # Streaming batch configuration
44
+ BATCH_SIZE_DAYS = 7 # Process 1 week at a time (~168 hourly files)
45
+ # At this size: ~7 days × 24 files × ~100MB per file = ~16GB uncompressed per batch
46
+
47
+ # Download configuration
48
+ DOWNLOAD_WORKERS = 4
49
+ DOWNLOAD_RETRY_DELAY = 2
50
+ MAX_RETRIES = 5
51
+
52
+ # Upload configuration
53
+ UPLOAD_DELAY_SECONDS = 5
54
+ UPLOAD_MAX_BACKOFF = 3600
55
+
56
+ # Scheduler configuration
57
+ SCHEDULE_ENABLED = True
58
+ SCHEDULE_DAY_OF_WEEK = 'thu' # Thursday
59
+ SCHEDULE_HOUR = 0
60
+ SCHEDULE_MINUTE = 0
61
+ SCHEDULE_TIMEZONE = 'UTC'
62
+
63
+ # =============================================================================
64
+ # UTILITY FUNCTIONS
65
+ # =============================================================================
66
+
67
+ def load_jsonl(filename):
68
+ """Load JSONL file and return list of dictionaries."""
69
+ if not os.path.exists(filename):
70
+ return []
71
+
72
+ data = []
73
+ with open(filename, 'r', encoding='utf-8') as f:
74
+ for line in f:
75
+ line = line.strip()
76
+ if line:
77
+ try:
78
+ data.append(json.loads(line))
79
+ except json.JSONDecodeError as e:
80
+ print(f"Warning: Skipping invalid JSON line: {e}")
81
+ return data
82
+
83
+
84
+ def save_jsonl(filename, data):
85
+ """Save list of dictionaries to JSONL file."""
86
+ with open(filename, 'w', encoding='utf-8') as f:
87
+ for item in data:
88
+ f.write(json.dumps(item) + '\n')
89
+
90
+
91
+ def normalize_date_format(date_string):
92
+ """Convert date strings or datetime objects to standardized ISO 8601 format with Z suffix."""
93
+ if not date_string or date_string == 'N/A':
94
+ return 'N/A'
95
+
96
+ try:
97
+ if isinstance(date_string, datetime):
98
+ return date_string.strftime('%Y-%m-%dT%H:%M:%SZ')
99
+
100
+ date_string = re.sub(r'\s+', ' ', date_string.strip())
101
+ date_string = date_string.replace(' ', 'T')
102
+
103
+ if len(date_string) >= 3:
104
+ if date_string[-3:-2] in ('+', '-') and ':' not in date_string[-3:]:
105
+ date_string = date_string + ':00'
106
+
107
+ dt = datetime.fromisoformat(date_string.replace('Z', '+00:00'))
108
+ return dt.strftime('%Y-%m-%dT%H:%M:%SZ')
109
+ except Exception as e:
110
+ print(f"Warning: Could not parse date '{date_string}': {e}")
111
+ return date_string
112
+
113
+
114
+ def get_hf_token():
115
+ """Get HuggingFace token from environment variables."""
116
+ token = os.getenv('HF_TOKEN')
117
+ if not token:
118
+ print("Warning: HF_TOKEN not found in environment variables")
119
+ return token
120
+
121
+
122
+ # =============================================================================
123
+ # GHARCHIVE DOWNLOAD FUNCTIONS
124
+ # =============================================================================
125
+
126
+ def download_file(url):
127
+ """Download a GHArchive file with retry logic."""
128
+ filename = url.split("/")[-1]
129
+ filepath = os.path.join(GHARCHIVE_DATA_LOCAL_PATH, filename)
130
+
131
+ if os.path.exists(filepath):
132
+ return True
133
+
134
+ for attempt in range(MAX_RETRIES):
135
+ try:
136
+ response = requests.get(url, timeout=30)
137
+ response.raise_for_status()
138
+ with open(filepath, "wb") as f:
139
+ f.write(response.content)
140
+ return True
141
+
142
+ except requests.exceptions.HTTPError as e:
143
+ # 404 means the file doesn't exist in GHArchive - skip without retry
144
+ if e.response.status_code == 404:
145
+ if attempt == 0: # Only log once, not for each retry
146
+ print(f" ○ {filename}: Not available (404) - skipping")
147
+ return False
148
+
149
+ # Other HTTP errors (5xx, etc.) should be retried
150
+ wait_time = DOWNLOAD_RETRY_DELAY * (2 ** attempt)
151
+ print(f" ○ {filename}: {e}, retrying in {wait_time}s (attempt {attempt + 1}/{MAX_RETRIES})")
152
+ time.sleep(wait_time)
153
+
154
+ except Exception as e:
155
+ # Network errors, timeouts, etc. should be retried
156
+ wait_time = DOWNLOAD_RETRY_DELAY * (2 ** attempt)
157
+ print(f" ○ {filename}: {e}, retrying in {wait_time}s (attempt {attempt + 1}/{MAX_RETRIES})")
158
+ time.sleep(wait_time)
159
+
160
+ return False
161
+
162
+
163
+ def download_all_gharchive_data():
164
+ """Download all GHArchive data files for the last LEADERBOARD_TIME_FRAME_DAYS."""
165
+ os.makedirs(GHARCHIVE_DATA_LOCAL_PATH, exist_ok=True)
166
+
167
+ end_date = datetime.now(timezone.utc)
168
+ start_date = end_date - timedelta(days=LEADERBOARD_TIME_FRAME_DAYS)
169
+
170
+ urls = []
171
+ current_date = start_date
172
+ while current_date <= end_date:
173
+ date_str = current_date.strftime("%Y-%m-%d")
174
+ for hour in range(24):
175
+ url = f"https://data.gharchive.org/{date_str}-{hour}.json.gz"
176
+ urls.append(url)
177
+ current_date += timedelta(days=1)
178
+
179
+ downloads_processed = 0
180
+
181
+ try:
182
+ with ThreadPoolExecutor(max_workers=DOWNLOAD_WORKERS) as executor:
183
+ futures = [executor.submit(download_file, url) for url in urls]
184
+ for future in as_completed(futures):
185
+ downloads_processed += 1
186
+
187
+ print(f" Download complete: {downloads_processed} files")
188
+ return True
189
+
190
+ except Exception as e:
191
+ print(f"Error during download: {str(e)}")
192
+ traceback.print_exc()
193
+ return False
194
+
195
+
196
+ # =============================================================================
197
+ # HUGGINGFACE API WRAPPERS
198
+ # =============================================================================
199
+
200
+ def is_retryable_error(e):
201
+ """Check if exception is retryable (rate limit or timeout error)."""
202
+ if isinstance(e, HfHubHTTPError):
203
+ if e.response.status_code == 429:
204
+ return True
205
+
206
+ if isinstance(e, (requests.exceptions.Timeout,
207
+ requests.exceptions.ReadTimeout,
208
+ requests.exceptions.ConnectTimeout)):
209
+ return True
210
+
211
+ if isinstance(e, Exception):
212
+ error_str = str(e).lower()
213
+ if 'timeout' in error_str or 'timed out' in error_str:
214
+ return True
215
+
216
+ return False
217
+
218
+
219
+ @backoff.on_exception(
220
+ backoff.expo,
221
+ (HfHubHTTPError, requests.exceptions.Timeout, requests.exceptions.RequestException, Exception),
222
+ max_tries=MAX_RETRIES,
223
+ base=300,
224
+ max_value=3600,
225
+ giveup=lambda e: not is_retryable_error(e),
226
+ on_backoff=lambda details: print(
227
+ f" {details['exception']} error. Retrying in {details['wait']/60:.1f} minutes ({details['wait']:.0f}s) - attempt {details['tries']}/5..."
228
+ )
229
+ )
230
+ def list_repo_files_with_backoff(api, **kwargs):
231
+ """Wrapper for api.list_repo_files() with exponential backoff."""
232
+ return api.list_repo_files(**kwargs)
233
+
234
+
235
+ @backoff.on_exception(
236
+ backoff.expo,
237
+ (HfHubHTTPError, requests.exceptions.Timeout, requests.exceptions.RequestException, Exception),
238
+ max_tries=MAX_RETRIES,
239
+ base=300,
240
+ max_value=3600,
241
+ giveup=lambda e: not is_retryable_error(e),
242
+ on_backoff=lambda details: print(
243
+ f" {details['exception']} error. Retrying in {details['wait']/60:.1f} minutes ({details['wait']:.0f}s) - attempt {details['tries']}/5..."
244
+ )
245
+ )
246
+ def hf_hub_download_with_backoff(**kwargs):
247
+ """Wrapper for hf_hub_download() with exponential backoff."""
248
+ return hf_hub_download(**kwargs)
249
+
250
+
251
+ @backoff.on_exception(
252
+ backoff.expo,
253
+ (HfHubHTTPError, requests.exceptions.Timeout, requests.exceptions.RequestException, Exception),
254
+ max_tries=MAX_RETRIES,
255
+ base=300,
256
+ max_value=3600,
257
+ giveup=lambda e: not is_retryable_error(e),
258
+ on_backoff=lambda details: print(
259
+ f" {details['exception']} error. Retrying in {details['wait']/60:.1f} minutes ({details['wait']:.0f}s) - attempt {details['tries']}/5..."
260
+ )
261
+ )
262
+ def upload_file_with_backoff(api, **kwargs):
263
+ """Wrapper for api.upload_file() with exponential backoff."""
264
+ return api.upload_file(**kwargs)
265
+
266
+
267
+ @backoff.on_exception(
268
+ backoff.expo,
269
+ (HfHubHTTPError, requests.exceptions.Timeout, requests.exceptions.RequestException, Exception),
270
+ max_tries=MAX_RETRIES,
271
+ base=300,
272
+ max_value=3600,
273
+ giveup=lambda e: not is_retryable_error(e),
274
+ on_backoff=lambda details: print(
275
+ f" {details['exception']} error. Retrying in {details['wait']/60:.1f} minutes ({details['wait']:.0f}s) - attempt {details['tries']}/5..."
276
+ )
277
+ )
278
+ def upload_folder_with_backoff(api, **kwargs):
279
+ """Wrapper for api.upload_folder() with exponential backoff."""
280
+ return api.upload_folder(**kwargs)
281
+
282
+
283
+ def get_duckdb_connection():
284
+ """
285
+ Initialize DuckDB connection with OPTIMIZED memory settings.
286
+ Uses persistent database and reduced memory footprint.
287
+ Automatically removes cache file if lock conflict is detected.
288
+ """
289
+ try:
290
+ conn = duckdb.connect(DUCKDB_CACHE_FILE)
291
+ except Exception as e:
292
+ # Check if it's a locking error
293
+ error_msg = str(e)
294
+ if "lock" in error_msg.lower() or "conflicting" in error_msg.lower():
295
+ print(f" ⚠ Lock conflict detected, removing {DUCKDB_CACHE_FILE}...")
296
+ if os.path.exists(DUCKDB_CACHE_FILE):
297
+ os.remove(DUCKDB_CACHE_FILE)
298
+ print(f" ✓ Cache file removed, retrying connection...")
299
+ # Retry connection after removing cache
300
+ conn = duckdb.connect(DUCKDB_CACHE_FILE)
301
+ else:
302
+ # Re-raise if it's not a locking error
303
+ raise
304
+
305
+ # OPTIMIZED SETTINGS
306
+ conn.execute(f"SET threads TO {DUCKDB_THREADS};")
307
+ conn.execute("SET preserve_insertion_order = false;")
308
+ conn.execute("SET enable_object_cache = true;")
309
+ conn.execute("SET temp_directory = '/tmp/duckdb_temp';")
310
+ conn.execute(f"SET memory_limit = '{DUCKDB_MEMORY_LIMIT}';") # Per-query limit
311
+ conn.execute(f"SET max_memory = '{DUCKDB_MEMORY_LIMIT}';") # Hard cap
312
+
313
+ return conn
314
+
315
+
316
+ def generate_file_path_patterns(start_date, end_date, data_dir=GHARCHIVE_DATA_LOCAL_PATH):
317
+ """Generate file path patterns for GHArchive data in date range (only existing files)."""
318
+ file_patterns = []
319
+ missing_dates = set()
320
+
321
+ current_date = start_date.replace(hour=0, minute=0, second=0, microsecond=0)
322
+ end_day = end_date.replace(hour=0, minute=0, second=0, microsecond=0)
323
+
324
+ while current_date <= end_day:
325
+ date_has_files = False
326
+ for hour in range(24):
327
+ pattern = os.path.join(data_dir, f"{current_date.strftime('%Y-%m-%d')}-{hour}.json.gz")
328
+ if os.path.exists(pattern):
329
+ file_patterns.append(pattern)
330
+ date_has_files = True
331
+
332
+ if not date_has_files:
333
+ missing_dates.add(current_date.strftime('%Y-%m-%d'))
334
+
335
+ current_date += timedelta(days=1)
336
+
337
+ if missing_dates:
338
+ print(f" ○ Skipping {len(missing_dates)} date(s) with no data")
339
+
340
+ return file_patterns
341
+
342
+
343
+ # =============================================================================
344
+ # STREAMING BATCH PROCESSING
345
+ # =============================================================================
346
+
347
+ def fetch_all_release_metadata_streaming(conn, identifiers, start_date, end_date):
348
+ """
349
+ OPTIMIZED: Fetch release metadata using streaming batch processing.
350
+
351
+ Processes GHArchive files in BATCH_SIZE_DAYS chunks to limit memory usage.
352
+ Instead of loading 180 days (4,344 files) at once, processes 7 days at a time.
353
+
354
+ This prevents OOM errors by:
355
+ 1. Only keeping ~168 hourly files in memory per batch (vs 4,344)
356
+ 2. Incrementally building the results dictionary
357
+ 3. Allowing DuckDB to garbage collect after each batch
358
+
359
+ Args:
360
+ conn: DuckDB connection instance
361
+ identifiers: List of GitHub usernames/bot identifiers (~1500)
362
+ start_date: Start datetime (timezone-aware)
363
+ end_date: End datetime (timezone-aware)
364
+
365
+ Returns:
366
+ Dictionary mapping agent identifier to list of release metadata
367
+ """
368
+ identifier_list = ', '.join([f"'{id}'" for id in identifiers])
369
+ metadata_by_agent = defaultdict(list)
370
+
371
+ # Calculate total batches
372
+ total_days = (end_date - start_date).days
373
+ total_batches = (total_days // BATCH_SIZE_DAYS) + 1
374
+
375
+ # Process in configurable batches
376
+ current_date = start_date
377
+ batch_num = 0
378
+ total_releases = 0
379
+
380
+ print(f" Streaming {total_batches} batches of {BATCH_SIZE_DAYS}-day intervals...")
381
+
382
+ while current_date <= end_date:
383
+ batch_num += 1
384
+ batch_end = min(current_date + timedelta(days=BATCH_SIZE_DAYS - 1), end_date)
385
+
386
+ # Get file patterns for THIS BATCH ONLY (not all 180 days)
387
+ file_patterns = generate_file_path_patterns(current_date, batch_end)
388
+
389
+ if not file_patterns:
390
+ print(f" Batch {batch_num}/{total_batches}: {current_date.date()} to {batch_end.date()} - NO DATA")
391
+ current_date = batch_end + timedelta(days=1)
392
+ continue
393
+
394
+ # Progress indicator
395
+ print(f" Batch {batch_num}/{total_batches}: {current_date.date()} to {batch_end.date()} ({len(file_patterns)} files)... ", end="", flush=True)
396
+
397
+ # Build file patterns SQL for THIS BATCH
398
+ file_patterns_sql = '[' + ', '.join([f"'{fp}'" for fp in file_patterns]) + ']'
399
+
400
+ # Query for this batch
401
+ # Extract release information from ReleaseEvent payloads
402
+ query = f"""
403
+ SELECT DISTINCT
404
+ actor.login as agent,
405
+ TRY_CAST(json_extract_string(to_json(payload), '$.release.tag_name') AS VARCHAR) as release_tag,
406
+ TRY_CAST(json_extract_string(to_json(payload), '$.action') AS VARCHAR) as action,
407
+ created_at
408
+ FROM read_json(
409
+ {file_patterns_sql},
410
+ union_by_name=true,
411
+ filename=true,
412
+ compression='gzip',
413
+ format='newline_delimited',
414
+ ignore_errors=true,
415
+ maximum_object_size=2147483648
416
+ )
417
+ WHERE type = 'ReleaseEvent'
418
+ AND TRY_CAST(json_extract_string(to_json(payload), '$.release.tag_name') AS VARCHAR) IS NOT NULL
419
+ AND TRY_CAST(json_extract_string(to_json(actor), '$.login') AS VARCHAR) IN ({identifier_list})
420
+ """
421
+
422
+ try:
423
+ results = conn.execute(query).fetchall()
424
+
425
+ batch_releases = 0
426
+ for row in results:
427
+ agent = row[0]
428
+ release_tag = row[1]
429
+ action = row[2]
430
+ created_at = normalize_date_format(row[3]) if row[3] else None
431
+
432
+ if not agent or not release_tag:
433
+ continue
434
+
435
+ # Build release metadata
436
+ release_metadata = {
437
+ 'release_tag': release_tag,
438
+ 'action': action,
439
+ 'created_at': created_at,
440
+ }
441
+
442
+ metadata_by_agent[agent].append(release_metadata)
443
+ batch_releases += 1
444
+ total_releases += 1
445
+
446
+ print(f"✓ {batch_releases} releases found")
447
+
448
+ except Exception as e:
449
+ print(f"\n ✗ Batch {batch_num} error: {str(e)}")
450
+ traceback.print_exc()
451
+
452
+ # Move to next batch
453
+ current_date = batch_end + timedelta(days=1)
454
+
455
+ # Final summary
456
+ agents_with_data = sum(1 for releases in metadata_by_agent.values() if releases)
457
+ print(f"\n ✓ Complete: {total_releases} releases found for {agents_with_data}/{len(identifiers)} agents")
458
+
459
+ return dict(metadata_by_agent)
460
+
461
+
462
+ def sync_agents_repo():
463
+ """
464
+ Sync local bot_data repository with remote using git pull.
465
+ This is MANDATORY to ensure we have the latest bot data.
466
+ Raises exception if sync fails.
467
+ """
468
+ if not os.path.exists(AGENTS_REPO_LOCAL_PATH):
469
+ error_msg = f"Local repository not found at {AGENTS_REPO_LOCAL_PATH}"
470
+ print(f" ✗ {error_msg}")
471
+ print(f" Please clone it first: git clone https://huggingface.co/datasets/{AGENTS_REPO}")
472
+ raise FileNotFoundError(error_msg)
473
+
474
+ if not os.path.exists(os.path.join(AGENTS_REPO_LOCAL_PATH, '.git')):
475
+ error_msg = f"{AGENTS_REPO_LOCAL_PATH} exists but is not a git repository"
476
+ print(f" ✗ {error_msg}")
477
+ raise ValueError(error_msg)
478
+
479
+ try:
480
+ # Run git pull with extended timeout due to large repository
481
+ result = subprocess.run(
482
+ ['git', 'pull'],
483
+ cwd=AGENTS_REPO_LOCAL_PATH,
484
+ capture_output=True,
485
+ text=True,
486
+ timeout=GIT_SYNC_TIMEOUT
487
+ )
488
+
489
+ if result.returncode == 0:
490
+ output = result.stdout.strip()
491
+ if "Already up to date" in output or "Already up-to-date" in output:
492
+ print(f" ✓ Repository is up to date")
493
+ else:
494
+ print(f" ✓ Repository synced successfully")
495
+ if output:
496
+ # Print first few lines of output
497
+ lines = output.split('\n')[:5]
498
+ for line in lines:
499
+ print(f" {line}")
500
+ return True
501
+ else:
502
+ error_msg = f"Git pull failed: {result.stderr.strip()}"
503
+ print(f" ✗ {error_msg}")
504
+ raise RuntimeError(error_msg)
505
+
506
+ except subprocess.TimeoutExpired:
507
+ error_msg = f"Git pull timed out after {GIT_SYNC_TIMEOUT} seconds"
508
+ print(f" ✗ {error_msg}")
509
+ raise TimeoutError(error_msg)
510
+ except (FileNotFoundError, ValueError, RuntimeError, TimeoutError):
511
+ raise # Re-raise expected exceptions
512
+ except Exception as e:
513
+ error_msg = f"Error syncing repository: {str(e)}"
514
+ print(f" ✗ {error_msg}")
515
+ raise RuntimeError(error_msg) from e
516
+
517
+
518
+ def load_agents_from_hf():
519
+ """
520
+ Load all agent metadata JSON files from local git repository.
521
+ ALWAYS syncs with remote first to ensure we have the latest bot data.
522
+ """
523
+ # MANDATORY: Sync with remote first to get latest bot data
524
+ print(f" Syncing bot_data repository to get latest agents...")
525
+ sync_agents_repo() # Will raise exception if sync fails
526
+
527
+ agents = []
528
+
529
+ # Scan local directory for JSON files
530
+ if not os.path.exists(AGENTS_REPO_LOCAL_PATH):
531
+ raise FileNotFoundError(f"Local repository not found at {AGENTS_REPO_LOCAL_PATH}")
532
+
533
+ # Walk through the directory to find all JSON files
534
+ files_processed = 0
535
+ print(f" Loading agent metadata from {AGENTS_REPO_LOCAL_PATH}...")
536
+
537
+ for root, dirs, files in os.walk(AGENTS_REPO_LOCAL_PATH):
538
+ # Skip .git directory
539
+ if '.git' in root:
540
+ continue
541
+
542
+ for filename in files:
543
+ if not filename.endswith('.json'):
544
+ continue
545
+
546
+ files_processed += 1
547
+ file_path = os.path.join(root, filename)
548
+
549
+ try:
550
+ with open(file_path, 'r', encoding='utf-8') as f:
551
+ agent_data = json.load(f)
552
+
553
+ # Only include active agents
554
+ if agent_data.get('status') != 'active':
555
+ continue
556
+
557
+ # Extract github_identifier from filename
558
+ github_identifier = filename.replace('.json', '')
559
+ agent_data['github_identifier'] = github_identifier
560
+
561
+ agents.append(agent_data)
562
+
563
+ except Exception as e:
564
+ print(f" ○ Error loading {filename}: {str(e)}")
565
+ continue
566
+
567
+ print(f" ✓ Loaded {len(agents)} active agents (from {files_processed} total files)")
568
+ return agents
569
+
570
+
571
+ def calculate_release_stats_from_metadata(metadata_list):
572
+ """Calculate statistics from a list of release metadata."""
573
+ total_releases = len(metadata_list)
574
+
575
+ return {
576
+ 'total_releases': total_releases,
577
+ }
578
+
579
+
580
+ def calculate_monthly_metrics_by_agent(all_metadata_dict, agents):
581
+ """Calculate monthly metrics for all agents for visualization."""
582
+ identifier_to_name = {agent.get('github_identifier'): agent.get('name') for agent in agents if agent.get('github_identifier')}
583
+
584
+ if not all_metadata_dict:
585
+ return {'agents': [], 'months': [], 'data': {}}
586
+
587
+ agent_month_data = defaultdict(lambda: defaultdict(list))
588
+
589
+ for agent_identifier, metadata_list in all_metadata_dict.items():
590
+ for release_meta in metadata_list:
591
+ created_at = release_meta.get('created_at')
592
+
593
+ if not created_at:
594
+ continue
595
+
596
+ agent_name = identifier_to_name.get(agent_identifier, agent_identifier)
597
+
598
+ try:
599
+ dt = datetime.fromisoformat(created_at.replace('Z', '+00:00'))
600
+ month_key = f"{dt.year}-{dt.month:02d}"
601
+ agent_month_data[agent_name][month_key].append(release_meta)
602
+ except Exception as e:
603
+ print(f"Warning: Could not parse date '{created_at}': {e}")
604
+ continue
605
+
606
+ all_months = set()
607
+ for agent_data in agent_month_data.values():
608
+ all_months.update(agent_data.keys())
609
+ months = sorted(list(all_months))
610
+
611
+ result_data = {}
612
+ for agent_name, month_dict in agent_month_data.items():
613
+ total_releases_list = []
614
+
615
+ for month in months:
616
+ releases_in_month = month_dict.get(month, [])
617
+ total_count = len(releases_in_month)
618
+
619
+ total_releases_list.append(total_count)
620
+
621
+ result_data[agent_name] = {
622
+ 'total_releases': total_releases_list,
623
+ }
624
+
625
+ agents_list = sorted(list(agent_month_data.keys()))
626
+
627
+ return {
628
+ 'agents': agents_list,
629
+ 'months': months,
630
+ 'data': result_data
631
+ }
632
+
633
+
634
+ def construct_leaderboard_from_metadata(all_metadata_dict, agents):
635
+ """Construct leaderboard from in-memory release metadata."""
636
+ if not agents:
637
+ print("Error: No agents found")
638
+ return {}
639
+
640
+ cache_dict = {}
641
+
642
+ for agent in agents:
643
+ identifier = agent.get('github_identifier')
644
+ agent_name = agent.get('name', 'Unknown')
645
+
646
+ bot_metadata = all_metadata_dict.get(identifier, [])
647
+ stats = calculate_release_stats_from_metadata(bot_metadata)
648
+
649
+ cache_dict[identifier] = {
650
+ 'name': agent_name,
651
+ 'website': agent.get('website', 'N/A'),
652
+ 'github_identifier': identifier,
653
+ **stats
654
+ }
655
+
656
+ return cache_dict
657
+
658
+
659
+ def save_leaderboard_data_to_hf(leaderboard_dict, monthly_metrics):
660
+ """Save leaderboard data and monthly metrics to HuggingFace dataset."""
661
+ try:
662
+ token = get_hf_token()
663
+ if not token:
664
+ raise Exception("No HuggingFace token found")
665
+
666
+ api = HfApi(token=token)
667
+
668
+ combined_data = {
669
+ 'last_updated': datetime.now(timezone.utc).isoformat(),
670
+ 'leaderboard': leaderboard_dict,
671
+ 'monthly_metrics': monthly_metrics,
672
+ 'metadata': {
673
+ 'leaderboard_time_frame_days': LEADERBOARD_TIME_FRAME_DAYS
674
+ }
675
+ }
676
+
677
+ with open(LEADERBOARD_FILENAME, 'w') as f:
678
+ json.dump(combined_data, f, indent=2)
679
+
680
+ try:
681
+ upload_file_with_backoff(
682
+ api=api,
683
+ path_or_fileobj=LEADERBOARD_FILENAME,
684
+ path_in_repo=LEADERBOARD_FILENAME,
685
+ repo_id=LEADERBOARD_REPO,
686
+ repo_type="dataset"
687
+ )
688
+ return True
689
+ finally:
690
+ if os.path.exists(LEADERBOARD_FILENAME):
691
+ os.remove(LEADERBOARD_FILENAME)
692
+
693
+ except Exception as e:
694
+ print(f"Error saving leaderboard data: {str(e)}")
695
+ traceback.print_exc()
696
+ return False
697
+
698
+
699
+ # =============================================================================
700
+ # MINING FUNCTION
701
+ # =============================================================================
702
+
703
+ def mine_all_agents():
704
+ """
705
+ Mine release metadata for all agents using STREAMING batch processing.
706
+ Downloads GHArchive data, then uses BATCH-based DuckDB queries.
707
+ """
708
+ print(f"\n[1/4] Downloading GHArchive data...")
709
+
710
+ if not download_all_gharchive_data():
711
+ print("Warning: Download had errors, continuing with available data...")
712
+
713
+ print(f"\n[2/4] Loading agent metadata...")
714
+
715
+ agents = load_agents_from_hf()
716
+ if not agents:
717
+ print("Error: No agents found")
718
+ return
719
+
720
+ identifiers = [agent['github_identifier'] for agent in agents if agent.get('github_identifier')]
721
+ if not identifiers:
722
+ print("Error: No valid agent identifiers found")
723
+ return
724
+
725
+ print(f"\n[3/4] Mining release metadata ({len(identifiers)} agents, {LEADERBOARD_TIME_FRAME_DAYS} days)...")
726
+
727
+ try:
728
+ conn = get_duckdb_connection()
729
+ except Exception as e:
730
+ print(f"Failed to initialize DuckDB connection: {str(e)}")
731
+ return
732
+
733
+ current_time = datetime.now(timezone.utc)
734
+ end_date = current_time.replace(hour=0, minute=0, second=0, microsecond=0)
735
+ start_date = end_date - timedelta(days=LEADERBOARD_TIME_FRAME_DAYS)
736
+
737
+ try:
738
+ # USE STREAMING FUNCTION
739
+ all_metadata = fetch_all_release_metadata_streaming(
740
+ conn, identifiers, start_date, end_date
741
+ )
742
+
743
+ except Exception as e:
744
+ print(f"Error during DuckDB fetch: {str(e)}")
745
+ traceback.print_exc()
746
+ return
747
+ finally:
748
+ conn.close()
749
+
750
+ print(f"\n[4/4] Saving leaderboard...")
751
+
752
+ try:
753
+ leaderboard_dict = construct_leaderboard_from_metadata(all_metadata, agents)
754
+ monthly_metrics = calculate_monthly_metrics_by_agent(all_metadata, agents)
755
+ save_leaderboard_data_to_hf(leaderboard_dict, monthly_metrics)
756
+
757
+ except Exception as e:
758
+ print(f"Error saving leaderboard: {str(e)}")
759
+ traceback.print_exc()
760
+
761
+
762
+ # =============================================================================
763
+ # SCHEDULER SETUP
764
+ # =============================================================================
765
+
766
+ def setup_scheduler():
767
+ """Set up APScheduler to run mining jobs periodically."""
768
+ logging.basicConfig(
769
+ level=logging.INFO,
770
+ format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
771
+ )
772
+
773
+ logging.getLogger('httpx').setLevel(logging.WARNING)
774
+
775
+ scheduler = BlockingScheduler(timezone=SCHEDULE_TIMEZONE)
776
+
777
+ trigger = CronTrigger(
778
+ day_of_week=SCHEDULE_DAY_OF_WEEK,
779
+ hour=SCHEDULE_HOUR,
780
+ minute=SCHEDULE_MINUTE,
781
+ timezone=SCHEDULE_TIMEZONE
782
+ )
783
+
784
+ scheduler.add_job(
785
+ mine_all_agents,
786
+ trigger=trigger,
787
+ id='mine_all_agents',
788
+ name='Mine GHArchive data for all agents',
789
+ replace_existing=True
790
+ )
791
+
792
+ next_run = trigger.get_next_fire_time(None, datetime.now(trigger.timezone))
793
+ print(f"Scheduler: Weekly on {SCHEDULE_DAY_OF_WEEK} at {SCHEDULE_HOUR:02d}:{SCHEDULE_MINUTE:02d} {SCHEDULE_TIMEZONE}")
794
+ print(f"Next run: {next_run}\n")
795
+
796
+ print(f"\nScheduler started")
797
+ scheduler.start()
798
+
799
+
800
+ # =============================================================================
801
+ # ENTRY POINT
802
+ # =============================================================================
803
+
804
+ if __name__ == "__main__":
805
+ if SCHEDULE_ENABLED:
806
+ setup_scheduler()
807
+ else:
808
+ mine_all_agents()
requirements.txt ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ APScheduler
2
+ backoff
3
+ duckdb[all]
4
+ gradio
5
+ gradio_leaderboard
6
+ huggingface_hub
7
+ pandas
8
+ plotly
9
+ python-dotenv
10
+ requests