asvs commited on
Commit
173d7ec
·
1 Parent(s): 9815010

Add GAM and GDM metrics to the table

Browse files
.claude/skills/hf-dataset-storage/SKILL.md DELETED
@@ -1,387 +0,0 @@
1
- ---
2
- name: hf-dataset-storage
3
- description: Implement persistent storage for Hugging Face Spaces using dataset storage. Use when working with HF Spaces persistence, saving space data to datasets, scheduled uploads to HuggingFace Hub, or when the user mentions dataset storage, space persistence, CommitScheduler, or backing up space data.
4
- allowed-tools: Read, Write, Edit, Bash, Grep, Glob
5
- ---
6
-
7
- # Hugging Face Dataset Storage for Spaces
8
-
9
- This skill helps you implement persistent storage for Hugging Face Spaces using dataset repositories as a data store.
10
-
11
- ## When to Use Dataset Storage
12
-
13
- Use dataset storage for Hugging Face Spaces when:
14
- - You need data to persist beyond the Space's lifecycle
15
- - You want to collect user feedback or logs from a Space
16
- - You need append-only storage for analytics or training data
17
- - You want to avoid paying for persistent storage upgrades
18
- - You need to version your data over time
19
-
20
- ## Quick Start
21
-
22
- ### 1. Install Required Package
23
-
24
- ```bash
25
- uv add huggingface_hub
26
- ```
27
-
28
- ### 2. Basic Setup with CommitScheduler (Recommended)
29
-
30
- For append-only data that should be uploaded periodically (e.g., logs, user feedback):
31
-
32
- ```python
33
- import json
34
- import uuid
35
- from pathlib import Path
36
- from huggingface_hub import CommitScheduler
37
-
38
- # Create a unique file to avoid conflicts across restarts
39
- feedback_file = Path("user_feedback/") / f"data_{uuid.uuid4()}.json"
40
- feedback_folder = feedback_file.parent
41
-
42
- # Schedule uploads every 10 minutes (minimum recommended: 5 minutes)
43
- scheduler = CommitScheduler(
44
- repo_id="username/my-dataset", # Will be created if doesn't exist
45
- repo_type="dataset",
46
- folder_path=feedback_folder,
47
- path_in_repo="data", # Upload to /data folder in the dataset
48
- every=10, # Upload every 10 minutes
49
- )
50
-
51
- # Append data with thread safety
52
- def save_data(data_dict):
53
- """Save data to file with thread lock for concurrent writes."""
54
- with scheduler.lock:
55
- with feedback_file.open("a") as f:
56
- f.write(json.dumps(data_dict))
57
- f.write("\n")
58
- ```
59
-
60
- ### 3. Manual Upload Methods
61
-
62
- For one-time or controlled uploads:
63
-
64
- ```python
65
- from huggingface_hub import HfApi
66
-
67
- api = HfApi()
68
-
69
- # Upload a single file
70
- api.upload_file(
71
- path_or_fileobj="/path/to/local/file.json",
72
- path_in_repo="data/file.json",
73
- repo_id="username/my-dataset",
74
- repo_type="dataset",
75
- )
76
-
77
- # Upload an entire folder
78
- api.upload_folder(
79
- folder_path="/path/to/local/folder",
80
- path_in_repo="data",
81
- repo_id="username/my-dataset",
82
- repo_type="dataset",
83
- )
84
- ```
85
-
86
- ## Authentication
87
-
88
- Before uploading, you need to authenticate with Hugging Face:
89
-
90
- ### Option 1: Login via CLI
91
- ```bash
92
- huggingface-cli login
93
- ```
94
-
95
- ### Option 2: Use Token Programmatically
96
- ```python
97
- from huggingface_hub import HfApi
98
-
99
- api = HfApi(token="hf_...")
100
- ```
101
-
102
- ### Option 3: Set Environment Variable
103
- ```bash
104
- export HF_TOKEN="hf_..."
105
- ```
106
-
107
- For Spaces, add `HF_TOKEN` as a secret in Space settings.
108
-
109
- ## Advanced Patterns
110
-
111
- ### Pattern 1: Gradio Space with User Feedback
112
-
113
- ```python
114
- import json
115
- import uuid
116
- from pathlib import Path
117
- import gradio as gr
118
- from huggingface_hub import CommitScheduler
119
-
120
- # Setup
121
- feedback_file = Path("user_feedback/") / f"data_{uuid.uuid4()}.json"
122
- feedback_folder = feedback_file.parent
123
-
124
- scheduler = CommitScheduler(
125
- repo_id="username/user-feedback-dataset",
126
- repo_type="dataset",
127
- folder_path=feedback_folder,
128
- path_in_repo="feedback",
129
- every=10,
130
- )
131
-
132
- def save_feedback(input_text, output_text, rating):
133
- """Save user feedback with thread safety."""
134
- with scheduler.lock:
135
- with feedback_file.open("a") as f:
136
- f.write(json.dumps({
137
- "input": input_text,
138
- "output": output_text,
139
- "rating": rating,
140
- "timestamp": str(uuid.uuid4())
141
- }))
142
- f.write("\n")
143
-
144
- # Use in Gradio interface
145
- with gr.Blocks() as demo:
146
- # ... define your Gradio UI
147
- submit_btn.click(save_feedback, inputs=[input_box, output_box, rating])
148
-
149
- demo.launch()
150
- ```
151
-
152
- ### Pattern 2: Training Logs with Progress Tracking
153
-
154
- ```python
155
- import json
156
- from pathlib import Path
157
- from huggingface_hub import CommitScheduler
158
- from tqdm import tqdm
159
-
160
- # Setup
161
- log_file = Path("training_logs/") / "metrics.jsonl"
162
- log_folder = log_file.parent
163
- log_folder.mkdir(exist_ok=True)
164
-
165
- scheduler = CommitScheduler(
166
- repo_id="username/training-logs",
167
- repo_type="dataset",
168
- folder_path=log_folder,
169
- path_in_repo="logs",
170
- every=5, # Upload every 5 minutes
171
- )
172
-
173
- # Training loop
174
- for epoch in tqdm(range(num_epochs), desc="Training"):
175
- # ... training code ...
176
-
177
- # Log metrics
178
- with scheduler.lock:
179
- with log_file.open("a") as f:
180
- f.write(json.dumps({
181
- "epoch": epoch,
182
- "loss": loss,
183
- "accuracy": accuracy
184
- }))
185
- f.write("\n")
186
- ```
187
-
188
- ### Pattern 3: Large File Upload with Background Processing
189
-
190
- ```python
191
- from huggingface_hub import HfApi
192
-
193
- api = HfApi()
194
-
195
- # Upload large files in the background (non-blocking)
196
- future = api.upload_folder(
197
- repo_id="username/large-dataset",
198
- folder_path="./data",
199
- repo_type="dataset",
200
- run_as_future=True, # Non-blocking upload
201
- )
202
-
203
- # Continue working while upload happens
204
- # ... do other work ...
205
-
206
- # Wait for upload to complete when needed
207
- future.result() # This blocks until upload finishes
208
- ```
209
-
210
- ### Pattern 4: Scheduled Uploads with Multiple File Types
211
-
212
- ```python
213
- import zipfile
214
- import tempfile
215
- from pathlib import Path
216
- from huggingface_hub import CommitScheduler
217
-
218
- class ImageArchiveScheduler(CommitScheduler):
219
- """Custom scheduler that zips images before uploading."""
220
-
221
- def push_to_hub(self):
222
- # Find all PNG files
223
- png_files = list(self.folder_path.glob("*.png"))
224
- if len(png_files) == 0:
225
- return None # Skip if nothing to commit
226
-
227
- # Zip files
228
- with tempfile.TemporaryDirectory() as tmpdir:
229
- archive_path = Path(tmpdir) / "images.zip"
230
- with zipfile.ZipFile(archive_path, "w", zipfile.ZIP_DEFLATED) as zip_file:
231
- for png_file in png_files:
232
- zip_file.write(filename=png_file, arcname=png_file.name)
233
-
234
- # Upload archive
235
- self.api.upload_file(
236
- path_or_fileobj=archive_path,
237
- path_in_repo=f"{self.path_in_repo}/images.zip",
238
- repo_id=self.repo_id,
239
- repo_type=self.repo_type,
240
- )
241
-
242
- # Clean up local files
243
- for png_file in png_files:
244
- png_file.unlink()
245
-
246
- # Usage
247
- scheduler = ImageArchiveScheduler(
248
- repo_id="username/image-dataset",
249
- repo_type="dataset",
250
- folder_path=Path("./images"),
251
- path_in_repo="archives",
252
- every=15,
253
- )
254
- ```
255
-
256
- ## Best Practices
257
-
258
- ### 1. File Naming for Concurrent Access
259
- Always use UUIDs or timestamps to avoid filename conflicts:
260
- ```python
261
- import uuid
262
- filename = f"data_{uuid.uuid4()}.json"
263
- ```
264
-
265
- ### 2. Thread Safety
266
- Always use the scheduler lock when writing:
267
- ```python
268
- with scheduler.lock:
269
- with file.open("a") as f:
270
- f.write(data)
271
- ```
272
-
273
- ### 3. Append-Only Data
274
- CommitScheduler assumes append-only operations. Only:
275
- - Create new files
276
- - Append to existing files
277
- - Never delete or overwrite files (this can corrupt the repo)
278
-
279
- ### 4. Upload Frequency
280
- - Minimum recommended: 5 minutes
281
- - For user-facing apps: 10-15 minutes
282
- - For training logs: 5-10 minutes
283
-
284
- ### 5. Data Format
285
- Use formats readable by the Datasets library:
286
- - JSON Lines (`.jsonl`) for structured data
287
- - CSV for tabular data
288
- - Parquet for large datasets
289
- - ZIP for grouped files
290
-
291
- ### 6. Error Handling
292
- The scheduler silently handles errors and retries. For critical data:
293
- ```python
294
- import logging
295
-
296
- logging.basicConfig(level=logging.INFO)
297
- # Scheduler will log errors automatically
298
- ```
299
-
300
- ### 7. Large Files
301
- For very large datasets (>1GB), consider:
302
- ```python
303
- from huggingface_hub import HfApi
304
-
305
- api = HfApi()
306
-
307
- # Upload large folders with automatic chunking
308
- api.upload_large_folder(
309
- repo_id="username/huge-dataset",
310
- folder_path="./data",
311
- repo_type="dataset",
312
- )
313
- ```
314
-
315
- ## Common Use Cases
316
-
317
- ### Use Case 1: A/B Testing Results
318
- ```python
319
- # Save A/B test results from a Gradio Space
320
- def log_ab_test(user_id, variant, conversion):
321
- with scheduler.lock:
322
- with ab_test_file.open("a") as f:
323
- f.write(json.dumps({
324
- "user_id": user_id,
325
- "variant": variant,
326
- "conversion": conversion,
327
- "timestamp": datetime.now().isoformat()
328
- }))
329
- f.write("\n")
330
- ```
331
-
332
- ### Use Case 2: Model Predictions Storage
333
- ```python
334
- # Store model predictions for analysis
335
- def save_prediction(input_data, prediction, confidence):
336
- with scheduler.lock:
337
- with predictions_file.open("a") as f:
338
- f.write(json.dumps({
339
- "input": input_data,
340
- "prediction": prediction,
341
- "confidence": confidence,
342
- "model_version": "v1.0"
343
- }))
344
- f.write("\n")
345
- ```
346
-
347
- ### Use Case 3: Dataset Versioning
348
- ```python
349
- # Create versioned snapshots of data
350
- api = HfApi()
351
-
352
- api.upload_folder(
353
- folder_path="./current_data",
354
- path_in_repo=f"snapshots/{datetime.now().strftime('%Y%m%d')}",
355
- repo_id="username/versioned-dataset",
356
- repo_type="dataset",
357
- commit_message=f"Snapshot for {datetime.now().date()}"
358
- )
359
- ```
360
-
361
- ## Troubleshooting
362
-
363
- ### Issue: Upload fails with authentication error
364
- **Solution:** Ensure you're logged in or have set the `HF_TOKEN` environment variable.
365
-
366
- ### Issue: Empty commits being created
367
- **Solution:** The scheduler automatically skips empty commits. If you see them, check if files are being created correctly.
368
-
369
- ### Issue: Files not appearing in dataset
370
- **Solution:** Wait for the next scheduled upload (check the `every` parameter) or the scheduler may be encountering errors.
371
-
372
- ### Issue: Out of memory errors
373
- **Solution:** Use `upload_large_folder()` for very large datasets or upload files one at a time.
374
-
375
- ### Issue: Concurrent write conflicts
376
- **Solution:** Always use `scheduler.lock` when writing to files that the scheduler is monitoring.
377
-
378
- ## References
379
-
380
- - [HF Hub Dataset Storage Documentation](https://huggingface.co/docs/hub/spaces-storage#dataset-storage)
381
- - [CommitScheduler API Reference](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.CommitScheduler)
382
- - [Upload Guide](https://huggingface.co/docs/huggingface_hub/main/en/guides/upload)
383
- - [Space to Dataset Saver Example](https://huggingface.co/spaces/Wauplin/space_to_dataset_saver)
384
-
385
- ## Example: Complete Gradio App with Dataset Storage
386
-
387
- See [examples.md](examples.md) for a complete working example of a Gradio Space with dataset storage.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
.claude/skills/hf-dataset-storage/demo.py DELETED
@@ -1,217 +0,0 @@
1
- #!/usr/bin/env python3
2
- """
3
- Demo script showing HuggingFace Dataset Storage patterns.
4
- This demonstrates the skill's capabilities without requiring actual HF authentication.
5
- """
6
-
7
- import json
8
- from pathlib import Path
9
- from datetime import datetime
10
-
11
-
12
- def demo_file_structure():
13
- """Show how to structure files for dataset storage."""
14
- print("=" * 60)
15
- print("📁 File Structure Demo")
16
- print("=" * 60)
17
-
18
- # Example 1: JSON Lines format (recommended for append-only data)
19
- demo_folder = Path("demo_data")
20
- demo_folder.mkdir(exist_ok=True)
21
-
22
- # Create sample JSONL file
23
- sample_file = demo_folder / "sample_data.jsonl"
24
- sample_data = [
25
- {"id": 1, "timestamp": datetime.now().isoformat(), "value": "first entry"},
26
- {"id": 2, "timestamp": datetime.now().isoformat(), "value": "second entry"},
27
- {"id": 3, "timestamp": datetime.now().isoformat(), "value": "third entry"},
28
- ]
29
-
30
- with sample_file.open("w") as f:
31
- for entry in sample_data:
32
- f.write(json.dumps(entry))
33
- f.write("\n")
34
-
35
- print(f"✅ Created sample JSONL file: {sample_file}")
36
- print(f" Contains {len(sample_data)} entries\n")
37
-
38
- # Show the content
39
- print("📄 File contents:")
40
- print(sample_file.read_text())
41
- print()
42
-
43
-
44
- def demo_scheduler_pattern():
45
- """Show the CommitScheduler pattern (without actual upload)."""
46
- print("=" * 60)
47
- print("🔄 CommitScheduler Pattern")
48
- print("=" * 60)
49
-
50
- code_example = '''
51
- from pathlib import Path
52
- from huggingface_hub import CommitScheduler
53
-
54
- # Setup
55
- feedback_folder = Path("user_feedback")
56
- feedback_file = feedback_folder / "data.jsonl"
57
-
58
- # Create scheduler (uploads every 10 minutes)
59
- scheduler = CommitScheduler(
60
- repo_id="username/my-dataset",
61
- repo_type="dataset",
62
- folder_path=feedback_folder,
63
- path_in_repo="feedback",
64
- every=10,
65
- )
66
-
67
- # Save data with thread safety
68
- def save_feedback(data):
69
- with scheduler.lock:
70
- with feedback_file.open("a") as f:
71
- f.write(json.dumps(data))
72
- f.write("\\n")
73
- '''
74
-
75
- print("💡 Recommended Pattern for Continuous Data Collection:")
76
- print(code_example)
77
- print()
78
-
79
-
80
- def demo_manual_upload_pattern():
81
- """Show manual upload patterns."""
82
- print("=" * 60)
83
- print("📤 Manual Upload Pattern")
84
- print("=" * 60)
85
-
86
- code_example = '''
87
- from huggingface_hub import HfApi
88
-
89
- api = HfApi()
90
-
91
- # Method 1: Upload single file
92
- api.upload_file(
93
- path_or_fileobj="data.json",
94
- path_in_repo="data/data.json",
95
- repo_id="username/my-dataset",
96
- repo_type="dataset",
97
- )
98
-
99
- # Method 2: Upload entire folder
100
- api.upload_folder(
101
- folder_path="./my_data",
102
- repo_id="username/my-dataset",
103
- repo_type="dataset",
104
- )
105
-
106
- # Method 3: Large folder (resumable)
107
- api.upload_large_folder(
108
- repo_id="username/huge-dataset",
109
- folder_path="/path/to/huge/folder",
110
- repo_type="dataset",
111
- num_workers=4,
112
- )
113
- '''
114
-
115
- print("💡 Manual Upload Options:")
116
- print(code_example)
117
- print()
118
-
119
-
120
- def demo_use_cases():
121
- """Show common use cases."""
122
- print("=" * 60)
123
- print("🎯 Common Use Cases")
124
- print("=" * 60)
125
-
126
- use_cases = {
127
- "1. Gradio Space User Feedback": {
128
- "description": "Collect and store user interactions",
129
- "pattern": "CommitScheduler",
130
- "frequency": "Every 10-15 minutes",
131
- "format": "JSON Lines (.jsonl)",
132
- },
133
- "2. Training Logs": {
134
- "description": "Store model training metrics over time",
135
- "pattern": "CommitScheduler",
136
- "frequency": "Every 5 minutes",
137
- "format": "JSON Lines (.jsonl) or CSV",
138
- },
139
- "3. A/B Testing Results": {
140
- "description": "Track experiment variants and conversions",
141
- "pattern": "CommitScheduler",
142
- "frequency": "Every 10 minutes",
143
- "format": "JSON Lines (.jsonl)",
144
- },
145
- "4. Dataset Versioning": {
146
- "description": "Create snapshots of evolving datasets",
147
- "pattern": "Manual upload_folder()",
148
- "frequency": "On-demand",
149
- "format": "Any format (Parquet recommended)",
150
- },
151
- "5. Image Collection": {
152
- "description": "Archive images periodically",
153
- "pattern": "Custom Scheduler (zip files)",
154
- "frequency": "Every 15 minutes",
155
- "format": "ZIP archives",
156
- },
157
- }
158
-
159
- for use_case, details in use_cases.items():
160
- print(f"\n{use_case}")
161
- print(f" 📝 {details['description']}")
162
- print(f" 🔧 Pattern: {details['pattern']}")
163
- print(f" ⏱️ Frequency: {details['frequency']}")
164
- print(f" 📄 Format: {details['format']}")
165
-
166
- print()
167
-
168
-
169
- def demo_best_practices():
170
- """Show best practices."""
171
- print("=" * 60)
172
- print("⭐ Best Practices")
173
- print("=" * 60)
174
-
175
- practices = [
176
- "✅ Use UUID filenames to avoid conflicts across restarts",
177
- "✅ Always use scheduler.lock for thread-safe writes",
178
- "✅ Use JSON Lines (.jsonl) for structured append-only data",
179
- "✅ Set minimum upload frequency to 5 minutes",
180
- "✅ Never delete or overwrite files with CommitScheduler",
181
- "✅ Use upload_large_folder() for datasets > 1GB",
182
- "✅ Store HF_TOKEN as environment variable or Space secret",
183
- "✅ Use Parquet format for very large tabular datasets",
184
- ]
185
-
186
- for practice in practices:
187
- print(f" {practice}")
188
-
189
- print()
190
-
191
-
192
- def main():
193
- """Run all demonstrations."""
194
- print("\n" + "=" * 60)
195
- print("🚀 HuggingFace Dataset Storage Skill Demo")
196
- print("=" * 60)
197
- print()
198
-
199
- demo_file_structure()
200
- demo_scheduler_pattern()
201
- demo_manual_upload_pattern()
202
- demo_use_cases()
203
- demo_best_practices()
204
-
205
- print("=" * 60)
206
- print("✅ Demo Complete!")
207
- print("=" * 60)
208
- print()
209
- print("📚 For more information, see:")
210
- print(" - examples.md: Complete working examples")
211
- print(" - reference.md: Detailed API documentation")
212
- print(" - SKILL.md: Quick start guide")
213
- print()
214
-
215
-
216
- if __name__ == "__main__":
217
- main()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
.claude/skills/hf-dataset-storage/demo_data/sample_data.jsonl DELETED
@@ -1,3 +0,0 @@
1
- {"id": 1, "timestamp": "2025-12-31T12:54:45.073335", "value": "first entry"}
2
- {"id": 2, "timestamp": "2025-12-31T12:54:45.073347", "value": "second entry"}
3
- {"id": 3, "timestamp": "2025-12-31T12:54:45.073348", "value": "third entry"}
 
 
 
 
.claude/skills/hf-dataset-storage/examples.md DELETED
@@ -1,728 +0,0 @@
1
- # Complete Examples for HF Dataset Storage
2
-
3
- This file contains complete, working examples for implementing dataset storage in Hugging Face Spaces.
4
-
5
- ## Example 1: Complete Gradio App with User Feedback Storage
6
-
7
- This example shows a complete Gradio application that collects user feedback and saves it to a dataset.
8
-
9
- ```python
10
- # app.py
11
- import json
12
- import uuid
13
- from pathlib import Path
14
- from datetime import datetime
15
- import gradio as gr
16
- from huggingface_hub import CommitScheduler
17
-
18
- # ============================================================================
19
- # Setup Dataset Storage
20
- # ============================================================================
21
-
22
- # Create unique feedback file using UUID to avoid conflicts
23
- feedback_file = Path("user_feedback") / f"feedback_{uuid.uuid4()}.jsonl"
24
- feedback_folder = feedback_file.parent
25
-
26
- # Create folder if it doesn't exist
27
- feedback_folder.mkdir(parents=True, exist_ok=True)
28
-
29
- # Initialize CommitScheduler
30
- # This will automatically upload data every 10 minutes
31
- scheduler = CommitScheduler(
32
- repo_id="your-username/app-feedback-dataset", # Replace with your repo
33
- repo_type="dataset",
34
- folder_path=feedback_folder,
35
- path_in_repo="feedback",
36
- every=10, # Upload every 10 minutes
37
- )
38
-
39
- print(f"✅ Dataset storage initialized. Data will be saved to: {scheduler.repo_id}")
40
-
41
- # ============================================================================
42
- # Application Logic
43
- # ============================================================================
44
-
45
- def translate_text(text, target_language):
46
- """
47
- Mock translation function. Replace with your actual model/API.
48
- """
49
- # Simulated translations
50
- translations = {
51
- "French": f"[FR] {text}",
52
- "Spanish": f"[ES] {text}",
53
- "German": f"[DE] {text}",
54
- }
55
- return translations.get(target_language, text)
56
-
57
- def save_feedback(input_text, translation, language, rating, comments):
58
- """
59
- Save user feedback to the dataset with thread safety.
60
- """
61
- if not input_text or not translation:
62
- return "⚠️ No data to save"
63
-
64
- feedback_data = {
65
- "timestamp": datetime.now().isoformat(),
66
- "input_text": input_text,
67
- "translation": translation,
68
- "target_language": language,
69
- "rating": rating,
70
- "comments": comments,
71
- "session_id": str(uuid.uuid4())
72
- }
73
-
74
- # Use scheduler lock for thread-safe writes
75
- with scheduler.lock:
76
- with feedback_file.open("a") as f:
77
- f.write(json.dumps(feedback_data))
78
- f.write("\n")
79
-
80
- return "✅ Feedback saved! Thank you!"
81
-
82
- # ============================================================================
83
- # Gradio Interface
84
- # ============================================================================
85
-
86
- with gr.Blocks(title="Translation App with Feedback") as demo:
87
- gr.Markdown("# Translation App")
88
- gr.Markdown("Translate text and provide feedback to help us improve!")
89
-
90
- with gr.Row():
91
- with gr.Column():
92
- input_text = gr.Textbox(
93
- label="Enter text to translate",
94
- placeholder="Type something...",
95
- lines=3
96
- )
97
- language = gr.Dropdown(
98
- choices=["French", "Spanish", "German"],
99
- label="Target Language",
100
- value="French"
101
- )
102
- translate_btn = gr.Button("Translate", variant="primary")
103
-
104
- with gr.Column():
105
- output_text = gr.Textbox(
106
- label="Translation",
107
- lines=3,
108
- interactive=False
109
- )
110
-
111
- gr.Markdown("### How was the translation?")
112
-
113
- with gr.Row():
114
- rating = gr.Slider(
115
- minimum=1,
116
- maximum=5,
117
- step=1,
118
- label="Rating (1-5 stars)",
119
- value=3
120
- )
121
- comments = gr.Textbox(
122
- label="Additional comments (optional)",
123
- placeholder="Any suggestions?",
124
- lines=2
125
- )
126
-
127
- feedback_status = gr.Textbox(label="Status", interactive=False)
128
- submit_feedback_btn = gr.Button("Submit Feedback", variant="secondary")
129
-
130
- # Connect the functions
131
- translate_btn.click(
132
- fn=translate_text,
133
- inputs=[input_text, language],
134
- outputs=output_text
135
- )
136
-
137
- submit_feedback_btn.click(
138
- fn=save_feedback,
139
- inputs=[input_text, output_text, language, rating, comments],
140
- outputs=feedback_status
141
- )
142
-
143
- gr.Markdown("---")
144
- gr.Markdown(
145
- f"💾 Feedback is automatically saved to the dataset: "
146
- f"[{scheduler.repo_id}](https://huggingface.co/datasets/{scheduler.repo_id})"
147
- )
148
-
149
- if __name__ == "__main__":
150
- demo.launch()
151
- ```
152
-
153
- ### Requirements for Example 1
154
-
155
- ```toml
156
- # pyproject.toml or requirements.txt
157
- [project]
158
- dependencies = [
159
- "gradio>=4.0.0",
160
- "huggingface_hub>=0.20.0",
161
- ]
162
- ```
163
-
164
- ---
165
-
166
- ## Example 2: Training Logger with Dataset Storage
167
-
168
- This example shows how to log training metrics to a dataset during model training.
169
-
170
- ```python
171
- # train.py
172
- import json
173
- import time
174
- from pathlib import Path
175
- from datetime import datetime
176
- from tqdm import tqdm
177
- from huggingface_hub import CommitScheduler
178
-
179
- # ============================================================================
180
- # Setup Dataset Storage for Training Logs
181
- # ============================================================================
182
-
183
- log_folder = Path("training_logs")
184
- log_folder.mkdir(exist_ok=True)
185
-
186
- # Create separate files for different log types
187
- metrics_file = log_folder / "metrics.jsonl"
188
- checkpoints_file = log_folder / "checkpoints.jsonl"
189
-
190
- # Initialize scheduler - uploads every 5 minutes during training
191
- scheduler = CommitScheduler(
192
- repo_id="your-username/training-logs",
193
- repo_type="dataset",
194
- folder_path=log_folder,
195
- path_in_repo="runs",
196
- every=5,
197
- )
198
-
199
- print(f"📊 Training logs will be saved to: {scheduler.repo_id}")
200
-
201
- # ============================================================================
202
- # Training Configuration
203
- # ============================================================================
204
-
205
- config = {
206
- "model": "my-model-v1",
207
- "learning_rate": 0.001,
208
- "batch_size": 32,
209
- "num_epochs": 10,
210
- "dataset": "training-data-v1"
211
- }
212
-
213
- # Save configuration
214
- with scheduler.lock:
215
- config_file = log_folder / "config.json"
216
- with config_file.open("w") as f:
217
- json.dump(config, f, indent=2)
218
-
219
- # ============================================================================
220
- # Training Functions
221
- # ============================================================================
222
-
223
- def log_metrics(epoch, step, loss, accuracy, learning_rate):
224
- """Log training metrics."""
225
- metrics = {
226
- "timestamp": datetime.now().isoformat(),
227
- "epoch": epoch,
228
- "step": step,
229
- "loss": float(loss),
230
- "accuracy": float(accuracy),
231
- "learning_rate": learning_rate
232
- }
233
-
234
- with scheduler.lock:
235
- with metrics_file.open("a") as f:
236
- f.write(json.dumps(metrics))
237
- f.write("\n")
238
-
239
- def log_checkpoint(epoch, model_path, metrics):
240
- """Log checkpoint information."""
241
- checkpoint_info = {
242
- "timestamp": datetime.now().isoformat(),
243
- "epoch": epoch,
244
- "model_path": model_path,
245
- "metrics": metrics
246
- }
247
-
248
- with scheduler.lock:
249
- with checkpoints_file.open("a") as f:
250
- f.write(json.dumps(checkpoint_info))
251
- f.write("\n")
252
-
253
- def train_epoch(epoch, num_steps=100):
254
- """Mock training epoch."""
255
- epoch_loss = 0
256
- pbar = tqdm(range(num_steps), desc=f"Epoch {epoch}")
257
-
258
- for step in pbar:
259
- # Simulate training
260
- time.sleep(0.1)
261
-
262
- # Mock metrics
263
- loss = 1.0 / (step + 1 + epoch * 10)
264
- accuracy = min(0.95, 0.5 + step * 0.005 + epoch * 0.05)
265
-
266
- epoch_loss += loss
267
-
268
- # Log every 10 steps
269
- if step % 10 == 0:
270
- log_metrics(
271
- epoch=epoch,
272
- step=step,
273
- loss=loss,
274
- accuracy=accuracy,
275
- learning_rate=config["learning_rate"]
276
- )
277
-
278
- pbar.set_postfix({"loss": f"{loss:.4f}", "acc": f"{accuracy:.4f}"})
279
-
280
- return epoch_loss / num_steps, accuracy
281
-
282
- # ============================================================================
283
- # Main Training Loop
284
- # ============================================================================
285
-
286
- def main():
287
- print("🚀 Starting training...")
288
-
289
- for epoch in range(config["num_epochs"]):
290
- print(f"\n📍 Epoch {epoch + 1}/{config['num_epochs']}")
291
-
292
- # Train for one epoch
293
- avg_loss, final_accuracy = train_epoch(epoch)
294
-
295
- # Log checkpoint
296
- checkpoint_path = f"checkpoints/model_epoch_{epoch}.pt"
297
- log_checkpoint(
298
- epoch=epoch,
299
- model_path=checkpoint_path,
300
- metrics={"loss": avg_loss, "accuracy": final_accuracy}
301
- )
302
-
303
- print(f"✅ Epoch {epoch + 1} complete - Loss: {avg_loss:.4f}, Acc: {final_accuracy:.4f}")
304
-
305
- print(f"\n🎉 Training complete! Logs saved to: {scheduler.repo_id}")
306
-
307
- # Force final upload
308
- # Note: scheduler will upload automatically, but we can trigger it manually if needed
309
- print("📤 Uploading final logs...")
310
- time.sleep(2) # Give scheduler time to complete
311
-
312
- if __name__ == "__main__":
313
- main()
314
- ```
315
-
316
- ---
317
-
318
- ## Example 3: Dataset Snapshot Saver
319
-
320
- This example shows how to create versioned snapshots of data.
321
-
322
- ```python
323
- # snapshot_saver.py
324
- import json
325
- from pathlib import Path
326
- from datetime import datetime
327
- from huggingface_hub import HfApi
328
- from tqdm import tqdm
329
-
330
- class DatasetSnapshotSaver:
331
- """Save versioned snapshots of data to HuggingFace datasets."""
332
-
333
- def __init__(self, repo_id, repo_type="dataset"):
334
- self.api = HfApi()
335
- self.repo_id = repo_id
336
- self.repo_type = repo_type
337
-
338
- # Create repo if it doesn't exist
339
- try:
340
- self.api.create_repo(
341
- repo_id=repo_id,
342
- repo_type=repo_type,
343
- exist_ok=True
344
- )
345
- print(f"✅ Repository ready: {repo_id}")
346
- except Exception as e:
347
- print(f"❌ Error creating repo: {e}")
348
-
349
- def save_snapshot(self, data_folder, snapshot_name=None):
350
- """
351
- Save a snapshot of the data folder to the dataset.
352
-
353
- Args:
354
- data_folder: Path to local folder containing data
355
- snapshot_name: Optional custom name, defaults to timestamp
356
- """
357
- if snapshot_name is None:
358
- snapshot_name = datetime.now().strftime("%Y%m%d_%H%M%S")
359
-
360
- data_path = Path(data_folder)
361
- if not data_path.exists():
362
- raise ValueError(f"Data folder not found: {data_folder}")
363
-
364
- print(f"📸 Creating snapshot: {snapshot_name}")
365
-
366
- # Upload folder
367
- self.api.upload_folder(
368
- folder_path=str(data_path),
369
- path_in_repo=f"snapshots/{snapshot_name}",
370
- repo_id=self.repo_id,
371
- repo_type=self.repo_type,
372
- commit_message=f"Snapshot: {snapshot_name}"
373
- )
374
-
375
- print(f"✅ Snapshot saved: snapshots/{snapshot_name}")
376
- return snapshot_name
377
-
378
- def save_metadata(self, snapshot_name, metadata):
379
- """Save metadata for a snapshot."""
380
- metadata_content = json.dumps(metadata, indent=2)
381
-
382
- self.api.upload_file(
383
- path_or_fileobj=metadata_content.encode(),
384
- path_in_repo=f"snapshots/{snapshot_name}/metadata.json",
385
- repo_id=self.repo_id,
386
- repo_type=self.repo_type,
387
- commit_message=f"Add metadata for {snapshot_name}"
388
- )
389
-
390
- print(f"✅ Metadata saved for snapshot: {snapshot_name}")
391
-
392
- # ============================================================================
393
- # Usage Example
394
- # ============================================================================
395
-
396
- if __name__ == "__main__":
397
- # Initialize saver
398
- saver = DatasetSnapshotSaver(
399
- repo_id="your-username/data-snapshots"
400
- )
401
-
402
- # Create sample data
403
- data_folder = Path("./sample_data")
404
- data_folder.mkdir(exist_ok=True)
405
-
406
- # Generate some sample files
407
- print("📝 Generating sample data...")
408
- for i in tqdm(range(10)):
409
- sample_file = data_folder / f"data_{i}.json"
410
- with sample_file.open("w") as f:
411
- json.dump({"id": i, "value": i * 10}, f)
412
-
413
- # Save snapshot
414
- snapshot_name = saver.save_snapshot(
415
- data_folder=data_folder,
416
- snapshot_name="initial_snapshot"
417
- )
418
-
419
- # Save metadata
420
- saver.save_metadata(
421
- snapshot_name=snapshot_name,
422
- metadata={
423
- "created_at": datetime.now().isoformat(),
424
- "num_files": 10,
425
- "description": "Initial data snapshot",
426
- "version": "1.0"
427
- }
428
- )
429
-
430
- print(f"\n🎉 Complete! View at: https://huggingface.co/datasets/{saver.repo_id}")
431
- ```
432
-
433
- ---
434
-
435
- ## Example 4: Image Collection Archiver
436
-
437
- This example shows how to collect images and periodically archive them to a dataset.
438
-
439
- ```python
440
- # image_archiver.py
441
- import zipfile
442
- import tempfile
443
- from pathlib import Path
444
- from datetime import datetime
445
- from huggingface_hub import CommitScheduler
446
-
447
- class ImageArchiveScheduler(CommitScheduler):
448
- """
449
- Custom scheduler that collects images and uploads them as ZIP archives.
450
- """
451
-
452
- def push_to_hub(self):
453
- """Override to zip images before uploading."""
454
-
455
- # Find all image files
456
- image_extensions = ["*.png", "*.jpg", "*.jpeg", "*.gif"]
457
- image_files = []
458
- for ext in image_extensions:
459
- image_files.extend(list(self.folder_path.glob(ext)))
460
-
461
- if len(image_files) == 0:
462
- print("No images to archive")
463
- return None
464
-
465
- print(f"📦 Archiving {len(image_files)} images...")
466
-
467
- # Create ZIP archive
468
- timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
469
- archive_name = f"images_{timestamp}.zip"
470
-
471
- with tempfile.TemporaryDirectory() as tmpdir:
472
- archive_path = Path(tmpdir) / archive_name
473
-
474
- with zipfile.ZipFile(archive_path, "w", zipfile.ZIP_DEFLATED) as zip_file:
475
- for img_file in image_files:
476
- zip_file.write(
477
- filename=img_file,
478
- arcname=f"{timestamp}/{img_file.name}"
479
- )
480
-
481
- # Upload archive
482
- self.api.upload_file(
483
- path_or_fileobj=str(archive_path),
484
- path_in_repo=f"{self.path_in_repo}/{archive_name}",
485
- repo_id=self.repo_id,
486
- repo_type=self.repo_type,
487
- commit_message=f"Archive {len(image_files)} images from {timestamp}"
488
- )
489
-
490
- # Delete local images after successful upload
491
- for img_file in image_files:
492
- img_file.unlink()
493
-
494
- print(f"✅ Archived and uploaded {len(image_files)} images")
495
-
496
- # ============================================================================
497
- # Usage Example
498
- # ============================================================================
499
-
500
- if __name__ == "__main__":
501
- import time
502
- from PIL import Image
503
- import numpy as np
504
-
505
- # Setup image folder
506
- image_folder = Path("./collected_images")
507
- image_folder.mkdir(exist_ok=True)
508
-
509
- # Initialize custom scheduler
510
- scheduler = ImageArchiveScheduler(
511
- repo_id="your-username/image-archives",
512
- repo_type="dataset",
513
- folder_path=image_folder,
514
- path_in_repo="archives",
515
- every=5, # Archive every 5 minutes
516
- )
517
-
518
- print(f"📸 Image archiver started. Saving to: {scheduler.repo_id}")
519
-
520
- # Simulate image collection
521
- print("Generating sample images...")
522
- for i in range(5):
523
- # Create a random image
524
- img_array = np.random.randint(0, 255, (100, 100, 3), dtype=np.uint8)
525
- img = Image.fromarray(img_array)
526
-
527
- # Save image
528
- img_path = image_folder / f"image_{i}_{datetime.now().strftime('%H%M%S')}.png"
529
- img.save(img_path)
530
- print(f" Generated: {img_path.name}")
531
-
532
- time.sleep(2)
533
-
534
- print("\n⏳ Waiting for scheduler to archive images...")
535
- print(" (In production, your app would continue running)")
536
-
537
- # In a real application, the scheduler runs in the background
538
- # and you can continue processing
539
- ```
540
-
541
- ---
542
-
543
- ## Example 5: A/B Testing Results Collector
544
-
545
- This example shows how to collect A/B testing results from a Gradio Space.
546
-
547
- ```python
548
- # ab_testing_app.py
549
- import json
550
- import uuid
551
- import random
552
- from pathlib import Path
553
- from datetime import datetime
554
- import gradio as gr
555
- from huggingface_hub import CommitScheduler
556
-
557
- # ============================================================================
558
- # Setup Dataset Storage for A/B Testing
559
- # ============================================================================
560
-
561
- results_folder = Path("ab_test_results")
562
- results_folder.mkdir(exist_ok=True)
563
-
564
- results_file = results_folder / f"results_{uuid.uuid4()}.jsonl"
565
-
566
- scheduler = CommitScheduler(
567
- repo_id="your-username/ab-test-results",
568
- repo_type="dataset",
569
- folder_path=results_folder,
570
- path_in_repo="experiments",
571
- every=10,
572
- )
573
-
574
- print(f"📊 A/B test results will be saved to: {scheduler.repo_id}")
575
-
576
- # ============================================================================
577
- # A/B Testing Logic
578
- # ============================================================================
579
-
580
- def assign_variant():
581
- """Randomly assign user to variant A or B."""
582
- return random.choice(["A", "B"])
583
-
584
- def get_recommendation(user_input, variant):
585
- """
586
- Generate recommendation based on variant.
587
- Variant A: Conservative recommendations
588
- Variant B: Aggressive recommendations
589
- """
590
- if variant == "A":
591
- return f"Conservative recommendation for: {user_input}"
592
- else:
593
- return f"Aggressive recommendation for: {user_input}"
594
-
595
- def log_interaction(session_id, variant, user_input, recommendation, user_clicked):
596
- """Log A/B test interaction."""
597
- result = {
598
- "timestamp": datetime.now().isoformat(),
599
- "session_id": session_id,
600
- "variant": variant,
601
- "user_input": user_input,
602
- "recommendation": recommendation,
603
- "user_clicked": user_clicked,
604
- "conversion": user_clicked
605
- }
606
-
607
- with scheduler.lock:
608
- with results_file.open("a") as f:
609
- f.write(json.dumps(result))
610
- f.write("\n")
611
-
612
- # ============================================================================
613
- # Gradio Interface
614
- # ============================================================================
615
-
616
- def process_request(user_input, session_state):
617
- """Process user request and assign variant."""
618
- if session_state is None:
619
- session_state = {
620
- "session_id": str(uuid.uuid4()),
621
- "variant": assign_variant()
622
- }
623
-
624
- recommendation = get_recommendation(user_input, session_state["variant"])
625
-
626
- return (
627
- recommendation,
628
- f"You are in variant: {session_state['variant']}",
629
- session_state,
630
- session_state["session_id"],
631
- session_state["variant"],
632
- user_input,
633
- recommendation
634
- )
635
-
636
- def log_click(session_id, variant, user_input, recommendation):
637
- """Log when user clicks the recommendation."""
638
- if session_id:
639
- log_interaction(session_id, variant, user_input, recommendation, True)
640
- return "✅ Click logged!"
641
- return "⚠️ No session data"
642
-
643
- with gr.Blocks(title="A/B Testing Demo") as demo:
644
- gr.Markdown("# A/B Testing Demo")
645
- gr.Markdown("Test two different recommendation strategies")
646
-
647
- # Session state
648
- session_state = gr.State(None)
649
- session_id_state = gr.State(None)
650
- variant_state = gr.State(None)
651
- input_state = gr.State(None)
652
- recommendation_state = gr.State(None)
653
-
654
- with gr.Row():
655
- user_input = gr.Textbox(
656
- label="What are you looking for?",
657
- placeholder="Enter your query..."
658
- )
659
- submit_btn = gr.Button("Get Recommendation", variant="primary")
660
-
661
- recommendation_output = gr.Textbox(
662
- label="Recommendation",
663
- interactive=False
664
- )
665
-
666
- variant_display = gr.Textbox(
667
- label="Your Test Variant",
668
- interactive=False
669
- )
670
-
671
- click_btn = gr.Button("I like this recommendation!", variant="secondary")
672
- click_status = gr.Textbox(label="Status", interactive=False)
673
-
674
- # Connect functions
675
- submit_btn.click(
676
- fn=process_request,
677
- inputs=[user_input, session_state],
678
- outputs=[
679
- recommendation_output,
680
- variant_display,
681
- session_state,
682
- session_id_state,
683
- variant_state,
684
- input_state,
685
- recommendation_state
686
- ]
687
- )
688
-
689
- click_btn.click(
690
- fn=log_click,
691
- inputs=[session_id_state, variant_state, input_state, recommendation_state],
692
- outputs=click_status
693
- )
694
-
695
- gr.Markdown("---")
696
- gr.Markdown(f"📊 Results are saved to: [{scheduler.repo_id}](https://huggingface.co/datasets/{scheduler.repo_id})")
697
-
698
- if __name__ == "__main__":
699
- demo.launch()
700
- ```
701
-
702
- ---
703
-
704
- ## Running the Examples
705
-
706
- For any of these examples:
707
-
708
- 1. **Install dependencies:**
709
- ```bash
710
- uv add gradio huggingface_hub
711
- # For image example: uv add pillow numpy
712
- ```
713
-
714
- 2. **Login to Hugging Face:**
715
- ```bash
716
- huggingface-cli login
717
- ```
718
-
719
- 3. **Update repo_id:**
720
- Replace `"your-username/repo-name"` with your actual HuggingFace username and desired dataset name.
721
-
722
- 4. **Run the script:**
723
- ```bash
724
- uv run app.py
725
- ```
726
-
727
- 5. **View results:**
728
- Visit `https://huggingface.co/datasets/your-username/repo-name` to see your data!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
.claude/skills/hf-dataset-storage/reference.md DELETED
@@ -1,767 +0,0 @@
1
- # API Reference for HF Dataset Storage
2
-
3
- This reference provides detailed documentation for all the APIs and configuration options related to Hugging Face dataset storage.
4
-
5
- ## Table of Contents
6
-
7
- 1. [CommitScheduler](#commitscheduler)
8
- 2. [HfApi Upload Methods](#hfapi-upload-methods)
9
- 3. [Commit Operations](#commit-operations)
10
- 4. [Authentication](#authentication)
11
- 5. [Configuration](#configuration)
12
- 6. [Error Handling](#error-handling)
13
-
14
- ---
15
-
16
- ## CommitScheduler
17
-
18
- The `CommitScheduler` class automatically uploads files to a dataset repository at regular intervals.
19
-
20
- ### Constructor
21
-
22
- ```python
23
- from huggingface_hub import CommitScheduler
24
-
25
- scheduler = CommitScheduler(
26
- repo_id: str,
27
- folder_path: str | Path,
28
- *,
29
- repo_type: str = "dataset",
30
- revision: str = "main",
31
- path_in_repo: str = ".",
32
- every: int | float = 5,
33
- token: str | None = None,
34
- allow_patterns: str | List[str] | None = None,
35
- ignore_patterns: str | List[str] | None = None,
36
- )
37
- ```
38
-
39
- ### Parameters
40
-
41
- | Parameter | Type | Default | Description |
42
- |-----------|------|---------|-------------|
43
- | `repo_id` | `str` | Required | Repository ID (e.g., "username/dataset-name") |
44
- | `folder_path` | `str | Path` | Required | Local folder to monitor and upload |
45
- | `repo_type` | `str` | `"dataset"` | Type of repo: "dataset", "model", or "space" |
46
- | `revision` | `str` | `"main"` | Git revision/branch to commit to |
47
- | `path_in_repo` | `str` | `"."` | Path in the repo where files will be uploaded |
48
- | `every` | `int | float` | `5` | Minutes between uploads (minimum 5 recommended) |
49
- | `token` | `str | None` | `None` | HuggingFace token (uses cached token if None) |
50
- | `allow_patterns` | `str | List[str] | None` | `None` | Glob patterns for files to include |
51
- | `ignore_patterns` | `str | List[str] | None` | `None` | Glob patterns for files to exclude |
52
-
53
- ### Attributes
54
-
55
- | Attribute | Type | Description |
56
- |-----------|------|-------------|
57
- | `lock` | `threading.Lock` | Thread lock for safe concurrent writes |
58
- | `api` | `HfApi` | HuggingFace API client instance |
59
- | `repo_id` | `str` | The repository ID |
60
- | `folder_path` | `Path` | The monitored folder path |
61
-
62
- ### Methods
63
-
64
- #### `push_to_hub()`
65
-
66
- Manually trigger an upload. Called automatically by the scheduler.
67
-
68
- ```python
69
- scheduler.push_to_hub()
70
- ```
71
-
72
- **Note:** You can override this method in a subclass for custom behavior.
73
-
74
- ### Example: Basic Usage
75
-
76
- ```python
77
- from pathlib import Path
78
- from huggingface_hub import CommitScheduler
79
-
80
- # Create scheduler
81
- scheduler = CommitScheduler(
82
- repo_id="username/my-dataset",
83
- folder_path=Path("./data"),
84
- every=10,
85
- )
86
-
87
- # Files in ./data will be uploaded every 10 minutes
88
- # Use scheduler.lock when writing to ensure thread safety
89
- ```
90
-
91
- ### Example: Custom Upload Logic
92
-
93
- ```python
94
- from huggingface_hub import CommitScheduler
95
- import zipfile
96
- from pathlib import Path
97
-
98
- class CustomScheduler(CommitScheduler):
99
- def push_to_hub(self):
100
- """Custom logic to zip files before upload."""
101
- files = list(self.folder_path.glob("*.txt"))
102
- if not files:
103
- return None
104
-
105
- # Create archive
106
- archive_path = self.folder_path / "archive.zip"
107
- with zipfile.ZipFile(archive_path, "w") as zf:
108
- for file in files:
109
- zf.write(file, file.name)
110
-
111
- # Upload using parent's API
112
- self.api.upload_file(
113
- path_or_fileobj=str(archive_path),
114
- path_in_repo="archives/archive.zip",
115
- repo_id=self.repo_id,
116
- repo_type=self.repo_type,
117
- )
118
-
119
- # Cleanup
120
- archive_path.unlink()
121
- for file in files:
122
- file.unlink()
123
- ```
124
-
125
- ---
126
-
127
- ## HfApi Upload Methods
128
-
129
- The `HfApi` class provides methods for uploading files and folders.
130
-
131
- ### upload_file()
132
-
133
- Upload a single file to a repository.
134
-
135
- ```python
136
- from huggingface_hub import HfApi
137
-
138
- api = HfApi()
139
-
140
- api.upload_file(
141
- path_or_fileobj: str | Path | bytes | BinaryIO,
142
- path_in_repo: str,
143
- repo_id: str,
144
- *,
145
- repo_type: str = "model",
146
- revision: str = "main",
147
- commit_message: str | None = None,
148
- commit_description: str | None = None,
149
- token: str | None = None,
150
- run_as_future: bool = False,
151
- )
152
- ```
153
-
154
- #### Parameters
155
-
156
- | Parameter | Type | Description |
157
- |-----------|------|-------------|
158
- | `path_or_fileobj` | `str | Path | bytes | BinaryIO` | File path or file-like object to upload |
159
- | `path_in_repo` | `str` | Destination path in the repository |
160
- | `repo_id` | `str` | Repository ID |
161
- | `repo_type` | `str` | "model", "dataset", or "space" |
162
- | `revision` | `str` | Branch/tag to commit to |
163
- | `commit_message` | `str | None` | Custom commit message |
164
- | `commit_description` | `str | None` | Extended commit description |
165
- | `token` | `str | None` | Authentication token |
166
- | `run_as_future` | `bool` | Run upload in background (returns Future) |
167
-
168
- #### Returns
169
-
170
- - `str`: Commit hash (URL to the commit)
171
- - `concurrent.futures.Future`: If `run_as_future=True`
172
-
173
- #### Example
174
-
175
- ```python
176
- # Upload file from path
177
- api.upload_file(
178
- path_or_fileobj="/path/to/file.json",
179
- path_in_repo="data/file.json",
180
- repo_id="username/my-dataset",
181
- repo_type="dataset",
182
- commit_message="Add new data file"
183
- )
184
-
185
- # Upload bytes
186
- data = b'{"key": "value"}'
187
- api.upload_file(
188
- path_or_fileobj=data,
189
- path_in_repo="config.json",
190
- repo_id="username/my-dataset",
191
- repo_type="dataset",
192
- )
193
-
194
- # Background upload
195
- future = api.upload_file(
196
- path_or_fileobj="large_file.bin",
197
- path_in_repo="large_file.bin",
198
- repo_id="username/my-dataset",
199
- repo_type="dataset",
200
- run_as_future=True,
201
- )
202
- # Do other work...
203
- future.result() # Wait for completion
204
- ```
205
-
206
- ---
207
-
208
- ### upload_folder()
209
-
210
- Upload an entire folder to a repository.
211
-
212
- ```python
213
- api.upload_folder(
214
- folder_path: str | Path,
215
- repo_id: str,
216
- *,
217
- repo_type: str = "model",
218
- revision: str = "main",
219
- path_in_repo: str = ".",
220
- commit_message: str | None = None,
221
- commit_description: str | None = None,
222
- token: str | None = None,
223
- allow_patterns: str | List[str] | None = None,
224
- ignore_patterns: str | List[str] | None = None,
225
- delete_patterns: str | List[str] | None = None,
226
- run_as_future: bool = False,
227
- )
228
- ```
229
-
230
- #### Additional Parameters
231
-
232
- | Parameter | Type | Description |
233
- |-----------|------|-------------|
234
- | `folder_path` | `str | Path` | Local folder to upload |
235
- | `allow_patterns` | `str | List[str] | None` | Glob patterns to include |
236
- | `ignore_patterns` | `str | List[str] | None` | Glob patterns to exclude |
237
- | `delete_patterns` | `str | List[str] | None` | Patterns to delete from repo before upload |
238
-
239
- #### Example
240
-
241
- ```python
242
- # Upload entire folder
243
- api.upload_folder(
244
- folder_path="./my_dataset",
245
- repo_id="username/my-dataset",
246
- repo_type="dataset",
247
- )
248
-
249
- # Upload only CSV files
250
- api.upload_folder(
251
- folder_path="./data",
252
- repo_id="username/my-dataset",
253
- repo_type="dataset",
254
- allow_patterns="*.csv",
255
- )
256
-
257
- # Upload and delete old files
258
- api.upload_folder(
259
- folder_path="./new_data",
260
- path_in_repo="data",
261
- repo_id="username/my-dataset",
262
- repo_type="dataset",
263
- delete_patterns="*.old", # Delete .old files first
264
- )
265
- ```
266
-
267
- ---
268
-
269
- ### upload_large_folder()
270
-
271
- Upload very large folders with resume capability.
272
-
273
- ```python
274
- api.upload_large_folder(
275
- repo_id: str,
276
- folder_path: str | Path,
277
- *,
278
- repo_type: str = "model",
279
- revision: str = "main",
280
- private: bool = False,
281
- token: str | None = None,
282
- allow_patterns: str | List[str] | None = None,
283
- ignore_patterns: str | List[str] | None = None,
284
- num_workers: int = 1,
285
- )
286
- ```
287
-
288
- #### Key Features
289
-
290
- - **Resumable**: Caches progress locally, can resume after interruption
291
- - **Multi-threaded**: Parallel uploads with `num_workers`
292
- - **Resilient**: Automatic retries on errors
293
-
294
- #### Parameters
295
-
296
- | Parameter | Type | Description |
297
- |-----------|------|-------------|
298
- | `num_workers` | `int` | Number of parallel upload threads |
299
-
300
- #### Limitations
301
-
302
- - Cannot set custom `path_in_repo` (upload to root)
303
- - Cannot set custom commit message
304
- - Cannot delete files while uploading
305
- - Cannot create PR directly
306
-
307
- #### Example
308
-
309
- ```python
310
- # Upload huge dataset
311
- api.upload_large_folder(
312
- repo_id="username/huge-dataset",
313
- folder_path="/data/massive_dataset",
314
- repo_type="dataset",
315
- num_workers=4, # Use 4 parallel threads
316
- )
317
-
318
- # If interrupted, re-run the same command to resume
319
- ```
320
-
321
- ---
322
-
323
- ## Commit Operations
324
-
325
- For fine-grained control over commits, use `create_commit()` with operation objects.
326
-
327
- ### create_commit()
328
-
329
- Create a commit with multiple operations (add/delete/copy files).
330
-
331
- ```python
332
- api.create_commit(
333
- repo_id: str,
334
- operations: List[CommitOperation],
335
- *,
336
- commit_message: str,
337
- commit_description: str | None = None,
338
- repo_type: str = "model",
339
- revision: str = "main",
340
- token: str | None = None,
341
- create_pr: bool = False,
342
- )
343
- ```
344
-
345
- ### Operation Types
346
-
347
- #### CommitOperationAdd
348
-
349
- Add or update a file.
350
-
351
- ```python
352
- from huggingface_hub import CommitOperationAdd
353
-
354
- op = CommitOperationAdd(
355
- path_in_repo="path/to/file.txt",
356
- path_or_fileobj="/local/path/file.txt" # or bytes or file object
357
- )
358
- ```
359
-
360
- #### CommitOperationDelete
361
-
362
- Delete a file or folder.
363
-
364
- ```python
365
- from huggingface_hub import CommitOperationDelete
366
-
367
- op = CommitOperationDelete(
368
- path_in_repo="path/to/delete.txt" # or "folder/" for directories
369
- )
370
- ```
371
-
372
- #### CommitOperationCopy
373
-
374
- Copy a file within the repository.
375
-
376
- ```python
377
- from huggingface_hub import CommitOperationCopy
378
-
379
- op = CommitOperationCopy(
380
- src_path_in_repo="original.txt",
381
- path_in_repo="copy.txt",
382
- src_revision="main" # optional: copy from different branch
383
- )
384
- ```
385
-
386
- ### Example: Multi-operation Commit
387
-
388
- ```python
389
- from huggingface_hub import HfApi, CommitOperationAdd, CommitOperationDelete
390
-
391
- api = HfApi()
392
-
393
- operations = [
394
- CommitOperationAdd(
395
- path_in_repo="data/new_file.json",
396
- path_or_fileobj="/local/new_file.json"
397
- ),
398
- CommitOperationAdd(
399
- path_in_repo="config.yaml",
400
- path_or_fileobj=b"key: value"
401
- ),
402
- CommitOperationDelete(
403
- path_in_repo="old_data/" # Delete entire folder
404
- ),
405
- ]
406
-
407
- api.create_commit(
408
- repo_id="username/my-dataset",
409
- operations=operations,
410
- commit_message="Update dataset files",
411
- commit_description="Added new data and removed old files",
412
- repo_type="dataset",
413
- )
414
- ```
415
-
416
- ---
417
-
418
- ## Authentication
419
-
420
- ### Method 1: CLI Login (Recommended)
421
-
422
- ```bash
423
- huggingface-cli login
424
- ```
425
-
426
- This caches your token locally. All subsequent API calls use this token automatically.
427
-
428
- ### Method 2: Environment Variable
429
-
430
- ```bash
431
- export HF_TOKEN="hf_..."
432
- ```
433
-
434
- ```python
435
- import os
436
- from huggingface_hub import HfApi
437
-
438
- api = HfApi(token=os.environ["HF_TOKEN"])
439
- ```
440
-
441
- ### Method 3: Programmatic Token
442
-
443
- ```python
444
- from huggingface_hub import HfApi
445
-
446
- api = HfApi(token="hf_your_token_here")
447
- ```
448
-
449
- ### For Hugging Face Spaces
450
-
451
- 1. Go to Space Settings → Repository secrets
452
- 2. Add secret: `HF_TOKEN` = your token value
453
- 3. Access in code:
454
-
455
- ```python
456
- import os
457
- token = os.environ.get("HF_TOKEN")
458
- ```
459
-
460
- ---
461
-
462
- ## Configuration
463
-
464
- ### Environment Variables
465
-
466
- | Variable | Description | Default |
467
- |----------|-------------|---------|
468
- | `HF_TOKEN` | Authentication token | None |
469
- | `HF_HOME` | Cache directory | `~/.cache/huggingface` |
470
- | `HF_HUB_CACHE` | Hub cache directory | `$HF_HOME/hub` |
471
- | `HF_ENDPOINT` | Hub endpoint URL | `https://huggingface.co` |
472
- | `HF_XET_CACHE` | Xet cache directory | `$HF_HOME/xet` |
473
- | `HF_XET_HIGH_PERFORMANCE` | Enable high-performance mode | `0` |
474
-
475
- ### Example: Custom Cache Location
476
-
477
- ```python
478
- import os
479
-
480
- # Set cache to local SSD for better performance
481
- os.environ["HF_HOME"] = "/mnt/local-ssd/.cache/huggingface"
482
-
483
- from huggingface_hub import HfApi
484
- api = HfApi()
485
- ```
486
-
487
- ---
488
-
489
- ## Error Handling
490
-
491
- ### Common Errors and Solutions
492
-
493
- #### 1. Authentication Error
494
-
495
- ```python
496
- from huggingface_hub import HfApi
497
- from huggingface_hub.utils import HfHubHTTPError
498
-
499
- api = HfApi()
500
-
501
- try:
502
- api.upload_file(
503
- path_or_fileobj="file.txt",
504
- path_in_repo="file.txt",
505
- repo_id="username/dataset",
506
- repo_type="dataset",
507
- )
508
- except HfHubHTTPError as e:
509
- if e.response.status_code == 401:
510
- print("❌ Authentication failed. Run: huggingface-cli login")
511
- else:
512
- raise
513
- ```
514
-
515
- #### 2. Repository Not Found
516
-
517
- ```python
518
- from huggingface_hub import HfApi
519
- from huggingface_hub.utils import RepositoryNotFoundError
520
-
521
- api = HfApi()
522
-
523
- try:
524
- api.upload_file(...)
525
- except RepositoryNotFoundError:
526
- print("❌ Repository not found. Creating...")
527
- api.create_repo(repo_id="username/dataset", repo_type="dataset")
528
- api.upload_file(...) # Retry
529
- ```
530
-
531
- #### 3. File Too Large
532
-
533
- ```python
534
- from huggingface_hub import HfApi
535
-
536
- api = HfApi()
537
-
538
- file_size = os.path.getsize("huge_file.bin")
539
-
540
- if file_size > 5 * 1024**3: # > 5GB
541
- print("⚠️ Large file detected, using upload_large_folder")
542
- # Move file to folder and use upload_large_folder
543
- else:
544
- api.upload_file(
545
- path_or_fileobj="huge_file.bin",
546
- path_in_repo="huge_file.bin",
547
- repo_id="username/dataset",
548
- repo_type="dataset",
549
- )
550
- ```
551
-
552
- #### 4. Network Interruption
553
-
554
- ```python
555
- from huggingface_hub import HfApi
556
- import time
557
-
558
- api = HfApi()
559
-
560
- max_retries = 3
561
- for attempt in range(max_retries):
562
- try:
563
- api.upload_folder(
564
- folder_path="./data",
565
- repo_id="username/dataset",
566
- repo_type="dataset",
567
- )
568
- break
569
- except Exception as e:
570
- if attempt < max_retries - 1:
571
- wait_time = 2 ** attempt # Exponential backoff
572
- print(f"⚠️ Upload failed, retrying in {wait_time}s...")
573
- time.sleep(wait_time)
574
- else:
575
- print("❌ Upload failed after all retries")
576
- raise
577
- ```
578
-
579
- ---
580
-
581
- ## Advanced Patterns
582
-
583
- ### Pattern 1: Atomic Updates
584
-
585
- Ensure all files are updated together or not at all.
586
-
587
- ```python
588
- from huggingface_hub import HfApi, CommitOperationAdd
589
-
590
- api = HfApi()
591
-
592
- # Prepare all operations
593
- operations = [
594
- CommitOperationAdd("file1.json", path_or_fileobj=data1),
595
- CommitOperationAdd("file2.json", path_or_fileobj=data2),
596
- CommitOperationAdd("file3.json", path_or_fileobj=data3),
597
- ]
598
-
599
- # Single atomic commit
600
- api.create_commit(
601
- repo_id="username/dataset",
602
- operations=operations,
603
- commit_message="Atomic update of all files",
604
- repo_type="dataset",
605
- )
606
- ```
607
-
608
- ### Pattern 2: Concurrent Uploads
609
-
610
- Upload multiple files in parallel.
611
-
612
- ```python
613
- from huggingface_hub import HfApi
614
- from concurrent.futures import ThreadPoolExecutor
615
-
616
- api = HfApi()
617
- files = ["file1.txt", "file2.txt", "file3.txt"]
618
-
619
- def upload_file(filename):
620
- api.upload_file(
621
- path_or_fileobj=filename,
622
- path_in_repo=filename,
623
- repo_id="username/dataset",
624
- repo_type="dataset",
625
- )
626
-
627
- with ThreadPoolExecutor(max_workers=3) as executor:
628
- executor.map(upload_file, files)
629
- ```
630
-
631
- ### Pattern 3: Progressive Dataset Building
632
-
633
- Build dataset incrementally with versioning.
634
-
635
- ```python
636
- from huggingface_hub import HfApi
637
-
638
- api = HfApi()
639
-
640
- for version in range(1, 11):
641
- # Generate data for this version
642
- data = generate_data(version)
643
-
644
- # Upload to versioned path
645
- api.upload_file(
646
- path_or_fileobj=data,
647
- path_in_repo=f"versions/v{version}/data.json",
648
- repo_id="username/dataset",
649
- repo_type="dataset",
650
- commit_message=f"Add version {version}",
651
- )
652
- ```
653
-
654
- ---
655
-
656
- ## Performance Tips
657
-
658
- ### 1. Use High-Performance Mode
659
-
660
- For maximum upload speed (uses all CPU cores and bandwidth):
661
-
662
- ```bash
663
- export HF_XET_HIGH_PERFORMANCE=1
664
- ```
665
-
666
- ### 2. Local Cache for Cluster Uploads
667
-
668
- When uploading from distributed filesystems:
669
-
670
- ```bash
671
- # Point cache to local SSD, not network filesystem
672
- export HF_XET_CACHE=/local-ssd/.cache/xet
673
- ```
674
-
675
- ### 3. Batch Small Files
676
-
677
- Instead of uploading thousands of small files individually:
678
-
679
- ```python
680
- import zipfile
681
- from huggingface_hub import HfApi
682
-
683
- # Zip small files
684
- with zipfile.ZipFile("archive.zip", "w") as zf:
685
- for file in small_files:
686
- zf.write(file)
687
-
688
- # Upload single archive
689
- api.upload_file(
690
- path_or_fileobj="archive.zip",
691
- path_in_repo="data/archive.zip",
692
- repo_id="username/dataset",
693
- repo_type="dataset",
694
- )
695
- ```
696
-
697
- ### 4. Use Background Uploads
698
-
699
- Don't block your main thread:
700
-
701
- ```python
702
- from huggingface_hub import HfApi
703
-
704
- api = HfApi()
705
-
706
- # Start upload in background
707
- future = api.upload_folder(
708
- folder_path="./data",
709
- repo_id="username/dataset",
710
- repo_type="dataset",
711
- run_as_future=True,
712
- )
713
-
714
- # Do other work
715
- process_more_data()
716
-
717
- # Wait for completion when ready
718
- future.result()
719
- ```
720
-
721
- ---
722
-
723
- ## Comparison Table
724
-
725
- | Feature | CommitScheduler | upload_folder() | upload_large_folder() |
726
- |---------|----------------|-----------------|----------------------|
727
- | Automatic uploads | ✅ Yes | ❌ No | ❌ No |
728
- | Resumable | ❌ No | ❌ No | ✅ Yes |
729
- | Custom commit message | ❌ No | ✅ Yes | ❌ No |
730
- | Background operation | ✅ Yes | ✅ Yes (with flag) | ❌ No |
731
- | Path in repo | ✅ Yes | ✅ Yes | ❌ No (root only) |
732
- | Multi-threaded | ❌ No | ❌ No | ✅ Yes |
733
- | Best for | Continuous logging | One-time uploads | Huge datasets |
734
-
735
- ---
736
-
737
- ## Quick Reference
738
-
739
- ### Upload Single File
740
- ```python
741
- api.upload_file(path_or_fileobj="file.txt", path_in_repo="file.txt",
742
- repo_id="user/dataset", repo_type="dataset")
743
- ```
744
-
745
- ### Upload Folder
746
- ```python
747
- api.upload_folder(folder_path="./data", repo_id="user/dataset",
748
- repo_type="dataset")
749
- ```
750
-
751
- ### Scheduled Uploads
752
- ```python
753
- scheduler = CommitScheduler(repo_id="user/dataset", folder_path="./data",
754
- every=10, repo_type="dataset")
755
- ```
756
-
757
- ### Background Upload
758
- ```python
759
- future = api.upload_folder(..., run_as_future=True)
760
- future.result() # Wait for completion
761
- ```
762
-
763
- ### Large Folder
764
- ```python
765
- api.upload_large_folder(repo_id="user/dataset", folder_path="./big_data",
766
- repo_type="dataset", num_workers=4)
767
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app.py CHANGED
@@ -85,7 +85,7 @@ def calculate_table(matches_list):
85
  all_teams.add(match[2]) # away team
86
 
87
  # Initialize stats for all teams
88
- table = {t: {"P": 0, "W": 0, "D": 0, "L": 0, "GF": 0, "GA": 0, "Pts": 0, "GPM": 0.0, "WP": 0.0}
89
  for t in all_teams}
90
 
91
  # Process each match
@@ -117,6 +117,8 @@ def calculate_table(matches_list):
117
  for t in all_teams:
118
  if table[t]["P"] > 0:
119
  table[t]["GPM"] = round(table[t]["GF"] / table[t]["P"], 2)
 
 
120
  table[t]["WP"] = round((table[t]["W"] / table[t]["P"]) * 100, 2)
121
 
122
  # Create DataFrame
@@ -126,7 +128,7 @@ def calculate_table(matches_list):
126
  df.rename(columns={"index": "Team"}, inplace=True)
127
 
128
  # Sort by WP descending (as per requirements)
129
- df = df[["Team", "WP", "GPM", "P", "W", "D", "L", "GF", "GA", "GD", "Pts"]]
130
  df = df.sort_values(by=["WP"], ascending=False)
131
 
132
  return df
 
85
  all_teams.add(match[2]) # away team
86
 
87
  # Initialize stats for all teams
88
+ table = {t: {"P": 0, "W": 0, "D": 0, "L": 0, "GF": 0, "GA": 0, "Pts": 0, "GPM": 0.0, "GAM": 0.0, "GDM": 0.0, "WP": 0.0}
89
  for t in all_teams}
90
 
91
  # Process each match
 
117
  for t in all_teams:
118
  if table[t]["P"] > 0:
119
  table[t]["GPM"] = round(table[t]["GF"] / table[t]["P"], 2)
120
+ table[t]["GAM"] = round(table[t]["GA"] / table[t]["P"], 2)
121
+ table[t]["GDM"] = round((table[t]["GF"] - table[t]["GA"]) / table[t]["P"], 2)
122
  table[t]["WP"] = round((table[t]["W"] / table[t]["P"]) * 100, 2)
123
 
124
  # Create DataFrame
 
128
  df.rename(columns={"index": "Team"}, inplace=True)
129
 
130
  # Sort by WP descending (as per requirements)
131
+ df = df[["Team", "WP", "GPM", "GAM", "GDM", "P", "W", "D", "L", "GF", "GA", "GD", "Pts"]]
132
  df = df.sort_values(by=["WP"], ascending=False)
133
 
134
  return df