Spaces:
Sleeping
Sleeping
Add GAM and GDM metrics to the table
Browse files
.claude/skills/hf-dataset-storage/SKILL.md
DELETED
|
@@ -1,387 +0,0 @@
|
|
| 1 |
-
---
|
| 2 |
-
name: hf-dataset-storage
|
| 3 |
-
description: Implement persistent storage for Hugging Face Spaces using dataset storage. Use when working with HF Spaces persistence, saving space data to datasets, scheduled uploads to HuggingFace Hub, or when the user mentions dataset storage, space persistence, CommitScheduler, or backing up space data.
|
| 4 |
-
allowed-tools: Read, Write, Edit, Bash, Grep, Glob
|
| 5 |
-
---
|
| 6 |
-
|
| 7 |
-
# Hugging Face Dataset Storage for Spaces
|
| 8 |
-
|
| 9 |
-
This skill helps you implement persistent storage for Hugging Face Spaces using dataset repositories as a data store.
|
| 10 |
-
|
| 11 |
-
## When to Use Dataset Storage
|
| 12 |
-
|
| 13 |
-
Use dataset storage for Hugging Face Spaces when:
|
| 14 |
-
- You need data to persist beyond the Space's lifecycle
|
| 15 |
-
- You want to collect user feedback or logs from a Space
|
| 16 |
-
- You need append-only storage for analytics or training data
|
| 17 |
-
- You want to avoid paying for persistent storage upgrades
|
| 18 |
-
- You need to version your data over time
|
| 19 |
-
|
| 20 |
-
## Quick Start
|
| 21 |
-
|
| 22 |
-
### 1. Install Required Package
|
| 23 |
-
|
| 24 |
-
```bash
|
| 25 |
-
uv add huggingface_hub
|
| 26 |
-
```
|
| 27 |
-
|
| 28 |
-
### 2. Basic Setup with CommitScheduler (Recommended)
|
| 29 |
-
|
| 30 |
-
For append-only data that should be uploaded periodically (e.g., logs, user feedback):
|
| 31 |
-
|
| 32 |
-
```python
|
| 33 |
-
import json
|
| 34 |
-
import uuid
|
| 35 |
-
from pathlib import Path
|
| 36 |
-
from huggingface_hub import CommitScheduler
|
| 37 |
-
|
| 38 |
-
# Create a unique file to avoid conflicts across restarts
|
| 39 |
-
feedback_file = Path("user_feedback/") / f"data_{uuid.uuid4()}.json"
|
| 40 |
-
feedback_folder = feedback_file.parent
|
| 41 |
-
|
| 42 |
-
# Schedule uploads every 10 minutes (minimum recommended: 5 minutes)
|
| 43 |
-
scheduler = CommitScheduler(
|
| 44 |
-
repo_id="username/my-dataset", # Will be created if doesn't exist
|
| 45 |
-
repo_type="dataset",
|
| 46 |
-
folder_path=feedback_folder,
|
| 47 |
-
path_in_repo="data", # Upload to /data folder in the dataset
|
| 48 |
-
every=10, # Upload every 10 minutes
|
| 49 |
-
)
|
| 50 |
-
|
| 51 |
-
# Append data with thread safety
|
| 52 |
-
def save_data(data_dict):
|
| 53 |
-
"""Save data to file with thread lock for concurrent writes."""
|
| 54 |
-
with scheduler.lock:
|
| 55 |
-
with feedback_file.open("a") as f:
|
| 56 |
-
f.write(json.dumps(data_dict))
|
| 57 |
-
f.write("\n")
|
| 58 |
-
```
|
| 59 |
-
|
| 60 |
-
### 3. Manual Upload Methods
|
| 61 |
-
|
| 62 |
-
For one-time or controlled uploads:
|
| 63 |
-
|
| 64 |
-
```python
|
| 65 |
-
from huggingface_hub import HfApi
|
| 66 |
-
|
| 67 |
-
api = HfApi()
|
| 68 |
-
|
| 69 |
-
# Upload a single file
|
| 70 |
-
api.upload_file(
|
| 71 |
-
path_or_fileobj="/path/to/local/file.json",
|
| 72 |
-
path_in_repo="data/file.json",
|
| 73 |
-
repo_id="username/my-dataset",
|
| 74 |
-
repo_type="dataset",
|
| 75 |
-
)
|
| 76 |
-
|
| 77 |
-
# Upload an entire folder
|
| 78 |
-
api.upload_folder(
|
| 79 |
-
folder_path="/path/to/local/folder",
|
| 80 |
-
path_in_repo="data",
|
| 81 |
-
repo_id="username/my-dataset",
|
| 82 |
-
repo_type="dataset",
|
| 83 |
-
)
|
| 84 |
-
```
|
| 85 |
-
|
| 86 |
-
## Authentication
|
| 87 |
-
|
| 88 |
-
Before uploading, you need to authenticate with Hugging Face:
|
| 89 |
-
|
| 90 |
-
### Option 1: Login via CLI
|
| 91 |
-
```bash
|
| 92 |
-
huggingface-cli login
|
| 93 |
-
```
|
| 94 |
-
|
| 95 |
-
### Option 2: Use Token Programmatically
|
| 96 |
-
```python
|
| 97 |
-
from huggingface_hub import HfApi
|
| 98 |
-
|
| 99 |
-
api = HfApi(token="hf_...")
|
| 100 |
-
```
|
| 101 |
-
|
| 102 |
-
### Option 3: Set Environment Variable
|
| 103 |
-
```bash
|
| 104 |
-
export HF_TOKEN="hf_..."
|
| 105 |
-
```
|
| 106 |
-
|
| 107 |
-
For Spaces, add `HF_TOKEN` as a secret in Space settings.
|
| 108 |
-
|
| 109 |
-
## Advanced Patterns
|
| 110 |
-
|
| 111 |
-
### Pattern 1: Gradio Space with User Feedback
|
| 112 |
-
|
| 113 |
-
```python
|
| 114 |
-
import json
|
| 115 |
-
import uuid
|
| 116 |
-
from pathlib import Path
|
| 117 |
-
import gradio as gr
|
| 118 |
-
from huggingface_hub import CommitScheduler
|
| 119 |
-
|
| 120 |
-
# Setup
|
| 121 |
-
feedback_file = Path("user_feedback/") / f"data_{uuid.uuid4()}.json"
|
| 122 |
-
feedback_folder = feedback_file.parent
|
| 123 |
-
|
| 124 |
-
scheduler = CommitScheduler(
|
| 125 |
-
repo_id="username/user-feedback-dataset",
|
| 126 |
-
repo_type="dataset",
|
| 127 |
-
folder_path=feedback_folder,
|
| 128 |
-
path_in_repo="feedback",
|
| 129 |
-
every=10,
|
| 130 |
-
)
|
| 131 |
-
|
| 132 |
-
def save_feedback(input_text, output_text, rating):
|
| 133 |
-
"""Save user feedback with thread safety."""
|
| 134 |
-
with scheduler.lock:
|
| 135 |
-
with feedback_file.open("a") as f:
|
| 136 |
-
f.write(json.dumps({
|
| 137 |
-
"input": input_text,
|
| 138 |
-
"output": output_text,
|
| 139 |
-
"rating": rating,
|
| 140 |
-
"timestamp": str(uuid.uuid4())
|
| 141 |
-
}))
|
| 142 |
-
f.write("\n")
|
| 143 |
-
|
| 144 |
-
# Use in Gradio interface
|
| 145 |
-
with gr.Blocks() as demo:
|
| 146 |
-
# ... define your Gradio UI
|
| 147 |
-
submit_btn.click(save_feedback, inputs=[input_box, output_box, rating])
|
| 148 |
-
|
| 149 |
-
demo.launch()
|
| 150 |
-
```
|
| 151 |
-
|
| 152 |
-
### Pattern 2: Training Logs with Progress Tracking
|
| 153 |
-
|
| 154 |
-
```python
|
| 155 |
-
import json
|
| 156 |
-
from pathlib import Path
|
| 157 |
-
from huggingface_hub import CommitScheduler
|
| 158 |
-
from tqdm import tqdm
|
| 159 |
-
|
| 160 |
-
# Setup
|
| 161 |
-
log_file = Path("training_logs/") / "metrics.jsonl"
|
| 162 |
-
log_folder = log_file.parent
|
| 163 |
-
log_folder.mkdir(exist_ok=True)
|
| 164 |
-
|
| 165 |
-
scheduler = CommitScheduler(
|
| 166 |
-
repo_id="username/training-logs",
|
| 167 |
-
repo_type="dataset",
|
| 168 |
-
folder_path=log_folder,
|
| 169 |
-
path_in_repo="logs",
|
| 170 |
-
every=5, # Upload every 5 minutes
|
| 171 |
-
)
|
| 172 |
-
|
| 173 |
-
# Training loop
|
| 174 |
-
for epoch in tqdm(range(num_epochs), desc="Training"):
|
| 175 |
-
# ... training code ...
|
| 176 |
-
|
| 177 |
-
# Log metrics
|
| 178 |
-
with scheduler.lock:
|
| 179 |
-
with log_file.open("a") as f:
|
| 180 |
-
f.write(json.dumps({
|
| 181 |
-
"epoch": epoch,
|
| 182 |
-
"loss": loss,
|
| 183 |
-
"accuracy": accuracy
|
| 184 |
-
}))
|
| 185 |
-
f.write("\n")
|
| 186 |
-
```
|
| 187 |
-
|
| 188 |
-
### Pattern 3: Large File Upload with Background Processing
|
| 189 |
-
|
| 190 |
-
```python
|
| 191 |
-
from huggingface_hub import HfApi
|
| 192 |
-
|
| 193 |
-
api = HfApi()
|
| 194 |
-
|
| 195 |
-
# Upload large files in the background (non-blocking)
|
| 196 |
-
future = api.upload_folder(
|
| 197 |
-
repo_id="username/large-dataset",
|
| 198 |
-
folder_path="./data",
|
| 199 |
-
repo_type="dataset",
|
| 200 |
-
run_as_future=True, # Non-blocking upload
|
| 201 |
-
)
|
| 202 |
-
|
| 203 |
-
# Continue working while upload happens
|
| 204 |
-
# ... do other work ...
|
| 205 |
-
|
| 206 |
-
# Wait for upload to complete when needed
|
| 207 |
-
future.result() # This blocks until upload finishes
|
| 208 |
-
```
|
| 209 |
-
|
| 210 |
-
### Pattern 4: Scheduled Uploads with Multiple File Types
|
| 211 |
-
|
| 212 |
-
```python
|
| 213 |
-
import zipfile
|
| 214 |
-
import tempfile
|
| 215 |
-
from pathlib import Path
|
| 216 |
-
from huggingface_hub import CommitScheduler
|
| 217 |
-
|
| 218 |
-
class ImageArchiveScheduler(CommitScheduler):
|
| 219 |
-
"""Custom scheduler that zips images before uploading."""
|
| 220 |
-
|
| 221 |
-
def push_to_hub(self):
|
| 222 |
-
# Find all PNG files
|
| 223 |
-
png_files = list(self.folder_path.glob("*.png"))
|
| 224 |
-
if len(png_files) == 0:
|
| 225 |
-
return None # Skip if nothing to commit
|
| 226 |
-
|
| 227 |
-
# Zip files
|
| 228 |
-
with tempfile.TemporaryDirectory() as tmpdir:
|
| 229 |
-
archive_path = Path(tmpdir) / "images.zip"
|
| 230 |
-
with zipfile.ZipFile(archive_path, "w", zipfile.ZIP_DEFLATED) as zip_file:
|
| 231 |
-
for png_file in png_files:
|
| 232 |
-
zip_file.write(filename=png_file, arcname=png_file.name)
|
| 233 |
-
|
| 234 |
-
# Upload archive
|
| 235 |
-
self.api.upload_file(
|
| 236 |
-
path_or_fileobj=archive_path,
|
| 237 |
-
path_in_repo=f"{self.path_in_repo}/images.zip",
|
| 238 |
-
repo_id=self.repo_id,
|
| 239 |
-
repo_type=self.repo_type,
|
| 240 |
-
)
|
| 241 |
-
|
| 242 |
-
# Clean up local files
|
| 243 |
-
for png_file in png_files:
|
| 244 |
-
png_file.unlink()
|
| 245 |
-
|
| 246 |
-
# Usage
|
| 247 |
-
scheduler = ImageArchiveScheduler(
|
| 248 |
-
repo_id="username/image-dataset",
|
| 249 |
-
repo_type="dataset",
|
| 250 |
-
folder_path=Path("./images"),
|
| 251 |
-
path_in_repo="archives",
|
| 252 |
-
every=15,
|
| 253 |
-
)
|
| 254 |
-
```
|
| 255 |
-
|
| 256 |
-
## Best Practices
|
| 257 |
-
|
| 258 |
-
### 1. File Naming for Concurrent Access
|
| 259 |
-
Always use UUIDs or timestamps to avoid filename conflicts:
|
| 260 |
-
```python
|
| 261 |
-
import uuid
|
| 262 |
-
filename = f"data_{uuid.uuid4()}.json"
|
| 263 |
-
```
|
| 264 |
-
|
| 265 |
-
### 2. Thread Safety
|
| 266 |
-
Always use the scheduler lock when writing:
|
| 267 |
-
```python
|
| 268 |
-
with scheduler.lock:
|
| 269 |
-
with file.open("a") as f:
|
| 270 |
-
f.write(data)
|
| 271 |
-
```
|
| 272 |
-
|
| 273 |
-
### 3. Append-Only Data
|
| 274 |
-
CommitScheduler assumes append-only operations. Only:
|
| 275 |
-
- Create new files
|
| 276 |
-
- Append to existing files
|
| 277 |
-
- Never delete or overwrite files (this can corrupt the repo)
|
| 278 |
-
|
| 279 |
-
### 4. Upload Frequency
|
| 280 |
-
- Minimum recommended: 5 minutes
|
| 281 |
-
- For user-facing apps: 10-15 minutes
|
| 282 |
-
- For training logs: 5-10 minutes
|
| 283 |
-
|
| 284 |
-
### 5. Data Format
|
| 285 |
-
Use formats readable by the Datasets library:
|
| 286 |
-
- JSON Lines (`.jsonl`) for structured data
|
| 287 |
-
- CSV for tabular data
|
| 288 |
-
- Parquet for large datasets
|
| 289 |
-
- ZIP for grouped files
|
| 290 |
-
|
| 291 |
-
### 6. Error Handling
|
| 292 |
-
The scheduler silently handles errors and retries. For critical data:
|
| 293 |
-
```python
|
| 294 |
-
import logging
|
| 295 |
-
|
| 296 |
-
logging.basicConfig(level=logging.INFO)
|
| 297 |
-
# Scheduler will log errors automatically
|
| 298 |
-
```
|
| 299 |
-
|
| 300 |
-
### 7. Large Files
|
| 301 |
-
For very large datasets (>1GB), consider:
|
| 302 |
-
```python
|
| 303 |
-
from huggingface_hub import HfApi
|
| 304 |
-
|
| 305 |
-
api = HfApi()
|
| 306 |
-
|
| 307 |
-
# Upload large folders with automatic chunking
|
| 308 |
-
api.upload_large_folder(
|
| 309 |
-
repo_id="username/huge-dataset",
|
| 310 |
-
folder_path="./data",
|
| 311 |
-
repo_type="dataset",
|
| 312 |
-
)
|
| 313 |
-
```
|
| 314 |
-
|
| 315 |
-
## Common Use Cases
|
| 316 |
-
|
| 317 |
-
### Use Case 1: A/B Testing Results
|
| 318 |
-
```python
|
| 319 |
-
# Save A/B test results from a Gradio Space
|
| 320 |
-
def log_ab_test(user_id, variant, conversion):
|
| 321 |
-
with scheduler.lock:
|
| 322 |
-
with ab_test_file.open("a") as f:
|
| 323 |
-
f.write(json.dumps({
|
| 324 |
-
"user_id": user_id,
|
| 325 |
-
"variant": variant,
|
| 326 |
-
"conversion": conversion,
|
| 327 |
-
"timestamp": datetime.now().isoformat()
|
| 328 |
-
}))
|
| 329 |
-
f.write("\n")
|
| 330 |
-
```
|
| 331 |
-
|
| 332 |
-
### Use Case 2: Model Predictions Storage
|
| 333 |
-
```python
|
| 334 |
-
# Store model predictions for analysis
|
| 335 |
-
def save_prediction(input_data, prediction, confidence):
|
| 336 |
-
with scheduler.lock:
|
| 337 |
-
with predictions_file.open("a") as f:
|
| 338 |
-
f.write(json.dumps({
|
| 339 |
-
"input": input_data,
|
| 340 |
-
"prediction": prediction,
|
| 341 |
-
"confidence": confidence,
|
| 342 |
-
"model_version": "v1.0"
|
| 343 |
-
}))
|
| 344 |
-
f.write("\n")
|
| 345 |
-
```
|
| 346 |
-
|
| 347 |
-
### Use Case 3: Dataset Versioning
|
| 348 |
-
```python
|
| 349 |
-
# Create versioned snapshots of data
|
| 350 |
-
api = HfApi()
|
| 351 |
-
|
| 352 |
-
api.upload_folder(
|
| 353 |
-
folder_path="./current_data",
|
| 354 |
-
path_in_repo=f"snapshots/{datetime.now().strftime('%Y%m%d')}",
|
| 355 |
-
repo_id="username/versioned-dataset",
|
| 356 |
-
repo_type="dataset",
|
| 357 |
-
commit_message=f"Snapshot for {datetime.now().date()}"
|
| 358 |
-
)
|
| 359 |
-
```
|
| 360 |
-
|
| 361 |
-
## Troubleshooting
|
| 362 |
-
|
| 363 |
-
### Issue: Upload fails with authentication error
|
| 364 |
-
**Solution:** Ensure you're logged in or have set the `HF_TOKEN` environment variable.
|
| 365 |
-
|
| 366 |
-
### Issue: Empty commits being created
|
| 367 |
-
**Solution:** The scheduler automatically skips empty commits. If you see them, check if files are being created correctly.
|
| 368 |
-
|
| 369 |
-
### Issue: Files not appearing in dataset
|
| 370 |
-
**Solution:** Wait for the next scheduled upload (check the `every` parameter) or the scheduler may be encountering errors.
|
| 371 |
-
|
| 372 |
-
### Issue: Out of memory errors
|
| 373 |
-
**Solution:** Use `upload_large_folder()` for very large datasets or upload files one at a time.
|
| 374 |
-
|
| 375 |
-
### Issue: Concurrent write conflicts
|
| 376 |
-
**Solution:** Always use `scheduler.lock` when writing to files that the scheduler is monitoring.
|
| 377 |
-
|
| 378 |
-
## References
|
| 379 |
-
|
| 380 |
-
- [HF Hub Dataset Storage Documentation](https://huggingface.co/docs/hub/spaces-storage#dataset-storage)
|
| 381 |
-
- [CommitScheduler API Reference](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.CommitScheduler)
|
| 382 |
-
- [Upload Guide](https://huggingface.co/docs/huggingface_hub/main/en/guides/upload)
|
| 383 |
-
- [Space to Dataset Saver Example](https://huggingface.co/spaces/Wauplin/space_to_dataset_saver)
|
| 384 |
-
|
| 385 |
-
## Example: Complete Gradio App with Dataset Storage
|
| 386 |
-
|
| 387 |
-
See [examples.md](examples.md) for a complete working example of a Gradio Space with dataset storage.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
.claude/skills/hf-dataset-storage/demo.py
DELETED
|
@@ -1,217 +0,0 @@
|
|
| 1 |
-
#!/usr/bin/env python3
|
| 2 |
-
"""
|
| 3 |
-
Demo script showing HuggingFace Dataset Storage patterns.
|
| 4 |
-
This demonstrates the skill's capabilities without requiring actual HF authentication.
|
| 5 |
-
"""
|
| 6 |
-
|
| 7 |
-
import json
|
| 8 |
-
from pathlib import Path
|
| 9 |
-
from datetime import datetime
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
def demo_file_structure():
|
| 13 |
-
"""Show how to structure files for dataset storage."""
|
| 14 |
-
print("=" * 60)
|
| 15 |
-
print("📁 File Structure Demo")
|
| 16 |
-
print("=" * 60)
|
| 17 |
-
|
| 18 |
-
# Example 1: JSON Lines format (recommended for append-only data)
|
| 19 |
-
demo_folder = Path("demo_data")
|
| 20 |
-
demo_folder.mkdir(exist_ok=True)
|
| 21 |
-
|
| 22 |
-
# Create sample JSONL file
|
| 23 |
-
sample_file = demo_folder / "sample_data.jsonl"
|
| 24 |
-
sample_data = [
|
| 25 |
-
{"id": 1, "timestamp": datetime.now().isoformat(), "value": "first entry"},
|
| 26 |
-
{"id": 2, "timestamp": datetime.now().isoformat(), "value": "second entry"},
|
| 27 |
-
{"id": 3, "timestamp": datetime.now().isoformat(), "value": "third entry"},
|
| 28 |
-
]
|
| 29 |
-
|
| 30 |
-
with sample_file.open("w") as f:
|
| 31 |
-
for entry in sample_data:
|
| 32 |
-
f.write(json.dumps(entry))
|
| 33 |
-
f.write("\n")
|
| 34 |
-
|
| 35 |
-
print(f"✅ Created sample JSONL file: {sample_file}")
|
| 36 |
-
print(f" Contains {len(sample_data)} entries\n")
|
| 37 |
-
|
| 38 |
-
# Show the content
|
| 39 |
-
print("📄 File contents:")
|
| 40 |
-
print(sample_file.read_text())
|
| 41 |
-
print()
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
def demo_scheduler_pattern():
|
| 45 |
-
"""Show the CommitScheduler pattern (without actual upload)."""
|
| 46 |
-
print("=" * 60)
|
| 47 |
-
print("🔄 CommitScheduler Pattern")
|
| 48 |
-
print("=" * 60)
|
| 49 |
-
|
| 50 |
-
code_example = '''
|
| 51 |
-
from pathlib import Path
|
| 52 |
-
from huggingface_hub import CommitScheduler
|
| 53 |
-
|
| 54 |
-
# Setup
|
| 55 |
-
feedback_folder = Path("user_feedback")
|
| 56 |
-
feedback_file = feedback_folder / "data.jsonl"
|
| 57 |
-
|
| 58 |
-
# Create scheduler (uploads every 10 minutes)
|
| 59 |
-
scheduler = CommitScheduler(
|
| 60 |
-
repo_id="username/my-dataset",
|
| 61 |
-
repo_type="dataset",
|
| 62 |
-
folder_path=feedback_folder,
|
| 63 |
-
path_in_repo="feedback",
|
| 64 |
-
every=10,
|
| 65 |
-
)
|
| 66 |
-
|
| 67 |
-
# Save data with thread safety
|
| 68 |
-
def save_feedback(data):
|
| 69 |
-
with scheduler.lock:
|
| 70 |
-
with feedback_file.open("a") as f:
|
| 71 |
-
f.write(json.dumps(data))
|
| 72 |
-
f.write("\\n")
|
| 73 |
-
'''
|
| 74 |
-
|
| 75 |
-
print("💡 Recommended Pattern for Continuous Data Collection:")
|
| 76 |
-
print(code_example)
|
| 77 |
-
print()
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
def demo_manual_upload_pattern():
|
| 81 |
-
"""Show manual upload patterns."""
|
| 82 |
-
print("=" * 60)
|
| 83 |
-
print("📤 Manual Upload Pattern")
|
| 84 |
-
print("=" * 60)
|
| 85 |
-
|
| 86 |
-
code_example = '''
|
| 87 |
-
from huggingface_hub import HfApi
|
| 88 |
-
|
| 89 |
-
api = HfApi()
|
| 90 |
-
|
| 91 |
-
# Method 1: Upload single file
|
| 92 |
-
api.upload_file(
|
| 93 |
-
path_or_fileobj="data.json",
|
| 94 |
-
path_in_repo="data/data.json",
|
| 95 |
-
repo_id="username/my-dataset",
|
| 96 |
-
repo_type="dataset",
|
| 97 |
-
)
|
| 98 |
-
|
| 99 |
-
# Method 2: Upload entire folder
|
| 100 |
-
api.upload_folder(
|
| 101 |
-
folder_path="./my_data",
|
| 102 |
-
repo_id="username/my-dataset",
|
| 103 |
-
repo_type="dataset",
|
| 104 |
-
)
|
| 105 |
-
|
| 106 |
-
# Method 3: Large folder (resumable)
|
| 107 |
-
api.upload_large_folder(
|
| 108 |
-
repo_id="username/huge-dataset",
|
| 109 |
-
folder_path="/path/to/huge/folder",
|
| 110 |
-
repo_type="dataset",
|
| 111 |
-
num_workers=4,
|
| 112 |
-
)
|
| 113 |
-
'''
|
| 114 |
-
|
| 115 |
-
print("💡 Manual Upload Options:")
|
| 116 |
-
print(code_example)
|
| 117 |
-
print()
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
def demo_use_cases():
|
| 121 |
-
"""Show common use cases."""
|
| 122 |
-
print("=" * 60)
|
| 123 |
-
print("🎯 Common Use Cases")
|
| 124 |
-
print("=" * 60)
|
| 125 |
-
|
| 126 |
-
use_cases = {
|
| 127 |
-
"1. Gradio Space User Feedback": {
|
| 128 |
-
"description": "Collect and store user interactions",
|
| 129 |
-
"pattern": "CommitScheduler",
|
| 130 |
-
"frequency": "Every 10-15 minutes",
|
| 131 |
-
"format": "JSON Lines (.jsonl)",
|
| 132 |
-
},
|
| 133 |
-
"2. Training Logs": {
|
| 134 |
-
"description": "Store model training metrics over time",
|
| 135 |
-
"pattern": "CommitScheduler",
|
| 136 |
-
"frequency": "Every 5 minutes",
|
| 137 |
-
"format": "JSON Lines (.jsonl) or CSV",
|
| 138 |
-
},
|
| 139 |
-
"3. A/B Testing Results": {
|
| 140 |
-
"description": "Track experiment variants and conversions",
|
| 141 |
-
"pattern": "CommitScheduler",
|
| 142 |
-
"frequency": "Every 10 minutes",
|
| 143 |
-
"format": "JSON Lines (.jsonl)",
|
| 144 |
-
},
|
| 145 |
-
"4. Dataset Versioning": {
|
| 146 |
-
"description": "Create snapshots of evolving datasets",
|
| 147 |
-
"pattern": "Manual upload_folder()",
|
| 148 |
-
"frequency": "On-demand",
|
| 149 |
-
"format": "Any format (Parquet recommended)",
|
| 150 |
-
},
|
| 151 |
-
"5. Image Collection": {
|
| 152 |
-
"description": "Archive images periodically",
|
| 153 |
-
"pattern": "Custom Scheduler (zip files)",
|
| 154 |
-
"frequency": "Every 15 minutes",
|
| 155 |
-
"format": "ZIP archives",
|
| 156 |
-
},
|
| 157 |
-
}
|
| 158 |
-
|
| 159 |
-
for use_case, details in use_cases.items():
|
| 160 |
-
print(f"\n{use_case}")
|
| 161 |
-
print(f" 📝 {details['description']}")
|
| 162 |
-
print(f" 🔧 Pattern: {details['pattern']}")
|
| 163 |
-
print(f" ⏱️ Frequency: {details['frequency']}")
|
| 164 |
-
print(f" 📄 Format: {details['format']}")
|
| 165 |
-
|
| 166 |
-
print()
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
def demo_best_practices():
|
| 170 |
-
"""Show best practices."""
|
| 171 |
-
print("=" * 60)
|
| 172 |
-
print("⭐ Best Practices")
|
| 173 |
-
print("=" * 60)
|
| 174 |
-
|
| 175 |
-
practices = [
|
| 176 |
-
"✅ Use UUID filenames to avoid conflicts across restarts",
|
| 177 |
-
"✅ Always use scheduler.lock for thread-safe writes",
|
| 178 |
-
"✅ Use JSON Lines (.jsonl) for structured append-only data",
|
| 179 |
-
"✅ Set minimum upload frequency to 5 minutes",
|
| 180 |
-
"✅ Never delete or overwrite files with CommitScheduler",
|
| 181 |
-
"✅ Use upload_large_folder() for datasets > 1GB",
|
| 182 |
-
"✅ Store HF_TOKEN as environment variable or Space secret",
|
| 183 |
-
"✅ Use Parquet format for very large tabular datasets",
|
| 184 |
-
]
|
| 185 |
-
|
| 186 |
-
for practice in practices:
|
| 187 |
-
print(f" {practice}")
|
| 188 |
-
|
| 189 |
-
print()
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
def main():
|
| 193 |
-
"""Run all demonstrations."""
|
| 194 |
-
print("\n" + "=" * 60)
|
| 195 |
-
print("🚀 HuggingFace Dataset Storage Skill Demo")
|
| 196 |
-
print("=" * 60)
|
| 197 |
-
print()
|
| 198 |
-
|
| 199 |
-
demo_file_structure()
|
| 200 |
-
demo_scheduler_pattern()
|
| 201 |
-
demo_manual_upload_pattern()
|
| 202 |
-
demo_use_cases()
|
| 203 |
-
demo_best_practices()
|
| 204 |
-
|
| 205 |
-
print("=" * 60)
|
| 206 |
-
print("✅ Demo Complete!")
|
| 207 |
-
print("=" * 60)
|
| 208 |
-
print()
|
| 209 |
-
print("📚 For more information, see:")
|
| 210 |
-
print(" - examples.md: Complete working examples")
|
| 211 |
-
print(" - reference.md: Detailed API documentation")
|
| 212 |
-
print(" - SKILL.md: Quick start guide")
|
| 213 |
-
print()
|
| 214 |
-
|
| 215 |
-
|
| 216 |
-
if __name__ == "__main__":
|
| 217 |
-
main()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
.claude/skills/hf-dataset-storage/demo_data/sample_data.jsonl
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
{"id": 1, "timestamp": "2025-12-31T12:54:45.073335", "value": "first entry"}
|
| 2 |
-
{"id": 2, "timestamp": "2025-12-31T12:54:45.073347", "value": "second entry"}
|
| 3 |
-
{"id": 3, "timestamp": "2025-12-31T12:54:45.073348", "value": "third entry"}
|
|
|
|
|
|
|
|
|
|
|
|
.claude/skills/hf-dataset-storage/examples.md
DELETED
|
@@ -1,728 +0,0 @@
|
|
| 1 |
-
# Complete Examples for HF Dataset Storage
|
| 2 |
-
|
| 3 |
-
This file contains complete, working examples for implementing dataset storage in Hugging Face Spaces.
|
| 4 |
-
|
| 5 |
-
## Example 1: Complete Gradio App with User Feedback Storage
|
| 6 |
-
|
| 7 |
-
This example shows a complete Gradio application that collects user feedback and saves it to a dataset.
|
| 8 |
-
|
| 9 |
-
```python
|
| 10 |
-
# app.py
|
| 11 |
-
import json
|
| 12 |
-
import uuid
|
| 13 |
-
from pathlib import Path
|
| 14 |
-
from datetime import datetime
|
| 15 |
-
import gradio as gr
|
| 16 |
-
from huggingface_hub import CommitScheduler
|
| 17 |
-
|
| 18 |
-
# ============================================================================
|
| 19 |
-
# Setup Dataset Storage
|
| 20 |
-
# ============================================================================
|
| 21 |
-
|
| 22 |
-
# Create unique feedback file using UUID to avoid conflicts
|
| 23 |
-
feedback_file = Path("user_feedback") / f"feedback_{uuid.uuid4()}.jsonl"
|
| 24 |
-
feedback_folder = feedback_file.parent
|
| 25 |
-
|
| 26 |
-
# Create folder if it doesn't exist
|
| 27 |
-
feedback_folder.mkdir(parents=True, exist_ok=True)
|
| 28 |
-
|
| 29 |
-
# Initialize CommitScheduler
|
| 30 |
-
# This will automatically upload data every 10 minutes
|
| 31 |
-
scheduler = CommitScheduler(
|
| 32 |
-
repo_id="your-username/app-feedback-dataset", # Replace with your repo
|
| 33 |
-
repo_type="dataset",
|
| 34 |
-
folder_path=feedback_folder,
|
| 35 |
-
path_in_repo="feedback",
|
| 36 |
-
every=10, # Upload every 10 minutes
|
| 37 |
-
)
|
| 38 |
-
|
| 39 |
-
print(f"✅ Dataset storage initialized. Data will be saved to: {scheduler.repo_id}")
|
| 40 |
-
|
| 41 |
-
# ============================================================================
|
| 42 |
-
# Application Logic
|
| 43 |
-
# ============================================================================
|
| 44 |
-
|
| 45 |
-
def translate_text(text, target_language):
|
| 46 |
-
"""
|
| 47 |
-
Mock translation function. Replace with your actual model/API.
|
| 48 |
-
"""
|
| 49 |
-
# Simulated translations
|
| 50 |
-
translations = {
|
| 51 |
-
"French": f"[FR] {text}",
|
| 52 |
-
"Spanish": f"[ES] {text}",
|
| 53 |
-
"German": f"[DE] {text}",
|
| 54 |
-
}
|
| 55 |
-
return translations.get(target_language, text)
|
| 56 |
-
|
| 57 |
-
def save_feedback(input_text, translation, language, rating, comments):
|
| 58 |
-
"""
|
| 59 |
-
Save user feedback to the dataset with thread safety.
|
| 60 |
-
"""
|
| 61 |
-
if not input_text or not translation:
|
| 62 |
-
return "⚠️ No data to save"
|
| 63 |
-
|
| 64 |
-
feedback_data = {
|
| 65 |
-
"timestamp": datetime.now().isoformat(),
|
| 66 |
-
"input_text": input_text,
|
| 67 |
-
"translation": translation,
|
| 68 |
-
"target_language": language,
|
| 69 |
-
"rating": rating,
|
| 70 |
-
"comments": comments,
|
| 71 |
-
"session_id": str(uuid.uuid4())
|
| 72 |
-
}
|
| 73 |
-
|
| 74 |
-
# Use scheduler lock for thread-safe writes
|
| 75 |
-
with scheduler.lock:
|
| 76 |
-
with feedback_file.open("a") as f:
|
| 77 |
-
f.write(json.dumps(feedback_data))
|
| 78 |
-
f.write("\n")
|
| 79 |
-
|
| 80 |
-
return "✅ Feedback saved! Thank you!"
|
| 81 |
-
|
| 82 |
-
# ============================================================================
|
| 83 |
-
# Gradio Interface
|
| 84 |
-
# ============================================================================
|
| 85 |
-
|
| 86 |
-
with gr.Blocks(title="Translation App with Feedback") as demo:
|
| 87 |
-
gr.Markdown("# Translation App")
|
| 88 |
-
gr.Markdown("Translate text and provide feedback to help us improve!")
|
| 89 |
-
|
| 90 |
-
with gr.Row():
|
| 91 |
-
with gr.Column():
|
| 92 |
-
input_text = gr.Textbox(
|
| 93 |
-
label="Enter text to translate",
|
| 94 |
-
placeholder="Type something...",
|
| 95 |
-
lines=3
|
| 96 |
-
)
|
| 97 |
-
language = gr.Dropdown(
|
| 98 |
-
choices=["French", "Spanish", "German"],
|
| 99 |
-
label="Target Language",
|
| 100 |
-
value="French"
|
| 101 |
-
)
|
| 102 |
-
translate_btn = gr.Button("Translate", variant="primary")
|
| 103 |
-
|
| 104 |
-
with gr.Column():
|
| 105 |
-
output_text = gr.Textbox(
|
| 106 |
-
label="Translation",
|
| 107 |
-
lines=3,
|
| 108 |
-
interactive=False
|
| 109 |
-
)
|
| 110 |
-
|
| 111 |
-
gr.Markdown("### How was the translation?")
|
| 112 |
-
|
| 113 |
-
with gr.Row():
|
| 114 |
-
rating = gr.Slider(
|
| 115 |
-
minimum=1,
|
| 116 |
-
maximum=5,
|
| 117 |
-
step=1,
|
| 118 |
-
label="Rating (1-5 stars)",
|
| 119 |
-
value=3
|
| 120 |
-
)
|
| 121 |
-
comments = gr.Textbox(
|
| 122 |
-
label="Additional comments (optional)",
|
| 123 |
-
placeholder="Any suggestions?",
|
| 124 |
-
lines=2
|
| 125 |
-
)
|
| 126 |
-
|
| 127 |
-
feedback_status = gr.Textbox(label="Status", interactive=False)
|
| 128 |
-
submit_feedback_btn = gr.Button("Submit Feedback", variant="secondary")
|
| 129 |
-
|
| 130 |
-
# Connect the functions
|
| 131 |
-
translate_btn.click(
|
| 132 |
-
fn=translate_text,
|
| 133 |
-
inputs=[input_text, language],
|
| 134 |
-
outputs=output_text
|
| 135 |
-
)
|
| 136 |
-
|
| 137 |
-
submit_feedback_btn.click(
|
| 138 |
-
fn=save_feedback,
|
| 139 |
-
inputs=[input_text, output_text, language, rating, comments],
|
| 140 |
-
outputs=feedback_status
|
| 141 |
-
)
|
| 142 |
-
|
| 143 |
-
gr.Markdown("---")
|
| 144 |
-
gr.Markdown(
|
| 145 |
-
f"💾 Feedback is automatically saved to the dataset: "
|
| 146 |
-
f"[{scheduler.repo_id}](https://huggingface.co/datasets/{scheduler.repo_id})"
|
| 147 |
-
)
|
| 148 |
-
|
| 149 |
-
if __name__ == "__main__":
|
| 150 |
-
demo.launch()
|
| 151 |
-
```
|
| 152 |
-
|
| 153 |
-
### Requirements for Example 1
|
| 154 |
-
|
| 155 |
-
```toml
|
| 156 |
-
# pyproject.toml or requirements.txt
|
| 157 |
-
[project]
|
| 158 |
-
dependencies = [
|
| 159 |
-
"gradio>=4.0.0",
|
| 160 |
-
"huggingface_hub>=0.20.0",
|
| 161 |
-
]
|
| 162 |
-
```
|
| 163 |
-
|
| 164 |
-
---
|
| 165 |
-
|
| 166 |
-
## Example 2: Training Logger with Dataset Storage
|
| 167 |
-
|
| 168 |
-
This example shows how to log training metrics to a dataset during model training.
|
| 169 |
-
|
| 170 |
-
```python
|
| 171 |
-
# train.py
|
| 172 |
-
import json
|
| 173 |
-
import time
|
| 174 |
-
from pathlib import Path
|
| 175 |
-
from datetime import datetime
|
| 176 |
-
from tqdm import tqdm
|
| 177 |
-
from huggingface_hub import CommitScheduler
|
| 178 |
-
|
| 179 |
-
# ============================================================================
|
| 180 |
-
# Setup Dataset Storage for Training Logs
|
| 181 |
-
# ============================================================================
|
| 182 |
-
|
| 183 |
-
log_folder = Path("training_logs")
|
| 184 |
-
log_folder.mkdir(exist_ok=True)
|
| 185 |
-
|
| 186 |
-
# Create separate files for different log types
|
| 187 |
-
metrics_file = log_folder / "metrics.jsonl"
|
| 188 |
-
checkpoints_file = log_folder / "checkpoints.jsonl"
|
| 189 |
-
|
| 190 |
-
# Initialize scheduler - uploads every 5 minutes during training
|
| 191 |
-
scheduler = CommitScheduler(
|
| 192 |
-
repo_id="your-username/training-logs",
|
| 193 |
-
repo_type="dataset",
|
| 194 |
-
folder_path=log_folder,
|
| 195 |
-
path_in_repo="runs",
|
| 196 |
-
every=5,
|
| 197 |
-
)
|
| 198 |
-
|
| 199 |
-
print(f"📊 Training logs will be saved to: {scheduler.repo_id}")
|
| 200 |
-
|
| 201 |
-
# ============================================================================
|
| 202 |
-
# Training Configuration
|
| 203 |
-
# ============================================================================
|
| 204 |
-
|
| 205 |
-
config = {
|
| 206 |
-
"model": "my-model-v1",
|
| 207 |
-
"learning_rate": 0.001,
|
| 208 |
-
"batch_size": 32,
|
| 209 |
-
"num_epochs": 10,
|
| 210 |
-
"dataset": "training-data-v1"
|
| 211 |
-
}
|
| 212 |
-
|
| 213 |
-
# Save configuration
|
| 214 |
-
with scheduler.lock:
|
| 215 |
-
config_file = log_folder / "config.json"
|
| 216 |
-
with config_file.open("w") as f:
|
| 217 |
-
json.dump(config, f, indent=2)
|
| 218 |
-
|
| 219 |
-
# ============================================================================
|
| 220 |
-
# Training Functions
|
| 221 |
-
# ============================================================================
|
| 222 |
-
|
| 223 |
-
def log_metrics(epoch, step, loss, accuracy, learning_rate):
|
| 224 |
-
"""Log training metrics."""
|
| 225 |
-
metrics = {
|
| 226 |
-
"timestamp": datetime.now().isoformat(),
|
| 227 |
-
"epoch": epoch,
|
| 228 |
-
"step": step,
|
| 229 |
-
"loss": float(loss),
|
| 230 |
-
"accuracy": float(accuracy),
|
| 231 |
-
"learning_rate": learning_rate
|
| 232 |
-
}
|
| 233 |
-
|
| 234 |
-
with scheduler.lock:
|
| 235 |
-
with metrics_file.open("a") as f:
|
| 236 |
-
f.write(json.dumps(metrics))
|
| 237 |
-
f.write("\n")
|
| 238 |
-
|
| 239 |
-
def log_checkpoint(epoch, model_path, metrics):
|
| 240 |
-
"""Log checkpoint information."""
|
| 241 |
-
checkpoint_info = {
|
| 242 |
-
"timestamp": datetime.now().isoformat(),
|
| 243 |
-
"epoch": epoch,
|
| 244 |
-
"model_path": model_path,
|
| 245 |
-
"metrics": metrics
|
| 246 |
-
}
|
| 247 |
-
|
| 248 |
-
with scheduler.lock:
|
| 249 |
-
with checkpoints_file.open("a") as f:
|
| 250 |
-
f.write(json.dumps(checkpoint_info))
|
| 251 |
-
f.write("\n")
|
| 252 |
-
|
| 253 |
-
def train_epoch(epoch, num_steps=100):
|
| 254 |
-
"""Mock training epoch."""
|
| 255 |
-
epoch_loss = 0
|
| 256 |
-
pbar = tqdm(range(num_steps), desc=f"Epoch {epoch}")
|
| 257 |
-
|
| 258 |
-
for step in pbar:
|
| 259 |
-
# Simulate training
|
| 260 |
-
time.sleep(0.1)
|
| 261 |
-
|
| 262 |
-
# Mock metrics
|
| 263 |
-
loss = 1.0 / (step + 1 + epoch * 10)
|
| 264 |
-
accuracy = min(0.95, 0.5 + step * 0.005 + epoch * 0.05)
|
| 265 |
-
|
| 266 |
-
epoch_loss += loss
|
| 267 |
-
|
| 268 |
-
# Log every 10 steps
|
| 269 |
-
if step % 10 == 0:
|
| 270 |
-
log_metrics(
|
| 271 |
-
epoch=epoch,
|
| 272 |
-
step=step,
|
| 273 |
-
loss=loss,
|
| 274 |
-
accuracy=accuracy,
|
| 275 |
-
learning_rate=config["learning_rate"]
|
| 276 |
-
)
|
| 277 |
-
|
| 278 |
-
pbar.set_postfix({"loss": f"{loss:.4f}", "acc": f"{accuracy:.4f}"})
|
| 279 |
-
|
| 280 |
-
return epoch_loss / num_steps, accuracy
|
| 281 |
-
|
| 282 |
-
# ============================================================================
|
| 283 |
-
# Main Training Loop
|
| 284 |
-
# ============================================================================
|
| 285 |
-
|
| 286 |
-
def main():
|
| 287 |
-
print("🚀 Starting training...")
|
| 288 |
-
|
| 289 |
-
for epoch in range(config["num_epochs"]):
|
| 290 |
-
print(f"\n📍 Epoch {epoch + 1}/{config['num_epochs']}")
|
| 291 |
-
|
| 292 |
-
# Train for one epoch
|
| 293 |
-
avg_loss, final_accuracy = train_epoch(epoch)
|
| 294 |
-
|
| 295 |
-
# Log checkpoint
|
| 296 |
-
checkpoint_path = f"checkpoints/model_epoch_{epoch}.pt"
|
| 297 |
-
log_checkpoint(
|
| 298 |
-
epoch=epoch,
|
| 299 |
-
model_path=checkpoint_path,
|
| 300 |
-
metrics={"loss": avg_loss, "accuracy": final_accuracy}
|
| 301 |
-
)
|
| 302 |
-
|
| 303 |
-
print(f"✅ Epoch {epoch + 1} complete - Loss: {avg_loss:.4f}, Acc: {final_accuracy:.4f}")
|
| 304 |
-
|
| 305 |
-
print(f"\n🎉 Training complete! Logs saved to: {scheduler.repo_id}")
|
| 306 |
-
|
| 307 |
-
# Force final upload
|
| 308 |
-
# Note: scheduler will upload automatically, but we can trigger it manually if needed
|
| 309 |
-
print("📤 Uploading final logs...")
|
| 310 |
-
time.sleep(2) # Give scheduler time to complete
|
| 311 |
-
|
| 312 |
-
if __name__ == "__main__":
|
| 313 |
-
main()
|
| 314 |
-
```
|
| 315 |
-
|
| 316 |
-
---
|
| 317 |
-
|
| 318 |
-
## Example 3: Dataset Snapshot Saver
|
| 319 |
-
|
| 320 |
-
This example shows how to create versioned snapshots of data.
|
| 321 |
-
|
| 322 |
-
```python
|
| 323 |
-
# snapshot_saver.py
|
| 324 |
-
import json
|
| 325 |
-
from pathlib import Path
|
| 326 |
-
from datetime import datetime
|
| 327 |
-
from huggingface_hub import HfApi
|
| 328 |
-
from tqdm import tqdm
|
| 329 |
-
|
| 330 |
-
class DatasetSnapshotSaver:
|
| 331 |
-
"""Save versioned snapshots of data to HuggingFace datasets."""
|
| 332 |
-
|
| 333 |
-
def __init__(self, repo_id, repo_type="dataset"):
|
| 334 |
-
self.api = HfApi()
|
| 335 |
-
self.repo_id = repo_id
|
| 336 |
-
self.repo_type = repo_type
|
| 337 |
-
|
| 338 |
-
# Create repo if it doesn't exist
|
| 339 |
-
try:
|
| 340 |
-
self.api.create_repo(
|
| 341 |
-
repo_id=repo_id,
|
| 342 |
-
repo_type=repo_type,
|
| 343 |
-
exist_ok=True
|
| 344 |
-
)
|
| 345 |
-
print(f"✅ Repository ready: {repo_id}")
|
| 346 |
-
except Exception as e:
|
| 347 |
-
print(f"❌ Error creating repo: {e}")
|
| 348 |
-
|
| 349 |
-
def save_snapshot(self, data_folder, snapshot_name=None):
|
| 350 |
-
"""
|
| 351 |
-
Save a snapshot of the data folder to the dataset.
|
| 352 |
-
|
| 353 |
-
Args:
|
| 354 |
-
data_folder: Path to local folder containing data
|
| 355 |
-
snapshot_name: Optional custom name, defaults to timestamp
|
| 356 |
-
"""
|
| 357 |
-
if snapshot_name is None:
|
| 358 |
-
snapshot_name = datetime.now().strftime("%Y%m%d_%H%M%S")
|
| 359 |
-
|
| 360 |
-
data_path = Path(data_folder)
|
| 361 |
-
if not data_path.exists():
|
| 362 |
-
raise ValueError(f"Data folder not found: {data_folder}")
|
| 363 |
-
|
| 364 |
-
print(f"📸 Creating snapshot: {snapshot_name}")
|
| 365 |
-
|
| 366 |
-
# Upload folder
|
| 367 |
-
self.api.upload_folder(
|
| 368 |
-
folder_path=str(data_path),
|
| 369 |
-
path_in_repo=f"snapshots/{snapshot_name}",
|
| 370 |
-
repo_id=self.repo_id,
|
| 371 |
-
repo_type=self.repo_type,
|
| 372 |
-
commit_message=f"Snapshot: {snapshot_name}"
|
| 373 |
-
)
|
| 374 |
-
|
| 375 |
-
print(f"✅ Snapshot saved: snapshots/{snapshot_name}")
|
| 376 |
-
return snapshot_name
|
| 377 |
-
|
| 378 |
-
def save_metadata(self, snapshot_name, metadata):
|
| 379 |
-
"""Save metadata for a snapshot."""
|
| 380 |
-
metadata_content = json.dumps(metadata, indent=2)
|
| 381 |
-
|
| 382 |
-
self.api.upload_file(
|
| 383 |
-
path_or_fileobj=metadata_content.encode(),
|
| 384 |
-
path_in_repo=f"snapshots/{snapshot_name}/metadata.json",
|
| 385 |
-
repo_id=self.repo_id,
|
| 386 |
-
repo_type=self.repo_type,
|
| 387 |
-
commit_message=f"Add metadata for {snapshot_name}"
|
| 388 |
-
)
|
| 389 |
-
|
| 390 |
-
print(f"✅ Metadata saved for snapshot: {snapshot_name}")
|
| 391 |
-
|
| 392 |
-
# ============================================================================
|
| 393 |
-
# Usage Example
|
| 394 |
-
# ============================================================================
|
| 395 |
-
|
| 396 |
-
if __name__ == "__main__":
|
| 397 |
-
# Initialize saver
|
| 398 |
-
saver = DatasetSnapshotSaver(
|
| 399 |
-
repo_id="your-username/data-snapshots"
|
| 400 |
-
)
|
| 401 |
-
|
| 402 |
-
# Create sample data
|
| 403 |
-
data_folder = Path("./sample_data")
|
| 404 |
-
data_folder.mkdir(exist_ok=True)
|
| 405 |
-
|
| 406 |
-
# Generate some sample files
|
| 407 |
-
print("📝 Generating sample data...")
|
| 408 |
-
for i in tqdm(range(10)):
|
| 409 |
-
sample_file = data_folder / f"data_{i}.json"
|
| 410 |
-
with sample_file.open("w") as f:
|
| 411 |
-
json.dump({"id": i, "value": i * 10}, f)
|
| 412 |
-
|
| 413 |
-
# Save snapshot
|
| 414 |
-
snapshot_name = saver.save_snapshot(
|
| 415 |
-
data_folder=data_folder,
|
| 416 |
-
snapshot_name="initial_snapshot"
|
| 417 |
-
)
|
| 418 |
-
|
| 419 |
-
# Save metadata
|
| 420 |
-
saver.save_metadata(
|
| 421 |
-
snapshot_name=snapshot_name,
|
| 422 |
-
metadata={
|
| 423 |
-
"created_at": datetime.now().isoformat(),
|
| 424 |
-
"num_files": 10,
|
| 425 |
-
"description": "Initial data snapshot",
|
| 426 |
-
"version": "1.0"
|
| 427 |
-
}
|
| 428 |
-
)
|
| 429 |
-
|
| 430 |
-
print(f"\n🎉 Complete! View at: https://huggingface.co/datasets/{saver.repo_id}")
|
| 431 |
-
```
|
| 432 |
-
|
| 433 |
-
---
|
| 434 |
-
|
| 435 |
-
## Example 4: Image Collection Archiver
|
| 436 |
-
|
| 437 |
-
This example shows how to collect images and periodically archive them to a dataset.
|
| 438 |
-
|
| 439 |
-
```python
|
| 440 |
-
# image_archiver.py
|
| 441 |
-
import zipfile
|
| 442 |
-
import tempfile
|
| 443 |
-
from pathlib import Path
|
| 444 |
-
from datetime import datetime
|
| 445 |
-
from huggingface_hub import CommitScheduler
|
| 446 |
-
|
| 447 |
-
class ImageArchiveScheduler(CommitScheduler):
|
| 448 |
-
"""
|
| 449 |
-
Custom scheduler that collects images and uploads them as ZIP archives.
|
| 450 |
-
"""
|
| 451 |
-
|
| 452 |
-
def push_to_hub(self):
|
| 453 |
-
"""Override to zip images before uploading."""
|
| 454 |
-
|
| 455 |
-
# Find all image files
|
| 456 |
-
image_extensions = ["*.png", "*.jpg", "*.jpeg", "*.gif"]
|
| 457 |
-
image_files = []
|
| 458 |
-
for ext in image_extensions:
|
| 459 |
-
image_files.extend(list(self.folder_path.glob(ext)))
|
| 460 |
-
|
| 461 |
-
if len(image_files) == 0:
|
| 462 |
-
print("No images to archive")
|
| 463 |
-
return None
|
| 464 |
-
|
| 465 |
-
print(f"📦 Archiving {len(image_files)} images...")
|
| 466 |
-
|
| 467 |
-
# Create ZIP archive
|
| 468 |
-
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
| 469 |
-
archive_name = f"images_{timestamp}.zip"
|
| 470 |
-
|
| 471 |
-
with tempfile.TemporaryDirectory() as tmpdir:
|
| 472 |
-
archive_path = Path(tmpdir) / archive_name
|
| 473 |
-
|
| 474 |
-
with zipfile.ZipFile(archive_path, "w", zipfile.ZIP_DEFLATED) as zip_file:
|
| 475 |
-
for img_file in image_files:
|
| 476 |
-
zip_file.write(
|
| 477 |
-
filename=img_file,
|
| 478 |
-
arcname=f"{timestamp}/{img_file.name}"
|
| 479 |
-
)
|
| 480 |
-
|
| 481 |
-
# Upload archive
|
| 482 |
-
self.api.upload_file(
|
| 483 |
-
path_or_fileobj=str(archive_path),
|
| 484 |
-
path_in_repo=f"{self.path_in_repo}/{archive_name}",
|
| 485 |
-
repo_id=self.repo_id,
|
| 486 |
-
repo_type=self.repo_type,
|
| 487 |
-
commit_message=f"Archive {len(image_files)} images from {timestamp}"
|
| 488 |
-
)
|
| 489 |
-
|
| 490 |
-
# Delete local images after successful upload
|
| 491 |
-
for img_file in image_files:
|
| 492 |
-
img_file.unlink()
|
| 493 |
-
|
| 494 |
-
print(f"✅ Archived and uploaded {len(image_files)} images")
|
| 495 |
-
|
| 496 |
-
# ============================================================================
|
| 497 |
-
# Usage Example
|
| 498 |
-
# ============================================================================
|
| 499 |
-
|
| 500 |
-
if __name__ == "__main__":
|
| 501 |
-
import time
|
| 502 |
-
from PIL import Image
|
| 503 |
-
import numpy as np
|
| 504 |
-
|
| 505 |
-
# Setup image folder
|
| 506 |
-
image_folder = Path("./collected_images")
|
| 507 |
-
image_folder.mkdir(exist_ok=True)
|
| 508 |
-
|
| 509 |
-
# Initialize custom scheduler
|
| 510 |
-
scheduler = ImageArchiveScheduler(
|
| 511 |
-
repo_id="your-username/image-archives",
|
| 512 |
-
repo_type="dataset",
|
| 513 |
-
folder_path=image_folder,
|
| 514 |
-
path_in_repo="archives",
|
| 515 |
-
every=5, # Archive every 5 minutes
|
| 516 |
-
)
|
| 517 |
-
|
| 518 |
-
print(f"📸 Image archiver started. Saving to: {scheduler.repo_id}")
|
| 519 |
-
|
| 520 |
-
# Simulate image collection
|
| 521 |
-
print("Generating sample images...")
|
| 522 |
-
for i in range(5):
|
| 523 |
-
# Create a random image
|
| 524 |
-
img_array = np.random.randint(0, 255, (100, 100, 3), dtype=np.uint8)
|
| 525 |
-
img = Image.fromarray(img_array)
|
| 526 |
-
|
| 527 |
-
# Save image
|
| 528 |
-
img_path = image_folder / f"image_{i}_{datetime.now().strftime('%H%M%S')}.png"
|
| 529 |
-
img.save(img_path)
|
| 530 |
-
print(f" Generated: {img_path.name}")
|
| 531 |
-
|
| 532 |
-
time.sleep(2)
|
| 533 |
-
|
| 534 |
-
print("\n⏳ Waiting for scheduler to archive images...")
|
| 535 |
-
print(" (In production, your app would continue running)")
|
| 536 |
-
|
| 537 |
-
# In a real application, the scheduler runs in the background
|
| 538 |
-
# and you can continue processing
|
| 539 |
-
```
|
| 540 |
-
|
| 541 |
-
---
|
| 542 |
-
|
| 543 |
-
## Example 5: A/B Testing Results Collector
|
| 544 |
-
|
| 545 |
-
This example shows how to collect A/B testing results from a Gradio Space.
|
| 546 |
-
|
| 547 |
-
```python
|
| 548 |
-
# ab_testing_app.py
|
| 549 |
-
import json
|
| 550 |
-
import uuid
|
| 551 |
-
import random
|
| 552 |
-
from pathlib import Path
|
| 553 |
-
from datetime import datetime
|
| 554 |
-
import gradio as gr
|
| 555 |
-
from huggingface_hub import CommitScheduler
|
| 556 |
-
|
| 557 |
-
# ============================================================================
|
| 558 |
-
# Setup Dataset Storage for A/B Testing
|
| 559 |
-
# ============================================================================
|
| 560 |
-
|
| 561 |
-
results_folder = Path("ab_test_results")
|
| 562 |
-
results_folder.mkdir(exist_ok=True)
|
| 563 |
-
|
| 564 |
-
results_file = results_folder / f"results_{uuid.uuid4()}.jsonl"
|
| 565 |
-
|
| 566 |
-
scheduler = CommitScheduler(
|
| 567 |
-
repo_id="your-username/ab-test-results",
|
| 568 |
-
repo_type="dataset",
|
| 569 |
-
folder_path=results_folder,
|
| 570 |
-
path_in_repo="experiments",
|
| 571 |
-
every=10,
|
| 572 |
-
)
|
| 573 |
-
|
| 574 |
-
print(f"📊 A/B test results will be saved to: {scheduler.repo_id}")
|
| 575 |
-
|
| 576 |
-
# ============================================================================
|
| 577 |
-
# A/B Testing Logic
|
| 578 |
-
# ============================================================================
|
| 579 |
-
|
| 580 |
-
def assign_variant():
|
| 581 |
-
"""Randomly assign user to variant A or B."""
|
| 582 |
-
return random.choice(["A", "B"])
|
| 583 |
-
|
| 584 |
-
def get_recommendation(user_input, variant):
|
| 585 |
-
"""
|
| 586 |
-
Generate recommendation based on variant.
|
| 587 |
-
Variant A: Conservative recommendations
|
| 588 |
-
Variant B: Aggressive recommendations
|
| 589 |
-
"""
|
| 590 |
-
if variant == "A":
|
| 591 |
-
return f"Conservative recommendation for: {user_input}"
|
| 592 |
-
else:
|
| 593 |
-
return f"Aggressive recommendation for: {user_input}"
|
| 594 |
-
|
| 595 |
-
def log_interaction(session_id, variant, user_input, recommendation, user_clicked):
|
| 596 |
-
"""Log A/B test interaction."""
|
| 597 |
-
result = {
|
| 598 |
-
"timestamp": datetime.now().isoformat(),
|
| 599 |
-
"session_id": session_id,
|
| 600 |
-
"variant": variant,
|
| 601 |
-
"user_input": user_input,
|
| 602 |
-
"recommendation": recommendation,
|
| 603 |
-
"user_clicked": user_clicked,
|
| 604 |
-
"conversion": user_clicked
|
| 605 |
-
}
|
| 606 |
-
|
| 607 |
-
with scheduler.lock:
|
| 608 |
-
with results_file.open("a") as f:
|
| 609 |
-
f.write(json.dumps(result))
|
| 610 |
-
f.write("\n")
|
| 611 |
-
|
| 612 |
-
# ============================================================================
|
| 613 |
-
# Gradio Interface
|
| 614 |
-
# ============================================================================
|
| 615 |
-
|
| 616 |
-
def process_request(user_input, session_state):
|
| 617 |
-
"""Process user request and assign variant."""
|
| 618 |
-
if session_state is None:
|
| 619 |
-
session_state = {
|
| 620 |
-
"session_id": str(uuid.uuid4()),
|
| 621 |
-
"variant": assign_variant()
|
| 622 |
-
}
|
| 623 |
-
|
| 624 |
-
recommendation = get_recommendation(user_input, session_state["variant"])
|
| 625 |
-
|
| 626 |
-
return (
|
| 627 |
-
recommendation,
|
| 628 |
-
f"You are in variant: {session_state['variant']}",
|
| 629 |
-
session_state,
|
| 630 |
-
session_state["session_id"],
|
| 631 |
-
session_state["variant"],
|
| 632 |
-
user_input,
|
| 633 |
-
recommendation
|
| 634 |
-
)
|
| 635 |
-
|
| 636 |
-
def log_click(session_id, variant, user_input, recommendation):
|
| 637 |
-
"""Log when user clicks the recommendation."""
|
| 638 |
-
if session_id:
|
| 639 |
-
log_interaction(session_id, variant, user_input, recommendation, True)
|
| 640 |
-
return "✅ Click logged!"
|
| 641 |
-
return "⚠️ No session data"
|
| 642 |
-
|
| 643 |
-
with gr.Blocks(title="A/B Testing Demo") as demo:
|
| 644 |
-
gr.Markdown("# A/B Testing Demo")
|
| 645 |
-
gr.Markdown("Test two different recommendation strategies")
|
| 646 |
-
|
| 647 |
-
# Session state
|
| 648 |
-
session_state = gr.State(None)
|
| 649 |
-
session_id_state = gr.State(None)
|
| 650 |
-
variant_state = gr.State(None)
|
| 651 |
-
input_state = gr.State(None)
|
| 652 |
-
recommendation_state = gr.State(None)
|
| 653 |
-
|
| 654 |
-
with gr.Row():
|
| 655 |
-
user_input = gr.Textbox(
|
| 656 |
-
label="What are you looking for?",
|
| 657 |
-
placeholder="Enter your query..."
|
| 658 |
-
)
|
| 659 |
-
submit_btn = gr.Button("Get Recommendation", variant="primary")
|
| 660 |
-
|
| 661 |
-
recommendation_output = gr.Textbox(
|
| 662 |
-
label="Recommendation",
|
| 663 |
-
interactive=False
|
| 664 |
-
)
|
| 665 |
-
|
| 666 |
-
variant_display = gr.Textbox(
|
| 667 |
-
label="Your Test Variant",
|
| 668 |
-
interactive=False
|
| 669 |
-
)
|
| 670 |
-
|
| 671 |
-
click_btn = gr.Button("I like this recommendation!", variant="secondary")
|
| 672 |
-
click_status = gr.Textbox(label="Status", interactive=False)
|
| 673 |
-
|
| 674 |
-
# Connect functions
|
| 675 |
-
submit_btn.click(
|
| 676 |
-
fn=process_request,
|
| 677 |
-
inputs=[user_input, session_state],
|
| 678 |
-
outputs=[
|
| 679 |
-
recommendation_output,
|
| 680 |
-
variant_display,
|
| 681 |
-
session_state,
|
| 682 |
-
session_id_state,
|
| 683 |
-
variant_state,
|
| 684 |
-
input_state,
|
| 685 |
-
recommendation_state
|
| 686 |
-
]
|
| 687 |
-
)
|
| 688 |
-
|
| 689 |
-
click_btn.click(
|
| 690 |
-
fn=log_click,
|
| 691 |
-
inputs=[session_id_state, variant_state, input_state, recommendation_state],
|
| 692 |
-
outputs=click_status
|
| 693 |
-
)
|
| 694 |
-
|
| 695 |
-
gr.Markdown("---")
|
| 696 |
-
gr.Markdown(f"📊 Results are saved to: [{scheduler.repo_id}](https://huggingface.co/datasets/{scheduler.repo_id})")
|
| 697 |
-
|
| 698 |
-
if __name__ == "__main__":
|
| 699 |
-
demo.launch()
|
| 700 |
-
```
|
| 701 |
-
|
| 702 |
-
---
|
| 703 |
-
|
| 704 |
-
## Running the Examples
|
| 705 |
-
|
| 706 |
-
For any of these examples:
|
| 707 |
-
|
| 708 |
-
1. **Install dependencies:**
|
| 709 |
-
```bash
|
| 710 |
-
uv add gradio huggingface_hub
|
| 711 |
-
# For image example: uv add pillow numpy
|
| 712 |
-
```
|
| 713 |
-
|
| 714 |
-
2. **Login to Hugging Face:**
|
| 715 |
-
```bash
|
| 716 |
-
huggingface-cli login
|
| 717 |
-
```
|
| 718 |
-
|
| 719 |
-
3. **Update repo_id:**
|
| 720 |
-
Replace `"your-username/repo-name"` with your actual HuggingFace username and desired dataset name.
|
| 721 |
-
|
| 722 |
-
4. **Run the script:**
|
| 723 |
-
```bash
|
| 724 |
-
uv run app.py
|
| 725 |
-
```
|
| 726 |
-
|
| 727 |
-
5. **View results:**
|
| 728 |
-
Visit `https://huggingface.co/datasets/your-username/repo-name` to see your data!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
.claude/skills/hf-dataset-storage/reference.md
DELETED
|
@@ -1,767 +0,0 @@
|
|
| 1 |
-
# API Reference for HF Dataset Storage
|
| 2 |
-
|
| 3 |
-
This reference provides detailed documentation for all the APIs and configuration options related to Hugging Face dataset storage.
|
| 4 |
-
|
| 5 |
-
## Table of Contents
|
| 6 |
-
|
| 7 |
-
1. [CommitScheduler](#commitscheduler)
|
| 8 |
-
2. [HfApi Upload Methods](#hfapi-upload-methods)
|
| 9 |
-
3. [Commit Operations](#commit-operations)
|
| 10 |
-
4. [Authentication](#authentication)
|
| 11 |
-
5. [Configuration](#configuration)
|
| 12 |
-
6. [Error Handling](#error-handling)
|
| 13 |
-
|
| 14 |
-
---
|
| 15 |
-
|
| 16 |
-
## CommitScheduler
|
| 17 |
-
|
| 18 |
-
The `CommitScheduler` class automatically uploads files to a dataset repository at regular intervals.
|
| 19 |
-
|
| 20 |
-
### Constructor
|
| 21 |
-
|
| 22 |
-
```python
|
| 23 |
-
from huggingface_hub import CommitScheduler
|
| 24 |
-
|
| 25 |
-
scheduler = CommitScheduler(
|
| 26 |
-
repo_id: str,
|
| 27 |
-
folder_path: str | Path,
|
| 28 |
-
*,
|
| 29 |
-
repo_type: str = "dataset",
|
| 30 |
-
revision: str = "main",
|
| 31 |
-
path_in_repo: str = ".",
|
| 32 |
-
every: int | float = 5,
|
| 33 |
-
token: str | None = None,
|
| 34 |
-
allow_patterns: str | List[str] | None = None,
|
| 35 |
-
ignore_patterns: str | List[str] | None = None,
|
| 36 |
-
)
|
| 37 |
-
```
|
| 38 |
-
|
| 39 |
-
### Parameters
|
| 40 |
-
|
| 41 |
-
| Parameter | Type | Default | Description |
|
| 42 |
-
|-----------|------|---------|-------------|
|
| 43 |
-
| `repo_id` | `str` | Required | Repository ID (e.g., "username/dataset-name") |
|
| 44 |
-
| `folder_path` | `str | Path` | Required | Local folder to monitor and upload |
|
| 45 |
-
| `repo_type` | `str` | `"dataset"` | Type of repo: "dataset", "model", or "space" |
|
| 46 |
-
| `revision` | `str` | `"main"` | Git revision/branch to commit to |
|
| 47 |
-
| `path_in_repo` | `str` | `"."` | Path in the repo where files will be uploaded |
|
| 48 |
-
| `every` | `int | float` | `5` | Minutes between uploads (minimum 5 recommended) |
|
| 49 |
-
| `token` | `str | None` | `None` | HuggingFace token (uses cached token if None) |
|
| 50 |
-
| `allow_patterns` | `str | List[str] | None` | `None` | Glob patterns for files to include |
|
| 51 |
-
| `ignore_patterns` | `str | List[str] | None` | `None` | Glob patterns for files to exclude |
|
| 52 |
-
|
| 53 |
-
### Attributes
|
| 54 |
-
|
| 55 |
-
| Attribute | Type | Description |
|
| 56 |
-
|-----------|------|-------------|
|
| 57 |
-
| `lock` | `threading.Lock` | Thread lock for safe concurrent writes |
|
| 58 |
-
| `api` | `HfApi` | HuggingFace API client instance |
|
| 59 |
-
| `repo_id` | `str` | The repository ID |
|
| 60 |
-
| `folder_path` | `Path` | The monitored folder path |
|
| 61 |
-
|
| 62 |
-
### Methods
|
| 63 |
-
|
| 64 |
-
#### `push_to_hub()`
|
| 65 |
-
|
| 66 |
-
Manually trigger an upload. Called automatically by the scheduler.
|
| 67 |
-
|
| 68 |
-
```python
|
| 69 |
-
scheduler.push_to_hub()
|
| 70 |
-
```
|
| 71 |
-
|
| 72 |
-
**Note:** You can override this method in a subclass for custom behavior.
|
| 73 |
-
|
| 74 |
-
### Example: Basic Usage
|
| 75 |
-
|
| 76 |
-
```python
|
| 77 |
-
from pathlib import Path
|
| 78 |
-
from huggingface_hub import CommitScheduler
|
| 79 |
-
|
| 80 |
-
# Create scheduler
|
| 81 |
-
scheduler = CommitScheduler(
|
| 82 |
-
repo_id="username/my-dataset",
|
| 83 |
-
folder_path=Path("./data"),
|
| 84 |
-
every=10,
|
| 85 |
-
)
|
| 86 |
-
|
| 87 |
-
# Files in ./data will be uploaded every 10 minutes
|
| 88 |
-
# Use scheduler.lock when writing to ensure thread safety
|
| 89 |
-
```
|
| 90 |
-
|
| 91 |
-
### Example: Custom Upload Logic
|
| 92 |
-
|
| 93 |
-
```python
|
| 94 |
-
from huggingface_hub import CommitScheduler
|
| 95 |
-
import zipfile
|
| 96 |
-
from pathlib import Path
|
| 97 |
-
|
| 98 |
-
class CustomScheduler(CommitScheduler):
|
| 99 |
-
def push_to_hub(self):
|
| 100 |
-
"""Custom logic to zip files before upload."""
|
| 101 |
-
files = list(self.folder_path.glob("*.txt"))
|
| 102 |
-
if not files:
|
| 103 |
-
return None
|
| 104 |
-
|
| 105 |
-
# Create archive
|
| 106 |
-
archive_path = self.folder_path / "archive.zip"
|
| 107 |
-
with zipfile.ZipFile(archive_path, "w") as zf:
|
| 108 |
-
for file in files:
|
| 109 |
-
zf.write(file, file.name)
|
| 110 |
-
|
| 111 |
-
# Upload using parent's API
|
| 112 |
-
self.api.upload_file(
|
| 113 |
-
path_or_fileobj=str(archive_path),
|
| 114 |
-
path_in_repo="archives/archive.zip",
|
| 115 |
-
repo_id=self.repo_id,
|
| 116 |
-
repo_type=self.repo_type,
|
| 117 |
-
)
|
| 118 |
-
|
| 119 |
-
# Cleanup
|
| 120 |
-
archive_path.unlink()
|
| 121 |
-
for file in files:
|
| 122 |
-
file.unlink()
|
| 123 |
-
```
|
| 124 |
-
|
| 125 |
-
---
|
| 126 |
-
|
| 127 |
-
## HfApi Upload Methods
|
| 128 |
-
|
| 129 |
-
The `HfApi` class provides methods for uploading files and folders.
|
| 130 |
-
|
| 131 |
-
### upload_file()
|
| 132 |
-
|
| 133 |
-
Upload a single file to a repository.
|
| 134 |
-
|
| 135 |
-
```python
|
| 136 |
-
from huggingface_hub import HfApi
|
| 137 |
-
|
| 138 |
-
api = HfApi()
|
| 139 |
-
|
| 140 |
-
api.upload_file(
|
| 141 |
-
path_or_fileobj: str | Path | bytes | BinaryIO,
|
| 142 |
-
path_in_repo: str,
|
| 143 |
-
repo_id: str,
|
| 144 |
-
*,
|
| 145 |
-
repo_type: str = "model",
|
| 146 |
-
revision: str = "main",
|
| 147 |
-
commit_message: str | None = None,
|
| 148 |
-
commit_description: str | None = None,
|
| 149 |
-
token: str | None = None,
|
| 150 |
-
run_as_future: bool = False,
|
| 151 |
-
)
|
| 152 |
-
```
|
| 153 |
-
|
| 154 |
-
#### Parameters
|
| 155 |
-
|
| 156 |
-
| Parameter | Type | Description |
|
| 157 |
-
|-----------|------|-------------|
|
| 158 |
-
| `path_or_fileobj` | `str | Path | bytes | BinaryIO` | File path or file-like object to upload |
|
| 159 |
-
| `path_in_repo` | `str` | Destination path in the repository |
|
| 160 |
-
| `repo_id` | `str` | Repository ID |
|
| 161 |
-
| `repo_type` | `str` | "model", "dataset", or "space" |
|
| 162 |
-
| `revision` | `str` | Branch/tag to commit to |
|
| 163 |
-
| `commit_message` | `str | None` | Custom commit message |
|
| 164 |
-
| `commit_description` | `str | None` | Extended commit description |
|
| 165 |
-
| `token` | `str | None` | Authentication token |
|
| 166 |
-
| `run_as_future` | `bool` | Run upload in background (returns Future) |
|
| 167 |
-
|
| 168 |
-
#### Returns
|
| 169 |
-
|
| 170 |
-
- `str`: Commit hash (URL to the commit)
|
| 171 |
-
- `concurrent.futures.Future`: If `run_as_future=True`
|
| 172 |
-
|
| 173 |
-
#### Example
|
| 174 |
-
|
| 175 |
-
```python
|
| 176 |
-
# Upload file from path
|
| 177 |
-
api.upload_file(
|
| 178 |
-
path_or_fileobj="/path/to/file.json",
|
| 179 |
-
path_in_repo="data/file.json",
|
| 180 |
-
repo_id="username/my-dataset",
|
| 181 |
-
repo_type="dataset",
|
| 182 |
-
commit_message="Add new data file"
|
| 183 |
-
)
|
| 184 |
-
|
| 185 |
-
# Upload bytes
|
| 186 |
-
data = b'{"key": "value"}'
|
| 187 |
-
api.upload_file(
|
| 188 |
-
path_or_fileobj=data,
|
| 189 |
-
path_in_repo="config.json",
|
| 190 |
-
repo_id="username/my-dataset",
|
| 191 |
-
repo_type="dataset",
|
| 192 |
-
)
|
| 193 |
-
|
| 194 |
-
# Background upload
|
| 195 |
-
future = api.upload_file(
|
| 196 |
-
path_or_fileobj="large_file.bin",
|
| 197 |
-
path_in_repo="large_file.bin",
|
| 198 |
-
repo_id="username/my-dataset",
|
| 199 |
-
repo_type="dataset",
|
| 200 |
-
run_as_future=True,
|
| 201 |
-
)
|
| 202 |
-
# Do other work...
|
| 203 |
-
future.result() # Wait for completion
|
| 204 |
-
```
|
| 205 |
-
|
| 206 |
-
---
|
| 207 |
-
|
| 208 |
-
### upload_folder()
|
| 209 |
-
|
| 210 |
-
Upload an entire folder to a repository.
|
| 211 |
-
|
| 212 |
-
```python
|
| 213 |
-
api.upload_folder(
|
| 214 |
-
folder_path: str | Path,
|
| 215 |
-
repo_id: str,
|
| 216 |
-
*,
|
| 217 |
-
repo_type: str = "model",
|
| 218 |
-
revision: str = "main",
|
| 219 |
-
path_in_repo: str = ".",
|
| 220 |
-
commit_message: str | None = None,
|
| 221 |
-
commit_description: str | None = None,
|
| 222 |
-
token: str | None = None,
|
| 223 |
-
allow_patterns: str | List[str] | None = None,
|
| 224 |
-
ignore_patterns: str | List[str] | None = None,
|
| 225 |
-
delete_patterns: str | List[str] | None = None,
|
| 226 |
-
run_as_future: bool = False,
|
| 227 |
-
)
|
| 228 |
-
```
|
| 229 |
-
|
| 230 |
-
#### Additional Parameters
|
| 231 |
-
|
| 232 |
-
| Parameter | Type | Description |
|
| 233 |
-
|-----------|------|-------------|
|
| 234 |
-
| `folder_path` | `str | Path` | Local folder to upload |
|
| 235 |
-
| `allow_patterns` | `str | List[str] | None` | Glob patterns to include |
|
| 236 |
-
| `ignore_patterns` | `str | List[str] | None` | Glob patterns to exclude |
|
| 237 |
-
| `delete_patterns` | `str | List[str] | None` | Patterns to delete from repo before upload |
|
| 238 |
-
|
| 239 |
-
#### Example
|
| 240 |
-
|
| 241 |
-
```python
|
| 242 |
-
# Upload entire folder
|
| 243 |
-
api.upload_folder(
|
| 244 |
-
folder_path="./my_dataset",
|
| 245 |
-
repo_id="username/my-dataset",
|
| 246 |
-
repo_type="dataset",
|
| 247 |
-
)
|
| 248 |
-
|
| 249 |
-
# Upload only CSV files
|
| 250 |
-
api.upload_folder(
|
| 251 |
-
folder_path="./data",
|
| 252 |
-
repo_id="username/my-dataset",
|
| 253 |
-
repo_type="dataset",
|
| 254 |
-
allow_patterns="*.csv",
|
| 255 |
-
)
|
| 256 |
-
|
| 257 |
-
# Upload and delete old files
|
| 258 |
-
api.upload_folder(
|
| 259 |
-
folder_path="./new_data",
|
| 260 |
-
path_in_repo="data",
|
| 261 |
-
repo_id="username/my-dataset",
|
| 262 |
-
repo_type="dataset",
|
| 263 |
-
delete_patterns="*.old", # Delete .old files first
|
| 264 |
-
)
|
| 265 |
-
```
|
| 266 |
-
|
| 267 |
-
---
|
| 268 |
-
|
| 269 |
-
### upload_large_folder()
|
| 270 |
-
|
| 271 |
-
Upload very large folders with resume capability.
|
| 272 |
-
|
| 273 |
-
```python
|
| 274 |
-
api.upload_large_folder(
|
| 275 |
-
repo_id: str,
|
| 276 |
-
folder_path: str | Path,
|
| 277 |
-
*,
|
| 278 |
-
repo_type: str = "model",
|
| 279 |
-
revision: str = "main",
|
| 280 |
-
private: bool = False,
|
| 281 |
-
token: str | None = None,
|
| 282 |
-
allow_patterns: str | List[str] | None = None,
|
| 283 |
-
ignore_patterns: str | List[str] | None = None,
|
| 284 |
-
num_workers: int = 1,
|
| 285 |
-
)
|
| 286 |
-
```
|
| 287 |
-
|
| 288 |
-
#### Key Features
|
| 289 |
-
|
| 290 |
-
- **Resumable**: Caches progress locally, can resume after interruption
|
| 291 |
-
- **Multi-threaded**: Parallel uploads with `num_workers`
|
| 292 |
-
- **Resilient**: Automatic retries on errors
|
| 293 |
-
|
| 294 |
-
#### Parameters
|
| 295 |
-
|
| 296 |
-
| Parameter | Type | Description |
|
| 297 |
-
|-----------|------|-------------|
|
| 298 |
-
| `num_workers` | `int` | Number of parallel upload threads |
|
| 299 |
-
|
| 300 |
-
#### Limitations
|
| 301 |
-
|
| 302 |
-
- Cannot set custom `path_in_repo` (upload to root)
|
| 303 |
-
- Cannot set custom commit message
|
| 304 |
-
- Cannot delete files while uploading
|
| 305 |
-
- Cannot create PR directly
|
| 306 |
-
|
| 307 |
-
#### Example
|
| 308 |
-
|
| 309 |
-
```python
|
| 310 |
-
# Upload huge dataset
|
| 311 |
-
api.upload_large_folder(
|
| 312 |
-
repo_id="username/huge-dataset",
|
| 313 |
-
folder_path="/data/massive_dataset",
|
| 314 |
-
repo_type="dataset",
|
| 315 |
-
num_workers=4, # Use 4 parallel threads
|
| 316 |
-
)
|
| 317 |
-
|
| 318 |
-
# If interrupted, re-run the same command to resume
|
| 319 |
-
```
|
| 320 |
-
|
| 321 |
-
---
|
| 322 |
-
|
| 323 |
-
## Commit Operations
|
| 324 |
-
|
| 325 |
-
For fine-grained control over commits, use `create_commit()` with operation objects.
|
| 326 |
-
|
| 327 |
-
### create_commit()
|
| 328 |
-
|
| 329 |
-
Create a commit with multiple operations (add/delete/copy files).
|
| 330 |
-
|
| 331 |
-
```python
|
| 332 |
-
api.create_commit(
|
| 333 |
-
repo_id: str,
|
| 334 |
-
operations: List[CommitOperation],
|
| 335 |
-
*,
|
| 336 |
-
commit_message: str,
|
| 337 |
-
commit_description: str | None = None,
|
| 338 |
-
repo_type: str = "model",
|
| 339 |
-
revision: str = "main",
|
| 340 |
-
token: str | None = None,
|
| 341 |
-
create_pr: bool = False,
|
| 342 |
-
)
|
| 343 |
-
```
|
| 344 |
-
|
| 345 |
-
### Operation Types
|
| 346 |
-
|
| 347 |
-
#### CommitOperationAdd
|
| 348 |
-
|
| 349 |
-
Add or update a file.
|
| 350 |
-
|
| 351 |
-
```python
|
| 352 |
-
from huggingface_hub import CommitOperationAdd
|
| 353 |
-
|
| 354 |
-
op = CommitOperationAdd(
|
| 355 |
-
path_in_repo="path/to/file.txt",
|
| 356 |
-
path_or_fileobj="/local/path/file.txt" # or bytes or file object
|
| 357 |
-
)
|
| 358 |
-
```
|
| 359 |
-
|
| 360 |
-
#### CommitOperationDelete
|
| 361 |
-
|
| 362 |
-
Delete a file or folder.
|
| 363 |
-
|
| 364 |
-
```python
|
| 365 |
-
from huggingface_hub import CommitOperationDelete
|
| 366 |
-
|
| 367 |
-
op = CommitOperationDelete(
|
| 368 |
-
path_in_repo="path/to/delete.txt" # or "folder/" for directories
|
| 369 |
-
)
|
| 370 |
-
```
|
| 371 |
-
|
| 372 |
-
#### CommitOperationCopy
|
| 373 |
-
|
| 374 |
-
Copy a file within the repository.
|
| 375 |
-
|
| 376 |
-
```python
|
| 377 |
-
from huggingface_hub import CommitOperationCopy
|
| 378 |
-
|
| 379 |
-
op = CommitOperationCopy(
|
| 380 |
-
src_path_in_repo="original.txt",
|
| 381 |
-
path_in_repo="copy.txt",
|
| 382 |
-
src_revision="main" # optional: copy from different branch
|
| 383 |
-
)
|
| 384 |
-
```
|
| 385 |
-
|
| 386 |
-
### Example: Multi-operation Commit
|
| 387 |
-
|
| 388 |
-
```python
|
| 389 |
-
from huggingface_hub import HfApi, CommitOperationAdd, CommitOperationDelete
|
| 390 |
-
|
| 391 |
-
api = HfApi()
|
| 392 |
-
|
| 393 |
-
operations = [
|
| 394 |
-
CommitOperationAdd(
|
| 395 |
-
path_in_repo="data/new_file.json",
|
| 396 |
-
path_or_fileobj="/local/new_file.json"
|
| 397 |
-
),
|
| 398 |
-
CommitOperationAdd(
|
| 399 |
-
path_in_repo="config.yaml",
|
| 400 |
-
path_or_fileobj=b"key: value"
|
| 401 |
-
),
|
| 402 |
-
CommitOperationDelete(
|
| 403 |
-
path_in_repo="old_data/" # Delete entire folder
|
| 404 |
-
),
|
| 405 |
-
]
|
| 406 |
-
|
| 407 |
-
api.create_commit(
|
| 408 |
-
repo_id="username/my-dataset",
|
| 409 |
-
operations=operations,
|
| 410 |
-
commit_message="Update dataset files",
|
| 411 |
-
commit_description="Added new data and removed old files",
|
| 412 |
-
repo_type="dataset",
|
| 413 |
-
)
|
| 414 |
-
```
|
| 415 |
-
|
| 416 |
-
---
|
| 417 |
-
|
| 418 |
-
## Authentication
|
| 419 |
-
|
| 420 |
-
### Method 1: CLI Login (Recommended)
|
| 421 |
-
|
| 422 |
-
```bash
|
| 423 |
-
huggingface-cli login
|
| 424 |
-
```
|
| 425 |
-
|
| 426 |
-
This caches your token locally. All subsequent API calls use this token automatically.
|
| 427 |
-
|
| 428 |
-
### Method 2: Environment Variable
|
| 429 |
-
|
| 430 |
-
```bash
|
| 431 |
-
export HF_TOKEN="hf_..."
|
| 432 |
-
```
|
| 433 |
-
|
| 434 |
-
```python
|
| 435 |
-
import os
|
| 436 |
-
from huggingface_hub import HfApi
|
| 437 |
-
|
| 438 |
-
api = HfApi(token=os.environ["HF_TOKEN"])
|
| 439 |
-
```
|
| 440 |
-
|
| 441 |
-
### Method 3: Programmatic Token
|
| 442 |
-
|
| 443 |
-
```python
|
| 444 |
-
from huggingface_hub import HfApi
|
| 445 |
-
|
| 446 |
-
api = HfApi(token="hf_your_token_here")
|
| 447 |
-
```
|
| 448 |
-
|
| 449 |
-
### For Hugging Face Spaces
|
| 450 |
-
|
| 451 |
-
1. Go to Space Settings → Repository secrets
|
| 452 |
-
2. Add secret: `HF_TOKEN` = your token value
|
| 453 |
-
3. Access in code:
|
| 454 |
-
|
| 455 |
-
```python
|
| 456 |
-
import os
|
| 457 |
-
token = os.environ.get("HF_TOKEN")
|
| 458 |
-
```
|
| 459 |
-
|
| 460 |
-
---
|
| 461 |
-
|
| 462 |
-
## Configuration
|
| 463 |
-
|
| 464 |
-
### Environment Variables
|
| 465 |
-
|
| 466 |
-
| Variable | Description | Default |
|
| 467 |
-
|----------|-------------|---------|
|
| 468 |
-
| `HF_TOKEN` | Authentication token | None |
|
| 469 |
-
| `HF_HOME` | Cache directory | `~/.cache/huggingface` |
|
| 470 |
-
| `HF_HUB_CACHE` | Hub cache directory | `$HF_HOME/hub` |
|
| 471 |
-
| `HF_ENDPOINT` | Hub endpoint URL | `https://huggingface.co` |
|
| 472 |
-
| `HF_XET_CACHE` | Xet cache directory | `$HF_HOME/xet` |
|
| 473 |
-
| `HF_XET_HIGH_PERFORMANCE` | Enable high-performance mode | `0` |
|
| 474 |
-
|
| 475 |
-
### Example: Custom Cache Location
|
| 476 |
-
|
| 477 |
-
```python
|
| 478 |
-
import os
|
| 479 |
-
|
| 480 |
-
# Set cache to local SSD for better performance
|
| 481 |
-
os.environ["HF_HOME"] = "/mnt/local-ssd/.cache/huggingface"
|
| 482 |
-
|
| 483 |
-
from huggingface_hub import HfApi
|
| 484 |
-
api = HfApi()
|
| 485 |
-
```
|
| 486 |
-
|
| 487 |
-
---
|
| 488 |
-
|
| 489 |
-
## Error Handling
|
| 490 |
-
|
| 491 |
-
### Common Errors and Solutions
|
| 492 |
-
|
| 493 |
-
#### 1. Authentication Error
|
| 494 |
-
|
| 495 |
-
```python
|
| 496 |
-
from huggingface_hub import HfApi
|
| 497 |
-
from huggingface_hub.utils import HfHubHTTPError
|
| 498 |
-
|
| 499 |
-
api = HfApi()
|
| 500 |
-
|
| 501 |
-
try:
|
| 502 |
-
api.upload_file(
|
| 503 |
-
path_or_fileobj="file.txt",
|
| 504 |
-
path_in_repo="file.txt",
|
| 505 |
-
repo_id="username/dataset",
|
| 506 |
-
repo_type="dataset",
|
| 507 |
-
)
|
| 508 |
-
except HfHubHTTPError as e:
|
| 509 |
-
if e.response.status_code == 401:
|
| 510 |
-
print("❌ Authentication failed. Run: huggingface-cli login")
|
| 511 |
-
else:
|
| 512 |
-
raise
|
| 513 |
-
```
|
| 514 |
-
|
| 515 |
-
#### 2. Repository Not Found
|
| 516 |
-
|
| 517 |
-
```python
|
| 518 |
-
from huggingface_hub import HfApi
|
| 519 |
-
from huggingface_hub.utils import RepositoryNotFoundError
|
| 520 |
-
|
| 521 |
-
api = HfApi()
|
| 522 |
-
|
| 523 |
-
try:
|
| 524 |
-
api.upload_file(...)
|
| 525 |
-
except RepositoryNotFoundError:
|
| 526 |
-
print("❌ Repository not found. Creating...")
|
| 527 |
-
api.create_repo(repo_id="username/dataset", repo_type="dataset")
|
| 528 |
-
api.upload_file(...) # Retry
|
| 529 |
-
```
|
| 530 |
-
|
| 531 |
-
#### 3. File Too Large
|
| 532 |
-
|
| 533 |
-
```python
|
| 534 |
-
from huggingface_hub import HfApi
|
| 535 |
-
|
| 536 |
-
api = HfApi()
|
| 537 |
-
|
| 538 |
-
file_size = os.path.getsize("huge_file.bin")
|
| 539 |
-
|
| 540 |
-
if file_size > 5 * 1024**3: # > 5GB
|
| 541 |
-
print("⚠️ Large file detected, using upload_large_folder")
|
| 542 |
-
# Move file to folder and use upload_large_folder
|
| 543 |
-
else:
|
| 544 |
-
api.upload_file(
|
| 545 |
-
path_or_fileobj="huge_file.bin",
|
| 546 |
-
path_in_repo="huge_file.bin",
|
| 547 |
-
repo_id="username/dataset",
|
| 548 |
-
repo_type="dataset",
|
| 549 |
-
)
|
| 550 |
-
```
|
| 551 |
-
|
| 552 |
-
#### 4. Network Interruption
|
| 553 |
-
|
| 554 |
-
```python
|
| 555 |
-
from huggingface_hub import HfApi
|
| 556 |
-
import time
|
| 557 |
-
|
| 558 |
-
api = HfApi()
|
| 559 |
-
|
| 560 |
-
max_retries = 3
|
| 561 |
-
for attempt in range(max_retries):
|
| 562 |
-
try:
|
| 563 |
-
api.upload_folder(
|
| 564 |
-
folder_path="./data",
|
| 565 |
-
repo_id="username/dataset",
|
| 566 |
-
repo_type="dataset",
|
| 567 |
-
)
|
| 568 |
-
break
|
| 569 |
-
except Exception as e:
|
| 570 |
-
if attempt < max_retries - 1:
|
| 571 |
-
wait_time = 2 ** attempt # Exponential backoff
|
| 572 |
-
print(f"⚠️ Upload failed, retrying in {wait_time}s...")
|
| 573 |
-
time.sleep(wait_time)
|
| 574 |
-
else:
|
| 575 |
-
print("❌ Upload failed after all retries")
|
| 576 |
-
raise
|
| 577 |
-
```
|
| 578 |
-
|
| 579 |
-
---
|
| 580 |
-
|
| 581 |
-
## Advanced Patterns
|
| 582 |
-
|
| 583 |
-
### Pattern 1: Atomic Updates
|
| 584 |
-
|
| 585 |
-
Ensure all files are updated together or not at all.
|
| 586 |
-
|
| 587 |
-
```python
|
| 588 |
-
from huggingface_hub import HfApi, CommitOperationAdd
|
| 589 |
-
|
| 590 |
-
api = HfApi()
|
| 591 |
-
|
| 592 |
-
# Prepare all operations
|
| 593 |
-
operations = [
|
| 594 |
-
CommitOperationAdd("file1.json", path_or_fileobj=data1),
|
| 595 |
-
CommitOperationAdd("file2.json", path_or_fileobj=data2),
|
| 596 |
-
CommitOperationAdd("file3.json", path_or_fileobj=data3),
|
| 597 |
-
]
|
| 598 |
-
|
| 599 |
-
# Single atomic commit
|
| 600 |
-
api.create_commit(
|
| 601 |
-
repo_id="username/dataset",
|
| 602 |
-
operations=operations,
|
| 603 |
-
commit_message="Atomic update of all files",
|
| 604 |
-
repo_type="dataset",
|
| 605 |
-
)
|
| 606 |
-
```
|
| 607 |
-
|
| 608 |
-
### Pattern 2: Concurrent Uploads
|
| 609 |
-
|
| 610 |
-
Upload multiple files in parallel.
|
| 611 |
-
|
| 612 |
-
```python
|
| 613 |
-
from huggingface_hub import HfApi
|
| 614 |
-
from concurrent.futures import ThreadPoolExecutor
|
| 615 |
-
|
| 616 |
-
api = HfApi()
|
| 617 |
-
files = ["file1.txt", "file2.txt", "file3.txt"]
|
| 618 |
-
|
| 619 |
-
def upload_file(filename):
|
| 620 |
-
api.upload_file(
|
| 621 |
-
path_or_fileobj=filename,
|
| 622 |
-
path_in_repo=filename,
|
| 623 |
-
repo_id="username/dataset",
|
| 624 |
-
repo_type="dataset",
|
| 625 |
-
)
|
| 626 |
-
|
| 627 |
-
with ThreadPoolExecutor(max_workers=3) as executor:
|
| 628 |
-
executor.map(upload_file, files)
|
| 629 |
-
```
|
| 630 |
-
|
| 631 |
-
### Pattern 3: Progressive Dataset Building
|
| 632 |
-
|
| 633 |
-
Build dataset incrementally with versioning.
|
| 634 |
-
|
| 635 |
-
```python
|
| 636 |
-
from huggingface_hub import HfApi
|
| 637 |
-
|
| 638 |
-
api = HfApi()
|
| 639 |
-
|
| 640 |
-
for version in range(1, 11):
|
| 641 |
-
# Generate data for this version
|
| 642 |
-
data = generate_data(version)
|
| 643 |
-
|
| 644 |
-
# Upload to versioned path
|
| 645 |
-
api.upload_file(
|
| 646 |
-
path_or_fileobj=data,
|
| 647 |
-
path_in_repo=f"versions/v{version}/data.json",
|
| 648 |
-
repo_id="username/dataset",
|
| 649 |
-
repo_type="dataset",
|
| 650 |
-
commit_message=f"Add version {version}",
|
| 651 |
-
)
|
| 652 |
-
```
|
| 653 |
-
|
| 654 |
-
---
|
| 655 |
-
|
| 656 |
-
## Performance Tips
|
| 657 |
-
|
| 658 |
-
### 1. Use High-Performance Mode
|
| 659 |
-
|
| 660 |
-
For maximum upload speed (uses all CPU cores and bandwidth):
|
| 661 |
-
|
| 662 |
-
```bash
|
| 663 |
-
export HF_XET_HIGH_PERFORMANCE=1
|
| 664 |
-
```
|
| 665 |
-
|
| 666 |
-
### 2. Local Cache for Cluster Uploads
|
| 667 |
-
|
| 668 |
-
When uploading from distributed filesystems:
|
| 669 |
-
|
| 670 |
-
```bash
|
| 671 |
-
# Point cache to local SSD, not network filesystem
|
| 672 |
-
export HF_XET_CACHE=/local-ssd/.cache/xet
|
| 673 |
-
```
|
| 674 |
-
|
| 675 |
-
### 3. Batch Small Files
|
| 676 |
-
|
| 677 |
-
Instead of uploading thousands of small files individually:
|
| 678 |
-
|
| 679 |
-
```python
|
| 680 |
-
import zipfile
|
| 681 |
-
from huggingface_hub import HfApi
|
| 682 |
-
|
| 683 |
-
# Zip small files
|
| 684 |
-
with zipfile.ZipFile("archive.zip", "w") as zf:
|
| 685 |
-
for file in small_files:
|
| 686 |
-
zf.write(file)
|
| 687 |
-
|
| 688 |
-
# Upload single archive
|
| 689 |
-
api.upload_file(
|
| 690 |
-
path_or_fileobj="archive.zip",
|
| 691 |
-
path_in_repo="data/archive.zip",
|
| 692 |
-
repo_id="username/dataset",
|
| 693 |
-
repo_type="dataset",
|
| 694 |
-
)
|
| 695 |
-
```
|
| 696 |
-
|
| 697 |
-
### 4. Use Background Uploads
|
| 698 |
-
|
| 699 |
-
Don't block your main thread:
|
| 700 |
-
|
| 701 |
-
```python
|
| 702 |
-
from huggingface_hub import HfApi
|
| 703 |
-
|
| 704 |
-
api = HfApi()
|
| 705 |
-
|
| 706 |
-
# Start upload in background
|
| 707 |
-
future = api.upload_folder(
|
| 708 |
-
folder_path="./data",
|
| 709 |
-
repo_id="username/dataset",
|
| 710 |
-
repo_type="dataset",
|
| 711 |
-
run_as_future=True,
|
| 712 |
-
)
|
| 713 |
-
|
| 714 |
-
# Do other work
|
| 715 |
-
process_more_data()
|
| 716 |
-
|
| 717 |
-
# Wait for completion when ready
|
| 718 |
-
future.result()
|
| 719 |
-
```
|
| 720 |
-
|
| 721 |
-
---
|
| 722 |
-
|
| 723 |
-
## Comparison Table
|
| 724 |
-
|
| 725 |
-
| Feature | CommitScheduler | upload_folder() | upload_large_folder() |
|
| 726 |
-
|---------|----------------|-----------------|----------------------|
|
| 727 |
-
| Automatic uploads | ✅ Yes | ❌ No | ❌ No |
|
| 728 |
-
| Resumable | ❌ No | ❌ No | ✅ Yes |
|
| 729 |
-
| Custom commit message | ❌ No | ✅ Yes | ❌ No |
|
| 730 |
-
| Background operation | ✅ Yes | ✅ Yes (with flag) | ❌ No |
|
| 731 |
-
| Path in repo | ✅ Yes | ✅ Yes | ❌ No (root only) |
|
| 732 |
-
| Multi-threaded | ❌ No | ❌ No | ✅ Yes |
|
| 733 |
-
| Best for | Continuous logging | One-time uploads | Huge datasets |
|
| 734 |
-
|
| 735 |
-
---
|
| 736 |
-
|
| 737 |
-
## Quick Reference
|
| 738 |
-
|
| 739 |
-
### Upload Single File
|
| 740 |
-
```python
|
| 741 |
-
api.upload_file(path_or_fileobj="file.txt", path_in_repo="file.txt",
|
| 742 |
-
repo_id="user/dataset", repo_type="dataset")
|
| 743 |
-
```
|
| 744 |
-
|
| 745 |
-
### Upload Folder
|
| 746 |
-
```python
|
| 747 |
-
api.upload_folder(folder_path="./data", repo_id="user/dataset",
|
| 748 |
-
repo_type="dataset")
|
| 749 |
-
```
|
| 750 |
-
|
| 751 |
-
### Scheduled Uploads
|
| 752 |
-
```python
|
| 753 |
-
scheduler = CommitScheduler(repo_id="user/dataset", folder_path="./data",
|
| 754 |
-
every=10, repo_type="dataset")
|
| 755 |
-
```
|
| 756 |
-
|
| 757 |
-
### Background Upload
|
| 758 |
-
```python
|
| 759 |
-
future = api.upload_folder(..., run_as_future=True)
|
| 760 |
-
future.result() # Wait for completion
|
| 761 |
-
```
|
| 762 |
-
|
| 763 |
-
### Large Folder
|
| 764 |
-
```python
|
| 765 |
-
api.upload_large_folder(repo_id="user/dataset", folder_path="./big_data",
|
| 766 |
-
repo_type="dataset", num_workers=4)
|
| 767 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
app.py
CHANGED
|
@@ -85,7 +85,7 @@ def calculate_table(matches_list):
|
|
| 85 |
all_teams.add(match[2]) # away team
|
| 86 |
|
| 87 |
# Initialize stats for all teams
|
| 88 |
-
table = {t: {"P": 0, "W": 0, "D": 0, "L": 0, "GF": 0, "GA": 0, "Pts": 0, "GPM": 0.0, "WP": 0.0}
|
| 89 |
for t in all_teams}
|
| 90 |
|
| 91 |
# Process each match
|
|
@@ -117,6 +117,8 @@ def calculate_table(matches_list):
|
|
| 117 |
for t in all_teams:
|
| 118 |
if table[t]["P"] > 0:
|
| 119 |
table[t]["GPM"] = round(table[t]["GF"] / table[t]["P"], 2)
|
|
|
|
|
|
|
| 120 |
table[t]["WP"] = round((table[t]["W"] / table[t]["P"]) * 100, 2)
|
| 121 |
|
| 122 |
# Create DataFrame
|
|
@@ -126,7 +128,7 @@ def calculate_table(matches_list):
|
|
| 126 |
df.rename(columns={"index": "Team"}, inplace=True)
|
| 127 |
|
| 128 |
# Sort by WP descending (as per requirements)
|
| 129 |
-
df = df[["Team", "WP", "GPM", "P", "W", "D", "L", "GF", "GA", "GD", "Pts"]]
|
| 130 |
df = df.sort_values(by=["WP"], ascending=False)
|
| 131 |
|
| 132 |
return df
|
|
|
|
| 85 |
all_teams.add(match[2]) # away team
|
| 86 |
|
| 87 |
# Initialize stats for all teams
|
| 88 |
+
table = {t: {"P": 0, "W": 0, "D": 0, "L": 0, "GF": 0, "GA": 0, "Pts": 0, "GPM": 0.0, "GAM": 0.0, "GDM": 0.0, "WP": 0.0}
|
| 89 |
for t in all_teams}
|
| 90 |
|
| 91 |
# Process each match
|
|
|
|
| 117 |
for t in all_teams:
|
| 118 |
if table[t]["P"] > 0:
|
| 119 |
table[t]["GPM"] = round(table[t]["GF"] / table[t]["P"], 2)
|
| 120 |
+
table[t]["GAM"] = round(table[t]["GA"] / table[t]["P"], 2)
|
| 121 |
+
table[t]["GDM"] = round((table[t]["GF"] - table[t]["GA"]) / table[t]["P"], 2)
|
| 122 |
table[t]["WP"] = round((table[t]["W"] / table[t]["P"]) * 100, 2)
|
| 123 |
|
| 124 |
# Create DataFrame
|
|
|
|
| 128 |
df.rename(columns={"index": "Team"}, inplace=True)
|
| 129 |
|
| 130 |
# Sort by WP descending (as per requirements)
|
| 131 |
+
df = df[["Team", "WP", "GPM", "GAM", "GDM", "P", "W", "D", "L", "GF", "GA", "GD", "Pts"]]
|
| 132 |
df = df.sort_values(by=["WP"], ascending=False)
|
| 133 |
|
| 134 |
return df
|