burtenshaw HF Staff commited on
Commit
6ab17a7
Β·
verified Β·
1 Parent(s): 02b5f0a

Upload folder using huggingface_hub

Browse files
SKILL.md ADDED
@@ -0,0 +1,706 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: model-trainer
3
+ description: This skill should be used when users want to train or fine-tune language models using TRL (Transformer Reinforcement Learning) on Hugging Face Jobs infrastructure. Covers SFT, DPO, GRPO and reward modeling training methods, plus GGUF conversion for local deployment. Includes guidance on the TRL Jobs package, UV scripts with PEP 723 format, dataset preparation and validation, hardware selection, cost estimation, Trackio monitoring, Hub authentication, and model persistence. Should be invoked for tasks involving cloud GPU training, GGUF conversion, or when users mention training on Hugging Face Jobs without local GPU setup.
4
+ license: Complete terms in LICENSE.txt
5
+ ---
6
+
7
+ # TRL Training on Hugging Face Jobs
8
+
9
+ ## Overview
10
+
11
+ Train language models using TRL (Transformer Reinforcement Learning) on fully managed Hugging Face infrastructure. No local GPU setup requiredβ€”models train on cloud GPUs and results are automatically saved to the Hugging Face Hub.
12
+
13
+ **TRL provides multiple training methods:**
14
+ - **SFT** (Supervised Fine-Tuning) - Standard instruction tuning
15
+ - **DPO** (Direct Preference Optimization) - Alignment from preference data
16
+ - **GRPO** (Group Relative Policy Optimization) - Online RL training
17
+ - **Reward Modeling** - Train reward models for RLHF
18
+
19
+ **For detailed TRL method documentation:**
20
+ ```python
21
+ hf_doc_search("your query", product="trl")
22
+ hf_doc_fetch("https://huggingface.co/docs/trl/sft_trainer") # SFT
23
+ hf_doc_fetch("https://huggingface.co/docs/trl/dpo_trainer") # DPO
24
+ # etc.
25
+ ```
26
+
27
+ **See also:** `references/training_methods.md` for method overviews and selection guidance
28
+
29
+ ## When to Use This Skill
30
+
31
+ Use this skill when users want to:
32
+ - Fine-tune language models on cloud GPUs without local infrastructure
33
+ - Train with TRL methods (SFT, DPO, GRPO, etc.)
34
+ - Run training jobs on Hugging Face Jobs infrastructure
35
+ - Convert trained models to GGUF for local deployment (Ollama, LM Studio, llama.cpp)
36
+ - Ensure trained models are permanently saved to the Hub
37
+ - Use modern workflows with optimized defaults
38
+
39
+ ## Key Directives
40
+
41
+ When assisting with training jobs:
42
+
43
+ 1. **ALWAYS use `hf_jobs()` MCP tool** - Submit jobs using `hf_jobs("uv", {...})`, NOT bash `trl-jobs` commands. The `script` parameter accepts Python code directly. Do NOT save to local files unless the user explicitly requests it. Pass the script content as a string to `hf_jobs()`. If user asks to "train a model", "fine-tune", or similar requests, you MUST create the training script AND submit the job immediately using `hf_jobs()`.
44
+
45
+ 2. **Always include Trackio** - Every training script should include Trackio for real-time monitoring. Use example scripts in `scripts/` as templates.
46
+
47
+ 3. **Provide job details after submission** - After submitting, provide job ID, monitoring URL, estimated time, and note that the user can request status checks later.
48
+
49
+ 4. **Use example scripts as templates** - Reference `scripts/train_sft_example.py`, `scripts/train_dpo_example.py`, etc. as starting points.
50
+
51
+ ## Local Script Dependencies
52
+
53
+ To run scripts locally (like `estimate_cost.py`), install dependencies:
54
+ ```bash
55
+ pip install -r requirements.txt
56
+ ```
57
+
58
+ ## Prerequisites Checklist
59
+
60
+ Before starting any training job, verify:
61
+
62
+ ### βœ… **Account & Authentication**
63
+ - Hugging Face Account with [Pro](https://hf.co/pro), [Team](https://hf.co/enterprise), or [Enterprise](https://hf.co/enterprise) plan (Jobs require paid plan)
64
+ - Authenticated login: Check with `hf_whoami()`
65
+ - **HF_TOKEN for Hub Push** ⚠️ CRITICAL - Training environment is ephemeral, must push to Hub or ALL training results are lost
66
+ - Token must have write permissions
67
+ - **MUST pass `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job config** to make token available (the `$HF_TOKEN` syntax
68
+ references your actual token value)
69
+
70
+ ### βœ… **Dataset Requirements**
71
+ - Dataset must exist on Hub or be loadable via `datasets.load_dataset()`
72
+ - Format must match training method (SFT: "messages"/text/prompt-completion; DPO: chosen/rejected; GRPO: prompt-only)
73
+ - **ALWAYS validate unknown datasets** before GPU training to prevent format failures (see Dataset Validation section below)
74
+ - Size appropriate for hardware (Demo: 50-100 examples on t4-small; Production: 1K-10K+ on a10g-large/a100-large)
75
+
76
+ ### ⚠️ **Critical Settings**
77
+ - **Timeout must exceed expected training time** - Default 30min is TOO SHORT for most training. Minimum recommended: 1-2 hours. Job fails and loses all progress if timeout is exceeded.
78
+ - **Hub push must be enabled** - Config: `push_to_hub=True`, `hub_model_id="username/model-name"`; Job: `secrets={"HF_TOKEN": "$HF_TOKEN"}`
79
+
80
+ ## Asynchronous Job Guidelines
81
+
82
+ **⚠️ IMPORTANT: Training jobs run asynchronously and can take hours**
83
+
84
+ ### Action Required
85
+
86
+ **When user requests training:**
87
+ 1. **Create the training script** with Trackio included (use `scripts/train_sft_example.py` as template)
88
+ 2. **Submit immediately** using `hf_jobs()` MCP tool with script content inline - don't save to file unless user requests
89
+ 3. **Report submission** with job ID, monitoring URL, and estimated time
90
+ 4. **Wait for user** to request status checks - don't poll automatically
91
+
92
+ ### Ground Rules
93
+ - **Jobs run in background** - Submission returns immediately; training continues independently
94
+ - **Initial logs delayed** - Can take 30-60 seconds for logs to appear
95
+ - **User checks status** - Wait for user to request status updates
96
+ - **Avoid polling** - Check logs only on user request; provide monitoring links instead
97
+
98
+ ### After Submission
99
+
100
+ **Provide to user:**
101
+ - βœ… Job ID and monitoring URL
102
+ - βœ… Expected completion time
103
+ - βœ… Trackio dashboard URL
104
+ - βœ… Note that user can request status checks later
105
+
106
+ **Example Response:**
107
+ ```
108
+ βœ… Job submitted successfully!
109
+
110
+ Job ID: abc123xyz
111
+ Monitor: https://huggingface.co/jobs/username/abc123xyz
112
+
113
+ Expected time: ~2 hours
114
+ Estimated cost: ~$10
115
+
116
+ The job is running in the background. Ask me to check status/logs when ready!
117
+ ```
118
+
119
+ ## Quick Start: Three Approaches
120
+
121
+ **πŸ’‘ Tip for Demos:** For quick demos on smaller GPUs (t4-small), omit `eval_dataset` and `eval_strategy` to save ~40% memory. You'll still see training loss and learning progress.
122
+
123
+ ### Sequence Length Configuration
124
+
125
+ **TRL config classes use `max_length` (not `max_seq_length`)** to control tokenized sequence length:
126
+
127
+ ```python
128
+ # βœ… CORRECT - If you need to set sequence length
129
+ SFTConfig(max_length=512) # Truncate sequences to 512 tokens
130
+ DPOConfig(max_length=2048) # Longer context (2048 tokens)
131
+
132
+ # ❌ WRONG - This parameter doesn't exist
133
+ SFTConfig(max_seq_length=512) # TypeError!
134
+ ```
135
+
136
+ **Default behavior:** `max_length=1024` (truncates from right). This works well for most training.
137
+
138
+ **When to override:**
139
+ - **Longer context**: Set higher (e.g., `max_length=2048`)
140
+ - **Memory constraints**: Set lower (e.g., `max_length=512`)
141
+ - **Vision models**: Set `max_length=None` (prevents cutting image tokens)
142
+
143
+ **Usually you don't need to set this parameter at all** - the examples below use the sensible default.
144
+
145
+ ### Approach 1: UV Scripts (Recommendedβ€”Default Choice)
146
+
147
+ UV scripts use PEP 723 inline dependencies for clean, self-contained training. **This is the primary approach for Claude Code.**
148
+
149
+ ```python
150
+ hf_jobs("uv", {
151
+ "script": """
152
+ # /// script
153
+ # dependencies = ["trl>=0.12.0", "peft>=0.7.0", "trackio"]
154
+ # ///
155
+
156
+ from datasets import load_dataset
157
+ from peft import LoraConfig
158
+ from trl import SFTTrainer, SFTConfig
159
+ import trackio
160
+
161
+ dataset = load_dataset("trl-lib/Capybara", split="train")
162
+
163
+ # Create train/eval split for monitoring
164
+ dataset_split = dataset.train_test_split(test_size=0.1, seed=42)
165
+
166
+ trainer = SFTTrainer(
167
+ model="Qwen/Qwen2.5-0.5B",
168
+ train_dataset=dataset_split["train"],
169
+ eval_dataset=dataset_split["test"],
170
+ peft_config=LoraConfig(r=16, lora_alpha=32),
171
+ args=SFTConfig(
172
+ output_dir="my-model",
173
+ push_to_hub=True,
174
+ hub_model_id="username/my-model",
175
+ num_train_epochs=3,
176
+ eval_strategy="steps",
177
+ eval_steps=50,
178
+ report_to="trackio",
179
+ project="meaningful_prject_name", # project name for the training name (trackio)
180
+ run_name="meaningful_run_name", # descriptive name for the specific training run (trackio)
181
+ )
182
+ )
183
+
184
+ trainer.train()
185
+ trainer.push_to_hub()
186
+ """,
187
+ "flavor": "a10g-large",
188
+ "timeout": "2h",
189
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"}
190
+ })
191
+ ```
192
+
193
+ **Benefits:** Direct MCP tool usage, clean code, dependencies declared inline (PEP 723), no file saving required, full control
194
+ **When to use:** Default choice for all training tasks in Claude Code, custom training logic, any scenario requiring `hf_jobs()`
195
+
196
+ #### Working with Scripts
197
+
198
+ ⚠️ **Important:** The `script` parameter accepts either inline code (as shown above) OR a URL. **Local file paths do NOT work.**
199
+
200
+ **Why local paths don't work:**
201
+ Jobs run in isolated Docker containers without access to your local filesystem. Scripts must be:
202
+ - Inline code (recommended for custom training)
203
+ - Publicly accessible URLs
204
+ - Private repo URLs (with HF_TOKEN)
205
+
206
+ **Common mistakes:**
207
+ ```python
208
+ # ❌ These will all fail
209
+ hf_jobs("uv", {"script": "train.py"})
210
+ hf_jobs("uv", {"script": "./scripts/train.py"})
211
+ hf_jobs("uv", {"script": "/path/to/train.py"})
212
+ ```
213
+
214
+ **Correct approaches:**
215
+ ```python
216
+ # βœ… Inline code (recommended)
217
+ hf_jobs("uv", {"script": "# /// script\n# dependencies = [...]\n# ///\n\n<your code>"})
218
+
219
+ # βœ… From Hugging Face Hub
220
+ hf_jobs("uv", {"script": "https://huggingface.co/user/repo/resolve/main/train.py"})
221
+
222
+ # βœ… From GitHub
223
+ hf_jobs("uv", {"script": "https://raw.githubusercontent.com/user/repo/main/train.py"})
224
+
225
+ # βœ… From Gist
226
+ hf_jobs("uv", {"script": "https://gist.githubusercontent.com/user/id/raw/train.py"})
227
+ ```
228
+
229
+ **To use local scripts:** Upload to HF Hub first:
230
+ ```bash
231
+ huggingface-cli repo create my-training-scripts --type model
232
+ huggingface-cli upload my-training-scripts ./train.py train.py
233
+ # Use: https://huggingface.co/USERNAME/my-training-scripts/resolve/main/train.py
234
+ ```
235
+
236
+ ### Approach 2: TRL Maintained Scripts (Official Examples)
237
+
238
+ TRL provides battle-tested scripts for all methods. Can be run from URLs:
239
+
240
+ ```python
241
+ hf_jobs("uv", {
242
+ "script": "https://github.com/huggingface/trl/blob/main/trl/scripts/sft.py",
243
+ "script_args": [
244
+ "--model_name_or_path", "Qwen/Qwen2.5-0.5B",
245
+ "--dataset_name", "trl-lib/Capybara",
246
+ "--output_dir", "my-model",
247
+ "--push_to_hub",
248
+ "--hub_model_id", "username/my-model"
249
+ ],
250
+ "flavor": "a10g-large",
251
+ "timeout": "2h",
252
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"}
253
+ })
254
+ ```
255
+
256
+ **Benefits:** No code to write, maintained by TRL team, production-tested
257
+ **When to use:** Standard TRL training, quick experiments, don't need custom code
258
+ **Available:** Scripts are available from https://github.com/huggingface/trl/tree/main/examples/scripts
259
+
260
+ ### Finding More UV Scripts on Hub
261
+
262
+ The `uv-scripts` organization provides ready-to-use UV scripts stored as datasets on Hugging Face Hub:
263
+
264
+ ```python
265
+ # Discover available UV script collections
266
+ dataset_search({"author": "uv-scripts", "sort": "downloads", "limit": 20})
267
+
268
+ # Explore a specific collection
269
+ hub_repo_details(["uv-scripts/classification"], repo_type="dataset", include_readme=True)
270
+ ```
271
+
272
+ **Popular collections:** ocr, classification, synthetic-data, vllm, dataset-creation
273
+
274
+ ### Approach 3: HF Jobs CLI (Direct Terminal Commands)
275
+
276
+ When the `hf_jobs()` MCP tool is unavailable, use the `hf jobs` CLI directly.
277
+
278
+ **⚠️ CRITICAL: CLI Syntax Rules**
279
+
280
+ ```bash
281
+ # βœ… CORRECT syntax - flags BEFORE script URL
282
+ hf jobs uv run --flavor a10g-large --timeout 2h --secrets HF_TOKEN "https://example.com/train.py"
283
+
284
+ # ❌ WRONG - "run uv" instead of "uv run"
285
+ hf jobs run uv "https://example.com/train.py" --flavor a10g-large
286
+
287
+ # ❌ WRONG - flags AFTER script URL (will be ignored!)
288
+ hf jobs uv run "https://example.com/train.py" --flavor a10g-large
289
+
290
+ # ❌ WRONG - "--secret" instead of "--secrets" (plural)
291
+ hf jobs uv run --secret HF_TOKEN "https://example.com/train.py"
292
+ ```
293
+
294
+ **Key syntax rules:**
295
+ 1. Command order is `hf jobs uv run` (NOT `hf jobs run uv`)
296
+ 2. All flags (`--flavor`, `--timeout`, `--secrets`) must come BEFORE the script URL
297
+ 3. Use `--secrets` (plural), not `--secret`
298
+ 4. Script URL must be the last positional argument
299
+
300
+ **Complete CLI example:**
301
+ ```bash
302
+ hf jobs uv run \
303
+ --flavor a10g-large \
304
+ --timeout 2h \
305
+ --secrets HF_TOKEN \
306
+ "https://huggingface.co/user/repo/resolve/main/train.py"
307
+ ```
308
+
309
+ **Check job status via CLI:**
310
+ ```bash
311
+ hf jobs ps # List all jobs
312
+ hf jobs logs <job-id> # View logs
313
+ hf jobs inspect <job-id> # Job details
314
+ hf jobs cancel <job-id> # Cancel a job
315
+ ```
316
+
317
+ ### Approach 4: TRL Jobs Package (Simplified Training)
318
+
319
+ The `trl-jobs` package provides optimized defaults and one-liner training.
320
+
321
+ ```bash
322
+ # Install
323
+ pip install trl-jobs
324
+
325
+ # Train with SFT (simplest possible)
326
+ trl-jobs sft \
327
+ --model_name Qwen/Qwen2.5-0.5B \
328
+ --dataset_name trl-lib/Capybara
329
+ ```
330
+
331
+ **Benefits:** Pre-configured settings, automatic Trackio integration, automatic Hub push, one-line commands
332
+ **When to use:** User working in terminal directly (not Claude Code context), quick local experimentation
333
+ **Repository:** https://github.com/huggingface/trl-jobs
334
+
335
+ ⚠️ **In Claude Code context, prefer using `hf_jobs()` MCP tool (Approach 1) when available.**
336
+
337
+ ## Hardware Selection
338
+
339
+ | Model Size | Recommended Hardware | Cost (approx/hr) | Use Case |
340
+ |------------|---------------------|------------------|----------|
341
+ | <1B params | `t4-small` | ~$0.75 | Demos, quick tests only without eval steps |
342
+ | 1-3B params | `t4-medium`, `l4x1` | ~$1.50-2.50 | Development |
343
+ | 3-7B params | `a10g-small`, `a10g-large` | ~$3.50-5.00 | Production training |
344
+ | 7-13B params | `a10g-large`, `a100-large` | ~$5-10 | Large models (use LoRA) |
345
+ | 13B+ params | `a100-large`, `a10g-largex2` | ~$10-20 | Very large (use LoRA) |
346
+
347
+ **GPU Flavors:** cpu-basic/upgrade/performance/xl, t4-small/medium, l4x1/x4, a10g-small/large/largex2/largex4, a100-large, h100/h100x8
348
+
349
+ **Guidelines:**
350
+ - Use **LoRA/PEFT** for models >7B to reduce memory
351
+ - Multi-GPU automatically handled by TRL/Accelerate
352
+ - Start with smaller hardware for testing
353
+
354
+ **See:** `references/hardware_guide.md` for detailed specifications
355
+
356
+ ## Critical: Saving Results to Hub
357
+
358
+ **⚠️ EPHEMERAL ENVIRONMENTβ€”MUST PUSH TO HUB**
359
+
360
+ The Jobs environment is temporary. All files are deleted when the job ends. If the model isn't pushed to Hub, **ALL TRAINING IS LOST**.
361
+
362
+ ### Required Configuration
363
+
364
+ **In training script/config:**
365
+ ```python
366
+ SFTConfig(
367
+ push_to_hub=True,
368
+ hub_model_id="username/model-name", # MUST specify
369
+ hub_strategy="every_save", # Optional: push checkpoints
370
+ )
371
+ ```
372
+
373
+ **In job submission:**
374
+ ```python
375
+ {
376
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"} # Enables authentication
377
+ }
378
+ ```
379
+
380
+ ### Verification Checklist
381
+
382
+ Before submitting:
383
+ - [ ] `push_to_hub=True` set in config
384
+ - [ ] `hub_model_id` includes username/repo-name
385
+ - [ ] `secrets` parameter includes HF_TOKEN
386
+ - [ ] User has write access to target repo
387
+
388
+ **See:** `references/hub_saving.md` for detailed troubleshooting
389
+
390
+ ## Timeout Management
391
+
392
+ **⚠️ DEFAULT: 30 MINUTESβ€”TOO SHORT FOR TRAINING**
393
+
394
+ ### Setting Timeouts
395
+
396
+ ```python
397
+ {
398
+ "timeout": "2h" # 2 hours (formats: "90m", "2h", "1.5h", or seconds as integer)
399
+ }
400
+ ```
401
+
402
+ ### Timeout Guidelines
403
+
404
+ | Scenario | Recommended | Notes |
405
+ |----------|-------------|-------|
406
+ | Quick demo (50-100 examples) | 10-30 min | Verify setup |
407
+ | Development training | 1-2 hours | Small datasets |
408
+ | Production (3-7B model) | 4-6 hours | Full datasets |
409
+ | Large model with LoRA | 3-6 hours | Depends on dataset |
410
+
411
+ **Always add 20-30% buffer** for model/dataset loading, checkpoint saving, Hub push operations, and network delays.
412
+
413
+ **On timeout:** Job killed immediately, all unsaved progress lost, must restart from beginning
414
+
415
+ ## Cost Estimation
416
+
417
+ **Offer to estimate cost when planning jobs with known parameters.** Use `scripts/estimate_cost.py`:
418
+
419
+ ```bash
420
+ python scripts/estimate_cost.py \
421
+ --model meta-llama/Llama-2-7b-hf \
422
+ --dataset trl-lib/Capybara \
423
+ --hardware a10g-large \
424
+ --dataset-size 16000 \
425
+ --epochs 3
426
+ ```
427
+
428
+ Output includes estimated time, cost, recommended timeout (with buffer), and optimization suggestions.
429
+
430
+ **When to offer:** User planning a job, asks about cost/time, choosing hardware, job will run >1 hour or cost >$5
431
+
432
+ ## Example Training Scripts
433
+
434
+ **Production-ready templates with all best practices:**
435
+
436
+ Load these scripts for correctly:
437
+
438
+ - **`scripts/train_sft_example.py`** - Complete SFT training with Trackio, LoRA, checkpoints
439
+ - **`scripts/train_dpo_example.py`** - DPO training for preference learning
440
+ - **`scripts/train_grpo_example.py`** - GRPO training for online RL
441
+
442
+ These scripts demonstrate proper Hub saving, Trackio integration, checkpoint management, and optimized parameters. Pass their content inline to `hf_jobs()` or use as templates for custom scripts.
443
+
444
+ ## Monitoring and Tracking
445
+
446
+ **Trackio** provides real-time metrics visualization. See `references/trackio_guide.md` for complete setup guide.
447
+
448
+ **Key points:**
449
+ - Add `trackio` to dependencies
450
+ - Configure trainer with `report_to="trackio" and run_name="meaningful_name"`
451
+
452
+ ### Trackio Configuration Defaults
453
+
454
+ **Use sensible defaults unless user specifies otherwise.** When generating training scripts with Trackio:
455
+
456
+ **Default Configuration:**
457
+ - **Space ID**: `{username}/trackio` (use "trackio" as default space name)
458
+ - **Run naming**: Unless otherwise specified, name the run in a way the user will recognize (e.g., descriptive of the task, model, or purpose)
459
+ - **Config**: Keep minimal - only include hyperparameters and model/dataset info
460
+ - **Project Name**: Use a Project Name to associate runs with a particular Project
461
+
462
+ **User overrides:** If user requests specific trackio configuration (custom space, run naming, grouping, or additional config), apply their preferences instead of defaults.
463
+
464
+
465
+ This is useful for managing multiple jobs with the same configuration or keeping training scripts portable.
466
+
467
+ See `references/trackio_guide.md` for complete documentation including grouping runs for experiments.
468
+
469
+ ### Check Job Status
470
+
471
+ ```python
472
+ # List all jobs
473
+ hf_jobs("ps")
474
+
475
+ # Inspect specific job
476
+ hf_jobs("inspect", {"job_id": "your-job-id"})
477
+
478
+ # View logs
479
+ hf_jobs("logs", {"job_id": "your-job-id"})
480
+ ```
481
+
482
+ **Remember:** Wait for user to request status checks. Avoid polling repeatedly.
483
+
484
+ ## Dataset Validation
485
+
486
+ **Validate dataset format BEFORE launching GPU training to prevent the #1 cause of training failures: format mismatches.**
487
+
488
+ ### Why Validate
489
+
490
+ - 50%+ of training failures are due to dataset format issues
491
+ - DPO especially strict: requires exact column names (`prompt`, `chosen`, `rejected`)
492
+ - Failed GPU jobs waste $1-10 and 30-60 minutes
493
+ - Validation on CPU costs ~$0.01 and takes <1 minute
494
+
495
+ ### When to Validate
496
+
497
+ **ALWAYS validate for:**
498
+ - Unknown or custom datasets
499
+ - DPO training (CRITICAL - 90% of datasets need mapping)
500
+ - Any dataset not explicitly TRL-compatible
501
+
502
+ **Skip validation for known TRL datasets:**
503
+ - `trl-lib/ultrachat_200k`, `trl-lib/Capybara`, `HuggingFaceH4/ultrachat_200k`, etc.
504
+
505
+ ### Usage
506
+
507
+ ```python
508
+ hf_jobs("uv", {
509
+ "script": "https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py",
510
+ "script_args": ["--dataset", "username/dataset-name", "--split", "train"]
511
+ })
512
+ ```
513
+
514
+ The script is fast, and will usually complete synchronously.
515
+
516
+ ### Reading Results
517
+
518
+ The output shows compatibility for each training method:
519
+
520
+ - **`βœ“ READY`** - Dataset is compatible, use directly
521
+ - **`βœ— NEEDS MAPPING`** - Compatible but needs preprocessing (mapping code provided)
522
+ - **`βœ— INCOMPATIBLE`** - Cannot be used for this method
523
+
524
+ When mapping is needed, the output includes a **"MAPPING CODE"** section with copy-paste ready Python code.
525
+
526
+ ### Example Workflow
527
+
528
+ ```python
529
+ # 1. Inspect dataset (costs ~$0.01, <1 min on CPU)
530
+ hf_jobs("uv", {
531
+ "script": "https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py",
532
+ "script_args": ["--dataset", "argilla/distilabel-math-preference-dpo", "--split", "train"]
533
+ })
534
+
535
+ # 2. Check output markers:
536
+ # βœ“ READY β†’ proceed with training
537
+ # βœ— NEEDS MAPPING β†’ apply mapping code below
538
+ # βœ— INCOMPATIBLE β†’ choose different method/dataset
539
+
540
+ # 3. If mapping needed, apply before training:
541
+ def format_for_dpo(example):
542
+ return {
543
+ 'prompt': example['instruction'],
544
+ 'chosen': example['chosen_response'],
545
+ 'rejected': example['rejected_response'],
546
+ }
547
+ dataset = dataset.map(format_for_dpo, remove_columns=dataset.column_names)
548
+
549
+ # 4. Launch training job with confidence
550
+ ```
551
+
552
+ ### Common Scenario: DPO Format Mismatch
553
+
554
+ Most DPO datasets use non-standard column names. Example:
555
+
556
+ ```
557
+ Dataset has: instruction, chosen_response, rejected_response
558
+ DPO expects: prompt, chosen, rejected
559
+ ```
560
+
561
+ The validator detects this and provides exact mapping code to fix it.
562
+
563
+ ## Converting Models to GGUF
564
+
565
+ After training, convert models to **GGUF format** for use with llama.cpp, Ollama, LM Studio, and other local inference tools.
566
+
567
+ **What is GGUF:**
568
+ - Optimized for CPU/GPU inference with llama.cpp
569
+ - Supports quantization (4-bit, 5-bit, 8-bit) to reduce model size
570
+ - Compatible with Ollama, LM Studio, Jan, GPT4All, llama.cpp
571
+ - Typically 2-8GB for 7B models (vs 14GB unquantized)
572
+
573
+ **When to convert:**
574
+ - Running models locally with Ollama or LM Studio
575
+ - Reducing model size with quantization
576
+ - Deploying to edge devices
577
+ - Sharing models for local-first use
578
+
579
+ **See:** `references/gguf_conversion.md` for complete conversion guide, including production-ready conversion script, quantization options, hardware requirements, usage examples, and troubleshooting.
580
+
581
+ **Quick conversion:**
582
+ ```python
583
+ hf_jobs("uv", {
584
+ "script": "<see references/gguf_conversion.md for complete script>",
585
+ "flavor": "a10g-large",
586
+ "timeout": "45m",
587
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"},
588
+ "env": {
589
+ "ADAPTER_MODEL": "username/my-finetuned-model",
590
+ "BASE_MODEL": "Qwen/Qwen2.5-0.5B",
591
+ "OUTPUT_REPO": "username/my-model-gguf"
592
+ }
593
+ })
594
+ ```
595
+
596
+ ## Common Training Patterns
597
+
598
+ See `references/training_patterns.md` for detailed examples including:
599
+ - Quick demo (5-10 minutes)
600
+ - Production with checkpoints
601
+ - Multi-GPU training
602
+ - DPO training (preference learning)
603
+ - GRPO training (online RL)
604
+
605
+ ## Common Failure Modes
606
+
607
+ ### Out of Memory (OOM)
608
+
609
+ **Fix (try in order):**
610
+ 1. Reduce batch size: `per_device_train_batch_size=1`, increase `gradient_accumulation_steps=8`. Effective batch size is `per_device_train_batch_size` x `gradient_accumulation_steps`. For best performance keep effective batch size close to 128.
611
+ 2. Enable: `gradient_checkpointing=True`
612
+ 3. Upgrade hardware: t4-small β†’ l4x1, a10g-small β†’ a10g-large etc.
613
+
614
+ ### Dataset Misformatted
615
+
616
+ **Fix:**
617
+ 1. Validate first with dataset inspector:
618
+ ```bash
619
+ uv run https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py \
620
+ --dataset name --split train
621
+ ```
622
+ 2. Check output for compatibility markers (βœ“ READY, βœ— NEEDS MAPPING, βœ— INCOMPATIBLE)
623
+ 3. Apply mapping code from inspector output if needed
624
+
625
+ ### Job Timeout
626
+
627
+ **Fix:**
628
+ 1. Check logs for actual runtime: `hf_jobs("logs", {"job_id": "..."})`
629
+ 2. Increase timeout with buffer: `"timeout": "3h"` (add 30% to estimated time)
630
+ 3. Or reduce training: lower `num_train_epochs`, use smaller dataset, enable `max_steps`
631
+ 4. Save checkpoints: `save_strategy="steps"`, `save_steps=500`, `hub_strategy="every_save"`
632
+
633
+ **Note:** Default 30min is insufficient for real training. Minimum 1-2 hours.
634
+
635
+ ### Hub Push Failures
636
+
637
+ **Fix:**
638
+ 1. Add to job: `secrets={"HF_TOKEN": "$HF_TOKEN"}`
639
+ 2. Add to config: `push_to_hub=True`, `hub_model_id="username/model-name"`
640
+ 3. Verify auth: `mcp__huggingface__hf_whoami()`
641
+ 4. Check token has write permissions and repo exists (or set `hub_private_repo=True`)
642
+
643
+ ### Missing Dependencies
644
+
645
+ **Fix:**
646
+ Add to PEP 723 header:
647
+ ```python
648
+ # /// script
649
+ # dependencies = ["trl>=0.12.0", "peft>=0.7.0", "trackio", "missing-package"]
650
+ # ///
651
+ ```
652
+
653
+ ## Troubleshooting
654
+
655
+ **Common issues:**
656
+ - Job times out β†’ Increase timeout, reduce epochs/dataset, use smaller model/LoRA
657
+ - Model not saved to Hub β†’ Check push_to_hub=True, hub_model_id, secrets=HF_TOKEN
658
+ - Out of Memory (OOM) β†’ Reduce batch size, increase gradient accumulation, enable LoRA, use larger GPU
659
+ - Dataset format error β†’ Validate with dataset inspector (see Dataset Validation section)
660
+ - Import/module errors β†’ Add PEP 723 header with dependencies, verify format
661
+ - Authentication errors β†’ Check `mcp__huggingface__hf_whoami()`, token permissions, secrets parameter
662
+
663
+ **See:** `references/troubleshooting.md` for complete troubleshooting guide
664
+
665
+ ## Resources
666
+
667
+ ### References (In This Skill)
668
+ - `references/training_methods.md` - Overview of SFT, DPO, GRPO, KTO, PPO, Reward Modeling
669
+ - `references/training_patterns.md` - Common training patterns and examples
670
+ - `references/gguf_conversion.md` - Complete GGUF conversion guide
671
+ - `references/trackio_guide.md` - Trackio monitoring setup
672
+ - `references/hardware_guide.md` - Hardware specs and selection
673
+ - `references/hub_saving.md` - Hub authentication troubleshooting
674
+ - `references/troubleshooting.md` - Common issues and solutions
675
+
676
+ ### Scripts (In This Skill)
677
+ - `scripts/train_sft_example.py` - Production SFT template
678
+ - `scripts/train_dpo_example.py` - Production DPO template
679
+ - `scripts/train_grpo_example.py` - Production GRPO template
680
+ - `scripts/estimate_cost.py` - Estimate time and cost (offer when appropriate)
681
+ - `scripts/convert_to_gguf.py` - Complete GGUF conversion script
682
+
683
+ ### External Scripts
684
+ - [Dataset Inspector](https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py) - Validate dataset format before training (use via `uv run` or `hf_jobs`)
685
+
686
+ ### External Links
687
+ - [TRL Documentation](https://huggingface.co/docs/trl)
688
+ - [TRL Jobs Training Guide](https://huggingface.co/docs/trl/en/jobs_training)
689
+ - [TRL Jobs Package](https://github.com/huggingface/trl-jobs)
690
+ - [HF Jobs Documentation](https://huggingface.co/docs/huggingface_hub/guides/jobs)
691
+ - [TRL Example Scripts](https://github.com/huggingface/trl/tree/main/examples/scripts)
692
+ - [UV Scripts Guide](https://docs.astral.sh/uv/guides/scripts/)
693
+ - [UV Scripts Organization](https://huggingface.co/uv-scripts)
694
+
695
+ ## Key Takeaways
696
+
697
+ 1. **Submit scripts inline** - The `script` parameter accepts Python code directly; no file saving required unless user requests
698
+ 2. **Jobs are asynchronous** - Don't wait/poll; let user check when ready
699
+ 3. **Always set timeout** - Default 30 min is insufficient; minimum 1-2 hours recommended
700
+ 4. **Always enable Hub push** - Environment is ephemeral; without push, all results lost
701
+ 5. **Include Trackio** - Use example scripts as templates for real-time monitoring
702
+ 6. **Offer cost estimation** - When parameters are known, use `scripts/estimate_cost.py`
703
+ 7. **Use UV scripts (Approach 1)** - Default to `hf_jobs("uv", {...})` with inline scripts; TRL maintained scripts for standard training; avoid bash `trl-jobs` commands in Claude Code
704
+ 8. **Use hf_doc_fetch/hf_doc_search** for latest TRL documentation
705
+ 9. **Validate dataset format** before training with dataset inspector (see Dataset Validation section)
706
+ 10. **Choose appropriate hardware** for model size; use LoRA for models >7B
references/gguf_conversion.md ADDED
@@ -0,0 +1,296 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GGUF Conversion Guide
2
+
3
+ After training models with TRL on Hugging Face Jobs, convert them to **GGUF format** for use with llama.cpp, Ollama, LM Studio, and other local inference tools.
4
+
5
+ **This guide provides production-ready, tested code based on successful conversions.** All critical dependencies and build steps are included.
6
+
7
+ ## What is GGUF?
8
+
9
+ **GGUF** (GPT-Generated Unified Format):
10
+ - Optimized format for CPU/GPU inference with llama.cpp
11
+ - Supports quantization (4-bit, 5-bit, 8-bit) to reduce model size
12
+ - Compatible with: Ollama, LM Studio, Jan, GPT4All, llama.cpp
13
+ - Typically 2-8GB for 7B models (vs 14GB unquantized)
14
+
15
+ ## When to Convert to GGUF
16
+
17
+ **Convert when:**
18
+ - Running models locally with Ollama or LM Studio
19
+ - Using CPU-optimized inference
20
+ - Reducing model size with quantization
21
+ - Deploying to edge devices
22
+ - Sharing models for local-first use
23
+
24
+ ## Critical Success Factors
25
+
26
+ Based on production testing, these are **essential** for reliable conversion:
27
+
28
+ ### 1. βœ… Install Build Tools FIRST
29
+ **Before cloning llama.cpp**, install build dependencies:
30
+ ```python
31
+ subprocess.run(["apt-get", "update", "-qq"], check=True, capture_output=True)
32
+ subprocess.run(["apt-get", "install", "-y", "-qq", "build-essential", "cmake"], check=True, capture_output=True)
33
+ ```
34
+
35
+ **Why:** The quantization tool requires gcc and cmake. Installing after cloning doesn't help.
36
+
37
+ ### 2. βœ… Use CMake (Not Make)
38
+ **Build the quantize tool with CMake:**
39
+ ```python
40
+ # Create build directory
41
+ os.makedirs("/tmp/llama.cpp/build", exist_ok=True)
42
+
43
+ # Configure
44
+ subprocess.run([
45
+ "cmake", "-B", "/tmp/llama.cpp/build", "-S", "/tmp/llama.cpp",
46
+ "-DGGML_CUDA=OFF" # Faster build, CUDA not needed for quantization
47
+ ], check=True, capture_output=True, text=True)
48
+
49
+ # Build
50
+ subprocess.run([
51
+ "cmake", "--build", "/tmp/llama.cpp/build",
52
+ "--target", "llama-quantize", "-j", "4"
53
+ ], check=True, capture_output=True, text=True)
54
+
55
+ # Binary path
56
+ quantize_bin = "/tmp/llama.cpp/build/bin/llama-quantize"
57
+ ```
58
+
59
+ **Why:** CMake is more reliable than `make` and produces consistent binary paths.
60
+
61
+ ### 3. βœ… Include All Dependencies
62
+ **PEP 723 header must include:**
63
+ ```python
64
+ # /// script
65
+ # dependencies = [
66
+ # "transformers>=4.36.0",
67
+ # "peft>=0.7.0",
68
+ # "torch>=2.0.0",
69
+ # "accelerate>=0.24.0",
70
+ # "huggingface_hub>=0.20.0",
71
+ # "sentencepiece>=0.1.99", # Required for tokenizer
72
+ # "protobuf>=3.20.0", # Required for tokenizer
73
+ # "numpy",
74
+ # "gguf",
75
+ # ]
76
+ # ///
77
+ ```
78
+
79
+ **Why:** `sentencepiece` and `protobuf` are critical for tokenizer conversion. Missing them causes silent failures.
80
+
81
+ ### 4. βœ… Verify Names Before Use
82
+ **Always verify repos exist:**
83
+ ```python
84
+ # Before submitting job, verify:
85
+ hub_repo_details([ADAPTER_MODEL], repo_type="model")
86
+ hub_repo_details([BASE_MODEL], repo_type="model")
87
+ ```
88
+
89
+ **Why:** Non-existent dataset/model names cause job failures that could be caught in seconds.
90
+
91
+ ## Complete Conversion Script
92
+
93
+ See `scripts/convert_to_gguf.py` for the complete, production-ready script.
94
+
95
+ **Key features:**
96
+ - βœ… All dependencies in PEP 723 header
97
+ - βœ… Build tools installed automatically
98
+ - βœ… CMake build process (reliable)
99
+ - βœ… Comprehensive error handling
100
+ - βœ… Environment variable configuration
101
+ - βœ… Automatic README generation
102
+
103
+ ## Quick Conversion Job
104
+
105
+ ```python
106
+ # Before submitting: VERIFY MODELS EXIST
107
+ hub_repo_details(["username/my-finetuned-model"], repo_type="model")
108
+ hub_repo_details(["Qwen/Qwen2.5-0.5B"], repo_type="model")
109
+
110
+ # Submit conversion job
111
+ hf_jobs("uv", {
112
+ "script": open("trl/scripts/convert_to_gguf.py").read(), # Or inline the script
113
+ "flavor": "a10g-large",
114
+ "timeout": "45m",
115
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"},
116
+ "env": {
117
+ "ADAPTER_MODEL": "username/my-finetuned-model",
118
+ "BASE_MODEL": "Qwen/Qwen2.5-0.5B",
119
+ "OUTPUT_REPO": "username/my-model-gguf",
120
+ "HF_USERNAME": "username" # Optional, for README
121
+ }
122
+ })
123
+ ```
124
+
125
+ ## Conversion Process
126
+
127
+ The script performs these steps:
128
+
129
+ 1. **Load and Merge** - Load base model and LoRA adapter, merge them
130
+ 2. **Install Build Tools** - Install gcc, cmake (CRITICAL: before cloning llama.cpp)
131
+ 3. **Setup llama.cpp** - Clone repo, install Python dependencies
132
+ 4. **Convert to GGUF** - Create FP16 GGUF using llama.cpp converter
133
+ 5. **Build Quantize Tool** - Use CMake to build `llama-quantize`
134
+ 6. **Quantize** - Create Q4_K_M, Q5_K_M, Q8_0 versions
135
+ 7. **Upload** - Upload all versions + README to Hub
136
+
137
+ ## Quantization Options
138
+
139
+ Common quantization formats (from smallest to largest):
140
+
141
+ | Format | Size | Quality | Use Case |
142
+ |--------|------|---------|----------|
143
+ | **Q4_K_M** | ~300MB | Good | **Recommended** - best balance of size/quality |
144
+ | **Q5_K_M** | ~350MB | Better | Higher quality, slightly larger |
145
+ | **Q8_0** | ~500MB | Very High | Near-original quality |
146
+ | **F16** | ~1GB | Original | Full precision, largest file |
147
+
148
+ **Recommendation:** Create Q4_K_M, Q5_K_M, and Q8_0 versions to give users options.
149
+
150
+ ## Hardware Requirements
151
+
152
+ **For conversion:**
153
+ - Small models (<1B): CPU-basic works, but slow
154
+ - Medium models (1-7B): a10g-large recommended
155
+ - Large models (7B+): a10g-large or a100-large
156
+
157
+ **Time estimates:**
158
+ - 0.5B model: ~15-25 minutes on A10G
159
+ - 3B model: ~30-45 minutes on A10G
160
+ - 7B model: ~45-60 minutes on A10G
161
+
162
+ ## Using GGUF Models
163
+
164
+ **GGUF models work on both CPU and GPU.** They're optimized for CPU inference but can also leverage GPU acceleration when available.
165
+
166
+ ### With Ollama (auto-detects GPU)
167
+ ```bash
168
+ # Download GGUF
169
+ huggingface-cli download username/my-model-gguf model-q4_k_m.gguf
170
+
171
+ # Create Modelfile
172
+ echo "FROM ./model-q4_k_m.gguf" > Modelfile
173
+
174
+ # Create and run (uses GPU automatically if available)
175
+ ollama create my-model -f Modelfile
176
+ ollama run my-model
177
+ ```
178
+
179
+ ### With llama.cpp
180
+ ```bash
181
+ # CPU only
182
+ ./llama-cli -m model-q4_k_m.gguf -p "Your prompt"
183
+
184
+ # With GPU acceleration (offload 32 layers to GPU)
185
+ ./llama-cli -m model-q4_k_m.gguf -ngl 32 -p "Your prompt"
186
+ ```
187
+
188
+ ### With LM Studio
189
+ 1. Download the `.gguf` file
190
+ 2. Import into LM Studio
191
+ 3. Start chatting
192
+
193
+ ## Best Practices
194
+
195
+ ### βœ… DO:
196
+ 1. **Verify repos exist** before submitting jobs (use `hub_repo_details`)
197
+ 2. **Install build tools FIRST** before cloning llama.cpp
198
+ 3. **Use CMake** for building quantize tool (not make)
199
+ 4. **Include all dependencies** in PEP 723 header (especially sentencepiece, protobuf)
200
+ 5. **Create multiple quantizations** - Give users choice
201
+ 6. **Test on known models** before production use
202
+ 7. **Use A10G GPU** for faster conversion
203
+
204
+ ### ❌ DON'T:
205
+ 1. **Assume repos exist** - Always verify with hub tools
206
+ 2. **Use make** instead of CMake - Less reliable
207
+ 3. **Remove dependencies** to "simplify" - They're all needed
208
+ 4. **Skip build tools** - Quantization will fail silently
209
+ 5. **Use default paths** - CMake puts binaries in build/bin/
210
+
211
+ ## Common Issues
212
+
213
+ ### Out of memory during merge
214
+ **Fix:**
215
+ - Use larger GPU (a10g-large or a100-large)
216
+ - Ensure `device_map="auto"` for automatic placement
217
+ - Use `dtype=torch.float16` or `torch.bfloat16`
218
+
219
+ ### Conversion fails with architecture error
220
+ **Fix:**
221
+ - Ensure llama.cpp supports the model architecture
222
+ - Check for standard architecture (Qwen, Llama, Mistral, etc.)
223
+ - Update llama.cpp to latest: `git clone --depth 1 https://github.com/ggerganov/llama.cpp.git`
224
+ - Check llama.cpp documentation for model support
225
+
226
+ ### Quantization fails
227
+ **Fix:**
228
+ - Verify build tools installed: `apt-get install build-essential cmake`
229
+ - Use CMake (not make) to build quantize tool
230
+ - Check binary path: `/tmp/llama.cpp/build/bin/llama-quantize`
231
+ - Verify FP16 GGUF exists before quantizing
232
+
233
+ ### Missing sentencepiece error
234
+ **Fix:**
235
+ - Add to PEP 723 header: `"sentencepiece>=0.1.99", "protobuf>=3.20.0"`
236
+ - Don't remove dependencies to "simplify" - all are required
237
+
238
+ ### Upload fails or times out
239
+ **Fix:**
240
+ - Large models (>2GB) need longer timeout: `"timeout": "1h"`
241
+ - Upload quantized versions separately if needed
242
+ - Check network/Hub status
243
+
244
+ ## Lessons Learned
245
+
246
+ These are from production testing and real failures:
247
+
248
+ ### 1. Always Verify Before Use
249
+ **Lesson:** Don't assume repos/datasets exist. Check first.
250
+ ```python
251
+ # BEFORE submitting job
252
+ hub_repo_details(["trl-lib/argilla-dpo-mix-7k"], repo_type="dataset") # Would catch error
253
+ ```
254
+ **Prevented failures:** Non-existent dataset names, typos in model names
255
+
256
+ ### 2. Prioritize Reliability Over Performance
257
+ **Lesson:** Default to what's most likely to succeed.
258
+ - Use CMake (not make) - more reliable
259
+ - Disable CUDA in build - faster, not needed
260
+ - Include all dependencies - don't "simplify"
261
+
262
+ **Prevented failures:** Build failures, missing binaries
263
+
264
+ ### 3. Create Atomic, Self-Contained Scripts
265
+ **Lesson:** Don't remove dependencies or steps. Scripts should work as a unit.
266
+ - All dependencies in PEP 723 header
267
+ - All build steps included
268
+ - Clear error messages
269
+
270
+ **Prevented failures:** Missing tokenizer libraries, build tool failures
271
+
272
+ ## References
273
+
274
+ **In this skill:**
275
+ - `scripts/convert_to_gguf.py` - Complete, production-ready script
276
+
277
+ **External:**
278
+ - [llama.cpp Repository](https://github.com/ggerganov/llama.cpp)
279
+ - [GGUF Specification](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)
280
+ - [Ollama Documentation](https://ollama.ai)
281
+ - [LM Studio](https://lmstudio.ai)
282
+
283
+ ## Summary
284
+
285
+ **Critical checklist for GGUF conversion:**
286
+ - [ ] Verify adapter and base models exist on Hub
287
+ - [ ] Use production script from `scripts/convert_to_gguf.py`
288
+ - [ ] All dependencies in PEP 723 header (including sentencepiece, protobuf)
289
+ - [ ] Build tools installed before cloning llama.cpp
290
+ - [ ] CMake used for building quantize tool (not make)
291
+ - [ ] Correct binary path: `/tmp/llama.cpp/build/bin/llama-quantize`
292
+ - [ ] A10G GPU selected for reasonable conversion time
293
+ - [ ] Timeout set to 45m minimum
294
+ - [ ] HF_TOKEN in secrets for Hub upload
295
+
296
+ **The script in `scripts/convert_to_gguf.py` incorporates all these lessons and has been tested successfully in production.**
references/hardware_guide.md ADDED
@@ -0,0 +1,283 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Hardware Selection Guide
2
+
3
+ Choosing the right hardware (flavor) is critical for cost-effective training.
4
+
5
+ ## Available Hardware
6
+
7
+ ### CPU
8
+ - `cpu-basic` - Basic CPU, testing only
9
+ - `cpu-upgrade` - Enhanced CPU
10
+
11
+ **Use cases:** Dataset validation, preprocessing, testing scripts
12
+ **Not recommended for training:** Too slow for any meaningful training
13
+
14
+ ### GPU Options
15
+
16
+ | Flavor | GPU | Memory | Use Case | Cost/hour |
17
+ |--------|-----|--------|----------|-----------|
18
+ | `t4-small` | NVIDIA T4 | 16GB | <1B models, demos | ~$0.50-1 |
19
+ | `t4-medium` | NVIDIA T4 | 16GB | 1-3B models, development | ~$1-2 |
20
+ | `l4x1` | NVIDIA L4 | 24GB | 3-7B models, efficient training | ~$2-3 |
21
+ | `l4x4` | 4x NVIDIA L4 | 96GB | Multi-GPU training | ~$8-12 |
22
+ | `a10g-small` | NVIDIA A10G | 24GB | 3-7B models, production | ~$3-4 |
23
+ | `a10g-large` | NVIDIA A10G | 24GB | 7-13B models | ~$4-6 |
24
+ | `a10g-largex2` | 2x NVIDIA A10G | 48GB | Multi-GPU, large models | ~$8-12 |
25
+ | `a10g-largex4` | 4x NVIDIA A10G | 96GB | Multi-GPU, very large models | ~$16-24 |
26
+ | `a100-large` | NVIDIA A100 | 40GB | 13B+ models, fast training | ~$8-12 |
27
+
28
+ ### TPU Options
29
+
30
+ | Flavor | Type | Use Case |
31
+ |--------|------|----------|
32
+ | `v5e-1x1` | TPU v5e | Small TPU workloads |
33
+ | `v5e-2x2` | 4x TPU v5e | Medium TPU workloads |
34
+ | `v5e-2x4` | 8x TPU v5e | Large TPU workloads |
35
+
36
+ **Note:** TPUs require TPU-optimized code. Most TRL training uses GPUs.
37
+
38
+ ## Selection Guidelines
39
+
40
+ ### By Model Size
41
+
42
+ **Tiny Models (<1B parameters)**
43
+ - **Recommended:** `t4-small`
44
+ - **Example:** Qwen2.5-0.5B, TinyLlama
45
+ - **Batch size:** 4-8
46
+ - **Training time:** 1-2 hours for 1K examples
47
+
48
+ **Small Models (1-3B parameters)**
49
+ - **Recommended:** `t4-medium` or `a10g-small`
50
+ - **Example:** Qwen2.5-1.5B, Phi-2
51
+ - **Batch size:** 2-4
52
+ - **Training time:** 2-4 hours for 10K examples
53
+
54
+ **Medium Models (3-7B parameters)**
55
+ - **Recommended:** `a10g-small` or `a10g-large`
56
+ - **Example:** Qwen2.5-7B, Mistral-7B
57
+ - **Batch size:** 1-2 (or LoRA with 4-8)
58
+ - **Training time:** 4-8 hours for 10K examples
59
+
60
+ **Large Models (7-13B parameters)**
61
+ - **Recommended:** `a10g-large` or `a100-large`
62
+ - **Example:** Llama-3-8B, Mixtral-8x7B (with LoRA)
63
+ - **Batch size:** 1 (full fine-tuning) or 2-4 (LoRA)
64
+ - **Training time:** 6-12 hours for 10K examples
65
+ - **Note:** Always use LoRA/PEFT
66
+
67
+ **Very Large Models (13B+ parameters)**
68
+ - **Recommended:** `a100-large` with LoRA
69
+ - **Example:** Llama-3-13B, Llama-3-70B (LoRA only)
70
+ - **Batch size:** 1-2 with LoRA
71
+ - **Training time:** 8-24 hours for 10K examples
72
+ - **Note:** Full fine-tuning not feasible, use LoRA/PEFT
73
+
74
+ ### By Budget
75
+
76
+ **Minimal Budget (<$5 total)**
77
+ - Use `t4-small`
78
+ - Train on subset of data (100-500 examples)
79
+ - Limit to 1-2 epochs
80
+ - Use small model (<1B)
81
+
82
+ **Small Budget ($5-20)**
83
+ - Use `t4-medium` or `a10g-small`
84
+ - Train on 1K-5K examples
85
+ - 2-3 epochs
86
+ - Model up to 3B parameters
87
+
88
+ **Medium Budget ($20-50)**
89
+ - Use `a10g-small` or `a10g-large`
90
+ - Train on 5K-20K examples
91
+ - 3-5 epochs
92
+ - Model up to 7B parameters
93
+
94
+ **Large Budget ($50-200)**
95
+ - Use `a10g-large` or `a100-large`
96
+ - Full dataset training
97
+ - Multiple epochs
98
+ - Model up to 13B parameters with LoRA
99
+
100
+ ### By Training Type
101
+
102
+ **Quick Demo/Experiment**
103
+ - `t4-small`
104
+ - 50-100 examples
105
+ - 5-10 steps
106
+ - ~10-15 minutes
107
+
108
+ **Development/Iteration**
109
+ - `t4-medium` or `a10g-small`
110
+ - 1K examples
111
+ - 1 epoch
112
+ - ~30-60 minutes
113
+
114
+ **Production Training**
115
+ - `a10g-large` or `a100-large`
116
+ - Full dataset
117
+ - 3-5 epochs
118
+ - 4-12 hours
119
+
120
+ **Research/Experimentation**
121
+ - `a100-large`
122
+ - Multiple runs
123
+ - Various hyperparameters
124
+ - Budget for 20-50 hours
125
+
126
+ ## Memory Considerations
127
+
128
+ ### Estimating Memory Requirements
129
+
130
+ **Full fine-tuning:**
131
+ ```
132
+ Memory (GB) β‰ˆ (Model params in billions) Γ— 20
133
+ ```
134
+
135
+ **LoRA fine-tuning:**
136
+ ```
137
+ Memory (GB) β‰ˆ (Model params in billions) Γ— 4
138
+ ```
139
+
140
+ **Examples:**
141
+ - Qwen2.5-0.5B full: ~10GB βœ… fits t4-small
142
+ - Qwen2.5-1.5B full: ~30GB ❌ exceeds most GPUs
143
+ - Qwen2.5-1.5B LoRA: ~6GB βœ… fits t4-small
144
+ - Qwen2.5-7B full: ~140GB ❌ not feasible
145
+ - Qwen2.5-7B LoRA: ~28GB βœ… fits a10g-large
146
+
147
+ ### Memory Optimization
148
+
149
+ If hitting memory limits:
150
+
151
+ 1. **Use LoRA/PEFT**
152
+ ```python
153
+ peft_config=LoraConfig(r=16, lora_alpha=32)
154
+ ```
155
+
156
+ 2. **Reduce batch size**
157
+ ```python
158
+ per_device_train_batch_size=1
159
+ ```
160
+
161
+ 3. **Increase gradient accumulation**
162
+ ```python
163
+ gradient_accumulation_steps=8 # Effective batch size = 1Γ—8
164
+ ```
165
+
166
+ 4. **Enable gradient checkpointing**
167
+ ```python
168
+ gradient_checkpointing=True
169
+ ```
170
+
171
+ 5. **Use mixed precision**
172
+ ```python
173
+ bf16=True # or fp16=True
174
+ ```
175
+
176
+ 6. **Upgrade to larger GPU**
177
+ - t4 β†’ a10g β†’ a100
178
+
179
+ ## Cost Estimation
180
+
181
+ ### Formula
182
+
183
+ ```
184
+ Total Cost = (Hours of training) Γ— (Cost per hour)
185
+ ```
186
+
187
+ ### Example Calculations
188
+
189
+ **Quick demo:**
190
+ - Hardware: t4-small ($0.75/hour)
191
+ - Time: 15 minutes (0.25 hours)
192
+ - Cost: $0.19
193
+
194
+ **Development training:**
195
+ - Hardware: a10g-small ($3.50/hour)
196
+ - Time: 2 hours
197
+ - Cost: $7.00
198
+
199
+ **Production training:**
200
+ - Hardware: a10g-large ($5/hour)
201
+ - Time: 6 hours
202
+ - Cost: $30.00
203
+
204
+ **Large model with LoRA:**
205
+ - Hardware: a100-large ($10/hour)
206
+ - Time: 8 hours
207
+ - Cost: $80.00
208
+
209
+ ### Cost Optimization Tips
210
+
211
+ 1. **Start small:** Test on t4-small with subset
212
+ 2. **Use LoRA:** 4-5x cheaper than full fine-tuning
213
+ 3. **Optimize hyperparameters:** Fewer epochs if possible
214
+ 4. **Set appropriate timeout:** Don't waste compute on stalled jobs
215
+ 5. **Use checkpointing:** Resume if job fails
216
+ 6. **Monitor costs:** Check running jobs regularly
217
+
218
+ ## Multi-GPU Training
219
+
220
+ TRL automatically handles multi-GPU training with Accelerate when using multi-GPU flavors.
221
+
222
+ **Multi-GPU flavors:**
223
+ - `l4x4` - 4x L4 GPUs
224
+ - `a10g-largex2` - 2x A10G GPUs
225
+ - `a10g-largex4` - 4x A10G GPUs
226
+
227
+ **When to use:**
228
+ - Models >13B parameters
229
+ - Need faster training (linear speedup)
230
+ - Large datasets (>50K examples)
231
+
232
+ **Example:**
233
+ ```python
234
+ hf_jobs("uv", {
235
+ "script": "train.py",
236
+ "flavor": "a10g-largex2", # 2 GPUs
237
+ "timeout": "4h",
238
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"}
239
+ })
240
+ ```
241
+
242
+ No code changes neededβ€”TRL/Accelerate handles distribution automatically.
243
+
244
+ ## Choosing Between Options
245
+
246
+ ### a10g vs a100
247
+
248
+ **Choose a10g when:**
249
+ - Model <13B parameters
250
+ - Budget conscious
251
+ - Training time not critical
252
+
253
+ **Choose a100 when:**
254
+ - Model 13B+ parameters
255
+ - Need fastest training
256
+ - Memory requirements high
257
+ - Budget allows
258
+
259
+ ### Single vs Multi-GPU
260
+
261
+ **Choose single GPU when:**
262
+ - Model <7B parameters
263
+ - Budget constrained
264
+ - Simpler debugging
265
+
266
+ **Choose multi-GPU when:**
267
+ - Model >13B parameters
268
+ - Need faster training
269
+ - Large batch sizes required
270
+ - Cost-effective for large jobs
271
+
272
+ ## Quick Reference
273
+
274
+ ```python
275
+ # Model size β†’ Hardware selection
276
+ HARDWARE_MAP = {
277
+ "<1B": "t4-small",
278
+ "1-3B": "a10g-small",
279
+ "3-7B": "a10g-large",
280
+ "7-13B": "a10g-large (LoRA) or a100-large",
281
+ ">13B": "a100-large (LoRA required)"
282
+ }
283
+ ```
references/hub_saving.md ADDED
@@ -0,0 +1,364 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Saving Training Results to Hugging Face Hub
2
+
3
+ **⚠️ CRITICAL:** Training environments are ephemeral. ALL results are lost when a job completes unless pushed to the Hub.
4
+
5
+ ## Why Hub Push is Required
6
+
7
+ When running on Hugging Face Jobs:
8
+ - Environment is temporary
9
+ - All files deleted on job completion
10
+ - No local disk persistence
11
+ - Cannot access results after job ends
12
+
13
+ **Without Hub push, training is completely wasted.**
14
+
15
+ ## Required Configuration
16
+
17
+ ### 1. Training Configuration
18
+
19
+ In your SFTConfig or trainer config:
20
+
21
+ ```python
22
+ SFTConfig(
23
+ push_to_hub=True, # Enable Hub push
24
+ hub_model_id="username/model-name", # Target repository
25
+ )
26
+ ```
27
+
28
+ ### 2. Job Configuration
29
+
30
+ When submitting the job:
31
+
32
+ ```python
33
+ hf_jobs("uv", {
34
+ "script": "train.py",
35
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"} # Provide authentication
36
+ })
37
+ ```
38
+
39
+ **The `$HF_TOKEN` placeholder is automatically replaced with your Hugging Face token.**
40
+
41
+ ## Complete Example
42
+
43
+ ```python
44
+ # train.py
45
+ # /// script
46
+ # dependencies = ["trl"]
47
+ # ///
48
+
49
+ from trl import SFTTrainer, SFTConfig
50
+ from datasets import load_dataset
51
+
52
+ dataset = load_dataset("trl-lib/Capybara", split="train")
53
+
54
+ # Configure with Hub push
55
+ config = SFTConfig(
56
+ output_dir="my-model",
57
+ num_train_epochs=3,
58
+
59
+ # βœ… CRITICAL: Hub push configuration
60
+ push_to_hub=True,
61
+ hub_model_id="myusername/my-trained-model",
62
+
63
+ # Optional: Push strategy
64
+ push_to_hub_model_id="myusername/my-trained-model",
65
+ push_to_hub_organization=None,
66
+ push_to_hub_token=None, # Uses environment token
67
+ )
68
+
69
+ trainer = SFTTrainer(
70
+ model="Qwen/Qwen2.5-0.5B",
71
+ train_dataset=dataset,
72
+ args=config,
73
+ )
74
+
75
+ trainer.train()
76
+
77
+ # βœ… Push final model
78
+ trainer.push_to_hub()
79
+
80
+ print("βœ… Model saved to: https://huggingface.co/myusername/my-trained-model")
81
+ ```
82
+
83
+ **Submit with authentication:**
84
+
85
+ ```python
86
+ hf_jobs("uv", {
87
+ "script": "train.py",
88
+ "flavor": "a10g-large",
89
+ "timeout": "2h",
90
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"} # βœ… Required!
91
+ })
92
+ ```
93
+
94
+ ## What Gets Saved
95
+
96
+ When `push_to_hub=True`:
97
+
98
+ 1. **Model weights** - Final trained parameters
99
+ 2. **Tokenizer** - Associated tokenizer
100
+ 3. **Configuration** - Model config (config.json)
101
+ 4. **Training arguments** - Hyperparameters used
102
+ 5. **Model card** - Auto-generated documentation
103
+ 6. **Checkpoints** - If `save_strategy="steps"` enabled
104
+
105
+ ## Checkpoint Saving
106
+
107
+ Save intermediate checkpoints during training:
108
+
109
+ ```python
110
+ SFTConfig(
111
+ output_dir="my-model",
112
+ push_to_hub=True,
113
+ hub_model_id="username/my-model",
114
+
115
+ # Checkpoint configuration
116
+ save_strategy="steps",
117
+ save_steps=100, # Save every 100 steps
118
+ save_total_limit=3, # Keep only last 3 checkpoints
119
+ )
120
+ ```
121
+
122
+ **Benefits:**
123
+ - Resume training if job fails
124
+ - Compare checkpoint performance
125
+ - Use intermediate models
126
+
127
+ **Checkpoints are pushed to:** `username/my-model` (same repo)
128
+
129
+ ## Authentication Methods
130
+
131
+ ### Method 1: Automatic Token (Recommended)
132
+
133
+ ```python
134
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"}
135
+ ```
136
+
137
+ Uses your logged-in Hugging Face token automatically.
138
+
139
+ ### Method 2: Explicit Token
140
+
141
+ ```python
142
+ "secrets": {"HF_TOKEN": "hf_abc123..."}
143
+ ```
144
+
145
+ Provide token explicitly (not recommended for security).
146
+
147
+ ### Method 3: Environment Variable
148
+
149
+ ```python
150
+ "env": {"HF_TOKEN": "hf_abc123..."}
151
+ ```
152
+
153
+ Pass as regular environment variable (less secure than secrets).
154
+
155
+ **Always prefer Method 1** for security and convenience.
156
+
157
+ ## Verification Checklist
158
+
159
+ Before submitting any training job, verify:
160
+
161
+ - [ ] `push_to_hub=True` in training config
162
+ - [ ] `hub_model_id` is specified (format: `username/model-name`)
163
+ - [ ] `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job config
164
+ - [ ] Repository name doesn't conflict with existing repos
165
+ - [ ] You have write access to the target namespace
166
+
167
+ ## Repository Setup
168
+
169
+ ### Automatic Creation
170
+
171
+ If repository doesn't exist, it's created automatically when first pushing.
172
+
173
+ ### Manual Creation
174
+
175
+ Create repository before training:
176
+
177
+ ```python
178
+ from huggingface_hub import HfApi
179
+
180
+ api = HfApi()
181
+ api.create_repo(
182
+ repo_id="username/model-name",
183
+ repo_type="model",
184
+ private=False, # or True for private repo
185
+ )
186
+ ```
187
+
188
+ ### Repository Naming
189
+
190
+ **Valid names:**
191
+ - `username/my-model`
192
+ - `username/model-name`
193
+ - `organization/model-name`
194
+
195
+ **Invalid names:**
196
+ - `model-name` (missing username)
197
+ - `username/model name` (spaces not allowed)
198
+ - `username/MODEL` (uppercase discouraged)
199
+
200
+ ## Troubleshooting
201
+
202
+ ### Error: 401 Unauthorized
203
+
204
+ **Cause:** HF_TOKEN not provided or invalid
205
+
206
+ **Solutions:**
207
+ 1. Verify `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job config
208
+ 2. Check you're logged in: `huggingface-cli whoami`
209
+ 3. Re-login: `huggingface-cli login`
210
+
211
+ ### Error: 403 Forbidden
212
+
213
+ **Cause:** No write access to repository
214
+
215
+ **Solutions:**
216
+ 1. Check repository namespace matches your username
217
+ 2. Verify you're a member of organization (if using org namespace)
218
+ 3. Check repository isn't private (if accessing org repo)
219
+
220
+ ### Error: Repository not found
221
+
222
+ **Cause:** Repository doesn't exist and auto-creation failed
223
+
224
+ **Solutions:**
225
+ 1. Manually create repository first
226
+ 2. Check repository name format
227
+ 3. Verify namespace exists
228
+
229
+ ### Error: Push failed during training
230
+
231
+ **Cause:** Network issues or Hub unavailable
232
+
233
+ **Solutions:**
234
+ 1. Training continues but final push fails
235
+ 2. Checkpoints may be saved
236
+ 3. Re-run push manually after job completes
237
+
238
+ ### Issue: Model saved but not visible
239
+
240
+ **Possible causes:**
241
+ 1. Repository is privateβ€”check https://huggingface.co/username
242
+ 2. Wrong namespaceβ€”verify `hub_model_id` matches login
243
+ 3. Push still in progressβ€”wait a few minutes
244
+
245
+ ## Manual Push After Training
246
+
247
+ If training completes but push fails, push manually:
248
+
249
+ ```python
250
+ from transformers import AutoModel, AutoTokenizer
251
+
252
+ # Load from local checkpoint
253
+ model = AutoModel.from_pretrained("./output_dir")
254
+ tokenizer = AutoTokenizer.from_pretrained("./output_dir")
255
+
256
+ # Push to Hub
257
+ model.push_to_hub("username/model-name", token="hf_abc123...")
258
+ tokenizer.push_to_hub("username/model-name", token="hf_abc123...")
259
+ ```
260
+
261
+ **Note:** Only possible if job hasn't completed (files still exist).
262
+
263
+ ## Best Practices
264
+
265
+ 1. **Always enable `push_to_hub=True`**
266
+ 2. **Use checkpoint saving** for long training runs
267
+ 3. **Verify Hub push** in logs before job completes
268
+ 4. **Set appropriate `save_total_limit`** to avoid excessive checkpoints
269
+ 5. **Use descriptive repo names** (e.g., `qwen-capybara-sft` not `model1`)
270
+ 6. **Add model card** with training details
271
+ 7. **Tag models** with relevant tags (e.g., `text-generation`, `fine-tuned`)
272
+
273
+ ## Monitoring Push Progress
274
+
275
+ Check logs for push progress:
276
+
277
+ ```python
278
+ hf_jobs("logs", {"job_id": "your-job-id"})
279
+ ```
280
+
281
+ **Look for:**
282
+ ```
283
+ Pushing model to username/model-name...
284
+ Upload file pytorch_model.bin: 100%
285
+ βœ… Model pushed successfully
286
+ ```
287
+
288
+ ## Example: Full Production Setup
289
+
290
+ ```python
291
+ # production_train.py
292
+ # /// script
293
+ # dependencies = ["trl>=0.12.0", "peft>=0.7.0"]
294
+ # ///
295
+
296
+ from datasets import load_dataset
297
+ from peft import LoraConfig
298
+ from trl import SFTTrainer, SFTConfig
299
+ import os
300
+
301
+ # Verify token is available
302
+ assert "HF_TOKEN" in os.environ, "HF_TOKEN not found in environment!"
303
+
304
+ # Load dataset
305
+ dataset = load_dataset("trl-lib/Capybara", split="train")
306
+ print(f"βœ… Dataset loaded: {len(dataset)} examples")
307
+
308
+ # Configure with comprehensive Hub settings
309
+ config = SFTConfig(
310
+ output_dir="qwen-capybara-sft",
311
+
312
+ # Hub configuration
313
+ push_to_hub=True,
314
+ hub_model_id="myusername/qwen-capybara-sft",
315
+ hub_strategy="checkpoint", # Push checkpoints
316
+
317
+ # Checkpoint configuration
318
+ save_strategy="steps",
319
+ save_steps=100,
320
+ save_total_limit=3,
321
+
322
+ # Training settings
323
+ num_train_epochs=3,
324
+ per_device_train_batch_size=4,
325
+
326
+ # Logging
327
+ logging_steps=10,
328
+ logging_first_step=True,
329
+ )
330
+
331
+ # Train with LoRA
332
+ trainer = SFTTrainer(
333
+ model="Qwen/Qwen2.5-0.5B",
334
+ train_dataset=dataset,
335
+ args=config,
336
+ peft_config=LoraConfig(r=16, lora_alpha=32),
337
+ )
338
+
339
+ print("πŸš€ Starting training...")
340
+ trainer.train()
341
+
342
+ print("πŸ’Ύ Pushing final model to Hub...")
343
+ trainer.push_to_hub()
344
+
345
+ print("βœ… Training complete!")
346
+ print(f"Model available at: https://huggingface.co/myusername/qwen-capybara-sft")
347
+ ```
348
+
349
+ **Submit:**
350
+
351
+ ```python
352
+ hf_jobs("uv", {
353
+ "script": "production_train.py",
354
+ "flavor": "a10g-large",
355
+ "timeout": "6h",
356
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"}
357
+ })
358
+ ```
359
+
360
+ ## Key Takeaway
361
+
362
+ **Without `push_to_hub=True` and `secrets={"HF_TOKEN": "$HF_TOKEN"}`, all training results are permanently lost.**
363
+
364
+ Always verify both are configured before submitting any training job.
references/reliability_principles.md ADDED
@@ -0,0 +1,371 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Reliability Principles for Training Jobs
2
+
3
+ These principles are derived from real production failures and successful fixes. Following them prevents common failure modes and ensures reliable job execution.
4
+
5
+ ## Principle 1: Always Verify Before Use
6
+
7
+ **Rule:** Never assume repos, datasets, or resources exist. Verify with tools first.
8
+
9
+ ### What It Prevents
10
+
11
+ - **Non-existent datasets** - Jobs fail immediately when dataset doesn't exist
12
+ - **Typos in names** - Simple mistakes like "argilla-dpo-mix-7k" vs "ultrafeedback_binarized"
13
+ - **Incorrect paths** - Old or moved repos, renamed files
14
+ - **Missing dependencies** - Undocumented requirements
15
+
16
+ ### How to Apply
17
+
18
+ **Before submitting ANY job:**
19
+
20
+ ```python
21
+ # Verify dataset exists
22
+ dataset_search({"query": "dataset-name", "author": "author-name", "limit": 5})
23
+ hub_repo_details(["author/dataset-name"], repo_type="dataset")
24
+
25
+ # Verify model exists
26
+ hub_repo_details(["org/model-name"], repo_type="model")
27
+
28
+ # Check script/file paths (for URL-based scripts)
29
+ # Verify before using: https://github.com/user/repo/blob/main/script.py
30
+ ```
31
+
32
+ **Examples that would have caught errors:**
33
+
34
+ ```python
35
+ # ❌ WRONG: Assumed dataset exists
36
+ hf_jobs("uv", {
37
+ "script": """...""",
38
+ "env": {"DATASET": "trl-lib/argilla-dpo-mix-7k"} # Doesn't exist!
39
+ })
40
+
41
+ # βœ… CORRECT: Verify first
42
+ dataset_search({"query": "argilla dpo", "author": "trl-lib"})
43
+ # Would show: "trl-lib/ultrafeedback_binarized" is the correct name
44
+
45
+ hub_repo_details(["trl-lib/ultrafeedback_binarized"], repo_type="dataset")
46
+ # Confirms it exists before using
47
+ ```
48
+
49
+ ### Implementation Checklist
50
+
51
+ - [ ] Check dataset exists before training
52
+ - [ ] Verify base model exists before fine-tuning
53
+ - [ ] Confirm adapter model exists before GGUF conversion
54
+ - [ ] Test script URLs are valid before submitting
55
+ - [ ] Validate file paths in repositories
56
+ - [ ] Check for recent updates/renames of resources
57
+
58
+ **Time cost:** 5-10 seconds
59
+ **Time saved:** Hours of failed job time + debugging
60
+
61
+ ---
62
+
63
+ ## Principle 2: Prioritize Reliability Over Performance
64
+
65
+ **Rule:** Default to what is most likely to succeed, not what is theoretically fastest.
66
+
67
+ ### What It Prevents
68
+
69
+ - **Hardware incompatibilities** - Features that fail on certain GPUs
70
+ - **Unstable optimizations** - Speed-ups that cause crashes
71
+ - **Complex configurations** - More failure points
72
+ - **Build system issues** - Unreliable compilation methods
73
+
74
+ ### How to Apply
75
+
76
+ **Choose reliability:**
77
+
78
+ ```python
79
+ # ❌ RISKY: Aggressive optimization that may fail
80
+ SFTConfig(
81
+ torch_compile=True, # Can fail on T4, A10G GPUs
82
+ optim="adamw_bnb_8bit", # Requires specific setup
83
+ fp16=False, # May cause training instability
84
+ ...
85
+ )
86
+
87
+ # βœ… SAFE: Proven defaults
88
+ SFTConfig(
89
+ # torch_compile=True, # Commented with note: "Enable on H100 for 20% speedup"
90
+ optim="adamw_torch", # Standard, always works
91
+ fp16=True, # Stable and fast
92
+ ...
93
+ )
94
+ ```
95
+
96
+ **For build processes:**
97
+
98
+ ```python
99
+ # ❌ UNRELIABLE: Uses make (platform-dependent)
100
+ subprocess.run(["make", "-C", "/tmp/llama.cpp", "llama-quantize"], check=True)
101
+
102
+ # βœ… RELIABLE: Uses CMake (consistent, documented)
103
+ subprocess.run([
104
+ "cmake", "-B", "/tmp/llama.cpp/build", "-S", "/tmp/llama.cpp",
105
+ "-DGGML_CUDA=OFF" # Disable CUDA for faster, more reliable build
106
+ ], check=True)
107
+
108
+ subprocess.run([
109
+ "cmake", "--build", "/tmp/llama.cpp/build",
110
+ "--target", "llama-quantize", "-j", "4"
111
+ ], check=True)
112
+ ```
113
+
114
+ ### Real-World Example
115
+
116
+ **The `torch.compile` failure:**
117
+ - Added for "20% speedup" on H100
118
+ - **Failed fatally on T4-medium** with cryptic error
119
+ - Misdiagnosed as dataset issue (cost hours)
120
+ - **Fix:** Disable by default, add as optional comment
121
+
122
+ **Result:** Reliability > 20% performance gain
123
+
124
+ ### Implementation Checklist
125
+
126
+ - [ ] Use proven, standard configurations by default
127
+ - [ ] Comment out performance optimizations with hardware notes
128
+ - [ ] Use stable build systems (CMake > make)
129
+ - [ ] Test on target hardware before production
130
+ - [ ] Document known incompatibilities
131
+ - [ ] Provide "safe" and "fast" variants when needed
132
+
133
+ **Performance loss:** 10-20% in best case
134
+ **Reliability gain:** 95%+ success rate vs 60-70%
135
+
136
+ ---
137
+
138
+ ## Principle 3: Create Atomic, Self-Contained Scripts
139
+
140
+ **Rule:** Scripts should work as complete, independent units. Don't remove parts to "simplify."
141
+
142
+ ### What It Prevents
143
+
144
+ - **Missing dependencies** - Removed "unnecessary" packages that are actually required
145
+ - **Incomplete processes** - Skipped steps that seem redundant
146
+ - **Environment assumptions** - Scripts that need pre-setup
147
+ - **Partial failures** - Some parts work, others fail silently
148
+
149
+ ### How to Apply
150
+
151
+ **Complete dependency specifications:**
152
+
153
+ ```python
154
+ # ❌ INCOMPLETE: "Simplified" by removing dependencies
155
+ # /// script
156
+ # dependencies = [
157
+ # "transformers",
158
+ # "peft",
159
+ # "torch",
160
+ # ]
161
+ # ///
162
+
163
+ # βœ… COMPLETE: All dependencies explicit
164
+ # /// script
165
+ # dependencies = [
166
+ # "transformers>=4.36.0",
167
+ # "peft>=0.7.0",
168
+ # "torch>=2.0.0",
169
+ # "accelerate>=0.24.0",
170
+ # "huggingface_hub>=0.20.0",
171
+ # "sentencepiece>=0.1.99", # Required for tokenizers
172
+ # "protobuf>=3.20.0", # Required for tokenizers
173
+ # "numpy",
174
+ # "gguf",
175
+ # ]
176
+ # ///
177
+ ```
178
+
179
+ **Complete build processes:**
180
+
181
+ ```python
182
+ # ❌ INCOMPLETE: Assumes build tools exist
183
+ subprocess.run(["git", "clone", "https://github.com/ggerganov/llama.cpp.git", "/tmp/llama.cpp"])
184
+ subprocess.run(["make", "-C", "/tmp/llama.cpp", "llama-quantize"]) # FAILS: no gcc/make
185
+
186
+ # βœ… COMPLETE: Installs all requirements
187
+ subprocess.run(["apt-get", "update", "-qq"], check=True)
188
+ subprocess.run(["apt-get", "install", "-y", "-qq", "build-essential", "cmake"], check=True)
189
+ subprocess.run(["git", "clone", "https://github.com/ggerganov/llama.cpp.git", "/tmp/llama.cpp"])
190
+ # ... then build
191
+ ```
192
+
193
+ ### Real-World Example
194
+
195
+ **The `sentencepiece` failure:**
196
+ - Original script had it: worked fine
197
+ - "Simplified" version removed it: "doesn't look necessary"
198
+ - **GGUF conversion failed silently** - tokenizer couldn't convert
199
+ - Hard to debug: no obvious error message
200
+ - **Fix:** Restore all original dependencies
201
+
202
+ **Result:** Don't remove dependencies without thorough testing
203
+
204
+ ### Implementation Checklist
205
+
206
+ - [ ] All dependencies in PEP 723 header with version pins
207
+ - [ ] All system packages installed by script
208
+ - [ ] No assumptions about pre-existing environment
209
+ - [ ] No "optional" steps that are actually required
210
+ - [ ] Test scripts in clean environment
211
+ - [ ] Document why each dependency is needed
212
+
213
+ **Complexity:** Slightly longer scripts
214
+ **Reliability:** Scripts "just work" every time
215
+
216
+ ---
217
+
218
+ ## Principle 4: Provide Clear Error Context
219
+
220
+ **Rule:** When things fail, make it obvious what went wrong and how to fix it.
221
+
222
+ ### How to Apply
223
+
224
+ **Wrap subprocess calls:**
225
+
226
+ ```python
227
+ # ❌ UNCLEAR: Silent failure
228
+ subprocess.run([...], check=True, capture_output=True)
229
+
230
+ # βœ… CLEAR: Shows what failed
231
+ try:
232
+ result = subprocess.run(
233
+ [...],
234
+ check=True,
235
+ capture_output=True,
236
+ text=True
237
+ )
238
+ print(result.stdout)
239
+ if result.stderr:
240
+ print("Warnings:", result.stderr)
241
+ except subprocess.CalledProcessError as e:
242
+ print(f"❌ Command failed!")
243
+ print("STDOUT:", e.stdout)
244
+ print("STDERR:", e.stderr)
245
+ raise
246
+ ```
247
+
248
+ **Validate inputs:**
249
+
250
+ ```python
251
+ # ❌ UNCLEAR: Fails later with cryptic error
252
+ model = load_model(MODEL_NAME)
253
+
254
+ # βœ… CLEAR: Fails fast with clear message
255
+ if not MODEL_NAME:
256
+ raise ValueError("MODEL_NAME environment variable not set!")
257
+
258
+ print(f"Loading model: {MODEL_NAME}")
259
+ try:
260
+ model = load_model(MODEL_NAME)
261
+ print(f"βœ… Model loaded successfully")
262
+ except Exception as e:
263
+ print(f"❌ Failed to load model: {MODEL_NAME}")
264
+ print(f"Error: {e}")
265
+ print("Hint: Check that model exists on Hub")
266
+ raise
267
+ ```
268
+
269
+ ### Implementation Checklist
270
+
271
+ - [ ] Wrap external calls with try/except
272
+ - [ ] Print stdout/stderr on failure
273
+ - [ ] Validate environment variables early
274
+ - [ ] Add progress indicators (βœ…, ❌, πŸ”„)
275
+ - [ ] Include hints for common failures
276
+ - [ ] Log configuration at start
277
+
278
+ ---
279
+
280
+ ## Principle 5: Test the Happy Path on Known-Good Inputs
281
+
282
+ **Rule:** Before using new code in production, test with inputs you know work.
283
+
284
+ ### How to Apply
285
+
286
+ **Known-good test inputs:**
287
+
288
+ ```python
289
+ # For training
290
+ TEST_DATASET = "trl-lib/Capybara" # Small, well-formatted, widely used
291
+ TEST_MODEL = "Qwen/Qwen2.5-0.5B" # Small, fast, reliable
292
+
293
+ # For GGUF conversion
294
+ TEST_ADAPTER = "evalstate/qwen-capybara-medium" # Known working model
295
+ TEST_BASE = "Qwen/Qwen2.5-0.5B" # Compatible base
296
+ ```
297
+
298
+ **Testing workflow:**
299
+
300
+ 1. Test with known-good inputs first
301
+ 2. If that works, try production inputs
302
+ 3. If production fails, you know it's the inputs (not code)
303
+ 4. Isolate the difference
304
+
305
+ ### Implementation Checklist
306
+
307
+ - [ ] Maintain list of known-good test models/datasets
308
+ - [ ] Test new scripts with test inputs first
309
+ - [ ] Document what makes inputs "good"
310
+ - [ ] Keep test jobs cheap (small models, short timeouts)
311
+ - [ ] Only move to production after test succeeds
312
+
313
+ **Time cost:** 5-10 minutes for test run
314
+ **Debugging time saved:** Hours
315
+
316
+ ---
317
+
318
+ ## Summary: The Reliability Checklist
319
+
320
+ Before submitting ANY job:
321
+
322
+ ### Pre-Flight Checks
323
+ - [ ] **Verified** all repos/datasets exist (hub_repo_details)
324
+ - [ ] **Tested** with known-good inputs if new code
325
+ - [ ] **Using** proven hardware/configuration
326
+ - [ ] **Included** all dependencies in PEP 723 header
327
+ - [ ] **Installed** system requirements (build tools, etc.)
328
+ - [ ] **Set** appropriate timeout (not default 30m)
329
+ - [ ] **Configured** Hub push with HF_TOKEN
330
+ - [ ] **Added** clear error handling
331
+
332
+ ### Script Quality
333
+ - [ ] Self-contained (no external setup needed)
334
+ - [ ] Complete dependencies listed
335
+ - [ ] Build tools installed by script
336
+ - [ ] Progress indicators included
337
+ - [ ] Error messages are clear
338
+ - [ ] Configuration logged at start
339
+
340
+ ### Job Configuration
341
+ - [ ] Timeout > expected runtime + 30% buffer
342
+ - [ ] Hardware appropriate for model size
343
+ - [ ] Secrets include HF_TOKEN
344
+ - [ ] Environment variables set correctly
345
+ - [ ] Cost estimated and acceptable
346
+
347
+ **Following these principles transforms job success rate from ~60-70% to ~95%+**
348
+
349
+ ---
350
+
351
+ ## When Principles Conflict
352
+
353
+ Sometimes reliability and performance conflict. Here's how to choose:
354
+
355
+ | Scenario | Choose | Rationale |
356
+ |----------|--------|-----------|
357
+ | Demo/test | Reliability | Fast failure is worse than slow success |
358
+ | Production (first run) | Reliability | Prove it works before optimizing |
359
+ | Production (proven) | Performance | Safe to optimize after validation |
360
+ | Time-critical | Reliability | Failures cause more delay than slow runs |
361
+ | Cost-critical | Balanced | Test with small model, then optimize |
362
+
363
+ **General rule:** Reliability first, optimize second.
364
+
365
+ ---
366
+
367
+ ## Further Reading
368
+
369
+ - `troubleshooting.md` - Common issues and fixes
370
+ - `training_patterns.md` - Proven training configurations
371
+ - `gguf_conversion.md` - Production GGUF workflow
references/trackio_guide.md ADDED
@@ -0,0 +1,189 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Trackio Integration for TRL Training
2
+
3
+ **Trackio** is an experiment tracking library that provides real-time metrics visualization for remote training on Hugging Face Jobs infrastructure.
4
+
5
+ ⚠️ **IMPORTANT**: For Jobs training (remote cloud GPUs):
6
+ - Training happens on ephemeral cloud runners (not your local machine)
7
+ - Trackio syncs metrics to a Hugging Face Space for real-time monitoring
8
+ - Without a Space, metrics are lost when the job completes
9
+ - The Space dashboard persists your training metrics permanently
10
+
11
+ ## Setting Up Trackio for Jobs
12
+
13
+ **Step 1: Add trackio dependency**
14
+ ```python
15
+ # /// script
16
+ # dependencies = [
17
+ # "trl>=0.12.0",
18
+ # "trackio", # Required!
19
+ # ]
20
+ # ///
21
+ ```
22
+
23
+ **Step 2: Create a Trackio Space (one-time setup)**
24
+
25
+ **Option A: Let Trackio auto-create (Recommended)**
26
+ Pass a `space_id` to `trackio.init()` and Trackio will automatically create the Space if it doesn't exist.
27
+
28
+ **Option B: Create manually**
29
+ - Create Space via Hub UI at https://huggingface.co/new-space
30
+ - Select Gradio SDK
31
+ - OR use command: `huggingface-cli repo create my-trackio-dashboard --type space --space_sdk gradio`
32
+
33
+ **Step 3: Initialize Trackio with space_id**
34
+ ```python
35
+ import trackio
36
+
37
+ trackio.init(
38
+ project="my-training",
39
+ space_id="username/trackio", # CRITICAL for Jobs! Replace 'username' with your HF username
40
+ config={
41
+ "model": "Qwen/Qwen2.5-0.5B",
42
+ "dataset": "trl-lib/Capybara",
43
+ "learning_rate": 2e-5,
44
+ }
45
+ )
46
+ ```
47
+
48
+ **Step 4: Configure TRL to use Trackio**
49
+ ```python
50
+ SFTConfig(
51
+ report_to="trackio",
52
+ # ... other config
53
+ )
54
+ ```
55
+
56
+ **Step 5: Finish tracking**
57
+ ```python
58
+ trainer.train()
59
+ trackio.finish() # Ensures final metrics are synced
60
+ ```
61
+
62
+ ## What Trackio Tracks
63
+
64
+ Trackio automatically logs:
65
+ - βœ… Training loss
66
+ - βœ… Learning rate
67
+ - βœ… GPU utilization
68
+ - βœ… Memory usage
69
+ - βœ… Training throughput
70
+ - βœ… Custom metrics
71
+
72
+ ## How It Works with Jobs
73
+
74
+ 1. **Training runs** β†’ Metrics logged to local SQLite DB
75
+ 2. **Every 5 minutes** β†’ Trackio syncs DB to HF Dataset (Parquet)
76
+ 3. **Space dashboard** β†’ Reads from Dataset, displays metrics in real-time
77
+ 4. **Job completes** β†’ Final sync ensures all metrics persisted
78
+
79
+ ## Default Configuration Pattern
80
+
81
+ **Use sensible defaults for trackio configuration unless user requests otherwise.**
82
+
83
+ ### Recommended Defaults
84
+
85
+ ```python
86
+ import trackio
87
+
88
+ trackio.init(
89
+ project="qwen-capybara-sft",
90
+ name="baseline-run", # Descriptive name user will recognize
91
+ space_id="username/trackio", # Default space: {username}/trackio
92
+ config={
93
+ # Keep config minimal - hyperparameters and model/dataset info only
94
+ "model": "Qwen/Qwen2.5-0.5B",
95
+ "dataset": "trl-lib/Capybara",
96
+ "learning_rate": 2e-5,
97
+ "num_epochs": 3,
98
+ }
99
+ )
100
+ ```
101
+
102
+ **Key principles:**
103
+ - **Space ID**: Use `{username}/trackio` with "trackio" as default space name
104
+ - **Run naming**: Unless otherwise specified, name the run in a way the user will recognize
105
+ - **Config**: Keep minimal - don't automatically capture job metadata unless requested
106
+ - **Grouping**: Optional - only use if user requests organizing related experiments
107
+
108
+ ## Grouping Runs (Optional)
109
+
110
+ The `group` parameter helps organize related runs together in the dashboard sidebar. This is useful when user is running multiple experiments with different configurations but wants to compare them together:
111
+
112
+ ```python
113
+ # Example: Group runs by experiment type
114
+ trackio.init(project="my-project", run_name="baseline-run-1", group="baseline")
115
+ trackio.init(project="my-project", run_name="augmented-run-1", group="augmented")
116
+ trackio.init(project="my-project", run_name="tuned-run-1", group="tuned")
117
+ ```
118
+
119
+ Runs with the same group name can be grouped together in the sidebar, making it easier to compare related experiments. You can group by any configuration parameter:
120
+
121
+ ```python
122
+ # Hyperparameter sweep - group by learning rate
123
+ trackio.init(project="hyperparam-sweep", run_name="lr-0.001-run", group="lr_0.001")
124
+ trackio.init(project="hyperparam-sweep", run_name="lr-0.01-run", group="lr_0.01")
125
+ ```
126
+
127
+ ## Environment Variables for Jobs
128
+
129
+ You can configure trackio using environment variables instead of passing parameters to `trackio.init()`. This is useful for managing configuration across multiple jobs.
130
+
131
+
132
+
133
+ **`HF_TOKEN`**
134
+ Required for creating Spaces and writing to datasets (passed via `secrets`):
135
+ ```python
136
+ hf_jobs("uv", {
137
+ "script": "...",
138
+ "secrets": {
139
+ "HF_TOKEN": "$HF_TOKEN" # Enables Space creation and Hub push
140
+ }
141
+ })
142
+ ```
143
+
144
+ ### Example with Environment Variables
145
+
146
+ ```python
147
+ hf_jobs("uv", {
148
+ "script": """
149
+ # Training script - trackio config from environment
150
+ import trackio
151
+ from datetime import datetime
152
+
153
+ # Auto-generate run name
154
+ timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M")
155
+ run_name = f"sft_qwen25_{timestamp}"
156
+
157
+ # Project and space_id can come from environment variables
158
+ trackio.init(run_name=run_name, group="SFT")
159
+
160
+ # ... training code ...
161
+ trackio.finish()
162
+ """,
163
+ "flavor": "a10g-large",
164
+ "timeout": "2h",
165
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"}
166
+ })
167
+ ```
168
+
169
+ **When to use environment variables:**
170
+ - Managing multiple jobs with same configuration
171
+ - Keeping training scripts portable across projects
172
+ - Separating configuration from code
173
+
174
+ **When to use direct parameters:**
175
+ - Single job with specific configuration
176
+ - When clarity in code is preferred
177
+ - When each job has different project/space
178
+
179
+ ## Viewing the Dashboard
180
+
181
+ After starting training:
182
+ 1. Navigate to the Space: `https://huggingface.co/spaces/username/trackio`
183
+ 2. The Gradio dashboard shows all tracked experiments
184
+ 3. Filter by project, compare runs, view charts with smoothing
185
+
186
+ ## Recommendation
187
+
188
+ - **Trackio**: Best for real-time monitoring during long training runs
189
+ - **Weights & Biases**: Best for team collaboration, requires account
references/training_methods.md ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TRL Training Methods Overview
2
+
3
+ TRL (Transformer Reinforcement Learning) provides multiple training methods for fine-tuning and aligning language models. This reference provides a brief overview of each method.
4
+
5
+ ## Supervised Fine-Tuning (SFT)
6
+
7
+ **What it is:** Standard instruction tuning with supervised learning on demonstration data.
8
+
9
+ **When to use:**
10
+ - Initial fine-tuning of base models on task-specific data
11
+ - Teaching new capabilities or domains
12
+ - Most common starting point for fine-tuning
13
+
14
+ **Dataset format:** Conversational format with "messages" field, OR text field, OR prompt/completion pairs
15
+
16
+ **Example:**
17
+ ```python
18
+ from trl import SFTTrainer, SFTConfig
19
+
20
+ trainer = SFTTrainer(
21
+ model="Qwen/Qwen2.5-0.5B",
22
+ train_dataset=dataset,
23
+ args=SFTConfig(
24
+ output_dir="my-model",
25
+ push_to_hub=True,
26
+ hub_model_id="username/my-model",
27
+ eval_strategy="no", # Disable eval for simple example
28
+ # max_length=1024 is the default - only set if you need different length
29
+ )
30
+ )
31
+ trainer.train()
32
+ ```
33
+
34
+ **Note:** For production training with evaluation monitoring, see `scripts/train_sft_example.py`
35
+
36
+ **Documentation:** `hf_doc_fetch("https://huggingface.co/docs/trl/sft_trainer")`
37
+
38
+ ## Direct Preference Optimization (DPO)
39
+
40
+ **What it is:** Alignment method that trains directly on preference pairs (chosen vs rejected responses) without requiring a reward model.
41
+
42
+ **When to use:**
43
+ - Aligning models to human preferences
44
+ - Improving response quality after SFT
45
+ - Have paired preference data (chosen/rejected responses)
46
+
47
+ **Dataset format:** Preference pairs with "chosen" and "rejected" fields
48
+
49
+ **Example:**
50
+ ```python
51
+ from trl import DPOTrainer, DPOConfig
52
+
53
+ trainer = DPOTrainer(
54
+ model="Qwen/Qwen2.5-0.5B-Instruct", # Use instruct model
55
+ train_dataset=dataset,
56
+ args=DPOConfig(
57
+ output_dir="dpo-model",
58
+ beta=0.1, # KL penalty coefficient
59
+ eval_strategy="no", # Disable eval for simple example
60
+ # max_length=1024 is the default - only set if you need different length
61
+ )
62
+ )
63
+ trainer.train()
64
+ ```
65
+
66
+ **Note:** For production training with evaluation monitoring, see `scripts/train_dpo_example.py`
67
+
68
+ **Documentation:** `hf_doc_fetch("https://huggingface.co/docs/trl/dpo_trainer")`
69
+
70
+ ## Group Relative Policy Optimization (GRPO)
71
+
72
+ **What it is:** Online RL method that optimizes relative to group performance, useful for tasks with verifiable rewards.
73
+
74
+ **When to use:**
75
+ - Tasks with automatic reward signals (code execution, math verification)
76
+ - Online learning scenarios
77
+ - When DPO offline data is insufficient
78
+
79
+ **Dataset format:** Prompt-only format (model generates responses, reward computed online)
80
+
81
+ **Example:**
82
+ ```python
83
+ # Use TRL maintained script
84
+ hf_jobs("uv", {
85
+ "script": "https://raw.githubusercontent.com/huggingface/trl/main/examples/scripts/grpo.py",
86
+ "script_args": [
87
+ "--model_name_or_path", "Qwen/Qwen2.5-0.5B-Instruct",
88
+ "--dataset_name", "trl-lib/math_shepherd",
89
+ "--output_dir", "grpo-model"
90
+ ],
91
+ "flavor": "a10g-large",
92
+ "timeout": "4h",
93
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"}
94
+ })
95
+ ```
96
+
97
+ **Documentation:** `hf_doc_fetch("https://huggingface.co/docs/trl/grpo_trainer")`
98
+
99
+ ## Reward Modeling
100
+
101
+ **What it is:** Train a reward model to score responses, used as a component in RLHF pipelines.
102
+
103
+ **When to use:**
104
+ - Building RLHF pipeline
105
+ - Need automatic quality scoring
106
+ - Creating reward signals for PPO training
107
+
108
+ **Dataset format:** Preference pairs with "chosen" and "rejected" responses
109
+
110
+ **Documentation:** `hf_doc_fetch("https://huggingface.co/docs/trl/reward_trainer")`
111
+
112
+ ## Method Selection Guide
113
+
114
+ | Method | Complexity | Data Required | Use Case |
115
+ |--------|-----------|---------------|----------|
116
+ | **SFT** | Low | Demonstrations | Initial fine-tuning |
117
+ | **DPO** | Medium | Paired preferences | Post-SFT alignment |
118
+ | **GRPO** | Medium | Prompts + reward fn | Online RL with automatic rewards |
119
+ | **Reward** | Medium | Paired preferences | Building RLHF pipeline |
120
+
121
+ ## Recommended Pipeline
122
+
123
+ **For most use cases:**
124
+ 1. **Start with SFT** - Fine-tune base model on task data
125
+ 2. **Follow with DPO** - Align to preferences using paired data
126
+ 3. **Optional: GGUF conversion** - Deploy for local inference
127
+
128
+ **For advanced RL scenarios:**
129
+ 1. **Start with SFT** - Fine-tune base model
130
+ 2. **Train reward model** - On preference data
131
+
132
+ ## Dataset Format Reference
133
+
134
+ For complete dataset format specifications, use:
135
+ ```python
136
+ hf_doc_fetch("https://huggingface.co/docs/trl/dataset_formats")
137
+ ```
138
+
139
+ Or validate your dataset:
140
+ ```bash
141
+ uv run https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py \
142
+ --dataset your/dataset --split train
143
+ ```
144
+
145
+ ## See Also
146
+
147
+ - `references/training_patterns.md` - Common training patterns and examples
148
+ - `scripts/train_sft_example.py` - Complete SFT template
149
+ - `scripts/train_dpo_example.py` - Complete DPO template
150
+ - [Dataset Inspector](https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py) - Dataset format validation tool
references/training_patterns.md ADDED
@@ -0,0 +1,203 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Common Training Patterns
2
+
3
+ This guide provides common training patterns and use cases for TRL on Hugging Face Jobs.
4
+
5
+ ## Multi-GPU Training
6
+
7
+ Automatic distributed training across multiple GPUs. TRL/Accelerate handles distribution automatically:
8
+
9
+ ```python
10
+ hf_jobs("uv", {
11
+ "script": """
12
+ # Your training script here (same as single GPU)
13
+ # No changes needed - Accelerate detects multiple GPUs
14
+ """,
15
+ "flavor": "a10g-largex2", # 2x A10G GPUs
16
+ "timeout": "4h",
17
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"}
18
+ })
19
+ ```
20
+
21
+ **Tips for multi-GPU:**
22
+ - No code changes needed
23
+ - Use `per_device_train_batch_size` (per GPU, not total)
24
+ - Effective batch size = `per_device_train_batch_size` Γ— `num_gpus` Γ— `gradient_accumulation_steps`
25
+ - Monitor GPU utilization to ensure both GPUs are being used
26
+
27
+ ## DPO Training (Preference Learning)
28
+
29
+ Train with preference data for alignment:
30
+
31
+ ```python
32
+ hf_jobs("uv", {
33
+ "script": """
34
+ # /// script
35
+ # dependencies = ["trl>=0.12.0", "trackio"]
36
+ # ///
37
+
38
+ from datasets import load_dataset
39
+ from trl import DPOTrainer, DPOConfig
40
+ import trackio
41
+
42
+ dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
43
+
44
+ # Create train/eval split
45
+ dataset_split = dataset.train_test_split(test_size=0.1, seed=42)
46
+
47
+ config = DPOConfig(
48
+ output_dir="dpo-model",
49
+ push_to_hub=True,
50
+ hub_model_id="username/dpo-model",
51
+ num_train_epochs=1,
52
+ beta=0.1, # KL penalty coefficient
53
+ eval_strategy="steps",
54
+ eval_steps=50,
55
+ report_to="trackio",
56
+ run_name="baseline_run", # use a meaningful run name
57
+ # max_length=1024, # Default - only set if you need different sequence length
58
+ )
59
+
60
+ trainer = DPOTrainer(
61
+ model="Qwen/Qwen2.5-0.5B-Instruct", # Use instruct model as base
62
+ train_dataset=dataset_split["train"],
63
+ eval_dataset=dataset_split["test"], # IMPORTANT: Provide eval_dataset when eval_strategy is enabled
64
+ args=config,
65
+ )
66
+
67
+ trainer.train()
68
+ trainer.push_to_hub()
69
+ trackio.finish()
70
+ """,
71
+ "flavor": "a10g-large",
72
+ "timeout": "3h",
73
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"}
74
+ })
75
+ ```
76
+
77
+ **For DPO documentation:** Use `hf_doc_fetch("https://huggingface.co/docs/trl/dpo_trainer")`
78
+
79
+ ## GRPO Training (Online RL)
80
+
81
+ Group Relative Policy Optimization for online reinforcement learning:
82
+
83
+ ```python
84
+ hf_jobs("uv", {
85
+ "script": "https://raw.githubusercontent.com/huggingface/trl/main/examples/scripts/grpo.py",
86
+ "script_args": [
87
+ "--model_name_or_path", "Qwen/Qwen2.5-0.5B-Instruct",
88
+ "--dataset_name", "trl-lib/math_shepherd",
89
+ "--output_dir", "grpo-model",
90
+ "--push_to_hub",
91
+ "--hub_model_id", "username/grpo-model"
92
+ ],
93
+ "flavor": "a10g-large",
94
+ "timeout": "4h",
95
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"}
96
+ })
97
+ ```
98
+
99
+ **For GRPO documentation:** Use `hf_doc_fetch("https://huggingface.co/docs/trl/grpo_trainer")`
100
+
101
+ ## Trackio Configuration
102
+
103
+ **Use sensible defaults for trackio setup.** See `references/trackio_guide.md` for complete documentation including grouping runs for experiments.
104
+
105
+ ### Basic Pattern
106
+
107
+ ```python
108
+ import trackio
109
+
110
+ trackio.init(
111
+ project="my-training",
112
+ run_name="baseline-run", # Descriptive name user will recognize
113
+ space_id="username/trackio", # Default space: {username}/trackio
114
+ config={
115
+ # Keep config minimal - hyperparameters and model/dataset info only
116
+ "model": "Qwen/Qwen2.5-0.5B",
117
+ "dataset": "trl-lib/Capybara",
118
+ "learning_rate": 2e-5,
119
+ }
120
+ )
121
+
122
+ # Your training code...
123
+
124
+ trackio.finish()
125
+ ```
126
+
127
+ ### Grouping for Experiments (Optional)
128
+
129
+ When user wants to compare related runs, use the `group` parameter:
130
+
131
+ ```python
132
+ # Hyperparameter sweep
133
+ trackio.init(project="hyperparam-sweep", run_name="lr-0.001", group="lr_0.001")
134
+ trackio.init(project="hyperparam-sweep", run_name="lr-0.01", group="lr_0.01")
135
+ ```
136
+
137
+ ## Pattern Selection Guide
138
+
139
+ | Use Case | Pattern | Hardware | Time |
140
+ |----------|---------|----------|------|
141
+ | SFT training | `scripts/train_sft_example.py` | a10g-large | 2-6 hours |
142
+ | Large dataset (>10K) | Multi-GPU | a10g-largex2 | 4-12 hours |
143
+ | Preference learning | DPO Training | a10g-large | 2-4 hours |
144
+ | Online RL | GRPO Training | a10g-large | 3-6 hours |
145
+
146
+ ## Critical: Evaluation Dataset Requirements
147
+
148
+ **⚠️ IMPORTANT**: If you set `eval_strategy="steps"` or `eval_strategy="epoch"`, you **MUST** provide an `eval_dataset` to the trainer, or the training will hang.
149
+
150
+ ### βœ… CORRECT - With eval dataset:
151
+ ```python
152
+ dataset_split = dataset.train_test_split(test_size=0.1, seed=42)
153
+
154
+ trainer = SFTTrainer(
155
+ model="Qwen/Qwen2.5-0.5B",
156
+ train_dataset=dataset_split["train"],
157
+ eval_dataset=dataset_split["test"], # ← MUST provide when eval_strategy is enabled
158
+ args=SFTConfig(eval_strategy="steps", ...),
159
+ )
160
+ ```
161
+
162
+ ### ❌ WRONG - Will hang:
163
+ ```python
164
+ trainer = SFTTrainer(
165
+ model="Qwen/Qwen2.5-0.5B",
166
+ train_dataset=dataset,
167
+ # NO eval_dataset but eval_strategy="steps" ← WILL HANG
168
+ args=SFTConfig(eval_strategy="steps", ...),
169
+ )
170
+ ```
171
+
172
+ ### Option: Disable evaluation if no eval dataset
173
+ ```python
174
+ config = SFTConfig(
175
+ eval_strategy="no", # ← Explicitly disable evaluation
176
+ # ... other config
177
+ )
178
+
179
+ trainer = SFTTrainer(
180
+ model="Qwen/Qwen2.5-0.5B",
181
+ train_dataset=dataset,
182
+ # No eval_dataset needed
183
+ args=config,
184
+ )
185
+ ```
186
+
187
+ ## Best Practices
188
+
189
+ 1. **Use train/eval splits** - Create evaluation split for monitoring progress
190
+ 2. **Enable Trackio** - Monitor progress in real-time
191
+ 3. **Add 20-30% buffer to timeout** - Account for loading/saving overhead
192
+ 4. **Test with TRL official scripts first** - Use maintained examples before custom code
193
+ 5. **Always provide eval_dataset** - When using eval_strategy, or set to "no"
194
+ 6. **Use multi-GPU for large models** - 7B+ models benefit significantly
195
+
196
+ ## See Also
197
+
198
+ - `scripts/train_sft_example.py` - Complete SFT template with Trackio and eval split
199
+ - `scripts/train_dpo_example.py` - Complete DPO template
200
+ - `scripts/train_grpo_example.py` - Complete GRPO template
201
+ - `references/hardware_guide.md` - Detailed hardware specifications
202
+ - `references/training_methods.md` - Overview of all TRL training methods
203
+ - `references/troubleshooting.md` - Common issues and solutions
references/troubleshooting.md ADDED
@@ -0,0 +1,282 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Troubleshooting TRL Training Jobs
2
+
3
+ Common issues and solutions when training with TRL on Hugging Face Jobs.
4
+
5
+ ## Training Hangs at "Starting training..." Step
6
+
7
+ **Problem:** Job starts but hangs at the training step - never progresses, never times out, just sits there.
8
+
9
+ **Root Cause:** Using `eval_strategy="steps"` or `eval_strategy="epoch"` without providing an `eval_dataset` to the trainer.
10
+
11
+ **Solution:**
12
+
13
+ **Option A: Provide eval_dataset (recommended)**
14
+ ```python
15
+ # Create train/eval split
16
+ dataset_split = dataset.train_test_split(test_size=0.1, seed=42)
17
+
18
+ trainer = SFTTrainer(
19
+ model="Qwen/Qwen2.5-0.5B",
20
+ train_dataset=dataset_split["train"],
21
+ eval_dataset=dataset_split["test"], # ← MUST provide when eval_strategy is enabled
22
+ args=SFTConfig(
23
+ eval_strategy="steps",
24
+ eval_steps=50,
25
+ ...
26
+ ),
27
+ )
28
+ ```
29
+
30
+ **Option B: Disable evaluation**
31
+ ```python
32
+ trainer = SFTTrainer(
33
+ model="Qwen/Qwen2.5-0.5B",
34
+ train_dataset=dataset,
35
+ # No eval_dataset
36
+ args=SFTConfig(
37
+ eval_strategy="no", # ← Explicitly disable
38
+ ...
39
+ ),
40
+ )
41
+ ```
42
+
43
+ **Prevention:**
44
+ - Always create train/eval split for better monitoring
45
+ - Use `dataset.train_test_split(test_size=0.1, seed=42)`
46
+ - Check example scripts: `scripts/train_sft_example.py` includes proper eval setup
47
+
48
+ ## Job Times Out
49
+
50
+ **Problem:** Job terminates before training completes, all progress lost.
51
+
52
+ **Solutions:**
53
+ - Increase timeout parameter (e.g., `"timeout": "4h"`)
54
+ - Reduce `num_train_epochs` or use smaller dataset slice
55
+ - Use smaller model or enable LoRA/PEFT to speed up training
56
+ - Add 20-30% buffer to estimated time for loading/saving overhead
57
+
58
+ **Prevention:**
59
+ - Always start with a quick demo run to estimate timing
60
+ - Use `scripts/estimate_cost.py` to get time estimates
61
+ - Monitor first runs closely via Trackio or logs
62
+
63
+ ## Model Not Saved to Hub
64
+
65
+ **Problem:** Training completes but model doesn't appear on Hub - all work lost.
66
+
67
+ **Check:**
68
+ - [ ] `push_to_hub=True` in training config
69
+ - [ ] `hub_model_id` specified with username (e.g., `"username/model-name"`)
70
+ - [ ] `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job submission
71
+ - [ ] User has write access to target repo
72
+ - [ ] Token has write permissions (check at https://huggingface.co/settings/tokens)
73
+ - [ ] Training script calls `trainer.push_to_hub()` at the end
74
+
75
+ **See:** `references/hub_saving.md` for detailed Hub authentication troubleshooting
76
+
77
+ ## Out of Memory (OOM)
78
+
79
+ **Problem:** Job fails with CUDA out of memory error.
80
+
81
+ **Solutions (in order of preference):**
82
+ 1. **Reduce batch size:** Lower `per_device_train_batch_size` (try 4 β†’ 2 β†’ 1)
83
+ 2. **Increase gradient accumulation:** Raise `gradient_accumulation_steps` to maintain effective batch size
84
+ 3. **Disable evaluation:** Remove `eval_dataset` and `eval_strategy` (saves ~40% memory, good for demos)
85
+ 4. **Enable LoRA/PEFT:** Use `peft_config=LoraConfig(r=8, lora_alpha=16)` to train adapters only (smaller rank = less memory)
86
+ 5. **Use larger GPU:** Switch from `t4-small` β†’ `l4x1` β†’ `a10g-large` β†’ `a100-large`
87
+ 6. **Enable gradient checkpointing:** Set `gradient_checkpointing=True` in config (slower but saves memory)
88
+ 7. **Use smaller model:** Try a smaller variant (e.g., 0.5B instead of 3B)
89
+
90
+ **Memory guidelines:**
91
+ - T4 (16GB): <1B models with LoRA
92
+ - A10G (24GB): 1-3B models with LoRA, <1B full fine-tune
93
+ - A100 (40GB/80GB): 7B+ models with LoRA, 3B full fine-tune
94
+
95
+ ## Parameter Naming Issues
96
+
97
+ **Problem:** `TypeError: SFTConfig.__init__() got an unexpected keyword argument 'max_seq_length'`
98
+
99
+ **Cause:** TRL config classes use `max_length`, not `max_seq_length`.
100
+
101
+ **Solution:**
102
+ ```python
103
+ # βœ… CORRECT - TRL uses max_length
104
+ SFTConfig(max_length=512)
105
+ DPOConfig(max_length=512)
106
+
107
+ # ❌ WRONG - This will fail
108
+ SFTConfig(max_seq_length=512)
109
+ ```
110
+
111
+ **Note:** Most TRL configs don't require explicit max_length - the default (1024) works well. Only set if you need a specific value.
112
+
113
+ ## Dataset Format Error
114
+
115
+ **Problem:** Training fails with dataset format errors or missing fields.
116
+
117
+ **Solutions:**
118
+ 1. **Check format documentation:**
119
+ ```python
120
+ hf_doc_fetch("https://huggingface.co/docs/trl/dataset_formats")
121
+ ```
122
+
123
+ 2. **Validate dataset before training:**
124
+ ```bash
125
+ uv run https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py \
126
+ --dataset <dataset-name> --split train
127
+ ```
128
+ Or via hf_jobs:
129
+ ```python
130
+ hf_jobs("uv", {
131
+ "script": "https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py",
132
+ "script_args": ["--dataset", "dataset-name", "--split", "train"]
133
+ })
134
+ ```
135
+
136
+ 3. **Verify field names:**
137
+ - **SFT:** Needs "messages" field (conversational), OR "text" field, OR "prompt"/"completion"
138
+ - **DPO:** Needs "chosen" and "rejected" fields
139
+ - **GRPO:** Needs prompt-only format
140
+
141
+ 4. **Check dataset split:**
142
+ - Ensure split exists (e.g., `split="train"`)
143
+ - Preview dataset: `load_dataset("name", split="train[:5]")`
144
+
145
+ ## Import/Module Errors
146
+
147
+ **Problem:** Job fails with "ModuleNotFoundError" or import errors.
148
+
149
+ **Solutions:**
150
+ 1. **Add PEP 723 header with dependencies:**
151
+ ```python
152
+ # /// script
153
+ # dependencies = [
154
+ # "trl>=0.12.0",
155
+ # "peft>=0.7.0",
156
+ # "transformers>=4.36.0",
157
+ # ]
158
+ # ///
159
+ ```
160
+
161
+ 2. **Verify exact format:**
162
+ - Must have `# ///` delimiters (with space after `#`)
163
+ - Dependencies must be valid PyPI package names
164
+ - Check spelling and version constraints
165
+
166
+ 3. **Test locally first:**
167
+ ```bash
168
+ uv run train.py # Tests if dependencies are correct
169
+ ```
170
+
171
+ ## Authentication Errors
172
+
173
+ **Problem:** Job fails with authentication or permission errors when pushing to Hub.
174
+
175
+ **Solutions:**
176
+ 1. **Verify authentication:**
177
+ ```python
178
+ mcp__huggingface__hf_whoami() # Check who's authenticated
179
+ ```
180
+
181
+ 2. **Check token permissions:**
182
+ - Go to https://huggingface.co/settings/tokens
183
+ - Ensure token has "write" permission
184
+ - Token must not be "read-only"
185
+
186
+ 3. **Verify token in job:**
187
+ ```python
188
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"} # Must be in job config
189
+ ```
190
+
191
+ 4. **Check repo permissions:**
192
+ - User must have write access to target repo
193
+ - If org repo, user must be member with write access
194
+ - Repo must exist or user must have permission to create
195
+
196
+ ## Job Stuck or Not Starting
197
+
198
+ **Problem:** Job shows "pending" or "starting" for extended period.
199
+
200
+ **Solutions:**
201
+ - Check Jobs dashboard for status: https://huggingface.co/jobs
202
+ - Verify hardware availability (some GPU types may have queues)
203
+ - Try different hardware flavor if one is heavily utilized
204
+ - Check for account billing issues (Jobs requires paid plan)
205
+
206
+ **Typical startup times:**
207
+ - CPU jobs: 10-30 seconds
208
+ - GPU jobs: 30-90 seconds
209
+ - If >3 minutes: likely queued or stuck
210
+
211
+ ## Training Loss Not Decreasing
212
+
213
+ **Problem:** Training runs but loss stays flat or doesn't improve.
214
+
215
+ **Solutions:**
216
+ 1. **Check learning rate:** May be too low (try 2e-5 to 5e-5) or too high (try 1e-6)
217
+ 2. **Verify dataset quality:** Inspect examples to ensure they're reasonable
218
+ 3. **Check model size:** Very small models may not have capacity for task
219
+ 4. **Increase training steps:** May need more epochs or larger dataset
220
+ 5. **Verify dataset format:** Wrong format may cause degraded training
221
+
222
+ ## Logs Not Appearing
223
+
224
+ **Problem:** Cannot see training logs or progress.
225
+
226
+ **Solutions:**
227
+ 1. **Wait 30-60 seconds:** Initial logs can be delayed
228
+ 2. **Check logs via MCP tool:**
229
+ ```python
230
+ hf_jobs("logs", {"job_id": "your-job-id"})
231
+ ```
232
+ 3. **Use Trackio for real-time monitoring:** See `references/trackio_guide.md`
233
+ 4. **Verify job is actually running:**
234
+ ```python
235
+ hf_jobs("inspect", {"job_id": "your-job-id"})
236
+ ```
237
+
238
+ ## Checkpoint/Resume Issues
239
+
240
+ **Problem:** Cannot resume from checkpoint or checkpoint not saved.
241
+
242
+ **Solutions:**
243
+ 1. **Enable checkpoint saving:**
244
+ ```python
245
+ SFTConfig(
246
+ save_strategy="steps",
247
+ save_steps=100,
248
+ hub_strategy="every_save", # Push each checkpoint
249
+ )
250
+ ```
251
+
252
+ 2. **Verify checkpoints pushed to Hub:** Check model repo for checkpoint folders
253
+
254
+ 3. **Resume from checkpoint:**
255
+ ```python
256
+ trainer = SFTTrainer(
257
+ model="username/model-name", # Can be checkpoint path
258
+ resume_from_checkpoint="username/model-name/checkpoint-1000",
259
+ )
260
+ ```
261
+
262
+ ## Getting Help
263
+
264
+ If issues persist:
265
+
266
+ 1. **Check TRL documentation:**
267
+ ```python
268
+ hf_doc_search("your issue", product="trl")
269
+ ```
270
+
271
+ 2. **Check Jobs documentation:**
272
+ ```python
273
+ hf_doc_fetch("https://huggingface.co/docs/huggingface_hub/guides/jobs")
274
+ ```
275
+
276
+ 3. **Review related guides:**
277
+ - `references/hub_saving.md` - Hub authentication issues
278
+ - `references/hardware_guide.md` - Hardware selection and specs
279
+ - `references/training_patterns.md` - Eval dataset requirements
280
+ - SKILL.md "Working with Scripts" section - Script format and URL issues
281
+
282
+ 4. **Ask in HF forums:** https://discuss.huggingface.co/
scripts/convert_to_gguf.py ADDED
@@ -0,0 +1,350 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ # /// script
3
+ # dependencies = [
4
+ # "transformers>=4.36.0",
5
+ # "peft>=0.7.0",
6
+ # "torch>=2.0.0",
7
+ # "accelerate>=0.24.0",
8
+ # "huggingface_hub>=0.20.0",
9
+ # "sentencepiece>=0.1.99",
10
+ # "protobuf>=3.20.0",
11
+ # "numpy",
12
+ # "gguf",
13
+ # ]
14
+ # ///
15
+
16
+ """
17
+ GGUF Conversion Script - Production Ready
18
+
19
+ This script converts a LoRA fine-tuned model to GGUF format for use with:
20
+ - llama.cpp
21
+ - Ollama
22
+ - LM Studio
23
+ - Other GGUF-compatible tools
24
+
25
+ Usage:
26
+ Set environment variables:
27
+ - ADAPTER_MODEL: Your fine-tuned model (e.g., "username/my-finetuned-model")
28
+ - BASE_MODEL: Base model used for fine-tuning (e.g., "Qwen/Qwen2.5-0.5B")
29
+ - OUTPUT_REPO: Where to upload GGUF files (e.g., "username/my-model-gguf")
30
+ - HF_USERNAME: Your Hugging Face username (optional, for README)
31
+
32
+ Dependencies: All required packages are declared in PEP 723 header above.
33
+ Build tools (gcc, cmake) are installed automatically by this script.
34
+ """
35
+
36
+ import os
37
+ import torch
38
+ from transformers import AutoModelForCausalLM, AutoTokenizer
39
+ from peft import PeftModel
40
+ from huggingface_hub import HfApi
41
+ import subprocess
42
+
43
+ print("πŸ”„ GGUF Conversion Script")
44
+ print("=" * 60)
45
+
46
+ # Configuration from environment variables
47
+ ADAPTER_MODEL = os.environ.get("ADAPTER_MODEL", "evalstate/qwen-capybara-medium")
48
+ BASE_MODEL = os.environ.get("BASE_MODEL", "Qwen/Qwen2.5-0.5B")
49
+ OUTPUT_REPO = os.environ.get("OUTPUT_REPO", "evalstate/qwen-capybara-medium-gguf")
50
+ username = os.environ.get("HF_USERNAME", ADAPTER_MODEL.split('/')[0])
51
+
52
+ print(f"\nπŸ“¦ Configuration:")
53
+ print(f" Base model: {BASE_MODEL}")
54
+ print(f" Adapter model: {ADAPTER_MODEL}")
55
+ print(f" Output repo: {OUTPUT_REPO}")
56
+
57
+ # Step 1: Load base model and adapter
58
+ print("\nπŸ”§ Step 1: Loading base model and LoRA adapter...")
59
+ print(" (This may take a few minutes)")
60
+
61
+ base_model = AutoModelForCausalLM.from_pretrained(
62
+ BASE_MODEL,
63
+ dtype=torch.float16,
64
+ device_map="auto",
65
+ trust_remote_code=True,
66
+ )
67
+ print(" βœ… Base model loaded")
68
+
69
+ # Load and merge adapter
70
+ print(" Loading LoRA adapter...")
71
+ model = PeftModel.from_pretrained(base_model, ADAPTER_MODEL)
72
+ print(" βœ… Adapter loaded")
73
+
74
+ print(" Merging adapter with base model...")
75
+ merged_model = model.merge_and_unload()
76
+ print(" βœ… Models merged!")
77
+
78
+ # Load tokenizer
79
+ tokenizer = AutoTokenizer.from_pretrained(ADAPTER_MODEL, trust_remote_code=True)
80
+ print(" βœ… Tokenizer loaded")
81
+
82
+ # Step 2: Save merged model temporarily
83
+ print("\nπŸ’Ύ Step 2: Saving merged model...")
84
+ merged_dir = "/tmp/merged_model"
85
+ merged_model.save_pretrained(merged_dir, safe_serialization=True)
86
+ tokenizer.save_pretrained(merged_dir)
87
+ print(f" βœ… Merged model saved to {merged_dir}")
88
+
89
+ # Step 3: Install llama.cpp for conversion
90
+ print("\nπŸ“₯ Step 3: Setting up llama.cpp for GGUF conversion...")
91
+
92
+ # CRITICAL: Install build tools FIRST (before cloning llama.cpp)
93
+ print(" Installing build tools...")
94
+ subprocess.run(
95
+ ["apt-get", "update", "-qq"],
96
+ check=True,
97
+ capture_output=True
98
+ )
99
+ subprocess.run(
100
+ ["apt-get", "install", "-y", "-qq", "build-essential", "cmake"],
101
+ check=True,
102
+ capture_output=True
103
+ )
104
+ print(" βœ… Build tools installed")
105
+
106
+ print(" Cloning llama.cpp repository...")
107
+ subprocess.run(
108
+ ["git", "clone", "https://github.com/ggerganov/llama.cpp.git", "/tmp/llama.cpp"],
109
+ check=True,
110
+ capture_output=True
111
+ )
112
+ print(" βœ… llama.cpp cloned")
113
+
114
+ print(" Installing Python dependencies...")
115
+ subprocess.run(
116
+ ["pip", "install", "-r", "/tmp/llama.cpp/requirements.txt"],
117
+ check=True,
118
+ capture_output=True
119
+ )
120
+ # sentencepiece and protobuf are needed for tokenizer conversion
121
+ subprocess.run(
122
+ ["pip", "install", "sentencepiece", "protobuf"],
123
+ check=True,
124
+ capture_output=True
125
+ )
126
+ print(" βœ… Dependencies installed")
127
+
128
+ # Step 4: Convert to GGUF (FP16)
129
+ print("\nπŸ”„ Step 4: Converting to GGUF format (FP16)...")
130
+ gguf_output_dir = "/tmp/gguf_output"
131
+ os.makedirs(gguf_output_dir, exist_ok=True)
132
+
133
+ convert_script = "/tmp/llama.cpp/convert_hf_to_gguf.py"
134
+ model_name = ADAPTER_MODEL.split('/')[-1]
135
+ gguf_file = f"{gguf_output_dir}/{model_name}-f16.gguf"
136
+
137
+ print(f" Running: python {convert_script} {merged_dir}")
138
+ try:
139
+ result = subprocess.run(
140
+ [
141
+ "python", convert_script,
142
+ merged_dir,
143
+ "--outfile", gguf_file,
144
+ "--outtype", "f16"
145
+ ],
146
+ check=True,
147
+ capture_output=True,
148
+ text=True
149
+ )
150
+ print(result.stdout)
151
+ if result.stderr:
152
+ print("Warnings:", result.stderr)
153
+ except subprocess.CalledProcessError as e:
154
+ print(f"❌ Conversion failed!")
155
+ print("STDOUT:", e.stdout)
156
+ print("STDERR:", e.stderr)
157
+ raise
158
+ print(f" βœ… FP16 GGUF created: {gguf_file}")
159
+
160
+ # Step 5: Quantize to different formats
161
+ print("\nβš™οΈ Step 5: Creating quantized versions...")
162
+
163
+ # Build quantize tool using CMake (more reliable than make)
164
+ print(" Building quantize tool with CMake...")
165
+ try:
166
+ # Create build directory
167
+ os.makedirs("/tmp/llama.cpp/build", exist_ok=True)
168
+
169
+ # Configure with CMake
170
+ subprocess.run(
171
+ ["cmake", "-B", "/tmp/llama.cpp/build", "-S", "/tmp/llama.cpp",
172
+ "-DGGML_CUDA=OFF"], # Disable CUDA for faster build
173
+ check=True,
174
+ capture_output=True,
175
+ text=True
176
+ )
177
+
178
+ # Build just the quantize tool
179
+ subprocess.run(
180
+ ["cmake", "--build", "/tmp/llama.cpp/build", "--target", "llama-quantize", "-j", "4"],
181
+ check=True,
182
+ capture_output=True,
183
+ text=True
184
+ )
185
+ print(" βœ… Quantize tool built")
186
+ except subprocess.CalledProcessError as e:
187
+ print(f" ❌ Build failed!")
188
+ print("STDOUT:", e.stdout)
189
+ print("STDERR:", e.stderr)
190
+ raise
191
+
192
+ # Use the CMake build output path
193
+ quantize_bin = "/tmp/llama.cpp/build/bin/llama-quantize"
194
+
195
+ # Common quantization formats
196
+ quant_formats = [
197
+ ("Q4_K_M", "4-bit, medium quality (recommended)"),
198
+ ("Q5_K_M", "5-bit, higher quality"),
199
+ ("Q8_0", "8-bit, very high quality"),
200
+ ]
201
+
202
+ quantized_files = []
203
+ for quant_type, description in quant_formats:
204
+ print(f" Creating {quant_type} quantization ({description})...")
205
+ quant_file = f"{gguf_output_dir}/{model_name}-{quant_type.lower()}.gguf"
206
+
207
+ subprocess.run(
208
+ [quantize_bin, gguf_file, quant_file, quant_type],
209
+ check=True,
210
+ capture_output=True
211
+ )
212
+ quantized_files.append((quant_file, quant_type))
213
+
214
+ # Get file size
215
+ size_mb = os.path.getsize(quant_file) / (1024 * 1024)
216
+ print(f" βœ… {quant_type}: {size_mb:.1f} MB")
217
+
218
+ # Step 6: Upload to Hub
219
+ print("\n☁️ Step 6: Uploading to Hugging Face Hub...")
220
+ api = HfApi()
221
+
222
+ # Create repo
223
+ print(f" Creating repository: {OUTPUT_REPO}")
224
+ try:
225
+ api.create_repo(repo_id=OUTPUT_REPO, repo_type="model", exist_ok=True)
226
+ print(" βœ… Repository created")
227
+ except Exception as e:
228
+ print(f" ℹ️ Repository may already exist: {e}")
229
+
230
+ # Upload FP16 version
231
+ print(" Uploading FP16 GGUF...")
232
+ api.upload_file(
233
+ path_or_fileobj=gguf_file,
234
+ path_in_repo=f"{model_name}-f16.gguf",
235
+ repo_id=OUTPUT_REPO,
236
+ )
237
+ print(" βœ… FP16 uploaded")
238
+
239
+ # Upload quantized versions
240
+ for quant_file, quant_type in quantized_files:
241
+ print(f" Uploading {quant_type}...")
242
+ api.upload_file(
243
+ path_or_fileobj=quant_file,
244
+ path_in_repo=f"{model_name}-{quant_type.lower()}.gguf",
245
+ repo_id=OUTPUT_REPO,
246
+ )
247
+ print(f" βœ… {quant_type} uploaded")
248
+
249
+ # Create README
250
+ print("\nπŸ“ Creating README...")
251
+ readme_content = f"""---
252
+ base_model: {BASE_MODEL}
253
+ tags:
254
+ - gguf
255
+ - llama.cpp
256
+ - quantized
257
+ - trl
258
+ - sft
259
+ ---
260
+
261
+ # {OUTPUT_REPO.split('/')[-1]}
262
+
263
+ This is a GGUF conversion of [{ADAPTER_MODEL}](https://huggingface.co/{ADAPTER_MODEL}), which is a LoRA fine-tuned version of [{BASE_MODEL}](https://huggingface.co/{BASE_MODEL}).
264
+
265
+ ## Model Details
266
+
267
+ - **Base Model:** {BASE_MODEL}
268
+ - **Fine-tuned Model:** {ADAPTER_MODEL}
269
+ - **Training:** Supervised Fine-Tuning (SFT) with TRL
270
+ - **Format:** GGUF (for llama.cpp, Ollama, LM Studio, etc.)
271
+
272
+ ## Available Quantizations
273
+
274
+ | File | Quant | Size | Description | Use Case |
275
+ |------|-------|------|-------------|----------|
276
+ | {model_name}-f16.gguf | F16 | ~1GB | Full precision | Best quality, slower |
277
+ | {model_name}-q8_0.gguf | Q8_0 | ~500MB | 8-bit | High quality |
278
+ | {model_name}-q5_k_m.gguf | Q5_K_M | ~350MB | 5-bit medium | Good quality, smaller |
279
+ | {model_name}-q4_k_m.gguf | Q4_K_M | ~300MB | 4-bit medium | Recommended - good balance |
280
+
281
+ ## Usage
282
+
283
+ ### With llama.cpp
284
+
285
+ ```bash
286
+ # Download model
287
+ huggingface-cli download {OUTPUT_REPO} {model_name}-q4_k_m.gguf
288
+
289
+ # Run with llama.cpp
290
+ ./llama-cli -m {model_name}-q4_k_m.gguf -p "Your prompt here"
291
+ ```
292
+
293
+ ### With Ollama
294
+
295
+ 1. Create a `Modelfile`:
296
+ ```
297
+ FROM ./{model_name}-q4_k_m.gguf
298
+ ```
299
+
300
+ 2. Create the model:
301
+ ```bash
302
+ ollama create my-model -f Modelfile
303
+ ollama run my-model
304
+ ```
305
+
306
+ ### With LM Studio
307
+
308
+ 1. Download the `.gguf` file
309
+ 2. Import into LM Studio
310
+ 3. Start chatting!
311
+
312
+ ## License
313
+
314
+ Inherits the license from the base model: {BASE_MODEL}
315
+
316
+ ## Citation
317
+
318
+ ```bibtex
319
+ @misc{{{OUTPUT_REPO.split('/')[-1].replace('-', '_')},
320
+ author = {{{username}}},
321
+ title = {{{OUTPUT_REPO.split('/')[-1]}}},
322
+ year = {{2025}},
323
+ publisher = {{Hugging Face}},
324
+ url = {{https://huggingface.co/{OUTPUT_REPO}}}
325
+ }}
326
+ ```
327
+
328
+ ---
329
+
330
+ *Converted to GGUF format using llama.cpp*
331
+ """
332
+
333
+ api.upload_file(
334
+ path_or_fileobj=readme_content.encode(),
335
+ path_in_repo="README.md",
336
+ repo_id=OUTPUT_REPO,
337
+ )
338
+ print(" βœ… README uploaded")
339
+
340
+ print("\n" + "=" * 60)
341
+ print("βœ… GGUF Conversion Complete!")
342
+ print(f"πŸ“¦ Repository: https://huggingface.co/{OUTPUT_REPO}")
343
+ print(f"\nπŸ“₯ Download with:")
344
+ print(f" huggingface-cli download {OUTPUT_REPO} {model_name}-q4_k_m.gguf")
345
+ print(f"\nπŸš€ Use with Ollama:")
346
+ print(" 1. Download the GGUF file")
347
+ print(f" 2. Create Modelfile: FROM ./{model_name}-q4_k_m.gguf")
348
+ print(" 3. ollama create my-model -f Modelfile")
349
+ print(" 4. ollama run my-model")
350
+ print("=" * 60)
scripts/dataset_inspector.py ADDED
@@ -0,0 +1,416 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ # /// script
3
+ # dependencies = []
4
+ # ///
5
+ """
6
+ Dataset Format Inspector for TRL Training (LLM-Optimized Output)
7
+
8
+ Inspects Hugging Face datasets to determine TRL training compatibility.
9
+ Uses Datasets Server API for instant results - no dataset download needed!
10
+
11
+ ULTRA-EFFICIENT: Uses HF Datasets Server API - completes in <2 seconds.
12
+
13
+ Usage with HF Jobs:
14
+ hf_jobs("uv", {
15
+ "script": "https://huggingface.co/datasets/evalstate/trl-helpers/raw/main/dataset_inspector.py",
16
+ "script_args": ["--dataset", "your/dataset", "--split", "train"]
17
+ })
18
+ """
19
+
20
+ import argparse
21
+ import sys
22
+ import json
23
+ import urllib.request
24
+ import urllib.parse
25
+ from typing import List, Dict, Any
26
+
27
+
28
+ def parse_args():
29
+ parser = argparse.ArgumentParser(description="Inspect dataset format for TRL training")
30
+ parser.add_argument("--dataset", type=str, required=True, help="Dataset name")
31
+ parser.add_argument("--split", type=str, default="train", help="Dataset split (default: train)")
32
+ parser.add_argument("--config", type=str, default="default", help="Dataset config name (default: default)")
33
+ parser.add_argument("--preview", type=int, default=150, help="Max chars per field preview")
34
+ parser.add_argument("--samples", type=int, default=5, help="Number of samples to fetch (default: 5)")
35
+ parser.add_argument("--json-output", action="store_true", help="Output as JSON")
36
+ return parser.parse_args()
37
+
38
+
39
+ def api_request(url: str) -> Dict:
40
+ """Make API request to Datasets Server"""
41
+ try:
42
+ with urllib.request.urlopen(url, timeout=10) as response:
43
+ return json.loads(response.read().decode())
44
+ except urllib.error.HTTPError as e:
45
+ if e.code == 404:
46
+ return None
47
+ raise Exception(f"API request failed: {e.code} {e.reason}")
48
+ except Exception as e:
49
+ raise Exception(f"API request failed: {str(e)}")
50
+
51
+
52
+ def get_splits(dataset: str) -> Dict:
53
+ """Get available splits for dataset"""
54
+ url = f"https://datasets-server.huggingface.co/splits?dataset={urllib.parse.quote(dataset)}"
55
+ return api_request(url)
56
+
57
+
58
+ def get_rows(dataset: str, config: str, split: str, offset: int = 0, length: int = 5) -> Dict:
59
+ """Get rows from dataset"""
60
+ url = f"https://datasets-server.huggingface.co/rows?dataset={urllib.parse.quote(dataset)}&config={config}&split={split}&offset={offset}&length={length}"
61
+ return api_request(url)
62
+
63
+
64
+ def find_columns(columns: List[str], patterns: List[str]) -> List[str]:
65
+ """Find columns matching patterns"""
66
+ return [c for c in columns if any(p in c.lower() for p in patterns)]
67
+
68
+
69
+ def check_sft_compatibility(columns: List[str]) -> Dict[str, Any]:
70
+ """Check SFT compatibility"""
71
+ has_messages = "messages" in columns
72
+ has_text = "text" in columns
73
+ has_prompt_completion = "prompt" in columns and "completion" in columns
74
+
75
+ ready = has_messages or has_text or has_prompt_completion
76
+
77
+ possible_prompt = find_columns(columns, ["prompt", "instruction", "question", "input"])
78
+ possible_response = find_columns(columns, ["response", "completion", "output", "answer"])
79
+
80
+ return {
81
+ "ready": ready,
82
+ "reason": "messages" if has_messages else "text" if has_text else "prompt+completion" if has_prompt_completion else None,
83
+ "possible_prompt": possible_prompt[0] if possible_prompt else None,
84
+ "possible_response": possible_response[0] if possible_response else None,
85
+ "has_context": "context" in columns,
86
+ }
87
+
88
+
89
+ def check_dpo_compatibility(columns: List[str]) -> Dict[str, Any]:
90
+ """Check DPO compatibility"""
91
+ has_standard = "prompt" in columns and "chosen" in columns and "rejected" in columns
92
+
93
+ possible_prompt = find_columns(columns, ["prompt", "instruction", "question", "input"])
94
+ possible_chosen = find_columns(columns, ["chosen", "preferred", "winner"])
95
+ possible_rejected = find_columns(columns, ["rejected", "dispreferred", "loser"])
96
+
97
+ can_map = bool(possible_prompt and possible_chosen and possible_rejected)
98
+
99
+ return {
100
+ "ready": has_standard,
101
+ "can_map": can_map,
102
+ "prompt_col": possible_prompt[0] if possible_prompt else None,
103
+ "chosen_col": possible_chosen[0] if possible_chosen else None,
104
+ "rejected_col": possible_rejected[0] if possible_rejected else None,
105
+ }
106
+
107
+
108
+ def check_grpo_compatibility(columns: List[str]) -> Dict[str, Any]:
109
+ """Check GRPO compatibility"""
110
+ has_prompt = "prompt" in columns
111
+ has_no_responses = "chosen" not in columns and "rejected" not in columns
112
+
113
+ possible_prompt = find_columns(columns, ["prompt", "instruction", "question", "input"])
114
+
115
+ return {
116
+ "ready": has_prompt and has_no_responses,
117
+ "can_map": bool(possible_prompt) and has_no_responses,
118
+ "prompt_col": possible_prompt[0] if possible_prompt else None,
119
+ }
120
+
121
+
122
+ def check_kto_compatibility(columns: List[str]) -> Dict[str, Any]:
123
+ """Check KTO compatibility"""
124
+ return {"ready": "prompt" in columns and "completion" in columns and "label" in columns}
125
+
126
+
127
+ def generate_mapping_code(method: str, info: Dict[str, Any]) -> str:
128
+ """Generate mapping code for a training method"""
129
+ if method == "SFT":
130
+ if info["ready"]:
131
+ return None
132
+
133
+ prompt_col = info.get("possible_prompt")
134
+ response_col = info.get("possible_response")
135
+ has_context = info.get("has_context", False)
136
+
137
+ if not prompt_col:
138
+ return None
139
+
140
+ if has_context and response_col:
141
+ return f"""def format_for_sft(example):
142
+ text = f"Instruction: {{example['{prompt_col}']}}\\n\\n"
143
+ if example.get('context'):
144
+ text += f"Context: {{example['context']}}\\n\\n"
145
+ text += f"Response: {{example['{response_col}']}}"
146
+ return {{'text': text}}
147
+
148
+ dataset = dataset.map(format_for_sft, remove_columns=dataset.column_names)"""
149
+ elif response_col:
150
+ return f"""def format_for_sft(example):
151
+ return {{'text': f"{{example['{prompt_col}']}}\\n\\n{{example['{response_col}']}}}}
152
+
153
+ dataset = dataset.map(format_for_sft, remove_columns=dataset.column_names)"""
154
+ else:
155
+ return f"""def format_for_sft(example):
156
+ return {{'text': example['{prompt_col}']}}
157
+
158
+ dataset = dataset.map(format_for_sft, remove_columns=dataset.column_names)"""
159
+
160
+ elif method == "DPO":
161
+ if info["ready"] or not info["can_map"]:
162
+ return None
163
+
164
+ return f"""def format_for_dpo(example):
165
+ return {{
166
+ 'prompt': example['{info['prompt_col']}'],
167
+ 'chosen': example['{info['chosen_col']}'],
168
+ 'rejected': example['{info['rejected_col']}'],
169
+ }}
170
+
171
+ dataset = dataset.map(format_for_dpo, remove_columns=dataset.column_names)"""
172
+
173
+ elif method == "GRPO":
174
+ if info["ready"] or not info["can_map"]:
175
+ return None
176
+
177
+ return f"""def format_for_grpo(example):
178
+ return {{'prompt': example['{info['prompt_col']}']}}
179
+
180
+ dataset = dataset.map(format_for_grpo, remove_columns=dataset.column_names)"""
181
+
182
+ return None
183
+
184
+
185
+ def format_value_preview(value: Any, max_chars: int) -> str:
186
+ """Format value for preview"""
187
+ if value is None:
188
+ return "None"
189
+ elif isinstance(value, str):
190
+ return value[:max_chars] + ("..." if len(value) > max_chars else "")
191
+ elif isinstance(value, list):
192
+ if len(value) > 0 and isinstance(value[0], dict):
193
+ return f"[{len(value)} items] Keys: {list(value[0].keys())}"
194
+ preview = str(value)
195
+ return preview[:max_chars] + ("..." if len(preview) > max_chars else "")
196
+ else:
197
+ preview = str(value)
198
+ return preview[:max_chars] + ("..." if len(preview) > max_chars else "")
199
+
200
+
201
+ def main():
202
+ args = parse_args()
203
+
204
+ print(f"Fetching dataset info via Datasets Server API...")
205
+
206
+ try:
207
+ # Get splits info
208
+ splits_data = get_splits(args.dataset)
209
+ if not splits_data or "splits" not in splits_data:
210
+ print(f"ERROR: Could not fetch splits for dataset '{args.dataset}'")
211
+ print(f" Dataset may not exist or is not accessible via Datasets Server API")
212
+ sys.exit(1)
213
+
214
+ # Find the right config
215
+ available_configs = set()
216
+ split_found = False
217
+ config_to_use = args.config
218
+
219
+ for split_info in splits_data["splits"]:
220
+ available_configs.add(split_info["config"])
221
+ if split_info["config"] == args.config and split_info["split"] == args.split:
222
+ split_found = True
223
+
224
+ # If default config not found, try first available
225
+ if not split_found and available_configs:
226
+ config_to_use = list(available_configs)[0]
227
+ print(f"Config '{args.config}' not found, trying '{config_to_use}'...")
228
+
229
+ # Get rows
230
+ rows_data = get_rows(args.dataset, config_to_use, args.split, offset=0, length=args.samples)
231
+
232
+ if not rows_data or "rows" not in rows_data:
233
+ print(f"ERROR: Could not fetch rows for dataset '{args.dataset}'")
234
+ print(f" Split '{args.split}' may not exist")
235
+ print(f" Available configs: {', '.join(sorted(available_configs))}")
236
+ sys.exit(1)
237
+
238
+ rows = rows_data["rows"]
239
+ if not rows:
240
+ print(f"ERROR: No rows found in split '{args.split}'")
241
+ sys.exit(1)
242
+
243
+ # Extract column info from first row
244
+ first_row = rows[0]["row"]
245
+ columns = list(first_row.keys())
246
+ features = rows_data.get("features", [])
247
+
248
+ # Get total count if available
249
+ total_examples = "Unknown"
250
+ for split_info in splits_data["splits"]:
251
+ if split_info["config"] == config_to_use and split_info["split"] == args.split:
252
+ total_examples = f"{split_info.get('num_examples', 'Unknown'):,}" if isinstance(split_info.get('num_examples'), int) else "Unknown"
253
+ break
254
+
255
+ except Exception as e:
256
+ print(f"ERROR: {str(e)}")
257
+ sys.exit(1)
258
+
259
+ # Run compatibility checks
260
+ sft_info = check_sft_compatibility(columns)
261
+ dpo_info = check_dpo_compatibility(columns)
262
+ grpo_info = check_grpo_compatibility(columns)
263
+ kto_info = check_kto_compatibility(columns)
264
+
265
+ # Determine recommended methods
266
+ recommended = []
267
+ if sft_info["ready"]:
268
+ recommended.append("SFT")
269
+ elif sft_info["possible_prompt"]:
270
+ recommended.append("SFT (needs mapping)")
271
+
272
+ if dpo_info["ready"]:
273
+ recommended.append("DPO")
274
+ elif dpo_info["can_map"]:
275
+ recommended.append("DPO (needs mapping)")
276
+
277
+ if grpo_info["ready"]:
278
+ recommended.append("GRPO")
279
+ elif grpo_info["can_map"]:
280
+ recommended.append("GRPO (needs mapping)")
281
+
282
+ if kto_info["ready"]:
283
+ recommended.append("KTO")
284
+
285
+ # JSON output mode
286
+ if args.json_output:
287
+ result = {
288
+ "dataset": args.dataset,
289
+ "config": config_to_use,
290
+ "split": args.split,
291
+ "total_examples": total_examples,
292
+ "columns": columns,
293
+ "features": [{"name": f["name"], "type": f["type"]} for f in features] if features else [],
294
+ "compatibility": {
295
+ "SFT": sft_info,
296
+ "DPO": dpo_info,
297
+ "GRPO": grpo_info,
298
+ "KTO": kto_info,
299
+ },
300
+ "recommended_methods": recommended,
301
+ }
302
+ print(json.dumps(result, indent=2))
303
+ sys.exit(0)
304
+
305
+ # Human-readable output optimized for LLM parsing
306
+ print("=" * 80)
307
+ print(f"DATASET INSPECTION RESULTS")
308
+ print("=" * 80)
309
+
310
+ print(f"\nDataset: {args.dataset}")
311
+ print(f"Config: {config_to_use}")
312
+ print(f"Split: {args.split}")
313
+ print(f"Total examples: {total_examples}")
314
+ print(f"Samples fetched: {len(rows)}")
315
+
316
+ print(f"\n{'COLUMNS':-<80}")
317
+ if features:
318
+ for feature in features:
319
+ print(f" {feature['name']}: {feature['type']}")
320
+ else:
321
+ for col in columns:
322
+ print(f" {col}: (type info not available)")
323
+
324
+ print(f"\n{'EXAMPLE DATA':-<80}")
325
+ example = first_row
326
+ for col in columns:
327
+ value = example.get(col)
328
+ display = format_value_preview(value, args.preview)
329
+ print(f"\n{col}:")
330
+ print(f" {display}")
331
+
332
+ print(f"\n{'TRAINING METHOD COMPATIBILITY':-<80}")
333
+
334
+ # SFT
335
+ print(f"\n[SFT] {'βœ“ READY' if sft_info['ready'] else 'βœ— NEEDS MAPPING'}")
336
+ if sft_info["ready"]:
337
+ print(f" Reason: Dataset has '{sft_info['reason']}' field")
338
+ print(f" Action: Use directly with SFTTrainer")
339
+ elif sft_info["possible_prompt"]:
340
+ print(f" Detected: prompt='{sft_info['possible_prompt']}' response='{sft_info['possible_response']}'")
341
+ print(f" Action: Apply mapping code (see below)")
342
+ else:
343
+ print(f" Status: Cannot determine mapping - manual inspection needed")
344
+
345
+ # DPO
346
+ print(f"\n[DPO] {'βœ“ READY' if dpo_info['ready'] else 'βœ— NEEDS MAPPING' if dpo_info['can_map'] else 'βœ— INCOMPATIBLE'}")
347
+ if dpo_info["ready"]:
348
+ print(f" Reason: Dataset has 'prompt', 'chosen', 'rejected' fields")
349
+ print(f" Action: Use directly with DPOTrainer")
350
+ elif dpo_info["can_map"]:
351
+ print(f" Detected: prompt='{dpo_info['prompt_col']}' chosen='{dpo_info['chosen_col']}' rejected='{dpo_info['rejected_col']}'")
352
+ print(f" Action: Apply mapping code (see below)")
353
+ else:
354
+ print(f" Status: Missing required fields (prompt + chosen + rejected)")
355
+
356
+ # GRPO
357
+ print(f"\n[GRPO] {'βœ“ READY' if grpo_info['ready'] else 'βœ— NEEDS MAPPING' if grpo_info['can_map'] else 'βœ— INCOMPATIBLE'}")
358
+ if grpo_info["ready"]:
359
+ print(f" Reason: Dataset has 'prompt' field")
360
+ print(f" Action: Use directly with GRPOTrainer")
361
+ elif grpo_info["can_map"]:
362
+ print(f" Detected: prompt='{grpo_info['prompt_col']}'")
363
+ print(f" Action: Apply mapping code (see below)")
364
+ else:
365
+ print(f" Status: Missing prompt field")
366
+
367
+ # KTO
368
+ print(f"\n[KTO] {'βœ“ READY' if kto_info['ready'] else 'βœ— INCOMPATIBLE'}")
369
+ if kto_info["ready"]:
370
+ print(f" Reason: Dataset has 'prompt', 'completion', 'label' fields")
371
+ print(f" Action: Use directly with KTOTrainer")
372
+ else:
373
+ print(f" Status: Missing required fields (prompt + completion + label)")
374
+
375
+ # Mapping code
376
+ print(f"\n{'MAPPING CODE (if needed)':-<80}")
377
+
378
+ mapping_needed = False
379
+
380
+ sft_mapping = generate_mapping_code("SFT", sft_info)
381
+ if sft_mapping:
382
+ print(f"\n# For SFT Training:")
383
+ print(sft_mapping)
384
+ mapping_needed = True
385
+
386
+ dpo_mapping = generate_mapping_code("DPO", dpo_info)
387
+ if dpo_mapping:
388
+ print(f"\n# For DPO Training:")
389
+ print(dpo_mapping)
390
+ mapping_needed = True
391
+
392
+ grpo_mapping = generate_mapping_code("GRPO", grpo_info)
393
+ if grpo_mapping:
394
+ print(f"\n# For GRPO Training:")
395
+ print(grpo_mapping)
396
+ mapping_needed = True
397
+
398
+ if not mapping_needed:
399
+ print("\nNo mapping needed - dataset is ready for training!")
400
+
401
+ print(f"\n{'SUMMARY':-<80}")
402
+ print(f"Recommended training methods: {', '.join(recommended) if recommended else 'None (dataset needs formatting)'}")
403
+ print(f"\nNote: Used Datasets Server API (instant, no download required)")
404
+
405
+ print("\n" + "=" * 80)
406
+ sys.exit(0)
407
+
408
+
409
+ if __name__ == "__main__":
410
+ try:
411
+ main()
412
+ except KeyboardInterrupt:
413
+ sys.exit(0)
414
+ except Exception as e:
415
+ print(f"ERROR: {e}", file=sys.stderr)
416
+ sys.exit(1)
scripts/estimate_cost.py ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ # /// script
3
+ # dependencies = []
4
+ # ///
5
+ """
6
+ Estimate training time and cost for TRL jobs.
7
+
8
+ Usage:
9
+ python estimate_cost.py --model <model> --dataset <dataset> --hardware <flavor>
10
+
11
+ Example:
12
+ python estimate_cost.py --model Qwen/Qwen2.5-0.5B --dataset trl-lib/Capybara --hardware a10g-large
13
+ """
14
+
15
+ import argparse
16
+
17
+ # Hardware costs per hour (approximate)
18
+ HARDWARE_COSTS = {
19
+ "t4-small": 0.75,
20
+ "t4-medium": 1.50,
21
+ "l4x1": 2.50,
22
+ "a10g-small": 3.50,
23
+ "a10g-large": 5.00,
24
+ "a10g-largex2": 10.00,
25
+ "a10g-largex4": 20.00,
26
+ "a100-large": 10.00,
27
+ }
28
+
29
+ # Model sizes in billions of parameters
30
+ MODEL_SIZES = {
31
+ "0.5B": 0.5,
32
+ "1.5B": 1.5,
33
+ "3B": 3,
34
+ "7B": 7,
35
+ "13B": 13,
36
+ }
37
+
38
+ def estimate_training_time(model_params, dataset_size, epochs, hardware):
39
+ """Estimate training time in hours."""
40
+ # Rough estimates based on empirical observations
41
+ # These are approximations and actual times will vary
42
+
43
+ base_time_per_1k_examples = 0.1 # hours for 1B model on a10g-large
44
+
45
+ # Adjust for model size
46
+ time = base_time_per_1k_examples * model_params * (dataset_size / 1000) * epochs
47
+
48
+ # Adjust for hardware (relative to a10g-large baseline)
49
+ hardware_multipliers = {
50
+ "t4-small": 2.0,
51
+ "t4-medium": 1.5,
52
+ "l4x1": 1.2,
53
+ "a10g-small": 1.3,
54
+ "a10g-large": 1.0,
55
+ "a10g-largex2": 0.6,
56
+ "a10g-largex4": 0.4,
57
+ "a100-large": 0.7,
58
+ }
59
+
60
+ multiplier = hardware_multipliers.get(hardware, 1.0)
61
+ time *= multiplier
62
+
63
+ return time
64
+
65
+ def parse_args():
66
+ parser = argparse.ArgumentParser(description="Estimate training cost for TRL jobs")
67
+ parser.add_argument("--model", required=True, help="Model name or size (e.g., 'Qwen/Qwen2.5-0.5B' or '0.5B')")
68
+ parser.add_argument("--dataset", required=True, help="Dataset name")
69
+ parser.add_argument("--hardware", required=True, choices=HARDWARE_COSTS.keys(), help="Hardware flavor")
70
+ parser.add_argument("--dataset-size", type=int, help="Override dataset size (number of examples)")
71
+ parser.add_argument("--epochs", type=int, default=3, help="Number of training epochs")
72
+ return parser.parse_args()
73
+
74
+ def extract_model_size(model_name):
75
+ """Extract model size from name or return parsed value."""
76
+ for size_str, size_val in MODEL_SIZES.items():
77
+ if size_str in model_name:
78
+ return size_val
79
+
80
+ # Try to parse directly
81
+ try:
82
+ if "B" in model_name:
83
+ return float(model_name.replace("B", ""))
84
+ except:
85
+ pass
86
+
87
+ return 1.0 # Default to 1B if can't determine
88
+
89
+ def main():
90
+ args = parse_args()
91
+
92
+ # Extract model parameters
93
+ model_params = extract_model_size(args.model)
94
+ print(f"πŸ“Š Model: {args.model} (~{model_params}B parameters)")
95
+
96
+ # Estimate dataset size (would need to load to get real size)
97
+ if args.dataset_size:
98
+ dataset_size = args.dataset_size
99
+ else:
100
+ # Common dataset sizes (approximations)
101
+ dataset_sizes = {
102
+ "trl-lib/Capybara": 16000,
103
+ "Anthropic/hh-rlhf": 160000,
104
+ }
105
+ dataset_size = dataset_sizes.get(args.dataset, 10000)
106
+
107
+ print(f"πŸ“¦ Dataset: {args.dataset} (~{dataset_size} examples)")
108
+ print(f"πŸ”„ Epochs: {args.epochs}")
109
+ print(f"πŸ’» Hardware: {args.hardware}")
110
+ print()
111
+
112
+ # Estimate training time
113
+ estimated_hours = estimate_training_time(model_params, dataset_size, args.epochs, args.hardware)
114
+ estimated_cost = estimated_hours * HARDWARE_COSTS[args.hardware]
115
+
116
+ # Recommend timeout with buffer
117
+ recommended_timeout_hours = estimated_hours * 1.3 # 30% buffer
118
+
119
+ print(f"⏱️ Estimated training time: {estimated_hours:.1f} hours")
120
+ print(f"πŸ’° Estimated cost: ${estimated_cost:.2f}")
121
+ print(f"⏰ Recommended timeout: {recommended_timeout_hours:.1f}h (with 30% buffer)")
122
+ print()
123
+
124
+ # Warnings and recommendations
125
+ if estimated_hours > 4:
126
+ print("⚠️ Long training time - consider:")
127
+ print(" - Using faster hardware")
128
+ print(" - Reducing epochs")
129
+ print(" - Using a smaller dataset subset for testing")
130
+
131
+ if model_params >= 7 and args.hardware not in ["a10g-largex2", "a10g-largex4", "a100-large"]:
132
+ print("⚠️ Large model - consider using:")
133
+ print(" - Larger GPU (a100-large)")
134
+ print(" - Multi-GPU setup (a10g-largex2 or a10g-largex4)")
135
+ print(" - LoRA/PEFT for memory efficiency")
136
+
137
+ print()
138
+ print("πŸ“‹ Example job configuration:")
139
+ print(f"""
140
+ hf_jobs("uv", {{
141
+ "script": "your_training_script.py",
142
+ "flavor": "{args.hardware}",
143
+ "timeout": "{recommended_timeout_hours:.0f}h",
144
+ "secrets": {{"HF_TOKEN": "$HF_TOKEN"}}
145
+ }})
146
+ """)
147
+
148
+ if __name__ == "__main__":
149
+ main()
scripts/train_dpo_example.py ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ # /// script
3
+ # dependencies = [
4
+ # "trl>=0.12.0",
5
+ # "transformers>=4.36.0",
6
+ # "accelerate>=0.24.0",
7
+ # "trackio",
8
+ # ]
9
+ # ///
10
+
11
+ """
12
+ Production-ready DPO training example for preference learning.
13
+
14
+ DPO (Direct Preference Optimization) trains models on preference pairs
15
+ (chosen vs rejected responses) without requiring a reward model.
16
+
17
+ Usage with hf_jobs MCP tool:
18
+ hf_jobs("uv", {
19
+ "script": '''<paste this entire file>''',
20
+ "flavor": "a10g-large",
21
+ "timeout": "3h",
22
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"},
23
+ })
24
+
25
+ Or submit the script content directly inline without saving to a file.
26
+ """
27
+
28
+ import trackio
29
+ from datasets import load_dataset
30
+ from trl import DPOTrainer, DPOConfig
31
+
32
+
33
+ # Load preference dataset
34
+ print("πŸ“¦ Loading dataset...")
35
+ dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
36
+ print(f"βœ… Dataset loaded: {len(dataset)} preference pairs")
37
+
38
+ # Create train/eval split
39
+ print("πŸ”€ Creating train/eval split...")
40
+ dataset_split = dataset.train_test_split(test_size=0.1, seed=42)
41
+ train_dataset = dataset_split["train"]
42
+ eval_dataset = dataset_split["test"]
43
+ print(f" Train: {len(train_dataset)} pairs")
44
+ print(f" Eval: {len(eval_dataset)} pairs")
45
+
46
+ # Training configuration
47
+ config = DPOConfig(
48
+ # CRITICAL: Hub settings
49
+ output_dir="qwen-dpo-aligned",
50
+ push_to_hub=True,
51
+ hub_model_id="username/qwen-dpo-aligned",
52
+ hub_strategy="every_save",
53
+
54
+ # DPO-specific parameters
55
+ beta=0.1, # KL penalty coefficient (higher = stay closer to reference)
56
+
57
+ # Training parameters
58
+ num_train_epochs=1, # DPO typically needs fewer epochs than SFT
59
+ per_device_train_batch_size=4,
60
+ gradient_accumulation_steps=4,
61
+ learning_rate=5e-7, # DPO uses much lower LR than SFT
62
+ # max_length=1024, # Default - only set if you need different sequence length
63
+
64
+ # Logging & checkpointing
65
+ logging_steps=10,
66
+ save_strategy="steps",
67
+ save_steps=100,
68
+ save_total_limit=2,
69
+
70
+ # Evaluation - IMPORTANT: Only enable if eval_dataset provided
71
+ eval_strategy="steps",
72
+ eval_steps=100,
73
+
74
+ # Optimization
75
+ warmup_ratio=0.1,
76
+ lr_scheduler_type="cosine",
77
+
78
+ # Monitoring
79
+ report_to="trackio", # Integrate with Trackio
80
+ project="meaningful_project_name", # project name for the training name (trackio)
81
+ run_name="baseline-run", #Descriptive name for this training run
82
+
83
+ )
84
+
85
+ # Initialize and train
86
+ # Note: DPO requires an instruct-tuned model as the base
87
+ print("🎯 Initializing trainer...")
88
+ trainer = DPOTrainer(
89
+ model="Qwen/Qwen2.5-0.5B-Instruct", # Use instruct model, not base model
90
+ train_dataset=train_dataset,
91
+ eval_dataset=eval_dataset, # CRITICAL: Must provide eval_dataset when eval_strategy is enabled
92
+ args=config,
93
+ )
94
+
95
+ print("πŸš€ Starting DPO training...")
96
+ trainer.train()
97
+
98
+ print("πŸ’Ύ Pushing to Hub...")
99
+ trainer.push_to_hub()
100
+
101
+ # Finish Trackio tracking
102
+ trackio.finish()
103
+
104
+ print("βœ… Complete! Model at: https://huggingface.co/username/qwen-dpo-aligned")
105
+ print("πŸ“Š View metrics at: https://huggingface.co/spaces/username/trackio")
scripts/train_grpo_example.py ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ # /// script
3
+ # dependencies = [
4
+ # "trl>=0.12.0",
5
+ # "transformers>=4.36.0",
6
+ # "accelerate>=0.24.0",
7
+ # "trackio",
8
+ # ]
9
+ # ///
10
+
11
+ """
12
+ Production-ready GRPO training example for online RL.
13
+
14
+ GRPO (Group Relative Policy Optimization) is an online RL method that
15
+ optimizes relative to group performance. Best for tasks with automatic
16
+ reward signals like code execution or math verification.
17
+
18
+ Usage with hf_jobs MCP tool:
19
+ hf_jobs("uv", {
20
+ "script": '''<paste this entire file>''',
21
+ "flavor": "a10g-large",
22
+ "timeout": "4h",
23
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"},
24
+ })
25
+
26
+ Or submit the script content directly inline without saving to a file.
27
+
28
+ Note: For most GRPO use cases, the TRL maintained script is recommended:
29
+ https://raw.githubusercontent.com/huggingface/trl/main/examples/scripts/grpo.py
30
+ """
31
+
32
+ import trackio
33
+ from datasets import load_dataset
34
+ from trl import GRPOTrainer, GRPOConfig
35
+
36
+
37
+ # Load dataset (GRPO uses prompt-only format)
38
+ dataset = load_dataset("trl-lib/math_shepherd", split="train")
39
+ print(f"βœ… Dataset loaded: {len(dataset)} prompts")
40
+
41
+ # Training configuration
42
+ config = GRPOConfig(
43
+ # CRITICAL: Hub settings
44
+ output_dir="qwen-grpo-math",
45
+ push_to_hub=True,
46
+ hub_model_id="username/qwen-grpo-math",
47
+ hub_strategy="every_save",
48
+
49
+ # Training parameters
50
+ num_train_epochs=1,
51
+ per_device_train_batch_size=4,
52
+ gradient_accumulation_steps=4,
53
+ learning_rate=1e-6,
54
+
55
+ # Logging & checkpointing
56
+ logging_steps=10,
57
+ save_strategy="steps",
58
+ save_steps=100,
59
+ save_total_limit=2,
60
+
61
+ # Optimization
62
+ warmup_ratio=0.1,
63
+ lr_scheduler_type="cosine",
64
+
65
+ # Monitoring
66
+ report_to="trackio", # Integrate with Trackio
67
+ project="meaningful_project_name", # project name for the training name (trackio)
68
+ run_name="baseline-run", #Descriptive name for this training run
69
+
70
+ )
71
+
72
+ # Initialize and train
73
+ # Note: GRPO requires an instruct-tuned model as the base
74
+ trainer = GRPOTrainer(
75
+ model="Qwen/Qwen2.5-0.5B-Instruct",
76
+ train_dataset=dataset,
77
+ args=config,
78
+ )
79
+
80
+ print("πŸš€ Starting GRPO training...")
81
+ trainer.train()
82
+
83
+ print("πŸ’Ύ Pushing to Hub...")
84
+ trainer.push_to_hub()
85
+
86
+
87
+ print("βœ… Complete! Model at: https://huggingface.co/username/qwen-grpo-math")
88
+ print("πŸ“Š View metrics at: https://huggingface.co/spaces/username/trackio")
scripts/train_sft_example.py ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ # /// script
3
+ # dependencies = [
4
+ # "trl>=0.12.0",
5
+ # "peft>=0.7.0",
6
+ # "transformers>=4.36.0",
7
+ # "accelerate>=0.24.0",
8
+ # "trackio", # For real-time monitoring
9
+ # ]
10
+ # ///
11
+
12
+ """
13
+ Production-ready SFT training example with all best practices.
14
+
15
+ This script demonstrates:
16
+ - Trackio integration for real-time monitoring
17
+ - LoRA/PEFT for efficient training
18
+ - Proper Hub saving configuration
19
+ - Train/eval split for monitoring
20
+ - Checkpoint management
21
+ - Optimized training parameters
22
+
23
+ Usage with hf_jobs MCP tool:
24
+ hf_jobs("uv", {
25
+ "script": '''<paste this entire file>''',
26
+ "flavor": "a10g-large",
27
+ "timeout": "3h",
28
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"},
29
+ })
30
+
31
+ Or submit the script content directly inline without saving to a file.
32
+ """
33
+
34
+ import trackio
35
+ from datasets import load_dataset
36
+ from peft import LoraConfig
37
+ from trl import SFTTrainer, SFTConfig
38
+
39
+
40
+
41
+ # Load dataset
42
+ print("πŸ“¦ Loading dataset...")
43
+ dataset = load_dataset("trl-lib/Capybara", split="train")
44
+ print(f"βœ… Dataset loaded: {len(dataset)} examples")
45
+
46
+ # Create train/eval split
47
+ print("πŸ”€ Creating train/eval split...")
48
+ dataset_split = dataset.train_test_split(test_size=0.1, seed=42)
49
+ train_dataset = dataset_split["train"]
50
+ eval_dataset = dataset_split["test"]
51
+ print(f" Train: {len(train_dataset)} examples")
52
+ print(f" Eval: {len(eval_dataset)} examples")
53
+
54
+ # Note: For memory-constrained demos, skip eval by using full dataset as train_dataset
55
+ # and removing eval_dataset, eval_strategy, and eval_steps from config below
56
+
57
+ # Training configuration
58
+ config = SFTConfig(
59
+ # CRITICAL: Hub settings
60
+ output_dir="qwen-capybara-sft",
61
+ push_to_hub=True,
62
+ hub_model_id="username/qwen-capybara-sft",
63
+ hub_strategy="every_save", # Push checkpoints
64
+
65
+ # Training parameters
66
+ num_train_epochs=3,
67
+ per_device_train_batch_size=4,
68
+ gradient_accumulation_steps=4,
69
+ learning_rate=2e-5,
70
+ # max_length=1024, # Default - only set if you need different sequence length
71
+
72
+ # Logging & checkpointing
73
+ logging_steps=10,
74
+ save_strategy="steps",
75
+ save_steps=100,
76
+ save_total_limit=2,
77
+
78
+ # Evaluation - IMPORTANT: Only enable if eval_dataset provided
79
+ eval_strategy="steps",
80
+ eval_steps=100,
81
+
82
+ # Optimization
83
+ warmup_ratio=0.1,
84
+ lr_scheduler_type="cosine",
85
+
86
+ # Monitoring
87
+ report_to="trackio", # Integrate with Trackio
88
+ project="meaningful_project_name", # project name for the training name (trackio)
89
+ run_name="baseline-run", #Descriptive name for this training run
90
+
91
+ # LoRA configuration
92
+ peft_config = LoraConfig(
93
+ r=16,
94
+ lora_alpha=32,
95
+ lora_dropout=0.05,
96
+ bias="none",
97
+ task_type="CAUSAL_LM",
98
+ target_modules=["q_proj", "v_proj"],
99
+ )
100
+
101
+ # Initialize and train
102
+ print("🎯 Initializing trainer...")
103
+ trainer = SFTTrainer(
104
+ model="Qwen/Qwen2.5-0.5B",
105
+ train_dataset=train_dataset,
106
+ eval_dataset=eval_dataset, # CRITICAL: Must provide eval_dataset when eval_strategy is enabled
107
+ args=config,
108
+ peft_config=peft_config,
109
+ )
110
+
111
+ print("πŸš€ Starting training...")
112
+ trainer.train()
113
+
114
+ print("πŸ’Ύ Pushing to Hub...")
115
+ trainer.push_to_hub()
116
+
117
+ # Finish Trackio tracking
118
+ print("βœ… Complete! Model at: https://huggingface.co/username/qwen-capybara-sft")
119
+ print("πŸ“Š View metrics at: https://huggingface.co/spaces/username/trackio")