File size: 10,867 Bytes
6ab17a7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
# Reliability Principles for Training Jobs

These principles are derived from real production failures and successful fixes. Following them prevents common failure modes and ensures reliable job execution.

## Principle 1: Always Verify Before Use

**Rule:** Never assume repos, datasets, or resources exist. Verify with tools first.

### What It Prevents

- **Non-existent datasets** - Jobs fail immediately when dataset doesn't exist
- **Typos in names** - Simple mistakes like "argilla-dpo-mix-7k" vs "ultrafeedback_binarized"
- **Incorrect paths** - Old or moved repos, renamed files
- **Missing dependencies** - Undocumented requirements

### How to Apply

**Before submitting ANY job:**

```python
# Verify dataset exists
dataset_search({"query": "dataset-name", "author": "author-name", "limit": 5})
hub_repo_details(["author/dataset-name"], repo_type="dataset")

# Verify model exists
hub_repo_details(["org/model-name"], repo_type="model")

# Check script/file paths (for URL-based scripts)
# Verify before using: https://github.com/user/repo/blob/main/script.py
```

**Examples that would have caught errors:**

```python
# ❌ WRONG: Assumed dataset exists
hf_jobs("uv", {
    "script": """...""",
    "env": {"DATASET": "trl-lib/argilla-dpo-mix-7k"}  # Doesn't exist!
})

# βœ… CORRECT: Verify first
dataset_search({"query": "argilla dpo", "author": "trl-lib"})
# Would show: "trl-lib/ultrafeedback_binarized" is the correct name

hub_repo_details(["trl-lib/ultrafeedback_binarized"], repo_type="dataset")
# Confirms it exists before using
```

### Implementation Checklist

- [ ] Check dataset exists before training
- [ ] Verify base model exists before fine-tuning
- [ ] Confirm adapter model exists before GGUF conversion
- [ ] Test script URLs are valid before submitting
- [ ] Validate file paths in repositories
- [ ] Check for recent updates/renames of resources

**Time cost:** 5-10 seconds  
**Time saved:** Hours of failed job time + debugging

---

## Principle 2: Prioritize Reliability Over Performance

**Rule:** Default to what is most likely to succeed, not what is theoretically fastest.

### What It Prevents

- **Hardware incompatibilities** - Features that fail on certain GPUs
- **Unstable optimizations** - Speed-ups that cause crashes
- **Complex configurations** - More failure points
- **Build system issues** - Unreliable compilation methods

### How to Apply

**Choose reliability:**

```python
# ❌ RISKY: Aggressive optimization that may fail
SFTConfig(
    torch_compile=True,  # Can fail on T4, A10G GPUs
    optim="adamw_bnb_8bit",  # Requires specific setup
    fp16=False,  # May cause training instability
    ...
)

# βœ… SAFE: Proven defaults
SFTConfig(
    # torch_compile=True,  # Commented with note: "Enable on H100 for 20% speedup"
    optim="adamw_torch",  # Standard, always works
    fp16=True,  # Stable and fast
    ...
)
```

**For build processes:**

```python
# ❌ UNRELIABLE: Uses make (platform-dependent)
subprocess.run(["make", "-C", "/tmp/llama.cpp", "llama-quantize"], check=True)

# βœ… RELIABLE: Uses CMake (consistent, documented)
subprocess.run([
    "cmake", "-B", "/tmp/llama.cpp/build", "-S", "/tmp/llama.cpp",
    "-DGGML_CUDA=OFF"  # Disable CUDA for faster, more reliable build
], check=True)

subprocess.run([
    "cmake", "--build", "/tmp/llama.cpp/build",
    "--target", "llama-quantize", "-j", "4"
], check=True)
```

### Real-World Example

**The `torch.compile` failure:**
- Added for "20% speedup" on H100
- **Failed fatally on T4-medium** with cryptic error
- Misdiagnosed as dataset issue (cost hours)
- **Fix:** Disable by default, add as optional comment

**Result:** Reliability > 20% performance gain

### Implementation Checklist

- [ ] Use proven, standard configurations by default
- [ ] Comment out performance optimizations with hardware notes
- [ ] Use stable build systems (CMake > make)
- [ ] Test on target hardware before production
- [ ] Document known incompatibilities
- [ ] Provide "safe" and "fast" variants when needed

**Performance loss:** 10-20% in best case  
**Reliability gain:** 95%+ success rate vs 60-70%

---

## Principle 3: Create Atomic, Self-Contained Scripts

**Rule:** Scripts should work as complete, independent units. Don't remove parts to "simplify."

### What It Prevents

- **Missing dependencies** - Removed "unnecessary" packages that are actually required
- **Incomplete processes** - Skipped steps that seem redundant
- **Environment assumptions** - Scripts that need pre-setup
- **Partial failures** - Some parts work, others fail silently

### How to Apply

**Complete dependency specifications:**

```python
# ❌ INCOMPLETE: "Simplified" by removing dependencies
# /// script
# dependencies = [
#     "transformers",
#     "peft",
#     "torch",
# ]
# ///

# βœ… COMPLETE: All dependencies explicit
# /// script
# dependencies = [
#     "transformers>=4.36.0",
#     "peft>=0.7.0",
#     "torch>=2.0.0",
#     "accelerate>=0.24.0",
#     "huggingface_hub>=0.20.0",
#     "sentencepiece>=0.1.99",  # Required for tokenizers
#     "protobuf>=3.20.0",        # Required for tokenizers
#     "numpy",
#     "gguf",
# ]
# ///
```

**Complete build processes:**

```python
# ❌ INCOMPLETE: Assumes build tools exist
subprocess.run(["git", "clone", "https://github.com/ggerganov/llama.cpp.git", "/tmp/llama.cpp"])
subprocess.run(["make", "-C", "/tmp/llama.cpp", "llama-quantize"])  # FAILS: no gcc/make

# βœ… COMPLETE: Installs all requirements
subprocess.run(["apt-get", "update", "-qq"], check=True)
subprocess.run(["apt-get", "install", "-y", "-qq", "build-essential", "cmake"], check=True)
subprocess.run(["git", "clone", "https://github.com/ggerganov/llama.cpp.git", "/tmp/llama.cpp"])
# ... then build
```

### Real-World Example

**The `sentencepiece` failure:**
- Original script had it: worked fine
- "Simplified" version removed it: "doesn't look necessary"
- **GGUF conversion failed silently** - tokenizer couldn't convert
- Hard to debug: no obvious error message
- **Fix:** Restore all original dependencies

**Result:** Don't remove dependencies without thorough testing

### Implementation Checklist

- [ ] All dependencies in PEP 723 header with version pins
- [ ] All system packages installed by script
- [ ] No assumptions about pre-existing environment
- [ ] No "optional" steps that are actually required
- [ ] Test scripts in clean environment
- [ ] Document why each dependency is needed

**Complexity:** Slightly longer scripts  
**Reliability:** Scripts "just work" every time

---

## Principle 4: Provide Clear Error Context

**Rule:** When things fail, make it obvious what went wrong and how to fix it.

### How to Apply

**Wrap subprocess calls:**

```python
# ❌ UNCLEAR: Silent failure
subprocess.run([...], check=True, capture_output=True)

# βœ… CLEAR: Shows what failed
try:
    result = subprocess.run(
        [...],
        check=True,
        capture_output=True,
        text=True
    )
    print(result.stdout)
    if result.stderr:
        print("Warnings:", result.stderr)
except subprocess.CalledProcessError as e:
    print(f"❌ Command failed!")
    print("STDOUT:", e.stdout)
    print("STDERR:", e.stderr)
    raise
```

**Validate inputs:**

```python
# ❌ UNCLEAR: Fails later with cryptic error
model = load_model(MODEL_NAME)

# βœ… CLEAR: Fails fast with clear message
if not MODEL_NAME:
    raise ValueError("MODEL_NAME environment variable not set!")

print(f"Loading model: {MODEL_NAME}")
try:
    model = load_model(MODEL_NAME)
    print(f"βœ… Model loaded successfully")
except Exception as e:
    print(f"❌ Failed to load model: {MODEL_NAME}")
    print(f"Error: {e}")
    print("Hint: Check that model exists on Hub")
    raise
```

### Implementation Checklist

- [ ] Wrap external calls with try/except
- [ ] Print stdout/stderr on failure
- [ ] Validate environment variables early
- [ ] Add progress indicators (βœ…, ❌, πŸ”„)
- [ ] Include hints for common failures
- [ ] Log configuration at start

---

## Principle 5: Test the Happy Path on Known-Good Inputs

**Rule:** Before using new code in production, test with inputs you know work.

### How to Apply

**Known-good test inputs:**

```python
# For training
TEST_DATASET = "trl-lib/Capybara"  # Small, well-formatted, widely used
TEST_MODEL = "Qwen/Qwen2.5-0.5B"  # Small, fast, reliable

# For GGUF conversion
TEST_ADAPTER = "evalstate/qwen-capybara-medium"  # Known working model
TEST_BASE = "Qwen/Qwen2.5-0.5B"  # Compatible base
```

**Testing workflow:**

1. Test with known-good inputs first
2. If that works, try production inputs
3. If production fails, you know it's the inputs (not code)
4. Isolate the difference

### Implementation Checklist

- [ ] Maintain list of known-good test models/datasets
- [ ] Test new scripts with test inputs first
- [ ] Document what makes inputs "good"
- [ ] Keep test jobs cheap (small models, short timeouts)
- [ ] Only move to production after test succeeds

**Time cost:** 5-10 minutes for test run  
**Debugging time saved:** Hours

---

## Summary: The Reliability Checklist

Before submitting ANY job:

### Pre-Flight Checks
- [ ] **Verified** all repos/datasets exist (hub_repo_details)
- [ ] **Tested** with known-good inputs if new code
- [ ] **Using** proven hardware/configuration
- [ ] **Included** all dependencies in PEP 723 header
- [ ] **Installed** system requirements (build tools, etc.)
- [ ] **Set** appropriate timeout (not default 30m)
- [ ] **Configured** Hub push with HF_TOKEN
- [ ] **Added** clear error handling

### Script Quality
- [ ] Self-contained (no external setup needed)
- [ ] Complete dependencies listed
- [ ] Build tools installed by script
- [ ] Progress indicators included
- [ ] Error messages are clear
- [ ] Configuration logged at start

### Job Configuration
- [ ] Timeout > expected runtime + 30% buffer
- [ ] Hardware appropriate for model size
- [ ] Secrets include HF_TOKEN
- [ ] Environment variables set correctly
- [ ] Cost estimated and acceptable

**Following these principles transforms job success rate from ~60-70% to ~95%+**

---

## When Principles Conflict

Sometimes reliability and performance conflict. Here's how to choose:

| Scenario | Choose | Rationale |
|----------|--------|-----------|
| Demo/test | Reliability | Fast failure is worse than slow success |
| Production (first run) | Reliability | Prove it works before optimizing |
| Production (proven) | Performance | Safe to optimize after validation |
| Time-critical | Reliability | Failures cause more delay than slow runs |
| Cost-critical | Balanced | Test with small model, then optimize |

**General rule:** Reliability first, optimize second.

---

## Further Reading

- `troubleshooting.md` - Common issues and fixes
- `training_patterns.md` - Proven training configurations
- `gguf_conversion.md` - Production GGUF workflow