ml-intern
File size: 1,012 Bytes
0a614fa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# NB2 Kaggle kernel-death fix

Version 5/6 died before the first epoch print. The data/label fixes are correct (`soundscape positive labels: 3122`), so the remaining issue is memory pressure during the first training epoch.

Use these safer NB2 settings before running:

```python
class CFG:
    epochs = 2
    model_name = "b0"
    folds_to_run = [0]          # train ONE fold per Kaggle run first
    batch_size = 4              # micro-batch
    grad_accum_steps = 3        # effective batch 12
    num_workers = 0
    use_data_parallel = False   # DataParallel caused kernel death on T4x2
    max_train_audio_samples = None
    max_sc_train_samples = None
```

Then repeat runs:

```python
# B0
folds_to_run = [0]
folds_to_run = [1]
folds_to_run = [2]
folds_to_run = [3]
folds_to_run = [4]

# B3, even safer
model_name = "b3"
folds_to_run = [0]
batch_size = 2
grad_accum_steps = 6
```

Also patch the optimizer loop: divide loss by `grad_accum_steps`, step only every N batches, and print every 100 batches.