eeeeeeeeeeeeee3 commited on
Commit
b42940f
·
verified ·
1 Parent(s): cfef58c

Upload RESUME_TRAINING_20_EPOCHS.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. RESUME_TRAINING_20_EPOCHS.md +144 -0
RESUME_TRAINING_20_EPOCHS.md ADDED
@@ -0,0 +1,144 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Resume Training from 20-Epoch Checkpoint
2
+
3
+ ## Checkpoint Information
4
+
5
+ **Checkpoint File:** `/workspace/soccer_cv_ball/models/soccer ball/checkpoint_20_soccer_ball.pth`
6
+ **Size:** 474.57 MB
7
+ **Epoch:** 19 (completed 20 epochs, 0-19)
8
+ **Next Epoch:** 20
9
+
10
+ ## Training Configuration
11
+
12
+ ### Model Architecture
13
+ - **Model Type:** RF-DETR Base
14
+ - **Encoder:** dinov2_windowed_small
15
+ - **Resolution:** 1288x1288
16
+ - **Classes:** 2 (ball + background)
17
+ - **Class Names:** ['ball']
18
+ - **Num Queries:** 300
19
+ - **Decoder Layers:** 3
20
+ - **Hidden Dim:** 256
21
+ - **Self-Attention Heads:** 8
22
+ - **Cross-Attention Heads:** 16
23
+
24
+ ### Training Hyperparameters
25
+ - **Batch Size:** 2
26
+ - **Gradient Accumulation Steps:** 16 (effective batch size: 32)
27
+ - **Learning Rate:** 0.0002
28
+ - **Encoder Learning Rate:** 0.00015
29
+ - **Weight Decay:** 0.0001
30
+ - **Gradient Clip:** 0.1
31
+ - **Total Epochs:** 20
32
+ - **Warmup Epochs:** 0.0
33
+ - **LR Scheduler:** step
34
+ - **LR Drop:** 100 (not reached)
35
+ - **Mixed Precision (AMP):** Enabled
36
+
37
+ ### Loss Configuration
38
+ - **Classification Loss Coef:** 1.0
39
+ - **Bbox Loss Coef:** 5
40
+ - **GIoU Loss Coef:** 2
41
+ - **Focal Alpha:** 0.25
42
+ - **Auxiliary Loss:** Enabled
43
+ - **Set Cost Class:** 2
44
+ - **Set Cost Bbox:** 5
45
+ - **Set Cost GIoU:** 2
46
+
47
+ ### Optimizer & Scheduler
48
+ - **Optimizer State:** ✅ Saved in checkpoint
49
+ - **Scheduler State:** ✅ Saved in checkpoint
50
+ - **EMA Model:** ✅ Saved (decay: 0.993, tau: 100)
51
+
52
+ ### Dataset Information
53
+ - **Original Dataset Path:** `/workspace/soccer_coach_cv/models/ball_detection_open_soccer_ball/dataset`
54
+ - **Dataset Format:** Roboflow (YOLO converted to COCO)
55
+ - **Original Output Dir:** `/workspace/soccer_coach_cv/models/ball_detection_open_soccer_ball`
56
+
57
+ ### Checkpoint Contents
58
+ - ✅ Model state dict (487 layers)
59
+ - ✅ Optimizer state dict
60
+ - ✅ Learning rate scheduler state
61
+ - ✅ EMA model state
62
+ - ✅ Training arguments
63
+ - ✅ Epoch number (19)
64
+
65
+ ## How to Resume Training
66
+
67
+ ### Option 1: Using the Resume Script
68
+
69
+ ```bash
70
+ cd /workspace/soccer_cv_ball
71
+ python scripts/resume_from_20_epochs.sh
72
+ ```
73
+
74
+ ### Option 2: Using train_ball.py with Resume Flag
75
+
76
+ First, update the dataset path in the config or script to match your current dataset location, then:
77
+
78
+ ```bash
79
+ cd /workspace/soccer_cv_ball
80
+ python scripts/train_ball.py \
81
+ --config configs/resume_20_epochs.yaml \
82
+ --output-dir models
83
+ ```
84
+
85
+ ### Option 3: Direct RF-DETR Training
86
+
87
+ If using RF-DETR directly:
88
+
89
+ ```python
90
+ from rfdetr import RFDETRBase
91
+
92
+ # Initialize model
93
+ model = RFDETRBase(class_names=['ball'])
94
+
95
+ # Load checkpoint
96
+ checkpoint_path = "/workspace/soccer_cv_ball/models/soccer ball/checkpoint_20_soccer_ball.pth"
97
+ checkpoint = torch.load(checkpoint_path, map_location='cpu', weights_only=False)
98
+
99
+ # Load model weights
100
+ if 'model' in checkpoint:
101
+ model_state = checkpoint['model']
102
+ if hasattr(model, 'model') and hasattr(model.model, 'model'):
103
+ current_state = model.model.model.state_dict()
104
+ filtered_state = {}
105
+ for key, value in model_state.items():
106
+ if key in current_state and current_state[key].shape == value.shape:
107
+ filtered_state[key] = value
108
+ model.model.model.load_state_dict(filtered_state, strict=False)
109
+
110
+ # Continue training with RF-DETR's train() method
111
+ # (pass resume=checkpoint_path to resume from epoch 20)
112
+ ```
113
+
114
+ ## Important Notes
115
+
116
+ 1. **Dataset Path:** The original training used a dataset at `/workspace/soccer_coach_cv/models/ball_detection_open_soccer_ball/dataset`. You may need to:
117
+ - Update the dataset path in the config/script to match your current dataset location
118
+ - Or ensure the dataset exists at the original path
119
+
120
+ 2. **Epoch Continuation:** The checkpoint is at epoch 19, so resuming will start from epoch 20. If you want to train for more epochs, update the `epochs` parameter.
121
+
122
+ 3. **Output Directory:** The original training saved to `/workspace/soccer_coach_cv/models/ball_detection_open_soccer_ball`. You may want to change this to save in the current workspace.
123
+
124
+ 4. **Model Compatibility:** The checkpoint uses RF-DETR format with the model structure: `model.model.model` (RFDETRBase -> Model -> LWDETR).
125
+
126
+ ## Files Created
127
+
128
+ 1. **`training_info_20_epochs.json`** - Complete training information extracted from checkpoint
129
+ 2. **`configs/resume_20_epochs.yaml`** - YAML config for resuming training
130
+ 3. **`scripts/resume_from_20_epochs.sh`** - Python script to resume training
131
+ 4. **`scripts/verify_checkpoint.py`** - Script to verify checkpoint validity
132
+
133
+ ## Training Progress
134
+
135
+ - **Completed:** 20 epochs (0-19)
136
+ - **Checkpoint saved:** Epoch 19
137
+ - **Ready to resume:** Yes ✅
138
+
139
+ ## Next Steps
140
+
141
+ 1. Verify dataset path exists and is accessible
142
+ 2. Update paths in config/script if needed
143
+ 3. Run resume script to continue training from epoch 20
144
+ 4. Monitor training logs and metrics