zsh12787 commited on
Commit
c2a2e8b
·
verified ·
1 Parent(s): 2a467a5

Delete README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -118
README.md DELETED
@@ -1,118 +0,0 @@
1
- # Checkpoint Compatibility Information
2
-
3
- ## Checkpoint Location
4
- `/scratch/zsh/shuhongz_adobe_ckpts/ckpt_for_single_lap/`
5
-
6
- ## Source
7
- Converted from: `/datasets/objaverse/shuhongz_adobe_ckpts/1023_generated_830k_lap_0_28_only_t2i/checkpoint-18000`
8
-
9
- ## Changes Applied
10
- - **Removed**: All `image_layerwise_attention_pooling.*` keys (18 keys)
11
- - **Kept**: All other modules including `text_layerwise_attention_pooling`
12
-
13
- ## Compatibility
14
- This checkpoint is compatible with models using the **HYBRID STRATEGY**:
15
- - Text tokens: Processed through `text_layerwise_attention_pooling`
16
- - Image tokens: Use ViT features directly (no LAP)
17
-
18
- Target model: `uno_debug/1029_mixed_text_lap_internvl_s2i_train_mllm_only_masked_loss_clip_lora.py`
19
-
20
- ## Files Included
21
- 1. ✅ `dit_lora.safetensors` - Cleaned model weights (334 keys, ~2.2GB)
22
- 2. ✅ `scheduler.bin` - Learning rate scheduler state
23
- 3. ❌ `optimizer.bin` - NOT INCLUDED (see below)
24
-
25
- ## ⚠️ Important: optimizer.bin NOT Included
26
-
27
- **Why optimizer.bin is NOT compatible:**
28
-
29
- The optimizer.bin from the original checkpoint stores optimizer states (momentum, variance, etc.)
30
- for all 352 parameters, including the 18 `image_layerwise_attention_pooling` parameters that
31
- have been removed.
32
-
33
- **Problem:**
34
- - Optimizer states are indexed by parameter position/ID, not by name
35
- - The cleaned model has 334 parameters (18 fewer than the original 352)
36
- - Using the old optimizer.bin would cause parameter ID mismatches
37
- - This leads to training errors or incorrect optimizer state application
38
-
39
- **Solutions:**
40
-
41
- ### Option 1: Start Fresh (RECOMMENDED)
42
- ```python
43
- # In your training config, set:
44
- resume_from_checkpoint = "/scratch/zsh/shuhongz_adobe_ckpts/ckpt_for_single_lap"
45
-
46
- # The training script will:
47
- # ✅ Load model weights from dit_lora.safetensors
48
- # ✅ Load scheduler state from scheduler.bin
49
- # ✅ Initialize a fresh optimizer (no momentum/variance carried over)
50
- ```
51
-
52
- **Pros:**
53
- - Clean start with no parameter mismatches
54
- - Model weights are preserved
55
- - Safe and reliable
56
-
57
- **Cons:**
58
- - Loses optimizer momentum/variance accumulated during previous training
59
- - May need a brief warm-up period (but usually minimal impact)
60
-
61
- ### Option 2: Keep Original Checkpoint
62
- If you absolutely need the optimizer state, use the original checkpoint:
63
- ```python
64
- resume_from_checkpoint = "/datasets/objaverse/shuhongz_adobe_ckpts/1023_generated_830k_lap_0_28_only_t2i/checkpoint-18000"
65
- ```
66
-
67
- But you'll need to modify the loading code to skip the incompatible keys:
68
- ```python
69
- # In resume_from_checkpoint function:
70
- lora_state = load_file(path, device=device)
71
- # Filter out image_layerwise_attention_pooling keys
72
- lora_state = {k: v for k, v in lora_state.items()
73
- if not k.startswith('image_layerwise_attention_pooling.')}
74
- unwarp_dit.load_state_dict(lora_state, strict=False)
75
- ```
76
-
77
- ## Verification
78
-
79
- To verify the checkpoint structure:
80
- ```bash
81
- python3 -c "
82
- from safetensors.torch import load_file
83
- state_dict = load_file('/scratch/zsh/shuhongz_adobe_ckpts/ckpt_for_single_lap/dit_lora.safetensors')
84
- modules = {}
85
- for key in state_dict.keys():
86
- module = key.split('.')[0]
87
- modules[module] = modules.get(module, 0) + 1
88
- print('Modules in checkpoint:')
89
- for m, count in sorted(modules.items()):
90
- print(f' {m}: {count} keys')
91
- "
92
- ```
93
-
94
- Expected output:
95
- - double_blocks: 152 keys
96
- - internvl_projector: 8 keys
97
- - single_blocks: 152 keys
98
- - text_layerwise_attention_pooling: 18 keys
99
- - vector_in: 4 keys
100
-
101
- **Total: 334 keys** (vs 352 in original)
102
-
103
- ## Training Command Example
104
-
105
- ```bash
106
- # Using the cleaned checkpoint without optimizer state
107
- accelerate launch --config_file config/accelerate/default_config.yaml \
108
- uno_debug/1029_mixed_text_lap_internvl_s2i_train_mllm_only_masked_loss_clip_lora.py \
109
- --config config/train_config.yaml \
110
- --resume_from_checkpoint "/scratch/zsh/shuhongz_adobe_ckpts/ckpt_for_single_lap"
111
- ```
112
-
113
- The training will automatically:
114
- 1. Load `dit_lora.safetensors` with 334 parameters
115
- 2. Load `scheduler.bin` for learning rate schedule
116
- 3. Initialize fresh optimizer for all trainable parameters
117
- 4. Continue training from step 18000
118
-