Delete README.md
Browse files
README.md
DELETED
|
@@ -1,118 +0,0 @@
|
|
| 1 |
-
# Checkpoint Compatibility Information
|
| 2 |
-
|
| 3 |
-
## Checkpoint Location
|
| 4 |
-
`/scratch/zsh/shuhongz_adobe_ckpts/ckpt_for_single_lap/`
|
| 5 |
-
|
| 6 |
-
## Source
|
| 7 |
-
Converted from: `/datasets/objaverse/shuhongz_adobe_ckpts/1023_generated_830k_lap_0_28_only_t2i/checkpoint-18000`
|
| 8 |
-
|
| 9 |
-
## Changes Applied
|
| 10 |
-
- **Removed**: All `image_layerwise_attention_pooling.*` keys (18 keys)
|
| 11 |
-
- **Kept**: All other modules including `text_layerwise_attention_pooling`
|
| 12 |
-
|
| 13 |
-
## Compatibility
|
| 14 |
-
This checkpoint is compatible with models using the **HYBRID STRATEGY**:
|
| 15 |
-
- Text tokens: Processed through `text_layerwise_attention_pooling`
|
| 16 |
-
- Image tokens: Use ViT features directly (no LAP)
|
| 17 |
-
|
| 18 |
-
Target model: `uno_debug/1029_mixed_text_lap_internvl_s2i_train_mllm_only_masked_loss_clip_lora.py`
|
| 19 |
-
|
| 20 |
-
## Files Included
|
| 21 |
-
1. ✅ `dit_lora.safetensors` - Cleaned model weights (334 keys, ~2.2GB)
|
| 22 |
-
2. ✅ `scheduler.bin` - Learning rate scheduler state
|
| 23 |
-
3. ❌ `optimizer.bin` - NOT INCLUDED (see below)
|
| 24 |
-
|
| 25 |
-
## ⚠️ Important: optimizer.bin NOT Included
|
| 26 |
-
|
| 27 |
-
**Why optimizer.bin is NOT compatible:**
|
| 28 |
-
|
| 29 |
-
The optimizer.bin from the original checkpoint stores optimizer states (momentum, variance, etc.)
|
| 30 |
-
for all 352 parameters, including the 18 `image_layerwise_attention_pooling` parameters that
|
| 31 |
-
have been removed.
|
| 32 |
-
|
| 33 |
-
**Problem:**
|
| 34 |
-
- Optimizer states are indexed by parameter position/ID, not by name
|
| 35 |
-
- The cleaned model has 334 parameters (18 fewer than the original 352)
|
| 36 |
-
- Using the old optimizer.bin would cause parameter ID mismatches
|
| 37 |
-
- This leads to training errors or incorrect optimizer state application
|
| 38 |
-
|
| 39 |
-
**Solutions:**
|
| 40 |
-
|
| 41 |
-
### Option 1: Start Fresh (RECOMMENDED)
|
| 42 |
-
```python
|
| 43 |
-
# In your training config, set:
|
| 44 |
-
resume_from_checkpoint = "/scratch/zsh/shuhongz_adobe_ckpts/ckpt_for_single_lap"
|
| 45 |
-
|
| 46 |
-
# The training script will:
|
| 47 |
-
# ✅ Load model weights from dit_lora.safetensors
|
| 48 |
-
# ✅ Load scheduler state from scheduler.bin
|
| 49 |
-
# ✅ Initialize a fresh optimizer (no momentum/variance carried over)
|
| 50 |
-
```
|
| 51 |
-
|
| 52 |
-
**Pros:**
|
| 53 |
-
- Clean start with no parameter mismatches
|
| 54 |
-
- Model weights are preserved
|
| 55 |
-
- Safe and reliable
|
| 56 |
-
|
| 57 |
-
**Cons:**
|
| 58 |
-
- Loses optimizer momentum/variance accumulated during previous training
|
| 59 |
-
- May need a brief warm-up period (but usually minimal impact)
|
| 60 |
-
|
| 61 |
-
### Option 2: Keep Original Checkpoint
|
| 62 |
-
If you absolutely need the optimizer state, use the original checkpoint:
|
| 63 |
-
```python
|
| 64 |
-
resume_from_checkpoint = "/datasets/objaverse/shuhongz_adobe_ckpts/1023_generated_830k_lap_0_28_only_t2i/checkpoint-18000"
|
| 65 |
-
```
|
| 66 |
-
|
| 67 |
-
But you'll need to modify the loading code to skip the incompatible keys:
|
| 68 |
-
```python
|
| 69 |
-
# In resume_from_checkpoint function:
|
| 70 |
-
lora_state = load_file(path, device=device)
|
| 71 |
-
# Filter out image_layerwise_attention_pooling keys
|
| 72 |
-
lora_state = {k: v for k, v in lora_state.items()
|
| 73 |
-
if not k.startswith('image_layerwise_attention_pooling.')}
|
| 74 |
-
unwarp_dit.load_state_dict(lora_state, strict=False)
|
| 75 |
-
```
|
| 76 |
-
|
| 77 |
-
## Verification
|
| 78 |
-
|
| 79 |
-
To verify the checkpoint structure:
|
| 80 |
-
```bash
|
| 81 |
-
python3 -c "
|
| 82 |
-
from safetensors.torch import load_file
|
| 83 |
-
state_dict = load_file('/scratch/zsh/shuhongz_adobe_ckpts/ckpt_for_single_lap/dit_lora.safetensors')
|
| 84 |
-
modules = {}
|
| 85 |
-
for key in state_dict.keys():
|
| 86 |
-
module = key.split('.')[0]
|
| 87 |
-
modules[module] = modules.get(module, 0) + 1
|
| 88 |
-
print('Modules in checkpoint:')
|
| 89 |
-
for m, count in sorted(modules.items()):
|
| 90 |
-
print(f' {m}: {count} keys')
|
| 91 |
-
"
|
| 92 |
-
```
|
| 93 |
-
|
| 94 |
-
Expected output:
|
| 95 |
-
- double_blocks: 152 keys
|
| 96 |
-
- internvl_projector: 8 keys
|
| 97 |
-
- single_blocks: 152 keys
|
| 98 |
-
- text_layerwise_attention_pooling: 18 keys
|
| 99 |
-
- vector_in: 4 keys
|
| 100 |
-
|
| 101 |
-
**Total: 334 keys** (vs 352 in original)
|
| 102 |
-
|
| 103 |
-
## Training Command Example
|
| 104 |
-
|
| 105 |
-
```bash
|
| 106 |
-
# Using the cleaned checkpoint without optimizer state
|
| 107 |
-
accelerate launch --config_file config/accelerate/default_config.yaml \
|
| 108 |
-
uno_debug/1029_mixed_text_lap_internvl_s2i_train_mllm_only_masked_loss_clip_lora.py \
|
| 109 |
-
--config config/train_config.yaml \
|
| 110 |
-
--resume_from_checkpoint "/scratch/zsh/shuhongz_adobe_ckpts/ckpt_for_single_lap"
|
| 111 |
-
```
|
| 112 |
-
|
| 113 |
-
The training will automatically:
|
| 114 |
-
1. Load `dit_lora.safetensors` with 334 parameters
|
| 115 |
-
2. Load `scheduler.bin` for learning rate schedule
|
| 116 |
-
3. Initialize fresh optimizer for all trainable parameters
|
| 117 |
-
4. Continue training from step 18000
|
| 118 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|