Update README.md
Browse files
README.md
CHANGED
|
@@ -68,9 +68,9 @@ Input Text (Reasoning Trace)
|
|
| 68 |
β
|
| 69 |
[Frozen Base LM Encoder] β Pre-trained, frozen during training
|
| 70 |
β
|
| 71 |
-
[
|
| 72 |
β
|
| 73 |
-
[Lightweight
|
| 74 |
β
|
| 75 |
Scalar Reward Score
|
| 76 |
```
|
|
@@ -129,14 +129,14 @@ Each pair contains:
|
|
| 129 |
### Training Configuration
|
| 130 |
|
| 131 |
**Hyperparameters:**
|
| 132 |
-
- **Base Model**: facebook/
|
| 133 |
- **Trainable Parameters**: Scoring head only (~500K-1M params)
|
| 134 |
- **Optimizer**: AdamW
|
| 135 |
-
- Learning rate:
|
| 136 |
- Betas: (0.9, 0.999)
|
| 137 |
- Weight decay: 0.01
|
| 138 |
-
- **Learning Rate Schedule**:
|
| 139 |
-
- **Batch Size**:
|
| 140 |
- **Gradient Clipping**: Max norm 1.0
|
| 141 |
- **Training Steps**: 800
|
| 142 |
- **Mixed Precision**: FP16
|
|
@@ -249,8 +249,9 @@ def score_trace(trace_text: str) -> float:
|
|
| 249 |
with torch.no_grad():
|
| 250 |
# Get base model embeddings
|
| 251 |
encoder_outputs = base_model(**inputs)
|
| 252 |
-
# Pool
|
| 253 |
-
|
|
|
|
| 254 |
# Get reward score
|
| 255 |
score = scoring_head(pooled).squeeze(-1).cpu().item()
|
| 256 |
|
|
|
|
| 68 |
β
|
| 69 |
[Frozen Base LM Encoder] β Pre-trained, frozen during training
|
| 70 |
β
|
| 71 |
+
[Final Token (EOS) Pooling]
|
| 72 |
β
|
| 73 |
+
[Lightweight Linear Head] β Only these parameters are trained
|
| 74 |
β
|
| 75 |
Scalar Reward Score
|
| 76 |
```
|
|
|
|
| 129 |
### Training Configuration
|
| 130 |
|
| 131 |
**Hyperparameters:**
|
| 132 |
+
- **Base Model**: facebook/opt-1.3b
|
| 133 |
- **Trainable Parameters**: Scoring head only (~500K-1M params)
|
| 134 |
- **Optimizer**: AdamW
|
| 135 |
+
- Learning rate: 2e-5
|
| 136 |
- Betas: (0.9, 0.999)
|
| 137 |
- Weight decay: 0.01
|
| 138 |
+
- **Learning Rate Schedule**: Linear warmup (50 steps) + constant
|
| 139 |
+
- **Batch Size**: 8 pairs
|
| 140 |
- **Gradient Clipping**: Max norm 1.0
|
| 141 |
- **Training Steps**: 800
|
| 142 |
- **Mixed Precision**: FP16
|
|
|
|
| 249 |
with torch.no_grad():
|
| 250 |
# Get base model embeddings
|
| 251 |
encoder_outputs = base_model(**inputs)
|
| 252 |
+
# Pool at actual sequence end (accounts for padding)
|
| 253 |
+
seq_lengths = inputs["attention_mask"].sum(dim=1) - 1
|
| 254 |
+
pooled = encoder_outputs.last_hidden_state[torch.arange(seq_lengths.size(0)), seq_lengths]
|
| 255 |
# Get reward score
|
| 256 |
score = scoring_head(pooled).squeeze(-1).cpu().item()
|
| 257 |
|