LossFunctionLover commited on
Commit
26dc055
Β·
verified Β·
1 Parent(s): 1fe85ee

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -8
README.md CHANGED
@@ -68,9 +68,9 @@ Input Text (Reasoning Trace)
68
  ↓
69
  [Frozen Base LM Encoder] ← Pre-trained, frozen during training
70
  ↓
71
- [Mean Pooling]
72
  ↓
73
- [Lightweight MLP Head] ← Only these parameters are trained
74
  ↓
75
  Scalar Reward Score
76
  ```
@@ -129,14 +129,14 @@ Each pair contains:
129
  ### Training Configuration
130
 
131
  **Hyperparameters:**
132
- - **Base Model**: facebook/OPT1.3b
133
  - **Trainable Parameters**: Scoring head only (~500K-1M params)
134
  - **Optimizer**: AdamW
135
- - Learning rate: 1e-4
136
  - Betas: (0.9, 0.999)
137
  - Weight decay: 0.01
138
- - **Learning Rate Schedule**: Cosine decay with 50-step warmup
139
- - **Batch Size**: 32 pairs
140
  - **Gradient Clipping**: Max norm 1.0
141
  - **Training Steps**: 800
142
  - **Mixed Precision**: FP16
@@ -249,8 +249,9 @@ def score_trace(trace_text: str) -> float:
249
  with torch.no_grad():
250
  # Get base model embeddings
251
  encoder_outputs = base_model(**inputs)
252
- # Pool final token (EOS)
253
- pooled = encoder_outputs.last_hidden_state[:, -1, :]
 
254
  # Get reward score
255
  score = scoring_head(pooled).squeeze(-1).cpu().item()
256
 
 
68
  ↓
69
  [Frozen Base LM Encoder] ← Pre-trained, frozen during training
70
  ↓
71
+ [Final Token (EOS) Pooling]
72
  ↓
73
+ [Lightweight Linear Head] ← Only these parameters are trained
74
  ↓
75
  Scalar Reward Score
76
  ```
 
129
  ### Training Configuration
130
 
131
  **Hyperparameters:**
132
+ - **Base Model**: facebook/opt-1.3b
133
  - **Trainable Parameters**: Scoring head only (~500K-1M params)
134
  - **Optimizer**: AdamW
135
+ - Learning rate: 2e-5
136
  - Betas: (0.9, 0.999)
137
  - Weight decay: 0.01
138
+ - **Learning Rate Schedule**: Linear warmup (50 steps) + constant
139
+ - **Batch Size**: 8 pairs
140
  - **Gradient Clipping**: Max norm 1.0
141
  - **Training Steps**: 800
142
  - **Mixed Precision**: FP16
 
249
  with torch.no_grad():
250
  # Get base model embeddings
251
  encoder_outputs = base_model(**inputs)
252
+ # Pool at actual sequence end (accounts for padding)
253
+ seq_lengths = inputs["attention_mask"].sum(dim=1) - 1
254
+ pooled = encoder_outputs.last_hidden_state[torch.arange(seq_lengths.size(0)), seq_lengths]
255
  # Get reward score
256
  score = scoring_head(pooled).squeeze(-1).cpu().item()
257