LossFunctionLover
/

pairwise-orm-model

Text Classification

preference-learning

agentic-reasoning

outcome-reward-model

pairwise-preference

Eval Results (legacy)

Model card Files Files and versions

LossFunctionLover commited on Jan 22

Commit

26dc055

·

verified ·

1 Parent(s): 1fe85ee

Update README.md

Files changed (1) hide show

README.md +9 -8

README.md CHANGED Viewed

@@ -68,9 +68,9 @@ Input Text (Reasoning Trace)
     ↓
 [Frozen Base LM Encoder]  ← Pre-trained, frozen during training
     ↓
-[Mean Pooling]
     ↓
-[Lightweight MLP Head]    ← Only these parameters are trained
     ↓
 Scalar Reward Score
 ```
@@ -129,14 +129,14 @@ Each pair contains:
 ### Training Configuration
 **Hyperparameters:**
-- **Base Model**: facebook/OPT1.3b
 - **Trainable Parameters**: Scoring head only (~500K-1M params)
 - **Optimizer**: AdamW
-  - Learning rate: 1e-4
   - Betas: (0.9, 0.999)
   - Weight decay: 0.01
-- **Learning Rate Schedule**: Cosine decay with 50-step warmup
-- **Batch Size**: 32 pairs
 - **Gradient Clipping**: Max norm 1.0
 - **Training Steps**: 800
 - **Mixed Precision**: FP16
@@ -249,8 +249,9 @@ def score_trace(trace_text: str) -> float:
     with torch.no_grad():
         # Get base model embeddings
         encoder_outputs = base_model(**inputs)
-        # Pool final token (EOS)
-        pooled = encoder_outputs.last_hidden_state[:, -1, :]
         # Get reward score
         score = scoring_head(pooled).squeeze(-1).cpu().item()

     ↓
 [Frozen Base LM Encoder]  ← Pre-trained, frozen during training
     ↓
+[Final Token (EOS) Pooling]
     ↓
+[Lightweight Linear Head]    ← Only these parameters are trained
     ↓
 Scalar Reward Score
 ```
 ### Training Configuration
 **Hyperparameters:**
+- **Base Model**: facebook/opt-1.3b
 - **Trainable Parameters**: Scoring head only (~500K-1M params)
 - **Optimizer**: AdamW
+  - Learning rate: 2e-5
   - Betas: (0.9, 0.999)
   - Weight decay: 0.01
+- **Learning Rate Schedule**: Linear warmup (50 steps) + constant
+- **Batch Size**: 8 pairs
 - **Gradient Clipping**: Max norm 1.0
 - **Training Steps**: 800
 - **Mixed Precision**: FP16
     with torch.no_grad():
         # Get base model embeddings
         encoder_outputs = base_model(**inputs)
+        # Pool at actual sequence end (accounts for padding)
+        seq_lengths = inputs["attention_mask"].sum(dim=1) - 1
+        pooled = encoder_outputs.last_hidden_state[torch.arange(seq_lengths.size(0)), seq_lengths]
         # Get reward score
         score = scoring_head(pooled).squeeze(-1).cpu().item()