LossFunctionLover commited on
Commit
e3a4d71
·
verified ·
1 Parent(s): 26dc055

Delete DATASET_CARD.md

Browse files
Files changed (1) hide show
  1. DATASET_CARD.md +0 -462
DATASET_CARD.md DELETED
@@ -1,462 +0,0 @@
1
- ---
2
- language: en
3
- license: cc-by-4.0
4
- task_categories:
5
- - text-classification
6
- - preference-learning
7
- - reward-modeling
8
- size_categories:
9
- - 10K<n<100K
10
- tags:
11
- - preference-pairs
12
- - reasoning-traces
13
- - outcome-reward-model
14
- - pairwise-ranking
15
- pretty_name: ORM Pairwise Preference Pairs
16
- dataset_info:
17
- features:
18
- - name: chosen
19
- dtype: string
20
- - name: rejected
21
- dtype: string
22
- - name: meta
23
- dtype:
24
- chosen_chain_id: string
25
- rejected_chain_id: string
26
- chosen_label: int32
27
- rejected_label: int32
28
- splits:
29
- - name: train
30
- num_examples: 41656
31
- - name: validation
32
- num_examples: 1144
33
- - name: test
34
- num_examples: 1232
35
- ---
36
-
37
- # ORM Pairwise Preference Pairs Dataset
38
-
39
- <div align="center">
40
-
41
- **High-Quality Reasoning Trace Preferences for Training Outcome Reward Models**
42
-
43
- [![Paper](https://img.shields.io/badge/Paper-ArXiv-red)](link-to-arxiv)
44
- [![Model](https://img.shields.io/badge/Model-HuggingFace-yellow)](https://huggingface.co/LossFunctionLover/pairwise-orm-model)
45
-
46
- </div>
47
-
48
- ## 📋 Dataset Description
49
-
50
- This dataset contains **44,032 curated pairwise preference judgments** over reasoning traces, designed for training robust Outcome Reward Models (ORMs). Each example is a pair of reasoning traces for the same task, with one marked as preferred (correct) and the other as rejected (incorrect).
51
-
52
- **Key Features:**
53
- - ✅ **Weeks of manual curation** and quality control on source data
54
- - ✅ **Validated quality**: Base model achieves 98.2% pairwise accuracy
55
- - ✅ **Strong signal separation**: Pearson r=0.87, Spearman ρ=0.83
56
- - ✅ **Balanced construction**: Multiple negatives per positive for robust learning
57
- - ✅ **Full traceability**: Chain IDs for linking back to source examples
58
-
59
- ## 📊 Dataset Statistics
60
-
61
- ### Split Sizes
62
-
63
- | Split | Pairs | Negatives per Positive | Source Examples |
64
- |-------|-------|------------------------|-----------------|
65
- | **Train** | 41,656 | 8 | 9,482 (pointwise) |
66
- | **Validation** | 1,144 | 4 | 524 (pointwise) |
67
- | **Test** | 1,232 | 4 | 547 (pointwise) |
68
- | **Total** | **44,032** | - | 10,553 |
69
-
70
- ### Quality Metrics (Source Pointwise Dataset)
71
-
72
- **Base Model Log-Probability Analysis:**
73
- - **Pearson correlation**: r = 0.87 (p < 1e-162)
74
- - **Spearman correlation**: ρ = 0.83 (p < 1e-134)
75
- - **Pairwise accuracy**: 98.2%
76
- - **Mean log-prob (positive)**: -2.17
77
- - **Mean log-prob (negative)**: -3.64
78
- - **Separation**: Strong discrimination between correct/incorrect traces
79
-
80
- ![Base Model Distribution](https://your-image-link/distribution.png)
81
-
82
- These metrics confirm robust signal quality before pairwise transformation, validating the dataset's suitability for preference learning.
83
-
84
- ## 🏗️ Dataset Construction
85
-
86
- ### Phase 1: Source Data Curation (Pointwise)
87
-
88
- **Original Dataset Structure:**
89
- ```json
90
- {
91
- "qid": "unique-question-id",
92
- "chain_id": "unique-chain-id",
93
- "label": 0, // Binary: 1=correct, 0=incorrect
94
- "orm_label": 0,
95
- "input_text": "Step-by-step reasoning trace...",
96
- "prompt": "Original problem statement",
97
- "steps": ["Step 1", "Step 2", ...],
98
- "final_answer": "Answer value",
99
- "meta": {
100
- "gold_answer": "Ground truth solution",
101
- "generated_by": "ORM-Repair-Synth-V1",
102
- "template": "error_type"
103
- }
104
- }
105
- ```
106
-
107
- **Curation Process (Weeks of Work):**
108
- 1. **Generation**: Synthetic reasoning traces with diverse error patterns
109
- 2. **Labeling**: Binary correctness annotation (1=correct, 0=incorrect)
110
- 3. **Quality Control**:
111
- - Manual review of reasoning validity
112
- - Verification against ground truth
113
- - Filtering of ambiguous cases
114
- - Removal of duplicates and near-duplicates
115
- 4. **Validation**: Base model log-probability analysis to confirm signal quality
116
-
117
- **Quality Thresholds:**
118
- - Positive examples: Logically sound reasoning leading to correct answer
119
- - Negative examples: Clear errors in reasoning or incorrect final answer
120
- - Filtering: Remove examples with log-prob inconsistent with label
121
-
122
- ### Phase 2: Pairwise Transformation
123
-
124
- **Algorithm: Global Negative Sampling**
125
-
126
- ```python
127
- def build_global_pairs(data, negatives_per_positive):
128
- """
129
- For each positive example, sample N negative examples globally.
130
- This creates diverse comparison pairs.
131
- """
132
- positives = [ex for ex in data if ex["orm_label"] == 1]
133
- negatives = [ex for ex in data if ex["orm_label"] == 0]
134
-
135
- pairs = []
136
- for pos in positives:
137
- sampled_negs = random.sample(negatives, k=negatives_per_positive)
138
- for neg in sampled_negs:
139
- pairs.append({
140
- "chosen": pos["input_text"],
141
- "rejected": neg["input_text"],
142
- "meta": {
143
- "chosen_chain_id": pos["chain_id"],
144
- "rejected_chain_id": neg["chain_id"],
145
- "chosen_label": 1,
146
- "rejected_label": 0
147
- }
148
- })
149
- return pairs
150
- ```
151
-
152
- **Sampling Strategy Rationale:**
153
- - **Train (8 neg/pos)**: Maximize training signal diversity
154
- - **Val/Test (4 neg/pos)**: Balanced evaluation while maintaining diversity
155
- - **Global sampling**: Ensures model learns general preferences, not task-specific patterns
156
- - **Random seed**: Fixed for reproducibility (seed=42)
157
-
158
- ### Phase 3: Verification
159
-
160
- **Post-Construction Checks:**
161
- - ✅ All chosen traces have label=1
162
- - ✅ All rejected traces have label=0
163
- - ✅ No duplicate pairs
164
- - ✅ Chain IDs traceable to source
165
- - ✅ Balanced length distribution across pairs
166
-
167
- ## 📝 Data Format
168
-
169
- ### Structure
170
-
171
- Each example contains:
172
-
173
- ```json
174
- {
175
- "chosen": "Step-by-step reasoning trace (correct)",
176
- "rejected": "Step-by-step reasoning trace (incorrect)",
177
- "meta": {
178
- "chosen_chain_id": "pos-xxxxx",
179
- "rejected_chain_id": "synv1-xxxxx",
180
- "chosen_label": 1,
181
- "rejected_label": 0
182
- }
183
- }
184
- ```
185
-
186
- ### Example
187
-
188
- ```json
189
- {
190
- "chosen": "1. Calculate the cost of the unicorn piñata:\n2. Calculate the total cost of the Reese's:\n3. Calculate the total cost of the Snickers:\n4. Calculate the total cost of the Skittles:\n5. Add all the costs together:\n6. Compute the total cost step by step:\n7. Check the arithmetic result:\n8. Verification step:\n\nFinal Answer: 99",
191
-
192
- "rejected": "1. Assume that the weather forecast always grows linearly. However, assume an unrelated constant mistakenly.\n2. Therefore doubling the time doubles the value. However, assume an unrelated constant mistakenly.\n3. Ignore seasonal variations and round the result.\n4. Conclude with the projected incorrect value.\n\nFinal Answer: 15",
193
-
194
- "meta": {
195
- "chosen_chain_id": "pos-fe119ec6-f4b1-4710-80f9-8e64ced43c7e",
196
- "rejected_chain_id": "synv1-1b81a660-fa66-4cd0-9606-70b694486752",
197
- "chosen_label": 1,
198
- "rejected_label": 0
199
- }
200
- }
201
- ```
202
-
203
- ### Field Descriptions
204
-
205
- | Field | Type | Description |
206
- |-------|------|-------------|
207
- | `chosen` | string | Correct reasoning trace (label=1) |
208
- | `rejected` | string | Incorrect reasoning trace (label=0) |
209
- | `meta.chosen_chain_id` | string | Unique ID for chosen trace (traceable to source) |
210
- | `meta.rejected_chain_id` | string | Unique ID for rejected trace (traceable to source) |
211
- | `meta.chosen_label` | int | Always 1 (correct) |
212
- | `meta.rejected_label` | int | Always 0 (incorrect) |
213
-
214
- ## 💻 Usage
215
-
216
- ### Loading the Dataset
217
-
218
- ```python
219
- from datasets import load_dataset
220
-
221
- # Load full dataset
222
- dataset = load_dataset("akleshmishra/orm-pairwise-preference-pairs")
223
-
224
- # Access splits
225
- train_data = dataset["train"]
226
- val_data = dataset["validation"]
227
- test_data = dataset["test"]
228
-
229
- # Example usage
230
- print(f"Train size: {len(train_data)}")
231
- print(f"First example: {train_data[0]}")
232
- ```
233
-
234
- ### Training a Pairwise ORM
235
-
236
- ```python
237
- import torch
238
- import torch.nn as nn
239
- from transformers import AutoModel, AutoTokenizer
240
-
241
- # Load dataset
242
- dataset = load_dataset("akleshmishra/orm-pairwise-preference-pairs")
243
-
244
- # Initialize model
245
- base_model = AutoModel.from_pretrained("your-base-model")
246
- tokenizer = AutoTokenizer.from_pretrained("your-base-model")
247
-
248
- class PairwiseORM(nn.Module):
249
- def __init__(self, base_model):
250
- super().__init__()
251
- self.encoder = base_model
252
- # Freeze encoder
253
- for param in self.encoder.parameters():
254
- param.requires_grad = False
255
-
256
- # Trainable scoring head
257
- hidden_size = base_model.config.hidden_size
258
- self.score_head = nn.Sequential(
259
- nn.Linear(hidden_size, 256),
260
- nn.ReLU(),
261
- nn.Dropout(0.1),
262
- nn.Linear(256, 1)
263
- )
264
-
265
- def forward(self, input_ids, attention_mask):
266
- outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
267
- pooled = outputs.last_hidden_state.mean(dim=1) # Mean pooling
268
- score = self.score_head(pooled)
269
- return score
270
-
271
- # Training loop
272
- def pairwise_loss(chosen_score, rejected_score):
273
- """Logistic pairwise ranking loss"""
274
- return -torch.log(torch.sigmoid(chosen_score - rejected_score)).mean()
275
-
276
- # Prepare batch
277
- def prepare_batch(examples):
278
- chosen_inputs = tokenizer(
279
- examples["chosen"],
280
- padding=True,
281
- truncation=True,
282
- max_length=512,
283
- return_tensors="pt"
284
- )
285
- rejected_inputs = tokenizer(
286
- examples["rejected"],
287
- padding=True,
288
- truncation=True,
289
- max_length=512,
290
- return_tensors="pt"
291
- )
292
- return chosen_inputs, rejected_inputs
293
-
294
- # Train (simplified)
295
- model = PairwiseORM(base_model)
296
- optimizer = torch.optim.AdamW(model.score_head.parameters(), lr=1e-4)
297
-
298
- for batch in dataloader:
299
- chosen_inputs, rejected_inputs = prepare_batch(batch)
300
-
301
- chosen_scores = model(**chosen_inputs)
302
- rejected_scores = model(**rejected_inputs)
303
-
304
- loss = pairwise_loss(chosen_scores, rejected_scores)
305
-
306
- optimizer.zero_grad()
307
- loss.backward()
308
- optimizer.step()
309
- ```
310
-
311
- ### Evaluation
312
-
313
- ```python
314
- from sklearn.metrics import accuracy_score
315
- import numpy as np
316
-
317
- def evaluate_pairwise_accuracy(model, test_data):
318
- """
319
- Compute pairwise accuracy: fraction where score(chosen) > score(rejected)
320
- """
321
- correct = 0
322
- total = 0
323
- margins = []
324
-
325
- for example in test_data:
326
- chosen_score = model.score(example["chosen"])
327
- rejected_score = model.score(example["rejected"])
328
-
329
- margin = chosen_score - rejected_score
330
- margins.append(margin)
331
-
332
- if margin > 0:
333
- correct += 1
334
- total += 1
335
-
336
- accuracy = correct / total
337
- mean_margin = np.mean(margins)
338
-
339
- return {
340
- "accuracy": accuracy,
341
- "mean_margin": mean_margin,
342
- "median_margin": np.median(margins),
343
- "std_margin": np.std(margins)
344
- }
345
- ```
346
-
347
- ## 📊 Dataset Analysis
348
-
349
- ### Length Distribution
350
-
351
- | Split | Avg Chosen Length | Avg Rejected Length |
352
- |-------|-------------------|---------------------|
353
- | Train | ~180 tokens | ~175 tokens |
354
- | Validation | ~178 tokens | ~173 tokens |
355
- | Test | ~182 tokens | ~176 tokens |
356
-
357
- **Note**: Lengths are balanced to prevent length bias in learning.
358
-
359
- ### Error Pattern Distribution (Rejected Traces)
360
-
361
- Common error types in rejected traces:
362
- - Incorrect arithmetic calculations
363
- - Logical fallacies in reasoning steps
364
- - Missing or redundant steps
365
- - Incorrect application of formulas
366
- - Rounding errors
367
- - Misinterpretation of problem constraints
368
-
369
- ### Chain ID Traceability
370
-
371
- All examples include `chain_id` fields linking back to source pointwise dataset:
372
- - `pos-*`: Positive (correct) reasoning traces
373
- - `synv1-*`: Synthetic negative traces from ORM-Repair-Synth-V1
374
-
375
- ## 🔬 Experimental Results
376
-
377
- Models trained on this dataset achieve:
378
-
379
- - **Pairwise Accuracy**: 96.3% [95.3%, 97.1% CI]
380
- - **Training Stability**: Converges in 800 steps
381
- - **Anti-symmetry**: -0.998 correlation on label-swap test
382
- - **Length Robustness**: 95.5%-99.7% across token ranges
383
-
384
- See [Model Card](https://huggingface.co/LossFunctionLover/pairwise-orm-model) for full evaluation details.
385
-
386
- ## 🔗 Related Resources
387
-
388
- - 📄 **Paper**: [ArXiv](link-to-arxiv) - "An Empirical Study of Robust Preference Learning under Minimal Supervision"
389
- - 🤖 **Trained Model**: [HuggingFace](https://huggingface.co/LossFunctionLover/pairwise-orm-model)
390
- - 💻 **Training Code**: [GitHub](https://github.com/Coder-12)
391
- - 📊 **Source Pointwise Dataset**: Available upon request
392
-
393
- ## 📧 Contact & Citation
394
-
395
- **Author**: Aklesh Mishra
396
- **Email**: akleshmishra7@gmail.com
397
-
398
- If you use this dataset, please cite:
399
-
400
- ```bibtex
401
- @misc{mishra2025orm-dataset,
402
- author = {Mishra, Aklesh},
403
- title = {ORM Pairwise Preference Pairs: A Curated Dataset for Training Outcome Reward Models},
404
- year = {2025},
405
- publisher = {HuggingFace},
406
- howpublished = {\url{https://huggingface.co/datasets/LossFunctionLover/orm-pairwise-preference-pairs}}
407
- }
408
-
409
- @article{mishra2025orm,
410
- title={An Empirical Study of Robust Preference Learning under Minimal Supervision},
411
- author={Mishra, Aklesh},
412
- journal={arXiv preprint arXiv:XXXX.XXXXX},
413
- year={2025}
414
- }
415
- ```
416
-
417
- ## 📝 License
418
-
419
- This dataset is released under the **Creative Commons Attribution 4.0 International (CC BY 4.0)** license.
420
-
421
- You are free to:
422
- - **Share**: Copy and redistribute the material
423
- - **Adapt**: Remix, transform, and build upon the material
424
-
425
- Under the following terms:
426
- - **Attribution**: Give appropriate credit
427
-
428
- ## 🙏 Acknowledgments
429
-
430
- This dataset represents **weeks of dedicated curation work** in preference learning for agentic reasoning. Special thanks to:
431
-
432
- - The **MagiCore-Agentic** project for inspiring robust multi-step reasoning research
433
- - The ML community for foundational work in reward modeling and RLHF
434
- - All contributors to the open-source ecosystem
435
-
436
- ## ⚠️ Limitations & Considerations
437
-
438
- ### Known Limitations
439
-
440
- 1. **Domain**: Primarily math/reasoning tasks; may not generalize to all domains
441
- 2. **Synthetic negatives**: Some rejected traces are synthetically generated with error templates
442
- 3. **English only**: All reasoning traces are in English
443
- 4. **Length range**: Optimized for traces up to 512 tokens
444
-
445
- ### Ethical Considerations
446
-
447
- - This dataset is designed for research purposes in improving AI reasoning
448
- - Models trained on this data should not be used as sole arbiters of correctness in high-stakes decisions
449
- - Users should validate model outputs independently in production settings
450
-
451
- ### Future Work
452
-
453
- - [ ] Expand to multi-domain reasoning (code, science, etc.)
454
- - [ ] Include multi-turn reasoning dialogues
455
- - [ ] Add fine-grained error annotations
456
- - [ ] Create multilingual versions
457
-
458
- ---
459
-
460
- **Dataset Version**: 1.0
461
- **Last Updated**: November 27, 2025
462
- **Status**: ✅ Stable