nielsr HF Staff commited on
Commit
980775f
·
verified ·
1 Parent(s): 262bf99

Improve model card: add metadata, links, highlights, and usage example

Browse files

This PR significantly improves the model card by:
- Adding the `pipeline_tag: text-generation` for better discoverability and functionality on the Hugging Face Hub.
- Specifying `library_name: transformers` to ensure correct library usage and enable the "how to use" widget.
- Including a direct link to the GitHub repository (https://github.com/apple/ml-reversal-blessing) for the code.
- Integrating "Highlights" and "Key Findings" from the original GitHub README for a better overview of the paper's contributions.
- Providing a clear "Usage" section with a Python code example using the `transformers` library.
- Updating image links to point to the raw content on GitHub.

Files changed (1) hide show
  1. README.md +123 -5
README.md CHANGED
@@ -1,25 +1,108 @@
1
  ---
2
  license: apple-amlr
 
 
3
  ---
 
4
  <h1 align="center"> Reversal Blessing: Thinking Backward May Outpace Thinking Forward in Multi-choice Questions </h1>
5
 
6
  <p align="center">
7
- <a href="https://arxiv.org/abs/2502.18435">📃 Paper</a>
 
 
8
 
9
  <a href="https://machinelearning.apple.com" >📝 Blog</a>
 
 
10
  </p>
11
 
12
- This model card accompanies the research paper, [Reversal Blessing: Thinking Backward May Outpace Thinking Forward in Multi-choice Questions](https://arxiv.org/abs/2502.18435).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
- Here we release 4 models' checkpoints trained with simulation data described in out paper Section 4.
15
- Please follow our github README to download and evaluate these models.
 
 
16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
  <div align="center">
19
 
20
  ### Results of the Controlled Simulation Study of 4-Digits Multiplication
21
 
22
- | || **Forward X** | || **Reverse X** | |
23
  |:--|:--:|:--:|:--:|:--:|:--:|:--:|
24
  | | **L2R** | **R2L(m,n)** | **R2L(m)** | **R2L** | **L2R(m,n)** | **L2R(n)** |
25
  | **Test Accuracy (%)** | **99.81±0.15** | 59.71±1.99 | 60.93±0.88 | **100±0** | 97.82±0.35 | 99.85±0.10 |
@@ -30,3 +113,38 @@ Please follow our github README to download and evaluate these models.
30
  | **Training loss** | **0.86** | 0.94 | 0.94 | **0.86** | 0.94 | 0.94 |
31
 
32
  </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apple-amlr
3
+ pipeline_tag: text-generation
4
+ library_name: transformers
5
  ---
6
+
7
  <h1 align="center"> Reversal Blessing: Thinking Backward May Outpace Thinking Forward in Multi-choice Questions </h1>
8
 
9
  <p align="center">
10
+ <a href="https://huggingface.co/papers/2502.18435">📃 Paper (Hugging Face)</a>
11
+
12
+ <a href="https://arxiv.org/abs/2502.18435">📃 Paper (arXiv)</a>
13
 
14
  <a href="https://machinelearning.apple.com" >📝 Blog</a>
15
+
16
+ <a href="https://github.com/apple/ml-reversal-blessing">💻 Code</a>
17
  </p>
18
 
19
+ This model card accompanies the research paper, [Reversal Blessing: Thinking Backward May Outpace Thinking Forward in Multi-choice Questions](https://huggingface.co/papers/2502.18435).
20
+
21
+ Here we release 4 models' checkpoints trained with simulation data described in our paper Section 4.
22
+ Please follow our GitHub README to download and evaluate these models.
23
+
24
+ ## 🌟 Highlights
25
+
26
+ ### 💡 Key Concept
27
+ **Reversal Blessing** demonstrates that right-to-left (R2L) factorization can outperform traditional left-to-right (L2R) approaches in specific MCQ reasoning tasks. We introduce "reverse thinking" - evaluating answer choices based on their likelihood of generating the question.
28
+
29
+ <div align="center">
30
+ <img src="https://github.com/apple/ml-reversal-blessing/raw/main/figures/figure2.png" width="95%" alt="Comparison of forward vs. reverse thinking in MCQs">
31
+ </div>
32
+
33
+ ### 🔍 Key Findings
34
+ - **Theoretical Insights**: We explore three key factors: calibration, computability, and conditional entropy
35
+ - **Empirical Evidence**: Lower conditional entropy correlates with higher task accuracy in both L2R and R2L models
36
+ - **Consistent Performance**: R2L models outperform L2R models across various MCQ benchmarks
37
+
38
+ ## Usage
39
+
40
+ You can load the model using the `transformers` library and perform text generation. Make sure you have `transformers` installed.
41
+
42
+ ```python
43
+ from transformers import AutoModelForCausalLM, AutoTokenizer
44
+
45
+ model_name = "apple/ml-reversal-blessing" # or the specific sub-model checkpoint you want
46
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
47
+ model = AutoModelForCausalLM.from_pretrained(model_name)
48
 
49
+ # Example for text generation (e.g., answering a question)
50
+ prompt = "Which of these is a fruit? (A) Carrot (B) Apple (C) Potato (D) Onion
51
+ Answer:"
52
+ inputs = tokenizer(prompt, return_tensors="pt")
53
 
54
+ # Generate response
55
+ outputs = model.generate(
56
+ inputs.input_ids,
57
+ max_new_tokens=20,
58
+ temperature=0.7,
59
+ top_p=0.9,
60
+ do_sample=True,
61
+ eos_token_id=tokenizer.eos_token_id,
62
+ )
63
+
64
+ generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
65
+ print(generated_text)
66
+
67
+ # For detailed usage, especially for specific L2R/R2L inference as described in the paper,
68
+ # please refer to the official GitHub repository's `simulation/run.sh` script and examples.
69
+ ```
70
+
71
+ ## 📊 Results
72
+
73
+ <div align="center">
74
+ <img src="https://github.com/apple/ml-reversal-blessing/raw/main/figures/sim.png" width="45%" alt="Correlation between conditional entropy and task accuracy">
75
+ </div>
76
+
77
+ <div align="center">
78
+
79
+ ### Comparing L2R and R2L on MCQs
80
+
81
+ | | **DCLM-2B** ||| **EDU-2B** ||| **EDU-8B** ||| **HF-2B** |
82
+ |:--|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
83
+ | | **L2R** | **R2L** | **% Change** | **L2R** | **R2L** | **% Change** | **L2R** | **R2L** | **% Change** | **L2R** |
84
+ | **Training loss** | **2.668** | 2.724 | +2.10 | **2.345** | 2.396 | +2.17 | **2.087** | 2.138 | +2.44 | - |
85
+ | **LogiQA** | 30.57 | **31.64** | +3.52 | 27.96 | **31.49** | +12.64 | 29.95 | **31.03** | +3.61 | - |
86
+ | **OpenbookQA** | 36.00 | **38.40** | +6.67 | 42.40 | **44.40** | +4.72 | 45.00 | **48.40** | +7.56 | 41.04 |
87
+ | **TruthfulQA** | 19.82 | **29.99** | +51.23 | 24.36 | **28.76** | +18.09 | 24.97 | **31.70** | +26.95 | - |
88
+ | **CommonsenseQA** | 42.83 | **45.29** | +5.74 | 42.92 | **45.13** | +5.15 | 39.15 | **44.96** | +14.84 | 36.60 |
89
+ | **Social IQA** | **41.56** | 40.94 | -1.48 | **42.78** | 42.22 | -1.32 | **44.58** | 43.50 | -2.42 | 40.52 |
90
+ | **ARC** | **54.11** | 43.88 | -18.91 | **60.65** | 52.31 | -13.75 | **68.29** | 56.22 | -17.67 | 57.47 |
91
+ | **HellaSwag** | **60.87** | 45.89 | -24.62 | **60.57** | 42.22 | -26.78 | **71.60** | 49.22 | -31.26 | 59.34 |
92
+ | **MathQA** | **26.50** | 22.21 | -16.18 | **26.80** | 24.86 | -7.25 | **28.77** | 25.33 | -11.96 | - |
93
+ | **MMLU** | **31.66** | 31.31 | -1.10 | **34.57** | 34.35 | -0.62 | **38.90** | 37.11 | -4.60 | 37.35 |
94
+ | **PIQA** | **74.43** | 58.05 | -22.00 | **74.48** | 57.13 | -23.30 | **77.80** | 59.14 | -23.98 | 76.70 |
95
+ | **Winogrande** | **61.01** | 53.51 | -12.29 | **60.93** | 54.85 | -9.97 | **65.75** | 54.70 | -16.81 | 57.54 |
96
+
97
+ </div>
98
+
99
+ *Note: All models are trained on 350B non-repeating tokens. The HF-2B baseline is from Penedo et al. (2024). EDU-2B, EDU-8B and HF-2B models are trained with the same FineWeb-EDU 350B dataset. Positive % Change values (indicating R2L wins) shown for LogiQA, OpenbookQA, TruthfulQA, and CommonsenseQA. Negative % Change values (indicating R2L loses) shown for other benchmarks.*
100
 
101
  <div align="center">
102
 
103
  ### Results of the Controlled Simulation Study of 4-Digits Multiplication
104
 
105
+ | | **Forward X** || | **Reverse X** || |
106
  |:--|:--:|:--:|:--:|:--:|:--:|:--:|
107
  | | **L2R** | **R2L(m,n)** | **R2L(m)** | **R2L** | **L2R(m,n)** | **L2R(n)** |
108
  | **Test Accuracy (%)** | **99.81±0.15** | 59.71±1.99 | 60.93±0.88 | **100±0** | 97.82±0.35 | 99.85±0.10 |
 
113
  | **Training loss** | **0.86** | 0.94 | 0.94 | **0.86** | 0.94 | 0.94 |
114
 
115
  </div>
116
+
117
+ *Note: Theoretical Conditional Entropy (Theo. Cond. Ent.) represents the expected conditional entropy under an ideal model. L2R consistently outperforms R2L in Forward X, while R2L is superior in Reverse X. Lower conditional entropy correlates with higher accuracy.*
118
+
119
+ ## 🚀 Getting Started
120
+
121
+ ### 1. Installation
122
+ ```bash
123
+ pip install -r requirement.txt
124
+ ```
125
+
126
+ ### 2. Prepare Checkpoints
127
+ ```bash
128
+ python simulation/download_model.py
129
+ ```
130
+
131
+ ### 3. Run the Model
132
+ ```bash
133
+ # First add your HuggingFace API key
134
+ export HF_TOKEN=your_huggingface_token
135
+ bash simulation/run.sh multiplication l2r
136
+ ```
137
+
138
+ ## 📚 Citation
139
+
140
+ If you find this work useful, please cite our paper:
141
+
142
+ ```bibtex
143
+ @article{zhang2025reversal,
144
+ title={Reversal Blessing: Thinking Backward May Outpace Thinking Forward in Multi-choice Questions},
145
+ author={Zhang, Yizhe and Bai, Richard and Gu, Zijin and Zhang, Ruixiang and Gu, Jiatao and Abbe, Emmanuel and Bengio, Samy and Jaitly, Navdeep},
146
+ journal={arXiv preprint arXiv:2502.18435},
147
+ year={2025},
148
+ url={https://arxiv.org/abs/2502.18435}
149
+ }
150
+ ```