pixas commited on
Commit
d3c5eb2
·
verified ·
1 Parent(s): 8107c54

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +136 -0
README.md ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ library_name: transformers
7
+ tags:
8
+ - transformers
9
+ - reasoning
10
+ - reinforcement-learning
11
+ - rlvr
12
+ - math
13
+ - miner
14
+ - qwen3
15
+ - causal-lm
16
+ model-index:
17
+ - name: Miner-8B
18
+ results: []
19
+ datasets:
20
+ - agentica-org/DeepScaleR-Preview-Dataset
21
+ base_model:
22
+ - Qwen/Qwen3-8B-Base
23
+ ---
24
+
25
+ # Miner-8B
26
+
27
+ This repository hosts the Hugging Face Transformers checkpoint for **MINER**: *Mining Intrinsic Mastery for Data-Efficient RL in Large Reasoning Models*.
28
+
29
+ - Paper: https://arxiv.org/pdf/2601.04731
30
+ - Code: https://github.com/pixas/Miner
31
+
32
+ ## Model Description
33
+
34
+ Miner-8B is a reasoning model trained with **MINER**, a reinforcement learning method designed to improve data efficiency for large reasoning models. MINER targets the inefficiency of critic-free RL methods on positive homogeneous prompts, where all sampled rollouts are correct and standard relative-advantage training provides little or no learning signal. Instead, MINER leverages the policy’s intrinsic uncertainty as a self-supervised reward signal, without requiring auxiliary reward models or additional inference-time overhead. :contentReference[oaicite:1]{index=1}
35
+
36
+ The MINER framework introduces two central ideas:
37
+ 1. **Token-level focal credit assignment**, which amplifies learning on uncertain and critical tokens while suppressing overconfident ones.
38
+ 2. **Adaptive advantage calibration**, which integrates intrinsic and verifiable rewards in a stable way. :contentReference[oaicite:2]{index=2}
39
+
40
+ According to the paper, MINER is evaluated on six reasoning benchmarks using Qwen3-8B-Base and Qwen3-8B-Base, and reports stronger sample efficiency and accuracy than several baseline methods including GRPO variants. :contentReference[oaicite:3]{index=3}
41
+
42
+ ## Intended Use
43
+
44
+ This model is intended for **research and experimental use** in:
45
+ - reasoning and problem solving
46
+ - reinforcement learning for language models
47
+ - mathematical and verifiable reasoning tasks
48
+ - post-training and evaluation of large reasoning models
49
+
50
+ Potential use cases include:
51
+ - academic research on RL for reasoning models
52
+ - evaluation on reasoning benchmarks
53
+ - ablation and reproduction studies based on the MINER framework
54
+ - further finetuning or post-training from this checkpoint
55
+
56
+ ## How to Use
57
+
58
+ ### Transformers
59
+
60
+ ```python
61
+ from transformers import AutoTokenizer, AutoModelForCausalLM
62
+
63
+ model_name = "pixas/Miner-8B"
64
+
65
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
66
+ model = AutoModelForCausalLM.from_pretrained(
67
+ model_name,
68
+ torch_dtype="auto",
69
+ device_map="auto"
70
+ )
71
+
72
+ prompt = [{"role": "user", "content": "What is 2+3?"}]
73
+ inputs = tokenizer(tokenizer.apply_chat_template(prompt, add_generation_prompt=True, tokenize=False), return_tensors='pt').to(model.device)
74
+
75
+ outputs = model.generate(
76
+ **inputs,
77
+ max_new_tokens=8192,
78
+ do_sample=True
79
+ )
80
+
81
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
82
+ ````
83
+
84
+ ### vLLM
85
+
86
+ ```python
87
+ from vllm import LLM, SamplingParams
88
+
89
+ llm = LLM(model="pixas/Miner-8B")
90
+ sampling_params = SamplingParams(
91
+ temperature=0.6,
92
+ max_tokens=8192
93
+ )
94
+ prompt = [{"role": "user", "content": "What is 2+3?"}]
95
+ inputs = tokenizer.apply_chat_template(prompt, add_generation_prompt=True, tokenize=False)
96
+ outputs = llm.generate(
97
+ inputs,
98
+ sampling_params
99
+ )
100
+
101
+ print(outputs[0].outputs[0].text)
102
+ ```
103
+
104
+
105
+ ## Limitations
106
+
107
+ This model is a research checkpoint and may have several limitations:
108
+
109
+ * It may produce incorrect, incomplete, or overconfident reasoning outputs.
110
+ * Performance may depend heavily on prompt format and decoding setup.
111
+ * Results reported in the paper may not transfer exactly to this released checkpoint unless the same base model, data mixture, and evaluation pipeline are used.
112
+ * The model is not intended as a substitute for expert judgment in high-stakes domains.
113
+
114
+ ## Bias, Risks, and Safety
115
+
116
+ Like other large language models, this model may reflect biases present in its training data and may generate harmful, misleading, or factually incorrect outputs. Additional care is required before deployment in user-facing or safety-critical applications.
117
+
118
+ ## Citation
119
+
120
+ If you use this model, please cite:
121
+
122
+ ```bibtex
123
+ @article{jiang2026miner,
124
+ title={Miner: Mining Intrinsic Mastery for Data-Efficient RL in Large Reasoning Models},
125
+ author={Jiang, Shuyang and Wang, Yuhao and Zhang, Ya and Wang, Yanfeng and Wang, Yu},
126
+ journal={arXiv preprint arXiv:2601.04731},
127
+ year={2026}
128
+ }
129
+ ```
130
+
131
+ ## Acknowledgements
132
+
133
+ This model card is based on the official MINER paper and code repository:
134
+
135
+ * Paper: [https://arxiv.org/pdf/2601.04731](https://arxiv.org/pdf/2601.04731)
136
+ * Code: [https://github.com/pixas/Miner](https://github.com/pixas/Miner)