whw06 commited on
Commit
be2fcd9
Β·
verified Β·
1 Parent(s): bdd929f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +252 -1
README.md CHANGED
@@ -1,3 +1,254 @@
1
  ---
2
- license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ library_name: transformers
6
+ pipeline_tag: text-generation
7
+ tags:
8
+ - mira
9
+ - mid-training
10
+ - data-selection
11
+ - rubric-scorer
12
+ - source-aware
13
+ - moe
14
+ - qwen3
15
+ base_model: Qwen/Qwen3.5-35B-A3B-Base
16
  ---
17
+
18
+ # MIRA-Text-Group3
19
+
20
+ A student scorer from **MIRA** (Mid-training Rubric Anchoring for Source-Aware Data Selection), fine-tuned to score **code-task documentation (PR / issue / wiki)** along a group-specific set of anchor rubric dimensions.
21
+
22
+ > πŸ“„ **Paper**: *MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection* (EMNLP 2026)
23
+ > πŸ’» **Code**: https://github.com/Multilingual-Multimodal-NLP/mira
24
+
25
+ ---
26
+
27
+ ## TL;DR
28
+
29
+ MIRA is a source-aware data selection framework for heterogeneous **mid-training** corpora. Instead of applying a single global quality rubric, MIRA (1) clusters sources into capability-coherent groups, (2) lets a frontier teacher (Kimi-K2.6) freely propose rubric dimensions and *anchors* them per group, (3) distills the anchored teacher into a lightweight **per-group student scorer**, and (4) applies reliability-aware aggregation with per-source retention thresholds.
30
+
31
+ **This repository is one of those student scorers** β€” variant **3** in the **Text** family, specialized for **code-task documentation (PR / issue / wiki)**. Given an in-distribution record, it produces a numerical score and a short rationale for every anchor dimension in this group's rubric.
32
+
33
+ ---
34
+
35
+ ## Model summary
36
+
37
+ | | |
38
+ |---|---|
39
+ | **Architecture** | Mixture-of-Experts decoder (35B total / β‰ˆ3B active params) |
40
+ | **Base model** | [Qwen3.5-35B-A3B-Base](https://huggingface.co/Qwen) |
41
+ | **Fine-tuning** | Full-parameter SFT on Kimi-K2.6 anchored teacher labels |
42
+ | **Domain** | Long code-related documents covering pull-requests, issues, repo wikis, Stack-Overflow notebooks, and templated bug-fix / file-localization / test-generation instructions. Strongest intra-group similarity: `ct_fixbug ↔ ct_unit_generation = 0.949`. |
43
+ | **Anchor rubric** | 15 group-specific dimensions (`group_D_dim_anchors.jsonl` in the project repo) |
44
+ | **Source count** | 6 text sources |
45
+ | **Output** | Structured (score, rationale) per anchor dimension |
46
+ | **Precision** | BF16 |
47
+ | **License** | Apache-2.0 (inherits from Qwen3) |
48
+
49
+ ---
50
+
51
+ ## Sources covered
52
+
53
+ This scorer is calibrated for the following mid-training sources in the **Text / Code-task documentation** group:
54
+
55
+ | Source | Description |
56
+ |---|---|
57
+ | `pr_issue` | PR / Issue learning notes |
58
+ | `deepwiki` | Repository wiki documentation (0420 refresh) |
59
+ | `stackoverflow_notebook` | Stack-Overflow-style notebooks |
60
+ | `ct_file_loc` | GitHub problem β†’ file-localization template |
61
+ | `ct_fixbug` | Bug-solving instruction template |
62
+ | `ct_unit_generation` | Unit-test generation template |
63
+
64
+ The full source-grouping report (KMeans k=4 / 5 clusters, intra-group cosine similarities) is in the [project repo](https://github.com/Multilingual-Multimodal-NLP/mira).
65
+
66
+ ---
67
+
68
+ ## Anchor dimensions (15 slots)
69
+
70
+ The scoring rubric for this group, discovered via Kimi-K2.6 free-form judging and clustered into 15 anchor dimensions (KMeans k=15 over the group's dim-score embeddings). Dimensions below are sorted by cluster size β€” larger clusters dominate the corpus and carry more signal. Anchor names are read verbatim from this group's `group_D_dim_anchors.jsonl`; **some names recur across slots** because semantically related but distinct rubric facets were clustered separately by the teacher.
71
+
72
+ | Slot | Dimension | Cluster size |
73
+ |---|---|---:|
74
+ | **A1** | Practical Actionability | 83,742 |
75
+ | **A2** | Practical Actionability | 76,628 |
76
+ | **A3** | Analytical Depth | 76,326 |
77
+ | **A4** | Pedagogical Clarity | 69,324 |
78
+ | **A5** | Reasoning Transparency | 61,419 |
79
+ | **A6** | Signal-to-Noise Ratio | 60,615 |
80
+ | **A7** | Repository Tree Navigation | 55,411 |
81
+ | **A8** | Document Structure & Formatting | 47,952 |
82
+ | **A9** | Practical Actionability | 42,272 |
83
+ | **A10** | Signal-to-Noise Ratio | 40,637 |
84
+ | **A11** | Training Utility | 36,132 |
85
+ | **A12** | Code Snippet Fidelity | 32,340 |
86
+ | **A13** | Format Compliance (SEARCH/REPLACE) | 23,085 |
87
+ | **A14** | Safety & Harmlessness | 17,185 |
88
+ | **A15** | Output Format Adherence | 12,529 |
89
+
90
+ The scorer outputs one `[Ai] <dimension>: <score>/10 β€” <rationale>` line per slot, plus `overall`, `training_recommendation`, `domain_tag`, and `brief`.
91
+
92
+ ---
93
+
94
+ ## Where this model fits in the MIRA pipeline
95
+
96
+ ```
97
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
98
+ β”‚ 1. Rubric β”‚ β”‚ 2. Anchored β”‚ β”‚ 3. Reliability β”‚ β”‚ 4. Data β”‚
99
+ β”‚ Discovery β”‚β†’ β”‚ Judge β”‚β†’ β”‚ Aggregation β”‚β†’ β”‚ Selection β”‚
100
+ β”‚ (Kimi-K2.6, β”‚ β”‚ Distillation β”‚ β”‚ (mask unreliable β”‚ β”‚ (per-source β”‚
101
+ β”‚ free-form β”‚ β”‚ ◀── THIS MODEL β”‚ β”‚ srcΓ—dim cells) β”‚ β”‚ retention) β”‚
102
+ β”‚ judging) β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
103
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
104
+ ```
105
+
106
+ `MIRA-Text-Group3` lives in Stage 2: it scores the full **Text / Code-task documentation** corpus so that downstream stages can apply reliability masking and source-aware retention.
107
+
108
+ ---
109
+
110
+ ## Intended use
111
+
112
+ - **Primary**: Score code-task documentation (PR / issue / wiki) on this group's anchor dimensions to drive source-aware data selection and filtering.
113
+ - **Secondary**: Research on rubric distillation, semantic quality scoring, and reliability diagnostics for heterogeneous training corpora.
114
+
115
+ **Not intended for**:
116
+ - General-purpose chat or instruction following β€” fine-tuned to emit structured scores, not freeform dialogue.
117
+ - Single-shot quality judgments without the anchor-dimension prompt template β€” outputs will be miscalibrated.
118
+ - Records outside the **Text / Code-task documentation** group; use the matching sibling scorer instead.
119
+
120
+ ---
121
+
122
+ ## Deployment
123
+
124
+ The scorer is designed to be served via **vLLM** behind an OpenAI-compatible endpoint and called in batch from the MIRA scoring pipeline.
125
+
126
+ ### 1. Serve with vLLM (recommended)
127
+
128
+ ```bash
129
+ vllm serve whw06/MIRA-Text-Group3 \
130
+ --tensor-parallel-size 8 \
131
+ --dtype bfloat16 \
132
+ --max-model-len 65536 \
133
+ --max-num-batched-tokens 131072 \
134
+ --gpu-memory-utilization 0.9 \
135
+ --trust-remote-code \
136
+ --port 8000
137
+ ```
138
+
139
+ **Why these values** (verified on H200 141GB during the paper's per-source evaluation):
140
+ - `max-model-len=65536` β€” 2Γ— the mid-training cutoff. Records can hit ~60K tokens for densely-tokenized sources; 40K runs into prompt-overflow errors.
141
+ - `max-num-batched-tokens=131072` β€” supports two full-length sequences per scheduling step.
142
+ - `gpu-memory-utilization=0.9` β€” 35B BF16 weights take ~70GB, leaving ~57GB KV cache. Roughly 4 concurrent 65K-context sequences per GPU.
143
+ - 8-way tensor parallel works well for the 35B MoE on a single 8Γ—H200/A100 node.
144
+
145
+ ### 2. Call from Python
146
+
147
+ ```python
148
+ from openai import OpenAI
149
+
150
+ client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
151
+
152
+ resp = client.chat.completions.create(
153
+ model="whw06/MIRA-Text-Group3",
154
+ messages=[
155
+ {"role": "system", "content": SYSTEM_PROMPT}, # group-D anchor calibration
156
+ {"role": "user", "content": USER_PROMPT}, # record + [A1]..[A15] template
157
+ ],
158
+ temperature=0.7,
159
+ top_p=0.95,
160
+ max_tokens=2048,
161
+ )
162
+ print(resp.choices[0].message.content)
163
+ ```
164
+
165
+ ### 3. Prompt template
166
+
167
+ The user message asks for one structured line per anchor dimension (top-15 of this group):
168
+
169
+ ```
170
+ [A1] {anchor_dim_1}: <score>/10 β€” <justification>
171
+ [A2] {anchor_dim_2}: <score>/10 β€” <justification>
172
+ ...
173
+ [A15] {anchor_dim_15}: <score>/10 β€” <justification>
174
+ overall: <0-100>
175
+ training_recommendation: <keep | downsample | drop>
176
+ domain_tag: <short tag>
177
+ brief: <one-sentence summary>
178
+ ```
179
+
180
+ The system prompt embeds the **top-12 anchor calibration references** (canonical examples from clustering) so the student matches the teacher's scoring scale. The full prompt builder, anchor JSONL files, and output parser are in the project repo's `scoring/score_text_anchored.py`.
181
+
182
+ ---
183
+
184
+ ## Training details
185
+
186
+ | | |
187
+ |---|---|
188
+ | **Teacher** | Kimi-K2.6 (free-form rubric discovery in Phase 1; anchored re-scoring in Phase 2) |
189
+ | **Training data** | Kimi-K2.6 anchored labels on this group's Phase-2 corpus, split into a distillation set + a held-out validation split for reliability diagnostics |
190
+ | **Loss** | Standard next-token CE over (score, rationale) labels for every anchor dimension |
191
+ | **Hyperparameters** | Held constant across all MIRA student scorers; full settings in paper Appendix A.4 |
192
+ | **Validation** | Per-dimension teacher–student MAE and Spearman ρ on a held-out split; dimensions failing reliability thresholds are masked **post-hoc** (Figure 3 in the paper) |
193
+
194
+ Training loss / step curve is preserved in `trainer_state.json` for full reproducibility.
195
+
196
+ ---
197
+
198
+ ## Headline results (from the paper)
199
+
200
+ End-to-end downstream evaluation: Qwen2.5-Coder-14B mid-trained on **25B-token MIRA-selected subsets** vs. baselines, then SFT, evaluated on 9 code benchmarks across 4 categories.
201
+
202
+ | Method | Code Gen | MultiplE | SQL (EX) | SWE-Multi | **Macro Avg** |
203
+ |-----------------------|---------:|---------:|---------:|----------:|--------------:|
204
+ | Base + SFT (no mid) | 53.91 | 72.57 | 64.24 | 3.67 | 48.60 |
205
+ | Raw Mixture (50B) | 53.71 | 67.42 | 94.18 | 40.00 | 63.83 |
206
+ | Random (25B) | 52.71 | 71.44 | 91.03 | 35.00 | 63.23 |
207
+ | DataMan (25B) | 53.82 | 71.38 | 93.84 | 33.00 | 63.01 |
208
+ | DSIR (25B) | 48.74 | 67.26 | 95.20 | 27.00 | 59.55 |
209
+ | PPL (25B) | 50.52 | 57.74 | 90.66 | 20.00 | 54.73 |
210
+ | MIRA-Global (25B) | 53.12 | 67.84 | 94.26 | 32.00 | 61.81 |
211
+ | **MIRA-Group (25B)** | **54.53**| 71.85 | 94.08 | 36.33 | **64.20** |
212
+ | MIRA-Source (25B) | 54.18 | **72.84**| 94.38 | 30.33 | 62.93 |
213
+
214
+ **MIRA-Group matches the full 50B-token raw mixture while using only half the tokens**, and out-performs all 25B-token selection baselines on the macro average. This scorer is one of the 12 student models used by the MIRA-Group variant.
215
+
216
+ ---
217
+
218
+ ## Sibling models
219
+
220
+ MIRA releases one student scorer per source-group variant. Use the matching scorer for each record's format:
221
+
222
+ - **Agent**: [whw06/MIRA-Agent-Group1](https://huggingface.co/whw06/MIRA-Agent-Group1) Β· [-Group2](https://huggingface.co/whw06/MIRA-Agent-Group2) Β· [-Group3](https://huggingface.co/whw06/MIRA-Agent-Group3) Β· [-Group4](https://huggingface.co/whw06/MIRA-Agent-Group4)
223
+ - **QA**: [whw06/MIRA-QA-Group1](https://huggingface.co/whw06/MIRA-QA-Group1) Β· [-Group2](https://huggingface.co/whw06/MIRA-QA-Group2) Β· [-Group3](https://huggingface.co/whw06/MIRA-QA-Group3) Β· [-Group4](https://huggingface.co/whw06/MIRA-QA-Group4) Β· [-Group5](https://huggingface.co/whw06/MIRA-QA-Group5)
224
+ - **Text**: [whw06/MIRA-Text-Group1](https://huggingface.co/whw06/MIRA-Text-Group1) Β· [-Group2](https://huggingface.co/whw06/MIRA-Text-Group2) Β· **MIRA-Text-Group3 (this model)**
225
+
226
+ ---
227
+
228
+ ## Limitations
229
+
230
+ - MIRA addresses **source-aware filtering** only. Source discovery, mixture-ratio design, curriculum scheduling, deduplication and contamination control remain orthogonal concerns.
231
+ - This scorer is calibrated against the **Text / Code-task documentation** group; cross-domain transfer is not advised β€” use the matching sibling for other source formats.
232
+ - Some anchor dimensions exhibit high teacher–student MAE and are **masked post-hoc** during aggregation (see paper Β§3.4). The model still emits scores for masked dimensions; downstream consumers should re-apply the reliability mask from the project repository.
233
+ - Calibrated on 6 sources within this group; behavior on out-of-distribution formats is unverified.
234
+
235
+ ---
236
+
237
+ ## Citation
238
+
239
+ ```bibtex
240
+ @inproceedings{wang2026mira,
241
+ title = {MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection},
242
+ author = {Wang, Haowen and Du, Yaxin and Yang, Jian and Wu, Jiajun and
243
+ Liu, Shukai and Zhang, Yuxuan and Wang, Pingjie and Chen, Siheng and
244
+ Zheng, Tuney and Zhou, Ming and Liu, Xianglong},
245
+ booktitle = {Proceedings of the 2026 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
246
+ year = {2026}
247
+ }
248
+ ```
249
+
250
+ ---
251
+
252
+ ## Acknowledgments
253
+
254
+ Built on [Qwen3.5-35B-A3B-Base](https://huggingface.co/Qwen) and the [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) training stack. Teacher labels generated with [Kimi-K2.6](https://moonshot.ai).