File size: 13,034 Bytes
b3dbb53
bc1acc9
 
 
 
 
 
 
 
 
 
 
 
 
 
b3dbb53
bc1acc9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
---
license: apache-2.0
language:
  - en
library_name: transformers
pipeline_tag: text-generation
tags:
  - mira
  - mid-training
  - data-selection
  - rubric-scorer
  - source-aware
  - moe
  - qwen3
base_model: Qwen/Qwen3.5-35B-A3B-Base
---

# MIRA-Text-Group2

A student scorer from **MIRA** (Mid-training Rubric Anchoring for Source-Aware Data Selection), fine-tuned to score **Chinese identity / algorithm explanation text** along a group-specific set of anchor rubric dimensions.

> πŸ“„ **Paper**: *MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection* (EMNLP 2026)
> πŸ’» **Code**: https://github.com/Multilingual-Multimodal-NLP/mira

---

## TL;DR

MIRA is a source-aware data selection framework for heterogeneous **mid-training** corpora. Instead of applying a single global quality rubric, MIRA (1) clusters sources into capability-coherent groups, (2) lets a frontier teacher (Kimi-K2.6) freely propose rubric dimensions and *anchors* them per group, (3) distills the anchored teacher into a lightweight **per-group student scorer**, and (4) applies reliability-aware aggregation with per-source retention thresholds.

**This repository is one of those student scorers** β€” variant **2** in the **Text** family, specialized for **Chinese identity / algorithm explanation text**. Given an in-distribution record, it produces a numerical score and a short rationale for every anchor dimension in this group's rubric.

---

## Model summary

| | |
|---|---|
| **Architecture** | Mixture-of-Experts decoder (35B total / β‰ˆ3B active params) |
| **Base model** | [Qwen3.5-35B-A3B-Base](https://huggingface.co/Qwen) |
| **Fine-tuning** | Full-parameter SFT on Kimi-K2.6 anchored teacher labels |
| **Domain** | Chinese flat-text identity-guided / algorithm explanation documents β€” `purchase_code` (distill_minimax final round), its first-round counterpart, and `ct_code_reasoning`. Strongest intra-group similarity: `purchase_code ↔ purchase_code_first = 0.959`. |
| **Anchor rubric** | 15 group-specific dimensions (`group_C_dim_anchors.jsonl` in the project repo) |
| **Source count** | 3 text sources |
| **Output** | Structured (score, rationale) per anchor dimension |
| **Precision** | BF16 |
| **License** | Apache-2.0 (inherits from Qwen3) |

---

## Sources covered

This scorer is calibrated for the following mid-training sources in the **Text / Chinese identity + algorithm QA** group:

| Source | Description |
|---|---|
| `purchase_code` | distill_minimax final-round purchase-code text |
| `purchase_code_first_round` | First-round counterpart of purchase_code |
| `ct_code_reasoning` | Chinese algorithm problems with worked solutions |

The full source-grouping report (KMeans k=4 / 5 clusters, intra-group cosine similarities) is in the [project repo](https://github.com/Multilingual-Multimodal-NLP/mira).

---

## Anchor dimensions (15 slots)

The scoring rubric for this group, discovered via Kimi-K2.6 free-form judging and clustered into 15 anchor dimensions (KMeans k=15 over the group's dim-score embeddings). Dimensions below are sorted by cluster size β€” larger clusters dominate the corpus and carry more signal. Anchor names are read verbatim from this group's `group_C_dim_anchors.jsonl`; **some names recur across slots** because semantically related but distinct rubric facets were clustered separately by the teacher.

| Slot | Dimension | Cluster size |
|---|---|---:|
| **A1** | Reasoning Quality | 37,321 |
| **A2** | Instruction Following | 36,571 |
| **A3** | Technical Depth | 31,723 |
| **A4** | Bug Identification Accuracy | 25,487 |
| **A5** | Formatting & Structural Clarity | 25,293 |
| **A6** | Training Utility | 25,114 |
| **A7** | Language Consistency & Fluency | 23,793 |
| **A8** | Communication Quality | 23,322 |
| **A9** | Solution Completeness | 22,410 |
| **A10** | Practical Actionability | 21,530 |
| **A11** | Signal-to-Noise Ratio | 21,469 |
| **A12** | Response Completeness | 20,253 |
| **A13** | Structural Organization | 17,329 |
| **A14** | Domain Expertise (Competitive Programming) | 16,016 |
| **A15** | Safety & Harmlessness | 13,024 |

The scorer outputs one `[Ai] <dimension>: <score>/10 β€” <rationale>` line per slot, plus `overall`, `training_recommendation`, `domain_tag`, and `brief`.

---

## Where this model fits in the MIRA pipeline

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. Rubric        β”‚  β”‚ 2. Anchored      β”‚  β”‚ 3. Reliability   β”‚  β”‚ 4. Data          β”‚
β”‚    Discovery     β”‚β†’ β”‚    Judge         β”‚β†’ β”‚    Aggregation   β”‚β†’ β”‚    Selection     β”‚
β”‚ (Kimi-K2.6,      β”‚  β”‚    Distillation  β”‚  β”‚ (mask unreliable β”‚  β”‚ (per-source      β”‚
β”‚  free-form       β”‚  β”‚ ◀── THIS MODEL   β”‚  β”‚  srcΓ—dim cells)  β”‚  β”‚  retention)      β”‚
β”‚  judging)        β”‚  β”‚                  β”‚  β”‚                  β”‚  β”‚                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

`MIRA-Text-Group2` lives in Stage 2: it scores the full **Text / Chinese identity + algorithm QA** corpus so that downstream stages can apply reliability masking and source-aware retention.

---

## Intended use

- **Primary**: Score Chinese identity / algorithm explanation text on this group's anchor dimensions to drive source-aware data selection and filtering.
- **Secondary**: Research on rubric distillation, semantic quality scoring, and reliability diagnostics for heterogeneous training corpora.

**Not intended for**:
- General-purpose chat or instruction following β€” fine-tuned to emit structured scores, not freeform dialogue.
- Single-shot quality judgments without the anchor-dimension prompt template β€” outputs will be miscalibrated.
- Records outside the **Text / Chinese identity + algorithm QA** group; use the matching sibling scorer instead.

---

## Deployment

The scorer is designed to be served via **vLLM** behind an OpenAI-compatible endpoint and called in batch from the MIRA scoring pipeline.

### 1. Serve with vLLM (recommended)

```bash
vllm serve whw06/MIRA-Text-Group2 \
    --tensor-parallel-size 8 \
    --dtype bfloat16 \
    --max-model-len 65536 \
    --max-num-batched-tokens 131072 \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code \
    --port 8000
```

**Why these values** (verified on H200 141GB during the paper's per-source evaluation):
- `max-model-len=65536` β€” 2Γ— the mid-training cutoff. Records can hit ~60K tokens for densely-tokenized sources; 40K runs into prompt-overflow errors.
- `max-num-batched-tokens=131072` β€” supports two full-length sequences per scheduling step.
- `gpu-memory-utilization=0.9` β€” 35B BF16 weights take ~70GB, leaving ~57GB KV cache. Roughly 4 concurrent 65K-context sequences per GPU.
- 8-way tensor parallel works well for the 35B MoE on a single 8Γ—H200/A100 node.

### 2. Call from Python

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

resp = client.chat.completions.create(
    model="whw06/MIRA-Text-Group2",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},   # group-C anchor calibration
        {"role": "user",   "content": USER_PROMPT},     # record + [A1]..[A15] template
    ],
    temperature=0.7,
    top_p=0.95,
    max_tokens=2048,
)
print(resp.choices[0].message.content)
```

### 3. Prompt template

The user message asks for one structured line per anchor dimension (top-15 of this group):

```
[A1] {anchor_dim_1}: <score>/10 β€” <justification>
[A2] {anchor_dim_2}: <score>/10 β€” <justification>
...
[A15] {anchor_dim_15}: <score>/10 β€” <justification>
overall: <0-100>
training_recommendation: <keep | downsample | drop>
domain_tag: <short tag>
brief: <one-sentence summary>
```

The system prompt embeds the **top-12 anchor calibration references** (canonical examples from clustering) so the student matches the teacher's scoring scale. The full prompt builder, anchor JSONL files, and output parser are in the project repo's `scoring/score_text_anchored.py`.

---

## Training details

| | |
|---|---|
| **Teacher** | Kimi-K2.6 (free-form rubric discovery in Phase 1; anchored re-scoring in Phase 2) |
| **Training data** | Kimi-K2.6 anchored labels on this group's Phase-2 corpus, split into a distillation set + a held-out validation split for reliability diagnostics |
| **Loss** | Standard next-token CE over (score, rationale) labels for every anchor dimension |
| **Hyperparameters** | Held constant across all MIRA student scorers; full settings in paper Appendix A.4 |
| **Validation** | Per-dimension teacher–student MAE and Spearman ρ on a held-out split; dimensions failing reliability thresholds are masked **post-hoc** (Figure 3 in the paper) |

Training loss / step curve is preserved in `trainer_state.json` for full reproducibility.

---

## Headline results (from the paper)

End-to-end downstream evaluation: Qwen2.5-Coder-14B mid-trained on **25B-token MIRA-selected subsets** vs. baselines, then SFT, evaluated on 9 code benchmarks across 4 categories.

| Method                | Code Gen | MultiplE | SQL (EX) | SWE-Multi | **Macro Avg** |
|-----------------------|---------:|---------:|---------:|----------:|--------------:|
| Base + SFT (no mid)   |    53.91 |    72.57 |    64.24 |      3.67 |         48.60 |
| Raw Mixture (50B)     |    53.71 |    67.42 |    94.18 |     40.00 |         63.83 |
| Random (25B)          |    52.71 |    71.44 |    91.03 |     35.00 |         63.23 |
| DataMan (25B)         |    53.82 |    71.38 |    93.84 |     33.00 |         63.01 |
| DSIR (25B)            |    48.74 |    67.26 |    95.20 |     27.00 |         59.55 |
| PPL (25B)             |    50.52 |    57.74 |    90.66 |     20.00 |         54.73 |
| MIRA-Global (25B)     |    53.12 |    67.84 |    94.26 |     32.00 |         61.81 |
| **MIRA-Group (25B)**  | **54.53**|    71.85 |    94.08 |     36.33 |     **64.20** |
| MIRA-Source (25B)     |    54.18 | **72.84**|    94.38 |     30.33 |         62.93 |

**MIRA-Group matches the full 50B-token raw mixture while using only half the tokens**, and out-performs all 25B-token selection baselines on the macro average. This scorer is one of the 12 student models used by the MIRA-Group variant.

---

## Sibling models

MIRA releases one student scorer per source-group variant. Use the matching scorer for each record's format:

- **Agent**: [whw06/MIRA-Agent-Group1](https://huggingface.co/whw06/MIRA-Agent-Group1) Β· [-Group2](https://huggingface.co/whw06/MIRA-Agent-Group2) Β· [-Group3](https://huggingface.co/whw06/MIRA-Agent-Group3) Β· [-Group4](https://huggingface.co/whw06/MIRA-Agent-Group4)
- **QA**: [whw06/MIRA-QA-Group1](https://huggingface.co/whw06/MIRA-QA-Group1) Β· [-Group2](https://huggingface.co/whw06/MIRA-QA-Group2) Β· [-Group3](https://huggingface.co/whw06/MIRA-QA-Group3) Β· [-Group4](https://huggingface.co/whw06/MIRA-QA-Group4) Β· [-Group5](https://huggingface.co/whw06/MIRA-QA-Group5)
- **Text**: [whw06/MIRA-Text-Group1](https://huggingface.co/whw06/MIRA-Text-Group1) Β· **MIRA-Text-Group2 (this model)** Β· [-Group3](https://huggingface.co/whw06/MIRA-Text-Group3)

---

## Limitations

- MIRA addresses **source-aware filtering** only. Source discovery, mixture-ratio design, curriculum scheduling, deduplication and contamination control remain orthogonal concerns.
- This scorer is calibrated against the **Text / Chinese identity + algorithm QA** group; cross-domain transfer is not advised β€” use the matching sibling for other source formats.
- Some anchor dimensions exhibit high teacher–student MAE and are **masked post-hoc** during aggregation (see paper Β§3.4). The model still emits scores for masked dimensions; downstream consumers should re-apply the reliability mask from the project repository.
- Calibrated on 3 sources within this group; behavior on out-of-distribution formats is unverified.

---

## Citation

```bibtex
@inproceedings{wang2026mira,
  title     = {MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection},
  author    = {Wang, Haowen and Du, Yaxin and Yang, Jian and Wu, Jiajun and
               Liu, Shukai and Zhang, Yuxuan and Wang, Pingjie and Chen, Siheng and
               Zheng, Tuney and Zhou, Ming and Liu, Xianglong},
  booktitle = {Proceedings of the 2026 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year      = {2026}
}
```

---

## Acknowledgments

Built on [Qwen3.5-35B-A3B-Base](https://huggingface.co/Qwen) and the [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) training stack. Teacher labels generated with [Kimi-K2.6](https://moonshot.ai).