File size: 2,823 Bytes
3f1a800
 
 
 
 
 
 
 
9a31823
 
 
 
 
 
33ac44d
 
9a31823
 
33ac44d
9a31823
 
 
 
f86b629
 
9a31823
 
 
 
 
3f1a800
 
 
 
a8e51b2
3f1a800
 
 
 
 
e142663
 
 
 
 
 
 
 
 
 
 
 
 
9a31823
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3f1a800
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
base_model:
- Qwen/Qwen3-8B
tags:
- difficulty
- scorer
- data_selection
---
# Difficulty Scorer v2 

A Qwen3-8B based difficulty scorer trained on our own difficulty data, as it is used in our EMNLP 2025 submission titled

**Stratified Selective Sampling for Instruction Tuning with Dedicated Scoring Strategy** [REF]

The model can be used to classify the difficulty of model instructions. More challenging instructions are associated with better learning outcomes during training.

## Model Architecture

- Finetuned model based on [`Qwen/Qwen3-8B`](https://huggingface.co/Qwen/Qwen3-8B) 
- Custom head: Regression head on top of pooling layer.

For more details, see `model.py`

*TODO: erase doubled weights from regression_head.bin*

---

##  How to Use

```python
from transformers import AutoModelForCausalLM

# Get model and tokenizer
model = AutoModelForCausalLM.from_pretrained("IIS-NLP-internal/qwen3-8B-difficulty-scorer-v2", trust_remote_code=True)
tokenizer = model.get_tokenizer()

# Prepare input data
current_category = "Math"
system_template = "You are an expert of {category} data. You judge problems for their difficulty."

instructions = ["What is the sum of 1 and 2?",
                "What are all values of $p$ such that for every $q>0$, " \
                "we have   $$\frac{3(pq^2+p^2q+3q^2+3pq)}{p+q}>2p^2q?$$ Express your answer in interval notation in decimal form."
                ]
convs = [[{"role": "system", "content": system_template.format(category=current_category)}, {"role": "user", "content": instruction}] for instruction in instructions]

conv_1_tokenized = tokenizer.apply_chat_template(convs[0], tokenize=True, return_tensors="pt").to(model.model.device)
conv_2_tokenized = tokenizer.apply_chat_template(convs[1], tokenize=True, return_tensors="pt").to(model.model.device)
difficulty_1 = model(conv_1_tokenized)['logits'].item()
difficulty_2 = model(conv_2_tokenized)['logits'].item()

print(difficulty_1, difficulty_2)
# -0.12232150137424469 0.1787720024585724

```

---

##  Model Files

* `pytorch_model-0000x-of-00002.bin` – finetuned model weights
* `regression_head.bin` - custom regression head
* `config.json` – configuration including base model and head details
* `tokenizer.json`, `vocab.txt`, etc. – tokenizer files
* `model.py` – custom regression model implementation

---

## Evaluation

We mostly checked the validity of the scorer through it's downstream benefits in training (see paper).
We additionally did a sanity check with coding data from [deepmind/code_contests](https://huggingface.co/datasets/deepmind/code_contests), which contains difficulty scores:

![Correlation code contest](./scatter_code_contests_vs_difficulty.png)


Correlation of our difficulty scores with code_contest data is `r = 0.41`

---

## Responsible 

Mostly Lucas W.