|
|
--- |
|
|
base_model: |
|
|
- Qwen/Qwen3-8B |
|
|
tags: |
|
|
- difficulty |
|
|
- scorer |
|
|
- data_selection |
|
|
--- |
|
|
# Difficulty Scorer v2 |
|
|
|
|
|
A Qwen3-8B based difficulty scorer trained on our own difficulty data, as it is used in our EMNLP 2025 submission titled |
|
|
|
|
|
**Stratified Selective Sampling for Instruction Tuning with Dedicated Scoring Strategy** [REF] |
|
|
|
|
|
The model can be used to classify the difficulty of model instructions. More challenging instructions are associated with better learning outcomes during training. |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
- Finetuned model based on [`Qwen/Qwen3-8B`](https://huggingface.co/Qwen/Qwen3-8B) |
|
|
- Custom head: Regression head on top of pooling layer. |
|
|
|
|
|
For more details, see `model.py` |
|
|
|
|
|
*TODO: erase doubled weights from regression_head.bin* |
|
|
|
|
|
--- |
|
|
|
|
|
## How to Use |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM |
|
|
|
|
|
# Get model and tokenizer |
|
|
model = AutoModelForCausalLM.from_pretrained("IIS-NLP-internal/qwen3-8B-difficulty-scorer-v2", trust_remote_code=True) |
|
|
tokenizer = model.get_tokenizer() |
|
|
|
|
|
# Prepare input data |
|
|
current_category = "Math" |
|
|
system_template = "You are an expert of {category} data. You judge problems for their difficulty." |
|
|
|
|
|
instructions = ["What is the sum of 1 and 2?", |
|
|
"What are all values of $p$ such that for every $q>0$, " \ |
|
|
"we have $$\frac{3(pq^2+p^2q+3q^2+3pq)}{p+q}>2p^2q?$$ Express your answer in interval notation in decimal form." |
|
|
] |
|
|
convs = [[{"role": "system", "content": system_template.format(category=current_category)}, {"role": "user", "content": instruction}] for instruction in instructions] |
|
|
|
|
|
conv_1_tokenized = tokenizer.apply_chat_template(convs[0], tokenize=True, return_tensors="pt").to(model.model.device) |
|
|
conv_2_tokenized = tokenizer.apply_chat_template(convs[1], tokenize=True, return_tensors="pt").to(model.model.device) |
|
|
difficulty_1 = model(conv_1_tokenized)['logits'].item() |
|
|
difficulty_2 = model(conv_2_tokenized)['logits'].item() |
|
|
|
|
|
print(difficulty_1, difficulty_2) |
|
|
# -0.12232150137424469 0.1787720024585724 |
|
|
|
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Files |
|
|
|
|
|
* `pytorch_model-0000x-of-00002.bin` – finetuned model weights |
|
|
* `regression_head.bin` - custom regression head |
|
|
* `config.json` – configuration including base model and head details |
|
|
* `tokenizer.json`, `vocab.txt`, etc. – tokenizer files |
|
|
* `model.py` – custom regression model implementation |
|
|
|
|
|
--- |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
We mostly checked the validity of the scorer through it's downstream benefits in training (see paper). |
|
|
We additionally did a sanity check with coding data from [deepmind/code_contests](https://huggingface.co/datasets/deepmind/code_contests), which contains difficulty scores: |
|
|
|
|
|
 |
|
|
|
|
|
|
|
|
Correlation of our difficulty scores with code_contest data is `r = 0.41` |
|
|
|
|
|
--- |
|
|
|
|
|
## Responsible |
|
|
|
|
|
Mostly Lucas W. |