IIS-NLP
/

difficulty-scorer-8B-v2

Model card Files Files and versions

difficulty-scorer-8B-v2 / README.md

lucweber's picture

Update README.md

a8e51b2 verified 8 months ago

|

history blame contribute delete

2.82 kB

	---
	base_model:
	- Qwen/Qwen3-8B
	tags:
	- difficulty
	- scorer
	- data_selection
	---
	# Difficulty Scorer v2

	A Qwen3-8B based difficulty scorer trained on our own difficulty data, as it is used in our EMNLP 2025 submission titled

	Stratified Selective Sampling for Instruction Tuning with Dedicated Scoring Strategy [REF]

	The model can be used to classify the difficulty of model instructions. More challenging instructions are associated with better learning outcomes during training.

	## Model Architecture

	- Finetuned model based on [`Qwen/Qwen3-8B`](https://huggingface.co/Qwen/Qwen3-8B)
	- Custom head: Regression head on top of pooling layer.

	For more details, see `model.py`

	TODO: erase doubled weights from regression_head.bin

	---

	## How to Use

	```python
	from transformers import AutoModelForCausalLM

	# Get model and tokenizer
	model = AutoModelForCausalLM.from_pretrained("IIS-NLP-internal/qwen3-8B-difficulty-scorer-v2", trust_remote_code=True)
	tokenizer = model.get_tokenizer()

	# Prepare input data
	current_category = "Math"
	system_template = "You are an expert of {category} data. You judge problems for their difficulty."

	instructions = ["What is the sum of 1 and 2?",
	"What are all values of $p$ such that for every $q>0$, " \
	"we have $$\frac{3(pq^2+p^2q+3q^2+3pq)}{p+q}>2p^2q?$$ Express your answer in interval notation in decimal form."
	]
	convs = [[{"role": "system", "content": system_template.format(category=current_category)}, {"role": "user", "content": instruction}] for instruction in instructions]

	conv_1_tokenized = tokenizer.apply_chat_template(convs[0], tokenize=True, return_tensors="pt").to(model.model.device)
	conv_2_tokenized = tokenizer.apply_chat_template(convs[1], tokenize=True, return_tensors="pt").to(model.model.device)
	difficulty_1 = model(conv_1_tokenized)['logits'].item()
	difficulty_2 = model(conv_2_tokenized)['logits'].item()

	print(difficulty_1, difficulty_2)
	# -0.12232150137424469 0.1787720024585724

	```

	---

	## Model Files

	* `pytorch_model-0000x-of-00002.bin` – finetuned model weights
	* `regression_head.bin` - custom regression head
	* `config.json` – configuration including base model and head details
	* `tokenizer.json`, `vocab.txt`, etc. – tokenizer files
	* `model.py` – custom regression model implementation

	---

	## Evaluation

	We mostly checked the validity of the scorer through it's downstream benefits in training (see paper).
	We additionally did a sanity check with coding data from [deepmind/code_contests](https://huggingface.co/datasets/deepmind/code_contests), which contains difficulty scores:

	![Correlation code contest](./scatter_code_contests_vs_difficulty.png)


	Correlation of our difficulty scores with code_contest data is `r = 0.41`

	---

	## Responsible

	Mostly Lucas W.