finephrase

Running on CPU Upgrade

App Files Files Community

finephrase / app /src /content /chapters /analyses.mdx

joelniklaus HF Staff

make verbosity analysis take data from rephrasing_metadata.json instead

f838d6f 2 months ago

raw

history blame

3.35 kB

	import HtmlEmbed from "../../components/HtmlEmbed.astro";
	import FigRef from "../../components/FigRef.astro";

	## Analyses

	Our final experiment explores an even more counterintuitive finding.

	{/*

	### Does edu-score or DCLM-score predict model performance?

	Running these ablations is super expensive. So we were looking for informative proxies that can predict whether a certain dataset will result in better downstream benchmark performance. Since the FineWeb-Edu-score and DCLM-score work well for human data, we surmised it could also work for synthetic data.

	TODO: Run this analysis and add a small report

	*/}

	### Math Rephrasing: When "Worse" Outputs Win

	We compared two ~1.7B parameter models for generating math word problems: SmolLM2 and Qwen3. SmolLM2's outputs looked objectively worse, yet models trained on them performed better.

	Qwen3 produced beautiful, structured outputs:

	- 100% had proper Problem/Solution sections
	- 99% had step-by-step formatting
	- 60% included LaTeX math notation

	Here's a typical Qwen3 output:

	```
	Problem:
	A disc rotates at 120 rpm. How many revolutions in 5 minutes?

	Solution:
	1. Revolutions per minute = 120
	2. Number of minutes = 5
	3. Total revolutions = 120 × 5

	$$120 \\times 5 = 600$$

	The disc makes 600 revolutions in 5 minutes.

	```
	SmolLM2 was messier:

	- Only 68% had complete solutions
	- Wide variance in output length (4 to 4,000 tokens)
	- Mix of formats: questions, partial answers, full solutions

	SmolLM2 outputs ranged from proper solutions to just questions like "What is the difference between X and Y?" or even 4-token fragments like "Areas Where We Service".

	Yet models trained on SmolLM2's data outperformed those trained on Qwen3's data on downstream benchmarks. We suspect this is due to template collapse: Qwen3's outputs were too consistent. 115 out of 1,000 samples started with identical text, while SmolLM2's most common pattern appeared only 3 times.

	\| Metric \| SmolLM2 \| Qwen3 \|
	\| --- \| --- \| --- \|
	\| Most common start \| 3/1000 \| 115/1000 \|
	\| Output length range \| 4-4,000 \| 100-2,600 \|
	\| Unique patterns \| High \| Low \|

	SmolLM2's quality distribution was actually reasonable:

	\| Quality \| Criteria \| Share \|
	\| --- \| --- \| --- \|
	\| Excellent \| Has "solution" + numbered steps + 80+ tokens \| 45% \|
	\| Good \| Has "solution" + 50+ tokens \| 22% \|
	\| Partial \| 30+ tokens but missing structure \| 25% \|
	\| Poor \| {'<'}30 tokens \| 8% \|

	For pretraining data, diversity beats consistency. Models that don't follow instructions perfectly can produce better training data than those that do.

	### Verbosity Analysis

	Different prompt formats produce wildly different output lengths. <FigRef target="verbosity" /> shows the output tokens per document across four prompt types, broken down by model family. Table and Math prompts tend to be concise, while FAQ and Tutorial prompts generate significantly more tokens per document. Notably, the spread within each prompt type varies across model families: some models are consistently verbose regardless of the prompt, while others adapt their output length to the task.

	<HtmlEmbed
	id="verbosity"
	src="verbosity.html"
	data="rephrasing_metadata.json"
	desc="Output tokens per document across prompt types and model families. Hover over dots to see detailed statistics for each experiment."
	/>