finephrase

Running on CPU Upgrade

App Files Files Community

joelniklaus HF Staff commited on Feb 9

Commit

b354c96

1 Parent(s): 3b14ed8

replace static plots with dynamic d3 embeds

Browse files

Files changed (25) hide show

app/src/content/assets/image/newplot_2c31384e-bcac-800b-82e8-ff44228f7720.png +0 -3
app/src/content/assets/image/newplot_2c41384e-bcac-8073-9395-cf2d0e901187.png +0 -3
app/src/content/assets/image/newplot_2d21384e-bcac-80ab-a6dd-e31a6c150e61.png +0 -3
app/src/content/assets/image/newplot_2d61384e-bcac-8092-baca-c17346b95734.png +0 -3
app/src/content/assets/image/newplot_2da1384e-bcac-80d6-a8b9-da80324f8fef.png +0 -3
app/src/content/assets/image/newplot_2df1384e-bcac-8010-abe7-cf477262b8d6.png +0 -3
app/src/content/assets/image/newplot_2df1384e-bcac-8018-b1f6-da1dcde1f90a.png +0 -3
app/src/content/assets/image/newplot_2df1384e-bcac-80bc-b93c-ee8e9cfd5529.png +0 -3
app/src/content/assets/image/newplot_2e01384e-bcac-8017-9829-cd0c1db928c6.png +0 -3
app/src/content/assets/image/newplot_2e01384e-bcac-806f-8bf1-f7e5405a2ff9.png +0 -3
app/src/content/assets/image/newplot_2e11384e-bcac-800a-abc6-d0690da3f955.png +0 -3
app/src/content/assets/image/newplot_2e11384e-bcac-8032-9835-e1407f4d780d.png +0 -3
app/src/content/assets/image/newplot_2e11384e-bcac-80a3-a6fa-e8634e0e2206.png +0 -3
app/src/content/assets/image/newplot_2e11384e-bcac-80bc-810d-d13554c628dc.png +0 -3
app/src/content/assets/image/newplot_2e11384e-bcac-80dd-972d-cf77d9c3b004.png +0 -3
app/src/content/assets/image/newplot_2e11384e-bcac-80ea-88cc-c971b2816596.png +0 -3
app/src/content/assets/image/newplot_2e21384e-bcac-80a2-9bac-c543304d926e.png +0 -3
app/src/content/assets/image/newplot_2e41384e-bcac-8065-b313-c38a6db4ac31.png +0 -3
app/src/content/assets/image/newplot_2e41384e-bcac-80c0-aef5-e71fdbaccd8d.png +0 -3
app/src/content/assets/image/newplot_2e71384e-bcac-8027-ae32-c133627ede4a.png +0 -3
app/src/content/assets/image/newplot_2ee1384e-bcac-80da-82cd-df97247e2e72.png +0 -3
app/src/content/assets/image/newplot_2f61384e-bcac-80d9-ab81-d57a228847cf.png +0 -3
app/src/content/assets/image/newplot_2f71384e-bcac-80c6-a99e-f52084fc497b.png +0 -3
app/src/content/assets/image/newplot_2f71384e-bcac-80d8-9985-e195d39f1e70.png +0 -3
app/src/content/chapters/experiments.mdx +408 -49

app/src/content/assets/image/newplot_2c31384e-bcac-800b-82e8-ff44228f7720.png DELETED Viewed

Git LFS Details

SHA256: 393d63285d6ec117fcf1e7654a8259fae40b4ebe7f1e92c0be933061ecfd3685
Pointer size: 130 Bytes
Size of remote file: 87.2 kB

app/src/content/assets/image/newplot_2c41384e-bcac-8073-9395-cf2d0e901187.png DELETED Viewed

Git LFS Details

SHA256: 2b85b3a4ca3cd61dece9980f23e11e54682b732d52c1f4c6f390542efd8792fb
Pointer size: 131 Bytes
Size of remote file: 109 kB

app/src/content/assets/image/newplot_2d21384e-bcac-80ab-a6dd-e31a6c150e61.png DELETED Viewed

Git LFS Details

SHA256: 60d2b0443abc763a230ba05d04a2666cc396c39fc2917b4511a065285e3d0cf6
Pointer size: 130 Bytes
Size of remote file: 90.7 kB

app/src/content/assets/image/newplot_2d61384e-bcac-8092-baca-c17346b95734.png DELETED Viewed

Git LFS Details

SHA256: fed0ef5419b5cdba0fdc095268f848fb0711b22dd6bfb07072f46792d42baae5
Pointer size: 130 Bytes
Size of remote file: 83.4 kB

app/src/content/assets/image/newplot_2da1384e-bcac-80d6-a8b9-da80324f8fef.png DELETED Viewed

Git LFS Details

SHA256: 825cf76b40b642b7ff4538e8329983e9f0bef3c668990b2b4c2ff6a0bb905bc1
Pointer size: 130 Bytes
Size of remote file: 92.5 kB

app/src/content/assets/image/newplot_2df1384e-bcac-8010-abe7-cf477262b8d6.png DELETED Viewed

Git LFS Details

SHA256: 66fd822ea756da653cb1384867986183d75e60dfdae63451d134d28288916502
Pointer size: 130 Bytes
Size of remote file: 99 kB

app/src/content/assets/image/newplot_2df1384e-bcac-8018-b1f6-da1dcde1f90a.png DELETED Viewed

Git LFS Details

SHA256: 64ec7c47e9f24d6820375da362b417c61bbca05887707104d739a9a01f18b600
Pointer size: 130 Bytes
Size of remote file: 73.7 kB

app/src/content/assets/image/newplot_2df1384e-bcac-80bc-b93c-ee8e9cfd5529.png DELETED Viewed

Git LFS Details

SHA256: 4c6e92f7ca608e9a77854a917f4a809dd0c2282739d32b9a89714f264b9c1203
Pointer size: 130 Bytes
Size of remote file: 77.5 kB

app/src/content/assets/image/newplot_2e01384e-bcac-8017-9829-cd0c1db928c6.png DELETED Viewed

Git LFS Details

SHA256: a68937ff61af1767a0ee61ff760ed66af7da099c0f0325c5264964023ca0b2a8
Pointer size: 130 Bytes
Size of remote file: 75.7 kB

app/src/content/assets/image/newplot_2e01384e-bcac-806f-8bf1-f7e5405a2ff9.png DELETED Viewed

Git LFS Details

SHA256: 06e3a3abfc35980821a65be553f6c1b67f4238664f09ee881e02c4e9135b8c98
Pointer size: 130 Bytes
Size of remote file: 73.4 kB

app/src/content/assets/image/newplot_2e11384e-bcac-800a-abc6-d0690da3f955.png DELETED Viewed

Git LFS Details

SHA256: e5ae7f869bea568144a87435ff41835e349f4b0d06cffc29dcffa70dd19b1692
Pointer size: 130 Bytes
Size of remote file: 93.7 kB

app/src/content/assets/image/newplot_2e11384e-bcac-8032-9835-e1407f4d780d.png DELETED Viewed

Git LFS Details

SHA256: 31439bb4c2abed60df484b18c78577d6cd0830bdde396dba3d5c7c4c3c8874d0
Pointer size: 131 Bytes
Size of remote file: 100 kB

app/src/content/assets/image/newplot_2e11384e-bcac-80a3-a6fa-e8634e0e2206.png DELETED Viewed

Git LFS Details

SHA256: 7d556c4a7ccaba6194ad634bafcd1d779eb8574ea0f9586353d39fde9c8e0771
Pointer size: 130 Bytes
Size of remote file: 74.4 kB

app/src/content/assets/image/newplot_2e11384e-bcac-80bc-810d-d13554c628dc.png DELETED Viewed

Git LFS Details

SHA256: e221a2fc28b87508ba9b8c4c4156705fad2327b105caf1a735a16f23abf91926
Pointer size: 130 Bytes
Size of remote file: 99.8 kB

app/src/content/assets/image/newplot_2e11384e-bcac-80dd-972d-cf77d9c3b004.png DELETED Viewed

Git LFS Details

SHA256: d71572cee1a96142dfb5cadf6e22d858eef7bc5995f20fc17362719de68b4991
Pointer size: 130 Bytes
Size of remote file: 74.9 kB

app/src/content/assets/image/newplot_2e11384e-bcac-80ea-88cc-c971b2816596.png DELETED Viewed

Git LFS Details

SHA256: 59d5232bcae5fe52821e4e5905311a1e454e7faa9414d11d3f06e52e214d849a
Pointer size: 131 Bytes
Size of remote file: 101 kB

app/src/content/assets/image/newplot_2e21384e-bcac-80a2-9bac-c543304d926e.png DELETED Viewed

Git LFS Details

SHA256: 9a98cb1381038c642f7bc96c85693f4ecda2ee30be1bf46f23d93578cf2b808a
Pointer size: 130 Bytes
Size of remote file: 93.6 kB

app/src/content/assets/image/newplot_2e41384e-bcac-8065-b313-c38a6db4ac31.png DELETED Viewed

Git LFS Details

SHA256: a6d95b249391197de687d1782493f2e9a8d4c8b4bb3c5b18763cc965a5da57b8
Pointer size: 130 Bytes
Size of remote file: 84.1 kB

app/src/content/assets/image/newplot_2e41384e-bcac-80c0-aef5-e71fdbaccd8d.png DELETED Viewed

Git LFS Details

SHA256: 6d27b337ef2f8c035ef53244313864a9f83b26c8f4f583b32d343c529c1e5778
Pointer size: 130 Bytes
Size of remote file: 73.8 kB

app/src/content/assets/image/newplot_2e71384e-bcac-8027-ae32-c133627ede4a.png DELETED Viewed

Git LFS Details

SHA256: 38c347549d3438bac47b6209dff31be07aa656e6cd5d5b2182d946ec8593d405
Pointer size: 130 Bytes
Size of remote file: 92.7 kB

app/src/content/assets/image/newplot_2ee1384e-bcac-80da-82cd-df97247e2e72.png DELETED Viewed

Git LFS Details

SHA256: 7ec19ccd45a44cd3643f42aa761446b6363470edae6331fc152bf8c2c45e1e54
Pointer size: 130 Bytes
Size of remote file: 90.7 kB

app/src/content/assets/image/newplot_2f61384e-bcac-80d9-ab81-d57a228847cf.png DELETED Viewed

Git LFS Details

SHA256: ee2fb627efea97e820ce3cd5ad14fb3dfb45cc09920d8c4c853d204bd7e13d98
Pointer size: 131 Bytes
Size of remote file: 108 kB

app/src/content/assets/image/newplot_2f71384e-bcac-80c6-a99e-f52084fc497b.png DELETED Viewed

Git LFS Details

SHA256: a5041ef43eab574da58b533997976e08c127dd7597820f97cd95c9db1c866e57
Pointer size: 130 Bytes
Size of remote file: 93.7 kB

app/src/content/assets/image/newplot_2f71384e-bcac-80d8-9985-e195d39f1e70.png DELETED Viewed

Git LFS Details

SHA256: 56dbf3a20dcee87cc9b6ea914f89208fb90943ba89f791529f76f2facbaf3038
Pointer size: 130 Bytes
Size of remote file: 93.6 kB

app/src/content/chapters/experiments.mdx CHANGED Viewed

@@ -1,29 +1,4 @@
-import Image from "../../components/Image.astro";
 import HtmlEmbed from "../../components/HtmlEmbed.astro";
-import newplot_2c41384e_bcac_8073_9395_cf2d0e901187 from "../assets/image/newplot_2c41384e-bcac-8073-9395-cf2d0e901187.png";
-import newplot_2c31384e_bcac_800b_82e8_ff44228f7720 from "../assets/image/newplot_2c31384e-bcac-800b-82e8-ff44228f7720.png";
-import newplot_2e11384e_bcac_800a_abc6_d0690da3f955 from "../assets/image/newplot_2e11384e-bcac-800a-abc6-d0690da3f955.png";
-import newplot_2e21384e_bcac_80a2_9bac_c543304d926e from "../assets/image/newplot_2e21384e-bcac-80a2-9bac-c543304d926e.png";
-import newplot_2e11384e_bcac_80dd_972d_cf77d9c3b004 from "../assets/image/newplot_2e11384e-bcac-80dd-972d-cf77d9c3b004.png";
-import newplot_2e11384e_bcac_80a3_a6fa_e8634e0e2206 from "../assets/image/newplot_2e11384e-bcac-80a3-a6fa-e8634e0e2206.png";
-import newplot_2e41384e_bcac_80c0_aef5_e71fdbaccd8d from "../assets/image/newplot_2e41384e-bcac-80c0-aef5-e71fdbaccd8d.png";
-import newplot_2da1384e_bcac_80d6_a8b9_da80324f8fef from "../assets/image/newplot_2da1384e-bcac-80d6-a8b9-da80324f8fef.png";
-import newplot_2e71384e_bcac_8027_ae32_c133627ede4a from "../assets/image/newplot_2e71384e-bcac-8027-ae32-c133627ede4a.png";
-import newplot_2f71384e_bcac_80c6_a99e_f52084fc497b from "../assets/image/newplot_2f71384e-bcac-80c6-a99e-f52084fc497b.png";
-import newplot_2f71384e_bcac_80d8_9985_e195d39f1e70 from "../assets/image/newplot_2f71384e-bcac-80d8-9985-e195d39f1e70.png";
-import newplot_2d21384e_bcac_80ab_a6dd_e31a6c150e61 from "../assets/image/newplot_2d21384e-bcac-80ab-a6dd-e31a6c150e61.png";
-import newplot_2e11384e_bcac_80ea_88cc_c971b2816596 from "../assets/image/newplot_2e11384e-bcac-80ea-88cc-c971b2816596.png";
-import newplot_2e11384e_bcac_8032_9835_e1407f4d780d from "../assets/image/newplot_2e11384e-bcac-8032-9835-e1407f4d780d.png";
-import newplot_2df1384e_bcac_80bc_b93c_ee8e9cfd5529 from "../assets/image/newplot_2df1384e-bcac-80bc-b93c-ee8e9cfd5529.png";
-import newplot_2df1384e_bcac_8018_b1f6_da1dcde1f90a from "../assets/image/newplot_2df1384e-bcac-8018-b1f6-da1dcde1f90a.png";
-import newplot_2e01384e_bcac_8017_9829_cd0c1db928c6 from "../assets/image/newplot_2e01384e-bcac-8017-9829-cd0c1db928c6.png";
-import newplot_2e01384e_bcac_806f_8bf1_f7e5405a2ff9 from "../assets/image/newplot_2e01384e-bcac-806f-8bf1-f7e5405a2ff9.png";
-import newplot_2d61384e_bcac_8092_baca_c17346b95734 from "../assets/image/newplot_2d61384e-bcac-8092-baca-c17346b95734.png";
-import newplot_2e41384e_bcac_8065_b313_c38a6db4ac31 from "../assets/image/newplot_2e41384e-bcac-8065-b313-c38a6db4ac31.png";
-import newplot_2df1384e_bcac_8010_abe7_cf477262b8d6 from "../assets/image/newplot_2df1384e-bcac-8010-abe7-cf477262b8d6.png";
-import newplot_2e11384e_bcac_80bc_810d_d13554c628dc from "../assets/image/newplot_2e11384e-bcac-80bc-810d-d13554c628dc.png";
-import newplot_2f61384e_bcac_80d9_ab81_d57a228847cf from "../assets/image/newplot_2f61384e-bcac-80d9-ab81-d57a228847cf.png";
-import newplot_2ee1384e_bcac_80da_82cd_df97247e2e72 from "../assets/image/newplot_2ee1384e-bcac-80da-82cd-df97247e2e72.png";
 ## Experiments
@@ -99,7 +74,26 @@ DCLM, REWIRE and Nemotron-HQ-Synth are the strongest baselines in our setup by a
 Using gemma-3-1b, the prompt from REWIRE (guided_rewrite_original) is on-par with DCLM in our setup. Nemotron-HQ-Synth was created using five prompts: diverse_qa_pairs, extract_knowledge, distil, wikipedia_style_rephrasing and knowledge_list. The only prompt that really works well in our setup is diverse_qa_pairs. This is mainly due to very strong performance on SQUAD. We used fineweb-edu-hq as the source dataset for all prompts.
-<Image src={newplot_2c41384e_bcac_8073_9395_cf2d0e901187} alt="Image" />
 We see that dclm is a very strong baseline: apart from the diverse_qa_pairs prompt from the Nemotron-HQ-Synth dataset, no other open prior work outperforms dclm. Can we do better with different prompts?
@@ -107,7 +101,23 @@ We see that dclm is a very strong baseline: apart from the diverse_qa_pairs prom
 We found four prompts that outperform both fw-edu-hq and the challenging dclm baseline: math, table, faq and tutorial.
-<Image src={newplot_2c31384e_bcac_800b_82e8_ff44228f7720} alt="Image" />
 For now we just used the Gemma-3-1b model, but can we do better by changing the rephrasing model?
@@ -119,11 +129,45 @@ In general, we want to know whether using a stronger model leads to better synth
 We compare rephrasing with all Gemma-3 sizes (270m, 1b, 4b, 12b, 27b) using the tutorial prompt. We find that the 270m model underperforms but otherwise there is no significant difference.
-<Image src={newplot_2e11384e_bcac_800a_abc6_d0690da3f955} alt="Image" />
 Potentially, writing a tutorial is easy enough and we only need larger models for harder prompts such as Math. So we tested it there too, but find similar results with the 270m underperforming and no large difference between 1b, 4b, 12b and 27b.
-<Image src={newplot_2e21384e_bcac_80a2_9bac_c543304d926e} alt="Image" />
 TODO: also run this experiment for the REWIRE prompt since the original authors claim that larger models are necessary there
@@ -131,34 +175,140 @@ TODO: also run this experiment for the REWIRE prompt since the original authors
 The [REWIRE](https://arxiv.org/abs/2506.04689) paper claims that for upcycling low quality data we need large models (Llama-3.3 70B in their case). Is this true?
 Continue prompt: For the 1b model the source data does not seem to matter, but the 12b model can make use of the hq data better.
-<Image src={newplot_2e11384e_bcac_80dd_972d_cf77d9c3b004} alt="Image" />
 Tutorial prompt: For the hq data the model size does not seem to matter whereas for the lq data the larger model is slightly better.
-<Image src={newplot_2e11384e_bcac_80a3_a6fa_e8634e0e2206} alt="Image" />
 FAQ prompt: Surprisingly, the 1b model is better for both lq and hq data.
-<Image src={newplot_2e41384e_bcac_80c0_aef5_e71fdbaccd8d} alt="Image" />
 In general we cannot reproduce REWIRE's claim that large models are needed for lq data. Overall we rarely see benefits of using models larger than 1b. So as long as the model has some baseline level (in our experiments already reached at the 1b scale) we see no evidence for a clear benefit of using larger models for rephrasing. For these reasons we default to the 1b size for maximum throughput from here on. We hypothesize that most rephrasing tasks are simple enough for smaller models to handle sufficiently well.
 #### Does the model family matter?
 Some model families may be better suited for rephrasing than others based on their training data. This is why we test top families at the 1B scale on the four top-performing prompts tutorial, faq, table, math. We find that for the tutorial prompt at the 1B scale Llama-3.2, Granite-3, Gemma-3, and Qwen3 and Falcon3 perform roughly at the same level. SmolLM2 clearly outperforms.
-<Image src={newplot_2da1384e_bcac_80d6_a8b9_da80324f8fef} alt="Image" />
 In the faq prompt SmolLM2 again clearly outperforms the others. Here Qwen3 underperforms.
-<Image src={newplot_2e71384e_bcac_8027_ae32_c133627ede4a} alt="Image" />
 For the table prompt we again see SmolLM2 and to some degree Falcon3 outperform.
-<Image src={newplot_2f71384e_bcac_80c6_a99e_f52084fc497b} alt="Image" />
 Finally, math is again a clear win for SmolLM2 with Qwen3 underperforming.
-<Image src={newplot_2f71384e_bcac_80d8_9985_e195d39f1e70} alt="Image" />
 We hypothesize that the consistently strong performance of SmolLM2 originates from [rewrite tasks](https://huggingface.co/datasets/HuggingFaceTB/smoltalk/viewer/smol-rewrite?row=0&views%5B%5D=smol_rewrite_train) in the training data.
 So the model family clearly seems to matter. However, SmolLM2 is already a year old. Are newer models better than older ones?
@@ -166,7 +316,23 @@ So the model family clearly seems to matter. However, SmolLM2 is already a year
 We compare rephrasing with Qwen models from versions 1.5, 2, 2.5 and 3 using the tutorial prompt, one of the prompts that outperformed the DCLM baseline. While the differences are small we find a trend that newer versions lead to higher evaluation performance.
-<Image src={newplot_2d21384e_bcac_80ab_a6dd_e31a6c150e61} alt="Image" />
 So now we know that certain models are better than others, newer models tend to outperform older models and usually rephrasing models can be as small as 1B parameters. What difference do the dataset choices make?
@@ -176,11 +342,47 @@ So now we know that certain models are better than others, newer models tend to
 To test the effect of the mix-in dataset we apply the tutorial prompt using Gemma-3-1b on fw_edu_hq and mix in dclm, cosmopedia, fw_edu_hq and fw_edu_lq. We find that the mix-in dataset makes a substantial difference, with cosmopedia and fw_edu_lq underperforming dclm and fw_edu_hq. fw_edu_hq and dclm achieve very similar accuracy even though dclm is much better by itself. We see that mixing in the synthetic data improves performance for all mix-in datasets. The effect is more pronounced for the worse datasets fw_edu_lq and cosmopedia.
-<Image src={newplot_2e11384e_bcac_80ea_88cc_c971b2816596} alt="Image" />
 Does this trend hold for other source datasets? We ran the experiment for fw_edu_lq as source and find similar results: fw_edu_hq and dclm outperform both cosmopedia and fw_edu_lq. For all mix-in datasets except dclm, adding synthetic data is beneficial.
-<Image src={newplot_2e11384e_bcac_8032_9835_e1407f4d780d} alt="Image" />
 So we know that the mix-in dataset plays a large role. What about the source dataset used for rephrasing?
@@ -188,23 +390,111 @@ So we know that the mix-in dataset plays a large role. What about the source dat
 To investigate to what extent the source dataset for rephrasing matters we rephrased dclm, cosmopedia, fw_edu_hq and fw_edu_lq using the Gemma-3-1B model and the tutorial and faq prompts. When we mix in the source dataset with the rephrased data we find fw_edu_hq and dclm clearly outperforming fw_edu_lq and cosmopedia for both prompts.
-<Image src={newplot_2df1384e_bcac_80bc_b93c_ee8e9cfd5529} alt="Image" />
-<Image src={newplot_2df1384e_bcac_8018_b1f6_da1dcde1f90a} alt="Image" />
 When fix the mix-in dataset to fw_edu_hq, the difference shrinks drastically for the tutorial prompt and even more for the faq prompt. This corroborates our finding that the mix-in datasets seem to matter much more than the source rephrasing datasets.
-<Image src={newplot_2e01384e_bcac_8017_9829_cd0c1db928c6} alt="Image" />
-<Image src={newplot_2e01384e_bcac_806f_8bf1_f7e5405a2ff9} alt="Image" />
 #### Is synthetic data enough?
 We were wondering whether just training on synthetic data works. While we get increased performance over fw-edu-hq, it does not match the original dataset performance (DCLM) and also is clearly below the performance of the original dataset mixed with the rephrased one for both the tutorial and faq prompts. We get the same result when we rephrase fw_edu_hq instead of dclm.
-<Image src={newplot_2d61384e_bcac_8092_baca_c17346b95734} alt="Image" />
-<Image src={newplot_2e41384e_bcac_8065_b313_c38a6db4ac31} alt="Image" />
 #### Does increased diversity help?
@@ -212,18 +502,71 @@ There are multiple ways of increasing diversity. We can think of mixing rephrasi
  **Mixing rephrasing approaches**
 We were wondering whether mixing the best performing rephrasing approaches can improve over the individual approaches. We find no significant increase over the best performing approach (mix-fw_edu_hq-math_1b_hq). It seems that when we mix together enough different prompts (mix-tutorial_1b_hq-faq_1b_hq-table_1b_hq-math_1b_hq), we don't necessarily need the source dataset (fw_edu_hq) for good performance. This could mean that when just training on one synthetic dataset we need the original dataset for diversity, but when we mix multiple ones it is not necessary. However, it does not hurt and is an easy way of increasing the dataset size while keeping the performance high. To follow up it would be interesting to study with how little synthetic data we can get away with without performance drops.
-<Image src={newplot_2df1384e_bcac_8010_abe7_cf477262b8d6} alt="Image" />
  **Mixing model families**
 We rephrased using different model families and saw SmolLM2 and Falcon3 clearly outperform Llama3.2 and Granite3. Now we wonder whether mixing the rephrased outputs of multiple models improves performance through increased diversity.
-<Image src={newplot_2e11384e_bcac_80bc_810d_d13554c628dc} alt="Image" />
 It turns out that benchmark performance does not improve through increased rephrasing model diversity but is largely an average of the mixed datasets performance (smollm2 and falcon3 are similar to just smollm2, smollm2 and llama3.2 lie in between smollm2 and llama3.2, llama3.2 and granite3 are similar to just llama3.2).
  **Mixing both rephrasing approaches and model families**
 Maybe we need more diversity by mixing both rephrasing approaches and model families?
-<Image src={newplot_2f61384e_bcac_80d9_ab81_d57a228847cf} alt="Image" />
 No, we get the same results as for just mixing rephrasing approaches or model families independently: the mix is around the average performance instead of resulting in a gain.
@@ -231,7 +574,23 @@ No, we get the same results as for just mixing rephrasing approaches or model fa
 The original REWIRE prompt contains many typos and grammar errors. To what extent do typos in the prompt hurt performance?
-<Image src={newplot_2ee1384e_bcac_80da_82cd_df97247e2e72} alt="Image" />
 Surprisingly, typos don't have a negative effect on downstream model performance. For the 1b model, even the opposite is the case.

 import HtmlEmbed from "../../components/HtmlEmbed.astro";
 ## Experiments
 Using gemma-3-1b, the prompt from REWIRE (guided_rewrite_original) is on-par with DCLM in our setup. Nemotron-HQ-Synth was created using five prompts: diverse_qa_pairs, extract_knowledge, distil, wikipedia_style_rephrasing and knowledge_list. The only prompt that really works well in our setup is diverse_qa_pairs. This is mainly due to very strong performance on SQUAD. We used fineweb-edu-hq as the source dataset for all prompts.
+<HtmlEmbed
+  id="dissecting-baselines"
+  src="d3-benchmark-comparison.html"
+  title="Dissecting Synthetic Baselines"
+  desc="Figure: Individual prompt performance from existing synthetic datasets compared to DCLM and FineWeb-Edu (HQ)."
+  config={{
+    defaultView: "line",
+    datasetNames: {
+      "mix-fw_edu_hq-diverse_qa_pairs_1b_hq": "Diverse QA Pairs",
+      dclm: "DCLM",
+      "mix-fw_edu_hq-extract_knowledge_1b_hq": "Extract Knowledge",
+      "mix-fw_edu_hq-guided_rewrite_original_1b_hq": "Guided Rewrite (REWIRE)",
+      nemotron_hq_synth: "Nemotron-HQ-Synth",
+      "mix-fw_edu_hq-distill_1b_hq": "Distill",
+      "mix-fw_edu_hq-wikipedia_style_rephrasing_1b_hq": "Wikipedia Rephrasing",
+      "mix-fw_edu_hq-knowledge_list_1b_hq": "Knowledge List",
+      fw_edu_hq: "FineWeb-Edu (HQ)"
+    }
+  }}
+/>
 We see that dclm is a very strong baseline: apart from the diverse_qa_pairs prompt from the Nemotron-HQ-Synth dataset, no other open prior work outperforms dclm. Can we do better with different prompts?
 We found four prompts that outperform both fw-edu-hq and the challenging dclm baseline: math, table, faq and tutorial.
+<HtmlEmbed
+  id="new-prompts"
+  src="d3-benchmark-comparison.html"
+  title="New Prompt Performance"
+  desc="Figure: Four new prompts (math, table, faq, tutorial) compared against DCLM and FineWeb-Edu (HQ)."
+  config={{
+    defaultView: "line",
+    datasetNames: {
+      "mix-fw_edu_hq-math_1b_hq": "Math",
+      "mix-fw_edu_hq-table_1b_hq": "Table",
+      "mix-fw_edu_hq-faq_1b_hq": "FAQ",
+      "mix-fw_edu_hq-tutorial_1b_hq": "Tutorial",
+      dclm: "DCLM",
+      fw_edu_hq: "FineWeb-Edu (HQ)"
+    }
+  }}
+/>
 For now we just used the Gemma-3-1b model, but can we do better by changing the rephrasing model?
 We compare rephrasing with all Gemma-3 sizes (270m, 1b, 4b, 12b, 27b) using the tutorial prompt. We find that the 270m model underperforms but otherwise there is no significant difference.
+<HtmlEmbed
+  id="model-size-tutorial"
+  src="d3-benchmark-comparison.html"
+  title="Model Size: Tutorial Prompt"
+  desc="Figure: Gemma-3 model sizes (270M to 27B) on the tutorial prompt."
+  config={{
+    defaultView: "line",
+    datasetNames: {
+      "mix-fw_edu_hq-tutorial_27b_hq": "Gemma-3 27B",
+      "mix-fw_edu_hq-tutorial_12b_hq": "Gemma-3 12B",
+      "mix-fw_edu_hq-tutorial_4b_hq": "Gemma-3 4B",
+      "mix-fw_edu_hq-tutorial_1b_hq": "Gemma-3 1B",
+      "mix-fw_edu_hq-tutorial_270m_hq": "Gemma-3 270M",
+      dclm: "DCLM",
+      fw_edu_hq: "FineWeb-Edu (HQ)"
+    }
+  }}
+/>
 Potentially, writing a tutorial is easy enough and we only need larger models for harder prompts such as Math. So we tested it there too, but find similar results with the 270m underperforming and no large difference between 1b, 4b, 12b and 27b.
+<HtmlEmbed
+  id="model-size-math"
+  src="d3-benchmark-comparison.html"
+  title="Model Size: Math Prompt"
+  desc="Figure: Gemma-3 model sizes (270M to 27B) on the math prompt."
+  config={{
+    defaultView: "line",
+    datasetNames: {
+      "mix-fw_edu_hq-math_1b_hq": "Gemma-3 1B",
+      "mix-fw_edu_hq-math_4b_hq": "Gemma-3 4B",
+      "mix-fw_edu_hq-math_27b_hq": "Gemma-3 27B",
+      "mix-fw_edu_hq-math_12b_hq": "Gemma-3 12B",
+      "mix-fw_edu_hq-math_270m_hq": "Gemma-3 270M",
+      dclm: "DCLM",
+      fw_edu_hq: "FineWeb-Edu (HQ)"
+    }
+  }}
+/>
 TODO: also run this experiment for the REWIRE prompt since the original authors claim that larger models are necessary there
 The [REWIRE](https://arxiv.org/abs/2506.04689) paper claims that for upcycling low quality data we need large models (Llama-3.3 70B in their case). Is this true?
 Continue prompt: For the 1b model the source data does not seem to matter, but the 12b model can make use of the hq data better.
+<HtmlEmbed
+  id="size-quality-continue"
+  src="d3-benchmark-comparison.html"
+  title="Model Size vs Data Quality: Continue Prompt"
+  desc="Figure: 1B vs 12B model on HQ vs LQ data using the continue prompt."
+  config={{
+    defaultView: "line",
+    datasetNames: {
+      "mix-fw_edu_hq-continue_12b_hq": "12B, HQ Source",
+      "mix-fw_edu_hq-continue_1b_hq": "1B, HQ Source",
+      "mix-fw_edu_hq-continue_1b_lq": "1B, LQ Source",
+      "mix-fw_edu_hq-continue_12b_lq": "12B, LQ Source"
+    }
+  }}
+/>
 Tutorial prompt: For the hq data the model size does not seem to matter whereas for the lq data the larger model is slightly better.
+<HtmlEmbed
+  id="size-quality-tutorial"
+  src="d3-benchmark-comparison.html"
+  title="Model Size vs Data Quality: Tutorial Prompt"
+  desc="Figure: 1B vs 12B model on HQ vs LQ data using the tutorial prompt."
+  config={{
+    defaultView: "line",
+    datasetNames: {
+      "mix-fw_edu_hq-tutorial_1b_hq": "1B, HQ Source",
+      "mix-fw_edu_hq-tutorial_12b_hq": "12B, HQ Source",
+      "mix-fw_edu_hq-tutorial_12b_lq": "12B, LQ Source",
+      "mix-fw_edu_hq-tutorial_1b_lq": "1B, LQ Source"
+    }
+  }}
+/>
 FAQ prompt: Surprisingly, the 1b model is better for both lq and hq data.
+<HtmlEmbed
+  id="size-quality-faq"
+  src="d3-benchmark-comparison.html"
+  title="Model Size vs Data Quality: FAQ Prompt"
+  desc="Figure: 1B vs 12B model on HQ vs LQ data using the FAQ prompt."
+  config={{
+    defaultView: "line",
+    datasetNames: {
+      "mix-fw_edu_hq-faq_1b_hq": "1B, HQ Source",
+      "mix-fw_edu_hq-faq_1b_lq": "1B, LQ Source",
+      "mix-fw_edu_hq-faq_12b_hq": "12B, HQ Source",
+      "mix-fw_edu_hq-faq_12b_lq": "12B, LQ Source"
+    }
+  }}
+/>
 In general we cannot reproduce REWIRE's claim that large models are needed for lq data. Overall we rarely see benefits of using models larger than 1b. So as long as the model has some baseline level (in our experiments already reached at the 1b scale) we see no evidence for a clear benefit of using larger models for rephrasing. For these reasons we default to the 1b size for maximum throughput from here on. We hypothesize that most rephrasing tasks are simple enough for smaller models to handle sufficiently well.
 #### Does the model family matter?
 Some model families may be better suited for rephrasing than others based on their training data. This is why we test top families at the 1B scale on the four top-performing prompts tutorial, faq, table, math. We find that for the tutorial prompt at the 1B scale Llama-3.2, Granite-3, Gemma-3, and Qwen3 and Falcon3 perform roughly at the same level. SmolLM2 clearly outperforms.
+<HtmlEmbed
+  id="model-family-tutorial"
+  src="d3-benchmark-comparison.html"
+  title="Model Family: Tutorial Prompt"
+  desc="Figure: Model families compared on the tutorial prompt at ~1B scale."
+  config={{
+    defaultView: "line",
+    datasetNames: {
+      "mix-fw_edu_hq-tutorial_smollm2_1.7b_hq": "SmolLM2",
+      "mix-fw_edu_hq-tutorial_falcon3_1b_hq": "Falcon3",
+      "mix-fw_edu_hq-tutorial_qwen3_1.7b_hq": "Qwen3",
+      "mix-fw_edu_hq-tutorial_1b_hq": "Gemma-3",
+      "mix-fw_edu_hq-tutorial_granite3_1b_hq": "Granite3",
+      "mix-fw_edu_hq-tutorial_llama3.2_1b_hq": "Llama-3.2"
+    }
+  }}
+/>
 In the faq prompt SmolLM2 again clearly outperforms the others. Here Qwen3 underperforms.
+<HtmlEmbed
+  id="model-family-faq"
+  src="d3-benchmark-comparison.html"
+  title="Model Family: FAQ Prompt"
+  desc="Figure: Model families compared on the FAQ prompt at ~1B scale."
+  config={{
+    defaultView: "line",
+    datasetNames: {
+      "mix-fw_edu_hq-faq_smollm2_1.7b_hq": "SmolLM2",
+      "mix-fw_edu_hq-faq_llama3.2_1b_hq": "Llama-3.2",
+      "mix-fw_edu_hq-faq_falcon3_1b_hq": "Falcon3",
+      "mix-fw_edu_hq-faq_1b_hq": "Gemma-3",
+      "mix-fw_edu_hq-faq_granite3_1b_hq": "Granite3",
+      "mix-fw_edu_hq-faq_qwen3_1.7b_hq": "Qwen3"
+    }
+  }}
+/>
 For the table prompt we again see SmolLM2 and to some degree Falcon3 outperform.
+<HtmlEmbed
+  id="model-family-table"
+  src="d3-benchmark-comparison.html"
+  title="Model Family: Table Prompt"
+  desc="Figure: Model families compared on the table prompt at ~1B scale."
+  config={{
+    defaultView: "line",
+    datasetNames: {
+      "mix-fw_edu_hq-table_smollm2_1.7b_hq": "SmolLM2",
+      "mix-fw_edu_hq-table_falcon3_1b_hq": "Falcon3",
+      "mix-fw_edu_hq-table_granite3_1b_hq": "Granite3",
+      "mix-fw_edu_hq-table_qwen3_1.7b_hq": "Qwen3",
+      "mix-fw_edu_hq-table_llama3.2_1b_hq": "Llama-3.2",
+      "mix-fw_edu_hq-table_1b_hq": "Gemma-3"
+    }
+  }}
+/>
 Finally, math is again a clear win for SmolLM2 with Qwen3 underperforming.
+<HtmlEmbed
+  id="model-family-math"
+  src="d3-benchmark-comparison.html"
+  title="Model Family: Math Prompt"
+  desc="Figure: Model families compared on the math prompt at ~1B scale."
+  config={{
+    defaultView: "line",
+    datasetNames: {
+      "mix-fw_edu_hq-math_smollm2_1.7b_hq": "SmolLM2",
+      "mix-fw_edu_hq-math_falcon3_1b_hq": "Falcon3",
+      "mix-fw_edu_hq-math_granite3_1b_hq": "Granite3",
+      "mix-fw_edu_hq-math_1b_hq": "Gemma-3",
+      "mix-fw_edu_hq-math_llama3.2_1b_hq": "Llama-3.2",
+      "mix-fw_edu_hq-math_qwen3_1.7b_hq": "Qwen3"
+    }
+  }}
+/>
 We hypothesize that the consistently strong performance of SmolLM2 originates from [rewrite tasks](https://huggingface.co/datasets/HuggingFaceTB/smoltalk/viewer/smol-rewrite?row=0&views%5B%5D=smol_rewrite_train) in the training data.
 So the model family clearly seems to matter. However, SmolLM2 is already a year old. Are newer models better than older ones?
 We compare rephrasing with Qwen models from versions 1.5, 2, 2.5 and 3 using the tutorial prompt, one of the prompts that outperformed the DCLM baseline. While the differences are small we find a trend that newer versions lead to higher evaluation performance.
+<HtmlEmbed
+  id="model-generation"
+  src="d3-benchmark-comparison.html"
+  title="Model Generation: Qwen Tutorial"
+  desc="Figure: Qwen model generations (1.5 to 3) on the tutorial prompt."
+  config={{
+    defaultView: "line",
+    datasetNames: {
+      "mix-fw_edu_hq-tutorial_qwen3_1.7b_hq": "Qwen3 (1.7B)",
+      "mix-fw_edu_hq-tutorial_qwen2.5_1.5b_hq": "Qwen2.5 (1.5B)",
+      "mix-fw_edu_hq-tutorial_qwen2_1.5b_hq": "Qwen2 (1.5B)",
+      dclm: "DCLM",
+      "mix-fw_edu_hq-tutorial_qwen1.5_1.8b_hq": "Qwen1.5 (1.8B)",
+      fw_edu_hq: "FineWeb-Edu (HQ)"
+    }
+  }}
+/>
 So now we know that certain models are better than others, newer models tend to outperform older models and usually rephrasing models can be as small as 1B parameters. What difference do the dataset choices make?
 To test the effect of the mix-in dataset we apply the tutorial prompt using Gemma-3-1b on fw_edu_hq and mix in dclm, cosmopedia, fw_edu_hq and fw_edu_lq. We find that the mix-in dataset makes a substantial difference, with cosmopedia and fw_edu_lq underperforming dclm and fw_edu_hq. fw_edu_hq and dclm achieve very similar accuracy even though dclm is much better by itself. We see that mixing in the synthetic data improves performance for all mix-in datasets. The effect is more pronounced for the worse datasets fw_edu_lq and cosmopedia.
+<HtmlEmbed
+  id="mixin-dataset-hq-source"
+  src="d3-benchmark-comparison.html"
+  title="Mix-in Dataset Effect (HQ Source)"
+  desc="Figure: Effect of different mix-in datasets with fw_edu_hq as source for the tutorial prompt."
+  config={{
+    defaultView: "line",
+    datasetNames: {
+      "mix-dclm-tutorial_1b_hq": "Mix-in: DCLM",
+      "mix-fw_edu_hq-tutorial_1b_hq": "Mix-in: FW-Edu (HQ)",
+      dclm: "DCLM",
+      "mix-fw_edu_lq-tutorial_1b_hq": "Mix-in: FW-Edu (LQ)",
+      "mix-cosmopedia-tutorial_1b_hq": "Mix-in: Cosmopedia",
+      fw_edu_hq: "FineWeb-Edu (HQ)",
+      cosmopedia: "Cosmopedia",
+      fw_edu_lq: "FineWeb-Edu (LQ)"
+    }
+  }}
+/>
 Does this trend hold for other source datasets? We ran the experiment for fw_edu_lq as source and find similar results: fw_edu_hq and dclm outperform both cosmopedia and fw_edu_lq. For all mix-in datasets except dclm, adding synthetic data is beneficial.
+<HtmlEmbed
+  id="mixin-dataset-lq-source"
+  src="d3-benchmark-comparison.html"
+  title="Mix-in Dataset Effect (LQ Source)"
+  desc="Figure: Effect of different mix-in datasets with fw_edu_lq as source for the tutorial prompt."
+  config={{
+    defaultView: "line",
+    datasetNames: {
+      dclm: "DCLM",
+      "mix-fw_edu_hq-tutorial_1b_lq": "Mix-in: FW-Edu (HQ)",
+      "mix-dclm-tutorial_1b_lq": "Mix-in: DCLM",
+      fw_edu_hq: "FineWeb-Edu (HQ)",
+      "mix-cosmopedia-tutorial_1b_lq": "Mix-in: Cosmopedia",
+      cosmopedia: "Cosmopedia",
+      "mix-fw_edu_lq-tutorial_1b_lq": "Mix-in: FW-Edu (LQ)",
+      fw_edu_lq: "FineWeb-Edu (LQ)"
+    }
+  }}
+/>
 So we know that the mix-in dataset plays a large role. What about the source dataset used for rephrasing?
 To investigate to what extent the source dataset for rephrasing matters we rephrased dclm, cosmopedia, fw_edu_hq and fw_edu_lq using the Gemma-3-1B model and the tutorial and faq prompts. When we mix in the source dataset with the rephrased data we find fw_edu_hq and dclm clearly outperforming fw_edu_lq and cosmopedia for both prompts.
+<HtmlEmbed
+  id="source-dataset-tutorial"
+  src="d3-benchmark-comparison.html"
+  title="Source Dataset: Tutorial (Mix-in = Source)"
+  desc="Figure: Effect of source dataset choice for the tutorial prompt when mix-in equals source."
+  config={{
+    defaultView: "line",
+    datasetNames: {
+      "mix-fw_edu_hq-tutorial_1b_hq": "Source: FW-Edu (HQ)",
+      "mix-dclm-tutorial_1b_dclm": "Source: DCLM",
+      "mix-cosmopedia-tutorial_1b_cosmopedia": "Source: Cosmopedia",
+      "mix-fw_edu_lq-tutorial_1b_lq": "Source: FW-Edu (LQ)"
+    }
+  }}
+/>
+<HtmlEmbed
+  id="source-dataset-faq"
+  src="d3-benchmark-comparison.html"
+  title="Source Dataset: FAQ (Mix-in = Source)"
+  desc="Figure: Effect of source dataset choice for the FAQ prompt when mix-in equals source."
+  config={{
+    defaultView: "line",
+    datasetNames: {
+      "mix-dclm-faq_1b_dclm": "Source: DCLM",
+      "mix-fw_edu_hq-faq_1b_hq": "Source: FW-Edu (HQ)",
+      "mix-fw_edu_lq-faq_1b_lq": "Source: FW-Edu (LQ)",
+      "mix-cosmopedia-faq_1b_cosmopedia": "Source: Cosmopedia"
+    }
+  }}
+/>
 When fix the mix-in dataset to fw_edu_hq, the difference shrinks drastically for the tutorial prompt and even more for the faq prompt. This corroborates our finding that the mix-in datasets seem to matter much more than the source rephrasing datasets.
+<HtmlEmbed
+  id="source-dataset-fixed-mixin-tutorial"
+  src="d3-benchmark-comparison.html"
+  title="Source Dataset: Tutorial (Fixed Mix-in: FW-Edu HQ)"
+  desc="Figure: Effect of source dataset for the tutorial prompt with fw_edu_hq as fixed mix-in."
+  config={{
+    defaultView: "line",
+    datasetNames: {
+      "mix-fw_edu_hq-tutorial_1b_dclm": "Source: DCLM",
+      "mix-fw_edu_hq-tutorial_1b_hq": "Source: FW-Edu (HQ)",
+      "mix-fw_edu_hq-tutorial_1b_cosmopedia": "Source: Cosmopedia",
+      "mix-fw_edu_hq-tutorial_1b_lq": "Source: FW-Edu (LQ)"
+    }
+  }}
+/>
+<HtmlEmbed
+  id="source-dataset-fixed-mixin-faq"
+  src="d3-benchmark-comparison.html"
+  title="Source Dataset: FAQ (Fixed Mix-in: FW-Edu HQ)"
+  desc="Figure: Effect of source dataset for the FAQ prompt with fw_edu_hq as fixed mix-in."
+  config={{
+    defaultView: "line",
+    datasetNames: {
+      "mix-fw_edu_hq-faq_1b_dclm": "Source: DCLM",
+      "mix-fw_edu_hq-faq_1b_hq": "Source: FW-Edu (HQ)",
+      "mix-fw_edu_hq-faq_1b_lq": "Source: FW-Edu (LQ)",
+      "mix-fw_edu_hq-faq_1b_cosmopedia": "Source: Cosmopedia"
+    }
+  }}
+/>
 #### Is synthetic data enough?
 We were wondering whether just training on synthetic data works. While we get increased performance over fw-edu-hq, it does not match the original dataset performance (DCLM) and also is clearly below the performance of the original dataset mixed with the rephrased one for both the tutorial and faq prompts. We get the same result when we rephrase fw_edu_hq instead of dclm.
+<HtmlEmbed
+  id="synthetic-only-dclm"
+  src="d3-benchmark-comparison.html"
+  title="Is Synthetic Data Enough? (DCLM Source)"
+  desc="Figure: Synthetic-only vs mixed training with DCLM as source."
+  config={{
+    defaultView: "line",
+    datasetNames: {
+      "mix-dclm-faq_1b_dclm": "Mix: FAQ + DCLM",
+      dclm: "DCLM",
+      "mix-dclm-tutorial_1b_dclm": "Mix: Tutorial + DCLM",
+      faq_1b_dclm: "FAQ Only",
+      tutorial_1b_dclm: "Tutorial Only",
+      fw_edu_hq: "FineWeb-Edu (HQ)"
+    }
+  }}
+/>
+<HtmlEmbed
+  id="synthetic-only-hq"
+  src="d3-benchmark-comparison.html"
+  title="Is Synthetic Data Enough? (FW-Edu HQ Source)"
+  desc="Figure: Synthetic-only vs mixed training with FW-Edu (HQ) as source."
+  config={{
+    defaultView: "line",
+    datasetNames: {
+      "mix-fw_edu_hq-faq_1b_hq": "Mix: FAQ + FW-Edu (HQ)",
+      "mix-fw_edu_hq-tutorial_1b_hq": "Mix: Tutorial + FW-Edu (HQ)",
+      dclm: "DCLM",
+      faq_1b_hq: "FAQ Only",
+      tutorial_1b_hq: "Tutorial Only",
+      fw_edu_hq: "FineWeb-Edu (HQ)"
+    }
+  }}
+/>
 #### Does increased diversity help?
  **Mixing rephrasing approaches**
 We were wondering whether mixing the best performing rephrasing approaches can improve over the individual approaches. We find no significant increase over the best performing approach (mix-fw_edu_hq-math_1b_hq). It seems that when we mix together enough different prompts (mix-tutorial_1b_hq-faq_1b_hq-table_1b_hq-math_1b_hq), we don't necessarily need the source dataset (fw_edu_hq) for good performance. This could mean that when just training on one synthetic dataset we need the original dataset for diversity, but when we mix multiple ones it is not necessary. However, it does not hurt and is an easy way of increasing the dataset size while keeping the performance high. To follow up it would be interesting to study with how little synthetic data we can get away with without performance drops.
+<HtmlEmbed
+  id="mixing-approaches"
+  src="d3-benchmark-comparison.html"
+  title="Mixing Rephrasing Approaches"
+  desc="Figure: Mixing multiple prompts vs individual prompts."
+  config={{
+    defaultView: "line",
+    datasetNames: {
+      "mix-fw_edu_hq-tutorial_1b_hq-fw_edu_hq-faq_1b_hq-table_1b_hq-math_1b_hq": "All Prompts + FW-Edu (HQ)",
+      "mix-fw_edu_hq-math_1b_hq": "Math",
+      "mix-tutorial_1b_hq-faq_1b_hq-table_1b_hq-math_1b_hq": "All Prompts (No Source)",
+      "mix-fw_edu_hq-table_1b_hq": "Table",
+      "mix-fw_edu_hq-faq_1b_hq": "FAQ",
+      "mix-fw_edu_hq-tutorial_1b_hq": "Tutorial",
+      dclm: "DCLM",
+      fw_edu_hq: "FineWeb-Edu (HQ)"
+    }
+  }}
+/>
  **Mixing model families**
 We rephrased using different model families and saw SmolLM2 and Falcon3 clearly outperform Llama3.2 and Granite3. Now we wonder whether mixing the rephrased outputs of multiple models improves performance through increased diversity.
+<HtmlEmbed
+  id="mixing-model-families"
+  src="d3-benchmark-comparison.html"
+  title="Mixing Model Families"
+  desc="Figure: Mixing rephrased outputs from different model families."
+  config={{
+    defaultView: "line",
+    datasetNames: {
+      "mix-fw_edu_hq-tutorial_smollm2_1.7b_hq": "SmolLM2",
+      "mix-fw_edu_hq-tutorial_smollm2_1.7b_hq-tutorial_falcon3_1b_hq": "SmolLM2 + Falcon3",
+      "mix-fw_edu_hq-tutorial_smollm2_1.7b_hq-tutorial_llama3.2_1b_hq": "SmolLM2 + Llama-3.2",
+      "mix-fw_edu_hq-tutorial_llama3.2_1b_hq-tutorial_granite3_1b_hq": "Llama-3.2 + Granite3",
+      "mix-fw_edu_hq-tutorial_llama3.2_1b_hq": "Llama-3.2",
+      dclm: "DCLM",
+      fw_edu_hq: "FineWeb-Edu (HQ)"
+    }
+  }}
+/>
 It turns out that benchmark performance does not improve through increased rephrasing model diversity but is largely an average of the mixed datasets performance (smollm2 and falcon3 are similar to just smollm2, smollm2 and llama3.2 lie in between smollm2 and llama3.2, llama3.2 and granite3 are similar to just llama3.2).
  **Mixing both rephrasing approaches and model families**
 Maybe we need more diversity by mixing both rephrasing approaches and model families?
+<HtmlEmbed
+  id="mixing-both"
+  src="d3-benchmark-comparison.html"
+  title="Mixing Approaches and Model Families"
+  desc="Figure: Mixing both rephrasing approaches and model families."
+  config={{
+    defaultView: "line",
+    datasetNames: {
+      "mix-fw_edu_hq-faq_smollm2_1.7b_hq": "FAQ (SmolLM2)",
+      "mix-fw_edu_hq-faq_smollm2_1.7b_hq-tutorial_falcon3_1b_hq": "FAQ (SmolLM2) + Tutorial (Falcon3)",
+      "mix-fw_edu_hq-tutorial_smollm2_1.7b_hq": "Tutorial (SmolLM2)",
+      "mix-fw_edu_hq-tutorial_smollm2_1.7b_hq-tutorial_falcon3_1b_hq": "Tutorial (SmolLM2) + Tutorial (Falcon3)",
+      "mix-fw_edu_hq-tutorial_falcon3_1b_hq": "Tutorial (Falcon3)",
+      "mix-fw_edu_hq-faq_falcon3_1b_hq": "FAQ (Falcon3)",
+      dclm: "DCLM",
+      fw_edu_hq: "FineWeb-Edu (HQ)"
+    }
+  }}
+/>
 No, we get the same results as for just mixing rephrasing approaches or model families independently: the mix is around the average performance instead of resulting in a gain.
 The original REWIRE prompt contains many typos and grammar errors. To what extent do typos in the prompt hurt performance?
+<HtmlEmbed
+  id="typos-effect"
+  src="d3-benchmark-comparison.html"
+  title="Effect of Typos in Prompt"
+  desc="Figure: REWIRE prompt with original typos vs improved version at 1B and 12B scale."
+  config={{
+    defaultView: "line",
+    datasetNames: {
+      "mix-fw_edu_hq-guided_rewrite_original_12b_hq": "Original (12B)",
+      "mix-fw_edu_hq-guided_rewrite_improved_12b_hq": "Improved (12B)",
+      dclm: "DCLM",
+      "mix-fw_edu_hq-guided_rewrite_original_1b_hq": "Original (1B)",
+      "mix-fw_edu_hq-guided_rewrite_improved_1b_hq": "Improved (1B)",
+      fw_edu_hq: "FineWeb-Edu (HQ)"
+    }
+  }}
+/>
 Surprisingly, typos don't have a negative effect on downstream model performance. For the 1b model, even the opposite is the case.