joelniklaus HF Staff commited on
Commit
455a326
·
1 Parent(s): 9fab25e

added article and discussion results

Browse files
app/src/content/assets/data/benchmark-results.csv CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:4ff88dedc4e0c1d7dd13f29a3bd9a68072119f1c2c5a9c48f7a6f2c893778615
3
- size 1245861
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:27dd686263a9217a306811036fd361d7616dc6231393f311387d1b5dd065f595
3
+ size 1334642
app/src/content/assets/data/rephrasing_metadata.json CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:19a1b032f82449d0c9dcaa9cda0c0db42fea5bc11e5007234bc0a2d27e45ff8c
3
- size 130560
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cac779aca41bc6f868d99a7c7fcc43343591b40ace727098341d52285c1ff856
3
+ size 152802
app/src/content/chapters/3-experiments.mdx CHANGED
@@ -232,7 +232,7 @@ Since model size barely matters, does the model family make a difference?
232
 
233
  #### Does the model family matter?
234
 
235
- We test six model families (SmolLM2, Falcon3 [@falcon3], Qwen3, Gemma-3, Granite3 [@granite3], Llama-3.2) at ~1B scale on four prompts. Use the Setup dropdown to compare across prompts. SmolLM2 consistently and clearly outperforms all others across all four prompts (see <FigRef target="model-family" />).
236
 
237
  <Sidenote>
238
  We hypothesize that SmolLM2's consistently strong rephrasing performance originates from explicit [rewrite tasks](https://huggingface.co/datasets/HuggingFaceTB/smoltalk/viewer/smol-rewrite?row=0&views%5B%5D=smol_rewrite_train) in its instruction tuning data (smoltalk). This would mean the model already "knows" how to rewrite well before we even prompt it.
@@ -244,6 +244,28 @@ We hypothesize that SmolLM2's consistently strong rephrasing performance origina
244
  desc="Model families compared at ~1B scale. Use the Setup dropdown to compare across prompts."
245
  config={{
246
  setups: {
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
247
  "Tutorial Prompt": {
248
  datasets: {
249
  "mix-fw_edu_hq-tutorial_smollm2_1.7b_hq": "SmolLM2",
 
232
 
233
  #### Does the model family matter?
234
 
235
+ We test six model families (SmolLM2, Falcon3 [@falcon3], Qwen3, Gemma-3, Granite3 [@granite3], Llama-3.2) at ~1B scale on six prompts. Use the Setup dropdown to compare across prompts. SmolLM2 consistently and clearly outperforms all others across all six prompts (see <FigRef target="model-family" />).
236
 
237
  <Sidenote>
238
  We hypothesize that SmolLM2's consistently strong rephrasing performance originates from explicit [rewrite tasks](https://huggingface.co/datasets/HuggingFaceTB/smoltalk/viewer/smol-rewrite?row=0&views%5B%5D=smol_rewrite_train) in its instruction tuning data (smoltalk). This would mean the model already "knows" how to rewrite well before we even prompt it.
 
244
  desc="Model families compared at ~1B scale. Use the Setup dropdown to compare across prompts."
245
  config={{
246
  setups: {
247
+ "Article Prompt": {
248
+ datasets: {
249
+ "mix-fw_edu_hq-article_smollm2_1.7b_hq": "SmolLM2",
250
+ "mix-fw_edu_hq-article_falcon3_1b_hq": "Falcon3",
251
+ "mix-fw_edu_hq-article_granite3_1b_hq": "Granite3",
252
+ "mix-fw_edu_hq-article_1b_hq": "Gemma-3",
253
+ "mix-fw_edu_hq-article_llama3.2_1b_hq": "Llama-3.2",
254
+ "mix-fw_edu_hq-article_qwen3_1.7b_hq": "Qwen3",
255
+ dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
256
+ }
257
+ },
258
+ "Discussion Prompt": {
259
+ datasets: {
260
+ "mix-fw_edu_hq-discussion_smollm2_1.7b_hq": "SmolLM2",
261
+ "mix-fw_edu_hq-discussion_falcon3_1b_hq": "Falcon3",
262
+ "mix-fw_edu_hq-discussion_granite3_1b_hq": "Granite3",
263
+ "mix-fw_edu_hq-discussion_1b_hq": "Gemma-3",
264
+ "mix-fw_edu_hq-discussion_llama3.2_1b_hq": "Llama-3.2",
265
+ "mix-fw_edu_hq-discussion_qwen3_1.7b_hq": "Qwen3",
266
+ dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
267
+ }
268
+ },
269
  "Tutorial Prompt": {
270
  datasets: {
271
  "mix-fw_edu_hq-tutorial_smollm2_1.7b_hq": "SmolLM2",