Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
33e5bb8
1
Parent(s): 2b72bd7
move model collapse to intro
Browse files
app/src/content/chapters/1-introduction.mdx
CHANGED
|
@@ -2,6 +2,7 @@
|
|
| 2 |
import Image from "../../components/Image.astro";
|
| 3 |
import HtmlEmbed from "../../components/HtmlEmbed.astro";
|
| 4 |
import Sidenote from "../../components/Sidenote.astro";
|
|
|
|
| 5 |
import FigRef from "../../components/FigRef.astro";
|
| 6 |
import syntheticDataScaleImg from "../assets/image/synthetic-data-scale.jpg";
|
| 7 |
|
|
@@ -58,4 +59,10 @@ Lavoisier replaced phlogiston theory with precise measurements and repeatable ex
|
|
| 58 |
</Sidenote>
|
| 59 |
|
| 60 |
We start by [setting up the problem](#rephrasing-the-web): what rephrasing is, which approaches exist, and what we want to test. Then we dive into the 90 [Experiments](#experiments) we ran to figure out which prompts, models, and datasets actually work. The [Analyses](#analyses) section zooms out to ask *why* things work the way they do. Next comes the [Infrastructure](#infrastructure) that made all of this possible, including detailed throughput benchmarking of popular models (super important for getting the most data for your bucks). Finally, we [put it all together](#applying-the-recipe-at-scale) into FinePhrase, our best configuration.
|
|
|
|
| 61 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
import Image from "../../components/Image.astro";
|
| 3 |
import HtmlEmbed from "../../components/HtmlEmbed.astro";
|
| 4 |
import Sidenote from "../../components/Sidenote.astro";
|
| 5 |
+
import Note from "../../components/Note.astro";
|
| 6 |
import FigRef from "../../components/FigRef.astro";
|
| 7 |
import syntheticDataScaleImg from "../assets/image/synthetic-data-scale.jpg";
|
| 8 |
|
|
|
|
| 59 |
</Sidenote>
|
| 60 |
|
| 61 |
We start by [setting up the problem](#rephrasing-the-web): what rephrasing is, which approaches exist, and what we want to test. Then we dive into the 90 [Experiments](#experiments) we ran to figure out which prompts, models, and datasets actually work. The [Analyses](#analyses) section zooms out to ask *why* things work the way they do. Next comes the [Infrastructure](#infrastructure) that made all of this possible, including detailed throughput benchmarking of popular models (super important for getting the most data for your bucks). Finally, we [put it all together](#applying-the-recipe-at-scale) into FinePhrase, our best configuration.
|
| 62 |
+
The sections below are fairly self-contained, so feel free to jump around and skip whatever seems less interesting to you.
|
| 63 |
|
| 64 |
+
<Note variant="info" title="But wait, what about model collapse?">
|
| 65 |
+
You might be wondering: doesn't training on synthetic data inevitably lead to model collapse? This is a common misconception that stems from research [@modelcollapse] showing severe degradation when models are trained exclusively and iteratively on their own outputs, without any new information or human data.
|
| 66 |
+
|
| 67 |
+
In practice, nobody trains models this way. Real-world pipelines mix synthetic with human data, use diverse reference materials in prompts, and apply synthetic data strategically rather than replacing entire training corpora. Model collapse happens in a closed loop on a model's own outputs without new signal, which is not how practitioners use synthetic data. The real concern is frontier models generating training data for other frontier models in isolation. Thoughtful integration of synthetic data that introduces new knowledge or perspectives is a different story entirely. In FineWeb [@fineweb] we also found no degradation from naturally occurring AI-generated data on the web.
|
| 68 |
+
</Note>
|
app/src/content/chapters/2-setup.mdx
CHANGED
|
@@ -76,12 +76,4 @@ Since our model is small and trained on only 20B tokens, we use the **cloze form
|
|
| 76 |
- **Table Understanding**: WikiTableQuestions [@wikitablequestions], TriviaQA [@triviaqa]
|
| 77 |
|
| 78 |
|
| 79 |
-
### But Wait, What About Model Collapse?
|
| 80 |
-
|
| 81 |
-
You might be wondering: doesn't training on synthetic data inevitably lead to model collapse? This is a common misconception that stems from research [@modelcollapse] showing severe degradation when models are trained exclusively and iteratively on their own outputs, without any new information or human data.
|
| 82 |
-
|
| 83 |
-
In practice, nobody trains models this way. Real-world pipelines mix synthetic with human data, use diverse reference materials in prompts, and apply synthetic data strategically rather than replacing entire training corpora. Model collapse happens in a closed loop on a model's own outputs without new signal, which is not how practitioners use synthetic data.
|
| 84 |
-
|
| 85 |
-
The real concern is frontier models generating training data for other frontier models in isolation. Thoughtful integration of synthetic data that introduces new knowledge or perspectives is a different story entirely. In FineWeb [@fineweb] we also found no degradation from naturally occurring AI-generated data on the web.
|
| 86 |
-
|
| 87 |
With all that context out of the way, let's get to the fun part: the experiments.
|
|
|
|
| 76 |
- **Table Understanding**: WikiTableQuestions [@wikitablequestions], TriviaQA [@triviaqa]
|
| 77 |
|
| 78 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
With all that context out of the way, let's get to the fun part: the experiments.
|