File size: 12,701 Bytes
2797f5f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bdefa11
 
2797f5f
 
 
 
 
 
 
4a65a6b
2797f5f
 
 
 
 
 
 
 
 
 
 
 
 
 
4a65a6b
2797f5f
 
 
 
4a65a6b
2797f5f
e3e245a
 
 
 
2797f5f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a55cd01
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bb0d7f1
 
 
a55cd01
 
 
 
 
2797f5f
 
 
 
 
 
 
 
 
 
 
 
 
4a65a6b
2797f5f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17d0a5b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2797f5f
 
 
 
 
 
 
 
a55cd01
 
2797f5f
 
 
a55cd01
 
 
 
 
 
 
 
2797f5f
 
a160373
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
# Scientific Blog Writing Guidelines

Guidelines for writing casual, accessible scientific content in interactive blog posts.

## Style References

Use these blog posts as inspiration for writing style:

- [The Smol Training Playbook](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook)
- [Transformers Tenets](https://huggingface.co/spaces/transformers-community/Transformers-tenets)
- [FineWeb: decanting the web for the finest text data at scale](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1)
- [FineVision: Open Data is All You Need](https://huggingface.co/spaces/HuggingFaceM4/FineVision)
- [The Ultra-Scale Playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook)

## Voice and Tone

- **Conversational but precise**: Write like explaining to a curious colleague over coffee
- **Direct address**: Use "you" and "we" freely
- **Active voice**: Prefer "the model learns" over "learning is performed by the model"
- **Short sentences**: Break complex ideas into digestible chunks
- **No hedging**: Say "X improves Y" not "X tends to potentially improve Y"

## Punctuation

- **No em-dashes (β€”)**: Use parentheses, commas, or separate sentences instead
- **En-dashes (–)** for ranges only: "2020–2024", "pages 10–15"
- **Minimal semicolons**: Prefer two sentences over one with a semicolon
- **Angle brackets in MDX**: Use `{'<'}` and `{'>'}` to escape literal `<` and `>` characters in prose, since MDX interprets bare `<` as the start of a JSX tag
- **Curly braces in MDX**: Avoid bare `{` and `}` in prose since MDX interprets them as JSX expressions. Remove them or replace with parentheses/other punctuation

## Formatting

- **Bold for key terms** on first introduction
- **Inline code** for technical identifiers: model names, hyperparameters, file paths
- Lists for features, steps, or comparisons
- Keep paragraphs short (3-5 sentences max)
- **Minimize horizontal scrolling in tables**: If a table row has long content that causes horizontal overflow, convert it to a bullet list or restructure the information. Prefer lists over tables when rows have variable-length prose

## Structure

- Start sections with the conclusion or key insight
- Use progressive disclosure: overview first, details on demand
- Every claim should have a visible artifact (chart, code, demo) nearby
- Sidenotes for tangential but useful context

## Technical Writing

- Define jargon on first use
- Concrete examples over abstract explanations
- Show, don't tell: prefer an interactive demo over a paragraph of description
- Link to sources, don't just cite them
- **Never show plain URLs as link text**: Always integrate links into descriptive text (e.g., write `[DataTrove inference benchmark](https://github.com/...)` not `[github.com/huggingface/datatrove/...](https://github.com/...)`)

## Interactivity

- Treat figures as first-class content, not decoration
- **All images/plots must be wrapped as figures** (using `<figure>` with `id` and `<figcaption>`) and every figure must be mentioned in the surrounding text using `<FigRef target="figure-id" />`. Do NOT include "Figure N:" in `<figcaption>` text since the figure environment auto-numbers
- Interactive elements should reveal insight, not just look fancy
- **Hover tooltips everywhere**: Add informative hover tooltips wherever possible to surface additional detail (model name, exact values, metadata) without cluttering the chart
- **Readable text size**: Ensure all text in charts (axis labels, tick marks, legends, annotations) is large enough to read comfortably, comparable to the main body text size
- **Minimal padding**: Keep internal chart padding/margins small so the majority of space is used for actual content
- **Discuss every visualization**: Every chart must be cited in the surrounding text with `<FigRef>` and accompanied by 1–3 paragraphs that explain what the chart shows, highlight the key takeaway, and discuss implications
- Ensure all visualizations work in dark mode
- Mobile-friendly is mandatory

## Content Structure

- **One file, one article**: Main content lives in `app/src/content/article.mdx`
- **Chapters for length**: Split long articles into `app/src/content/chapters/` and import them
- **Short sections**: Each section answers one question or supports one idea
- **TOC-friendly headings**: Use H2–H4 with short, descriptive titles (auto-generates table of contents)

## Math Notation

- **Introduce symbols** the first time they appear
- **Prefer well-known notation** over custom shorthand
- **Plain-language interpretation** after every significant formula
- Inline math: `$x^2$`, block math: `$$...$$`
- Use `\htmlId{name}{...}` for cross-referenceable equations

## Citations and References

- Citations from `app/src/content/bibliography.bib` using `[@key]` or `@key`
- Footnotes: `[^id]` with definition `[^id]: explanation`
- Deep-link anything with `id` prop on `Image`, `HtmlEmbed`, `Reference` components
- Link format: `[Figure 1](#my-figure-id)`

### Bibliography Organization

The bibliography file is organized into these sections (in order). Always place new entries in the correct section:

| Section | Comment header | What belongs here |
|---------|---------------|-------------------|
| Datasets | `% Datasets` | Training/eval datasets (FineWeb, DCLM, Cosmopedia, s1K, etc.) |
| Synthetic data methods | `% Synthetic data methods` | Papers about data generation/rephrasing (WRAP, REWIRE, BeyondWeb, model collapse) |
| Models | `% Models` | LLM papers and technical reports (Qwen, Llama, Gemma, SmolLM, etc.) |
| Architecture | `% Architecture` | Architectural components (GQA, RoPE, etc.) |
| Inference | `% Inference` | Serving/inference engines and techniques (vLLM, SGLang, FlashAttention, speculative decoding) |
| Training | `% Training` | Optimizers, schedules, training methods (AdamW, WSD/MiniCPM, etc.) |
| Tools | `% Tools` | Software tools and frameworks (DSPy, DataTrove, etc.) |
| Benchmarks | `% Benchmarks` | Evaluation benchmarks (ARC, HellaSwag, MMLU, GSM8K, etc.) |

### Citation Placement Rules

- **Papers: cite every occurrence**: Place `[@key]` or `@key` every time a paper is referenced in the blog post. Papers are research contributions that readers may want to look up at any point, so always provide the citation.
- **Models, datasets, tools: cite on first occurrence only**: Place `[@key]` the first time a model, dataset, benchmark, tool, or other non-paper reference appears in the blog post. Subsequent mentions don't need the citation.
- **Blog chapter order matters**: The rendering order is defined in `app/src/content/article.mdx` (Introduction β†’ Infrastructure β†’ Setup β†’ Experiments β†’ Conclusions β†’ Appendix). "First occurrence" means first across the entire concatenated blog, not first within a single chapter file.
- **Everything citable gets cited**: Models, datasets, benchmarks, architectural techniques, optimizers, inference engines, training schedules, and tools all need citations if they have a paper or official reference
- **Hyperlinks are not citations**: Even if something is already linked (e.g., `[DataTrove](https://github.com/...)`) it still needs a `[@key]` citation
- **Use `@software` for code-only references**: Libraries/tools without a paper get a `@software{...}` entry pointing to the repo or blog post
- **Use `@misc` with `note = {Blog post}`** for references that only exist as blog posts

## Components Quick Reference

| Component | Use for |
|-----------|---------|
| `<Sidenote>` | Contextual notes in the margin |
| `<Note variant="info/success/danger">` | Callouts, tips, warnings |
| `<Accordion>` | Collapsible details, code examples |
| `<Image>` | Optimized images with captions, zoom |
| `<HtmlEmbed>` | D3/Plotly charts, interactive figures |
| `<Wide>` / `<FullWidth>` | Break out of content column |
| `<Glossary>` | Hover definitions for terms |
| `<Quote>` | Attributed quotations |
| `<Stack>` | Multi-column layouts |
| `<FigRef>` | Auto-numbered figure cross-references (e.g., `<FigRef target="my-figure-id" />` renders as "Figure 3") |

## Chart Selection

Match visualization to analytical goal:
- **Compare values**: Bar, dot plot
- **Show distribution**: Histogram, box plot, violin
- **Part-to-whole**: Pie, stacked bar, treemap
- **Trends over time**: Line chart
- **Relationships**: Scatter, bubble
- **Flow/process**: Sankey, flowchart

Prefer D3 over Plotly for custom, web-native visualizations.

---

## Skills

### Vibe Coding D3 Charts

Create interactive D3 charts as self-contained HTML fragments. Charts must be **responsive**, **accessible**, **interactive**, and **dark mode ready**.

**Workflow:**

1. Read the directives at `app/src/content/embeds/vibe-code-d3-embeds-directives.md`
2. Optionally use an existing chart from `app/src/content/embeds/` as a starting point
3. Generate the chart with a prompt like:
   ```
   I want you to code a new d3 chart named `yourchart`.
   I have one CSV file called `yourdata.csv` in the data folder.
   The csv has the following columns: `x`, `y`, `z`.
   ```
4. Iterate with small adjustments until satisfied

**Embedding:**

Use the `<HtmlEmbed>` component in MDX:

```jsx
<HtmlEmbed
  id="fig1"
  src="your-chart.html"
  title="Chart Title"
  desc="Figure 1: Description of the chart."
/>
```

**Reference examples:** `app/src/content/chapters/demo/vibe-coding-charts.mdx`

### Taking Screenshots for Debugging

Take screenshots of the rendered blog to visually verify how content looks. Requires the dev server to be running (`npm run dev` in the `app/` directory).

**Script:** `app/scripts/screenshot.mjs`

**Usage (run from the `app/` directory):**

```bash
# Screenshot the current viewport (above the fold)
node scripts/screenshot.mjs

# Screenshot a specific section by heading anchor
node scripts/screenshot.mjs --target "#experiments"

# Screenshot a specific figure by its id
node scripts/screenshot.mjs --target "#baselines-comparison"

# Screenshot a CSS selector (e.g., the first mermaid diagram)
node scripts/screenshot.mjs --target ".mermaid" --padding 80

# Full-page screenshot in dark mode
node scripts/screenshot.mjs --full-page --dark

# Custom output path
node scripts/screenshot.mjs --target "#infrastructure" --output ./infra-shot.png
```

**Options:**

| Option | Default | Description |
|--------|---------|-------------|
| `--target <selector>` | (viewport) | CSS selector or `#anchor` to screenshot |
| `--output <path>` | `../../assets/screenshot-<target>.png` | Output file path |
| `--width <px>` | 1400 | Viewport width |
| `--height <px>` | 900 | Viewport height |
| `--full-page` | false | Capture the full scrollable page |
| `--padding <px>` | 40 | Extra padding around the target element |
| `--url <url>` | `http://localhost:4321/` | Page URL |
| `--wait <ms>` | 2000 | Extra wait after navigation for async content |
| `--dark` | false | Use dark theme |

**When to use:** After modifying MDX content, charts, CSS, or mermaid diagrams, take a screenshot to verify the rendered output looks correct. Particularly useful for checking figure numbering, chart rendering, and layout issues.

---

## End-of-Interaction Checklist

After completing any content edit, perform this reference audit:

1. **Scan for uncited claims**: Look for statements about methods, benchmarks, or prior work that lack citations
2. **Search for BibTeX entries**: For each missing reference, search online (Google Scholar, Semantic Scholar, arXiv) for the correct BibTeX entry
3. **Add to bibliography**: Place new entries in `app/src/content/bibliography.bib` **in the correct section** (see Bibliography Organization above). Do not dump entries at the end of the file
4. **Insert citations**: Add `[@key]` inline at the first occurrence across the whole blog (respecting chapter rendering order from `article.mdx`)
5. **Ask if unsure**: If multiple papers match or authorship is ambiguous, ask the user which reference is correct before adding

**What needs citations:**
- Model names and families (Qwen, Llama, Gemma, SmolLM, etc.)
- Datasets (FineWeb, DCLM, Cosmopedia, s1K, etc.)
- Methods and techniques (speculative decoding, RLHF, etc.)
- Architectural components (GQA, RoPE, Flash Attention, etc.)
- Training details (AdamW optimizer, WSD learning rate schedule, etc.)
- Inference engines (vLLM, SGLang, FlashInfer, etc.)
- Evaluation benchmarks (ARC, HellaSwag, MMLU, GSM8K, etc.)
- Tools and frameworks (DataTrove, DSPy, etc.)
- Benchmark results or comparisons
- Claims about prior work ("X showed that...")

## Python Dependencies

Always use `uv` to install Python packages (not `pip`). Example:

```bash
uv pip install pandas scipy numpy
```