Spaces:
Running on CPU Upgrade
Scientific Blog Writing Guidelines
Guidelines for writing casual, accessible scientific content in interactive blog posts.
Style References
Use these blog posts as inspiration for writing style:
- The Smol Training Playbook
- Transformers Tenets
- FineWeb: decanting the web for the finest text data at scale
- FineVision: Open Data is All You Need
- The Ultra-Scale Playbook
Voice and Tone
- Conversational but precise: Write like explaining to a curious colleague over coffee
- Direct address: Use "you" and "we" freely
- Active voice: Prefer "the model learns" over "learning is performed by the model"
- Short sentences: Break complex ideas into digestible chunks
- No hedging: Say "X improves Y" not "X tends to potentially improve Y"
Punctuation
- No em-dashes (β): Use parentheses, commas, or separate sentences instead
- En-dashes (β) for ranges only: "2020β2024", "pages 10β15"
- Minimal semicolons: Prefer two sentences over one with a semicolon
- Angle brackets in MDX: Use
{'<'}and{'>'}to escape literal<and>characters in prose, since MDX interprets bare<as the start of a JSX tag - Curly braces in MDX: Avoid bare
{and}in prose since MDX interprets them as JSX expressions. Remove them or replace with parentheses/other punctuation
Formatting
- Bold for key terms on first introduction
- Inline code for technical identifiers: model names, hyperparameters, file paths
- Lists for features, steps, or comparisons
- Keep paragraphs short (3-5 sentences max)
- Minimize horizontal scrolling in tables: If a table row has long content that causes horizontal overflow, convert it to a bullet list or restructure the information. Prefer lists over tables when rows have variable-length prose
Structure
- Start sections with the conclusion or key insight
- Use progressive disclosure: overview first, details on demand
- Every claim should have a visible artifact (chart, code, demo) nearby
- Sidenotes for tangential but useful context
Technical Writing
- Define jargon on first use
- Concrete examples over abstract explanations
- Show, don't tell: prefer an interactive demo over a paragraph of description
- Link to sources, don't just cite them
- Never show plain URLs as link text: Always integrate links into descriptive text (e.g., write
[DataTrove inference benchmark](https://github.com/...)not[github.com/huggingface/datatrove/...](https://github.com/...))
Interactivity
- Treat figures as first-class content, not decoration
- All images/plots must be wrapped as figures (using
<figure>withidand<figcaption>) and every figure must be mentioned in the surrounding text using<FigRef target="figure-id" />. Do NOT include "Figure N:" in<figcaption>text since the figure environment auto-numbers - Interactive elements should reveal insight, not just look fancy
- Hover tooltips everywhere: Add informative hover tooltips wherever possible to surface additional detail (model name, exact values, metadata) without cluttering the chart
- Readable text size: Ensure all text in charts (axis labels, tick marks, legends, annotations) is large enough to read comfortably, comparable to the main body text size
- Minimal padding: Keep internal chart padding/margins small so the majority of space is used for actual content
- Discuss every visualization: Every chart must be cited in the surrounding text with
<FigRef>and accompanied by 1β3 paragraphs that explain what the chart shows, highlight the key takeaway, and discuss implications - Ensure all visualizations work in dark mode
- Mobile-friendly is mandatory
Content Structure
- One file, one article: Main content lives in
app/src/content/article.mdx - Chapters for length: Split long articles into
app/src/content/chapters/and import them - Short sections: Each section answers one question or supports one idea
- TOC-friendly headings: Use H2βH4 with short, descriptive titles (auto-generates table of contents)
Math Notation
- Introduce symbols the first time they appear
- Prefer well-known notation over custom shorthand
- Plain-language interpretation after every significant formula
- Inline math:
$x^2$, block math:$$...$$ - Use
\htmlId{name}{...}for cross-referenceable equations
Citations and References
- Citations from
app/src/content/bibliography.bibusing[@key]or@key - Footnotes:
[^id]with definition[^id]: explanation - Deep-link anything with
idprop onImage,HtmlEmbed,Referencecomponents - Link format:
[Figure 1](#my-figure-id)
Bibliography Organization
The bibliography file is organized into these sections (in order). Always place new entries in the correct section:
| Section | Comment header | What belongs here |
|---|---|---|
| Datasets | % Datasets |
Training/eval datasets (FineWeb, DCLM, Cosmopedia, s1K, etc.) |
| Synthetic data methods | % Synthetic data methods |
Papers about data generation/rephrasing (WRAP, REWIRE, BeyondWeb, model collapse) |
| Models | % Models |
LLM papers and technical reports (Qwen, Llama, Gemma, SmolLM, etc.) |
| Architecture | % Architecture |
Architectural components (GQA, RoPE, etc.) |
| Inference | % Inference |
Serving/inference engines and techniques (vLLM, SGLang, FlashAttention, speculative decoding) |
| Training | % Training |
Optimizers, schedules, training methods (AdamW, WSD/MiniCPM, etc.) |
| Tools | % Tools |
Software tools and frameworks (DSPy, DataTrove, etc.) |
| Benchmarks | % Benchmarks |
Evaluation benchmarks (ARC, HellaSwag, MMLU, GSM8K, etc.) |
Citation Placement Rules
- Papers: cite every occurrence: Place
[@key]or@keyevery time a paper is referenced in the blog post. Papers are research contributions that readers may want to look up at any point, so always provide the citation. - Models, datasets, tools: cite on first occurrence only: Place
[@key]the first time a model, dataset, benchmark, tool, or other non-paper reference appears in the blog post. Subsequent mentions don't need the citation. - Blog chapter order matters: The rendering order is defined in
app/src/content/article.mdx(Introduction β Infrastructure β Setup β Experiments β Conclusions β Appendix). "First occurrence" means first across the entire concatenated blog, not first within a single chapter file. - Everything citable gets cited: Models, datasets, benchmarks, architectural techniques, optimizers, inference engines, training schedules, and tools all need citations if they have a paper or official reference
- Hyperlinks are not citations: Even if something is already linked (e.g.,
[DataTrove](https://github.com/...)) it still needs a[@key]citation - Use
@softwarefor code-only references: Libraries/tools without a paper get a@software{...}entry pointing to the repo or blog post - Use
@miscwithnote = {Blog post}for references that only exist as blog posts
Components Quick Reference
| Component | Use for |
|---|---|
<Sidenote> |
Contextual notes in the margin |
<Note variant="info/success/danger"> |
Callouts, tips, warnings |
<Accordion> |
Collapsible details, code examples |
<Image> |
Optimized images with captions, zoom |
<HtmlEmbed> |
D3/Plotly charts, interactive figures |
<Wide> / <FullWidth> |
Break out of content column |
<Glossary> |
Hover definitions for terms |
<Quote> |
Attributed quotations |
<Stack> |
Multi-column layouts |
<FigRef> |
Auto-numbered figure cross-references (e.g., <FigRef target="my-figure-id" /> renders as "Figure 3") |
Chart Selection
Match visualization to analytical goal:
- Compare values: Bar, dot plot
- Show distribution: Histogram, box plot, violin
- Part-to-whole: Pie, stacked bar, treemap
- Trends over time: Line chart
- Relationships: Scatter, bubble
- Flow/process: Sankey, flowchart
Prefer D3 over Plotly for custom, web-native visualizations.
Skills
Vibe Coding D3 Charts
Create interactive D3 charts as self-contained HTML fragments. Charts must be responsive, accessible, interactive, and dark mode ready.
Workflow:
- Read the directives at
app/src/content/embeds/vibe-code-d3-embeds-directives.md - Optionally use an existing chart from
app/src/content/embeds/as a starting point - Generate the chart with a prompt like:
I want you to code a new d3 chart named `yourchart`. I have one CSV file called `yourdata.csv` in the data folder. The csv has the following columns: `x`, `y`, `z`. - Iterate with small adjustments until satisfied
Embedding:
Use the <HtmlEmbed> component in MDX:
<HtmlEmbed
id="fig1"
src="your-chart.html"
title="Chart Title"
desc="Figure 1: Description of the chart."
/>
Reference examples: app/src/content/chapters/demo/vibe-coding-charts.mdx
Taking Screenshots for Debugging
Take screenshots of the rendered blog to visually verify how content looks. Requires the dev server to be running (npm run dev in the app/ directory).
Script: app/scripts/screenshot.mjs
Usage (run from the app/ directory):
# Screenshot the current viewport (above the fold)
node scripts/screenshot.mjs
# Screenshot a specific section by heading anchor
node scripts/screenshot.mjs --target "#experiments"
# Screenshot a specific figure by its id
node scripts/screenshot.mjs --target "#baselines-comparison"
# Screenshot a CSS selector (e.g., the first mermaid diagram)
node scripts/screenshot.mjs --target ".mermaid" --padding 80
# Full-page screenshot in dark mode
node scripts/screenshot.mjs --full-page --dark
# Custom output path
node scripts/screenshot.mjs --target "#infrastructure" --output ./infra-shot.png
Options:
| Option | Default | Description |
|---|---|---|
--target <selector> |
(viewport) | CSS selector or #anchor to screenshot |
--output <path> |
../../assets/screenshot-<target>.png |
Output file path |
--width <px> |
1400 | Viewport width |
--height <px> |
900 | Viewport height |
--full-page |
false | Capture the full scrollable page |
--padding <px> |
40 | Extra padding around the target element |
--url <url> |
http://localhost:4321/ |
Page URL |
--wait <ms> |
2000 | Extra wait after navigation for async content |
--dark |
false | Use dark theme |
When to use: After modifying MDX content, charts, CSS, or mermaid diagrams, take a screenshot to verify the rendered output looks correct. Particularly useful for checking figure numbering, chart rendering, and layout issues.
End-of-Interaction Checklist
After completing any content edit, perform this reference audit:
- Scan for uncited claims: Look for statements about methods, benchmarks, or prior work that lack citations
- Search for BibTeX entries: For each missing reference, search online (Google Scholar, Semantic Scholar, arXiv) for the correct BibTeX entry
- Add to bibliography: Place new entries in
app/src/content/bibliography.bibin the correct section (see Bibliography Organization above). Do not dump entries at the end of the file - Insert citations: Add
[@key]inline at the first occurrence across the whole blog (respecting chapter rendering order fromarticle.mdx) - Ask if unsure: If multiple papers match or authorship is ambiguous, ask the user which reference is correct before adding
What needs citations:
- Model names and families (Qwen, Llama, Gemma, SmolLM, etc.)
- Datasets (FineWeb, DCLM, Cosmopedia, s1K, etc.)
- Methods and techniques (speculative decoding, RLHF, etc.)
- Architectural components (GQA, RoPE, Flash Attention, etc.)
- Training details (AdamW optimizer, WSD learning rate schedule, etc.)
- Inference engines (vLLM, SGLang, FlashInfer, etc.)
- Evaluation benchmarks (ARC, HellaSwag, MMLU, GSM8K, etc.)
- Tools and frameworks (DataTrove, DSPy, etc.)
- Benchmark results or comparisons
- Claims about prior work ("X showed that...")
Python Dependencies
Always use uv to install Python packages (not pip). Example:
uv pip install pandas scipy numpy