# Scientific Blog Writing Guidelines Guidelines for writing casual, accessible scientific content in interactive blog posts. ## Style References Use these blog posts as inspiration for writing style: - [The Smol Training Playbook](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook) - [Transformers Tenets](https://huggingface.co/spaces/transformers-community/Transformers-tenets) - [FineWeb: decanting the web for the finest text data at scale](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1) - [FineVision: Open Data is All You Need](https://huggingface.co/spaces/HuggingFaceM4/FineVision) - [The Ultra-Scale Playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook) ## Voice and Tone - **Conversational but precise**: Write like explaining to a curious colleague over coffee - **Direct address**: Use "you" and "we" freely - **Active voice**: Prefer "the model learns" over "learning is performed by the model" - **Short sentences**: Break complex ideas into digestible chunks - **No hedging**: Say "X improves Y" not "X tends to potentially improve Y" ## Punctuation - **No em-dashes (—)**: Use parentheses, commas, or separate sentences instead - **En-dashes (–)** for ranges only: "2020–2024", "pages 10–15" - **Minimal semicolons**: Prefer two sentences over one with a semicolon - **Angle brackets in MDX**: Use `{'<'}` and `{'>'}` to escape literal `<` and `>` characters in prose, since MDX interprets bare `<` as the start of a JSX tag - **Curly braces in MDX**: Avoid bare `{` and `}` in prose since MDX interprets them as JSX expressions. Remove them or replace with parentheses/other punctuation ## Formatting - **Bold for key terms** on first introduction - **Inline code** for technical identifiers: model names, hyperparameters, file paths - Lists for features, steps, or comparisons - Keep paragraphs short (3-5 sentences max) - **Minimize horizontal scrolling in tables**: If a table row has long content that causes horizontal overflow, convert it to a bullet list or restructure the information. Prefer lists over tables when rows have variable-length prose ## Structure - Start sections with the conclusion or key insight - Use progressive disclosure: overview first, details on demand - Every claim should have a visible artifact (chart, code, demo) nearby - Sidenotes for tangential but useful context ## Technical Writing - Define jargon on first use - Concrete examples over abstract explanations - Show, don't tell: prefer an interactive demo over a paragraph of description - Link to sources, don't just cite them - **Never show plain URLs as link text**: Always integrate links into descriptive text (e.g., write `[DataTrove inference benchmark](https://github.com/...)` not `[github.com/huggingface/datatrove/...](https://github.com/...)`) ## Interactivity - Treat figures as first-class content, not decoration - **All images/plots must be wrapped as figures** (using `

`) and every figure must be mentioned in the surrounding text using ``. Do NOT include "Figure N:" in `
` text since the figure environment auto-numbers - Interactive elements should reveal insight, not just look fancy - **Hover tooltips everywhere**: Add informative hover tooltips wherever possible to surface additional detail (model name, exact values, metadata) without cluttering the chart - **Readable text size**: Ensure all text in charts (axis labels, tick marks, legends, annotations) is large enough to read comfortably, comparable to the main body text size - **Minimal padding**: Keep internal chart padding/margins small so the majority of space is used for actual content - **Discuss every visualization**: Every chart must be cited in the surrounding text with `` and accompanied by 1–3 paragraphs that explain what the chart shows, highlight the key takeaway, and discuss implications - Ensure all visualizations work in dark mode - Mobile-friendly is mandatory ## Content Structure - **One file, one article**: Main content lives in `app/src/content/article.mdx` - **Chapters for length**: Split long articles into `app/src/content/chapters/` and import them - **Short sections**: Each section answers one question or supports one idea - **TOC-friendly headings**: Use H2–H4 with short, descriptive titles (auto-generates table of contents) ## Math Notation - **Introduce symbols** the first time they appear - **Prefer well-known notation** over custom shorthand - **Plain-language interpretation** after every significant formula - Inline math: `$x^2$`, block math: `$$...$$` - Use `\htmlId{name}{...}` for cross-referenceable equations ## Citations and References - Citations from `app/src/content/bibliography.bib` using `[@key]` or `@key` - Footnotes: `[^id]` with definition `[^id]: explanation` - Deep-link anything with `id` prop on `Image`, `HtmlEmbed`, `Reference` components - Link format: `[Figure 1](#my-figure-id)` ### Bibliography Organization The bibliography file is organized into these sections (in order). Always place new entries in the correct section: | Section | Comment header | What belongs here | |---------|---------------|-------------------| | Datasets | `% Datasets` | Training/eval datasets (FineWeb, DCLM, Cosmopedia, s1K, etc.) | | Synthetic data methods | `% Synthetic data methods` | Papers about data generation/rephrasing (WRAP, REWIRE, BeyondWeb, model collapse) | | Models | `% Models` | LLM papers and technical reports (Qwen, Llama, Gemma, SmolLM, etc.) | | Architecture | `% Architecture` | Architectural components (GQA, RoPE, etc.) | | Inference | `% Inference` | Serving/inference engines and techniques (vLLM, SGLang, FlashAttention, speculative decoding) | | Training | `% Training` | Optimizers, schedules, training methods (AdamW, WSD/MiniCPM, etc.) | | Tools | `% Tools` | Software tools and frameworks (DSPy, DataTrove, etc.) | | Benchmarks | `% Benchmarks` | Evaluation benchmarks (ARC, HellaSwag, MMLU, GSM8K, etc.) | ### Citation Placement Rules - **Papers: cite every occurrence**: Place `[@key]` or `@key` every time a paper is referenced in the blog post. Papers are research contributions that readers may want to look up at any point, so always provide the citation. - **Models, datasets, tools: cite on first occurrence only**: Place `[@key]` the first time a model, dataset, benchmark, tool, or other non-paper reference appears in the blog post. Subsequent mentions don't need the citation. - **Blog chapter order matters**: The rendering order is defined in `app/src/content/article.mdx` (Introduction → Infrastructure → Setup → Experiments → Conclusions → Appendix). "First occurrence" means first across the entire concatenated blog, not first within a single chapter file. - **Everything citable gets cited**: Models, datasets, benchmarks, architectural techniques, optimizers, inference engines, training schedules, and tools all need citations if they have a paper or official reference - **Hyperlinks are not citations**: Even if something is already linked (e.g., `[DataTrove](https://github.com/...)`) it still needs a `[@key]` citation - **Use `@software` for code-only references**: Libraries/tools without a paper get a `@software{...}` entry pointing to the repo or blog post - **Use `@misc` with `note = {Blog post}`** for references that only exist as blog posts ## Components Quick Reference | Component | Use for | |-----------|---------| | `` | Contextual notes in the margin | | `` | Callouts, tips, warnings | | `` | Collapsible details, code examples | | `` | Optimized images with captions, zoom | | `` | D3/Plotly charts, interactive figures | | `` / `` | Break out of content column | | `` | Hover definitions for terms | | `` | Attributed quotations | | `` | Multi-column layouts | | `` | Auto-numbered figure cross-references (e.g., `` renders as "Figure 3") | ## Chart Selection Match visualization to analytical goal: - **Compare values**: Bar, dot plot - **Show distribution**: Histogram, box plot, violin - **Part-to-whole**: Pie, stacked bar, treemap - **Trends over time**: Line chart - **Relationships**: Scatter, bubble - **Flow/process**: Sankey, flowchart Prefer D3 over Plotly for custom, web-native visualizations. --- ## Skills ### Vibe Coding D3 Charts Create interactive D3 charts as self-contained HTML fragments. Charts must be **responsive**, **accessible**, **interactive**, and **dark mode ready**. **Workflow:** 1. Read the directives at `app/src/content/embeds/vibe-code-d3-embeds-directives.md` 2. Optionally use an existing chart from `app/src/content/embeds/` as a starting point 3. Generate the chart with a prompt like: ``` I want you to code a new d3 chart named `yourchart`. I have one CSV file called `yourdata.csv` in the data folder. The csv has the following columns: `x`, `y`, `z`. ``` 4. Iterate with small adjustments until satisfied **Embedding:** Use the `` component in MDX: ```jsx ``` **Reference examples:** `app/src/content/chapters/demo/vibe-coding-charts.mdx` ### Taking Screenshots for Debugging Take screenshots of the rendered blog to visually verify how content looks. Requires the dev server to be running (`npm run dev` in the `app/` directory). **Script:** `app/scripts/screenshot.mjs` **Usage (run from the `app/` directory):** ```bash # Screenshot the current viewport (above the fold) node scripts/screenshot.mjs # Screenshot a specific section by heading anchor node scripts/screenshot.mjs --target "#experiments" # Screenshot a specific figure by its id node scripts/screenshot.mjs --target "#baselines-comparison" # Screenshot a CSS selector (e.g., the first mermaid diagram) node scripts/screenshot.mjs --target ".mermaid" --padding 80 # Full-page screenshot in dark mode node scripts/screenshot.mjs --full-page --dark # Custom output path node scripts/screenshot.mjs --target "#infrastructure" --output ./infra-shot.png ``` **Options:** | Option | Default | Description | |--------|---------|-------------| | `--target ` | (viewport) | CSS selector or `#anchor` to screenshot | | `--output ` | `../../assets/screenshot-.png` | Output file path | | `--width ` | 1400 | Viewport width | | `--height ` | 900 | Viewport height | | `--full-page` | false | Capture the full scrollable page | | `--padding ` | 40 | Extra padding around the target element | | `--url ` | `http://localhost:4321/` | Page URL | | `--wait ` | 2000 | Extra wait after navigation for async content | | `--dark` | false | Use dark theme | **When to use:** After modifying MDX content, charts, CSS, or mermaid diagrams, take a screenshot to verify the rendered output looks correct. Particularly useful for checking figure numbering, chart rendering, and layout issues. --- ## End-of-Interaction Checklist After completing any content edit, perform this reference audit: 1. **Scan for uncited claims**: Look for statements about methods, benchmarks, or prior work that lack citations 2. **Search for BibTeX entries**: For each missing reference, search online (Google Scholar, Semantic Scholar, arXiv) for the correct BibTeX entry 3. **Add to bibliography**: Place new entries in `app/src/content/bibliography.bib` **in the correct section** (see Bibliography Organization above). Do not dump entries at the end of the file 4. **Insert citations**: Add `[@key]` inline at the first occurrence across the whole blog (respecting chapter rendering order from `article.mdx`) 5. **Ask if unsure**: If multiple papers match or authorship is ambiguous, ask the user which reference is correct before adding **What needs citations:** - Model names and families (Qwen, Llama, Gemma, SmolLM, etc.) - Datasets (FineWeb, DCLM, Cosmopedia, s1K, etc.) - Methods and techniques (speculative decoding, RLHF, etc.) - Architectural components (GQA, RoPE, Flash Attention, etc.) - Training details (AdamW optimizer, WSD learning rate schedule, etc.) - Inference engines (vLLM, SGLang, FlashInfer, etc.) - Evaluation benchmarks (ARC, HellaSwag, MMLU, GSM8K, etc.) - Tools and frameworks (DataTrove, DSPy, etc.) - Benchmark results or comparisons - Claims about prior work ("X showed that...") ## Python Dependencies Always use `uv` to install Python packages (not `pip`). Example: ```bash uv pip install pandas scipy numpy ```