Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
| # Scientific Blog Writing Guidelines | |
| Guidelines for writing casual, accessible scientific content in interactive blog posts. | |
| ## Style References | |
| Use these blog posts as inspiration for writing style: | |
| - [The Smol Training Playbook](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook) | |
| - [Transformers Tenets](https://huggingface.co/spaces/transformers-community/Transformers-tenets) | |
| - [FineWeb: decanting the web for the finest text data at scale](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1) | |
| - [FineVision: Open Data is All You Need](https://huggingface.co/spaces/HuggingFaceM4/FineVision) | |
| - [The Ultra-Scale Playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook) | |
| ## Voice and Tone | |
| - **Conversational but precise**: Write like explaining to a curious colleague over coffee | |
| - **Direct address**: Use "you" and "we" freely | |
| - **Active voice**: Prefer "the model learns" over "learning is performed by the model" | |
| - **Short sentences**: Break complex ideas into digestible chunks | |
| - **No hedging**: Say "X improves Y" not "X tends to potentially improve Y" | |
| ## Punctuation | |
| - **No em-dashes (—)**: Use parentheses, commas, or separate sentences instead | |
| - **En-dashes (–)** for ranges only: "2020–2024", "pages 10–15" | |
| - **Minimal semicolons**: Prefer two sentences over one with a semicolon | |
| - **Angle brackets in MDX**: Use `{'<'}` and `{'>'}` to escape literal `<` and `>` characters in prose, since MDX interprets bare `<` as the start of a JSX tag | |
| - **Curly braces in MDX**: Avoid bare `{` and `}` in prose since MDX interprets them as JSX expressions. Remove them or replace with parentheses/other punctuation | |
| ## Formatting | |
| - **Bold for key terms** on first introduction | |
| - **Inline code** for technical identifiers: model names, hyperparameters, file paths | |
| - Lists for features, steps, or comparisons | |
| - Keep paragraphs short (3-5 sentences max) | |
| - **Minimize horizontal scrolling in tables**: If a table row has long content that causes horizontal overflow, convert it to a bullet list or restructure the information. Prefer lists over tables when rows have variable-length prose | |
| ## Structure | |
| - Start sections with the conclusion or key insight | |
| - Use progressive disclosure: overview first, details on demand | |
| - Every claim should have a visible artifact (chart, code, demo) nearby | |
| - Sidenotes for tangential but useful context | |
| ## Technical Writing | |
| - Define jargon on first use | |
| - Concrete examples over abstract explanations | |
| - Show, don't tell: prefer an interactive demo over a paragraph of description | |
| - Link to sources, don't just cite them | |
| - **Never show plain URLs as link text**: Always integrate links into descriptive text (e.g., write `[DataTrove inference benchmark](https://github.com/...)` not `[github.com/huggingface/datatrove/...](https://github.com/...)`) | |
| ## Interactivity | |
| - Treat figures as first-class content, not decoration | |
| - **All images/plots must be wrapped as figures** (using `<figure>` with `id` and `<figcaption>`) and every figure must be mentioned in the surrounding text using `<FigRef target="figure-id" />`. Do NOT include "Figure N:" in `<figcaption>` text since the figure environment auto-numbers | |
| - Interactive elements should reveal insight, not just look fancy | |
| - **Hover tooltips everywhere**: Add informative hover tooltips wherever possible to surface additional detail (model name, exact values, metadata) without cluttering the chart | |
| - **Readable text size**: Ensure all text in charts (axis labels, tick marks, legends, annotations) is large enough to read comfortably, comparable to the main body text size | |
| - **Minimal padding**: Keep internal chart padding/margins small so the majority of space is used for actual content | |
| - **Discuss every visualization**: Every chart must be cited in the surrounding text with `<FigRef>` and accompanied by 1–3 paragraphs that explain what the chart shows, highlight the key takeaway, and discuss implications | |
| - Ensure all visualizations work in dark mode | |
| - Mobile-friendly is mandatory | |
| ## Content Structure | |
| - **One file, one article**: Main content lives in `app/src/content/article.mdx` | |
| - **Chapters for length**: Split long articles into `app/src/content/chapters/` and import them | |
| - **Short sections**: Each section answers one question or supports one idea | |
| - **TOC-friendly headings**: Use H2–H4 with short, descriptive titles (auto-generates table of contents) | |
| ## Math Notation | |
| - **Introduce symbols** the first time they appear | |
| - **Prefer well-known notation** over custom shorthand | |
| - **Plain-language interpretation** after every significant formula | |
| - Inline math: `$x^2$`, block math: `$$...$$` | |
| - Use `\htmlId{name}{...}` for cross-referenceable equations | |
| ## Citations and References | |
| - Citations from `app/src/content/bibliography.bib` using `[@key]` or `@key` | |
| - Footnotes: `[^id]` with definition `[^id]: explanation` | |
| - Deep-link anything with `id` prop on `Image`, `HtmlEmbed`, `Reference` components | |
| - Link format: `[Figure 1](#my-figure-id)` | |
| ### Bibliography Organization | |
| The bibliography file is organized into these sections (in order). Always place new entries in the correct section: | |
| | Section | Comment header | What belongs here | | |
| |---------|---------------|-------------------| | |
| | Datasets | `% Datasets` | Training/eval datasets (FineWeb, DCLM, Cosmopedia, s1K, etc.) | | |
| | Synthetic data methods | `% Synthetic data methods` | Papers about data generation/rephrasing (WRAP, REWIRE, BeyondWeb, model collapse) | | |
| | Models | `% Models` | LLM papers and technical reports (Qwen, Llama, Gemma, SmolLM, etc.) | | |
| | Architecture | `% Architecture` | Architectural components (GQA, RoPE, etc.) | | |
| | Inference | `% Inference` | Serving/inference engines and techniques (vLLM, SGLang, FlashAttention, speculative decoding) | | |
| | Training | `% Training` | Optimizers, schedules, training methods (AdamW, WSD/MiniCPM, etc.) | | |
| | Tools | `% Tools` | Software tools and frameworks (DSPy, DataTrove, etc.) | | |
| | Benchmarks | `% Benchmarks` | Evaluation benchmarks (ARC, HellaSwag, MMLU, GSM8K, etc.) | | |
| ### Citation Placement Rules | |
| - **Papers: cite every occurrence**: Place `[@key]` or `@key` every time a paper is referenced in the blog post. Papers are research contributions that readers may want to look up at any point, so always provide the citation. | |
| - **Models, datasets, tools: cite on first occurrence only**: Place `[@key]` the first time a model, dataset, benchmark, tool, or other non-paper reference appears in the blog post. Subsequent mentions don't need the citation. | |
| - **Blog chapter order matters**: The rendering order is defined in `app/src/content/article.mdx` (Introduction → Infrastructure → Setup → Experiments → Conclusions → Appendix). "First occurrence" means first across the entire concatenated blog, not first within a single chapter file. | |
| - **Everything citable gets cited**: Models, datasets, benchmarks, architectural techniques, optimizers, inference engines, training schedules, and tools all need citations if they have a paper or official reference | |
| - **Hyperlinks are not citations**: Even if something is already linked (e.g., `[DataTrove](https://github.com/...)`) it still needs a `[@key]` citation | |
| - **Use `@software` for code-only references**: Libraries/tools without a paper get a `@software{...}` entry pointing to the repo or blog post | |
| - **Use `@misc` with `note = {Blog post}`** for references that only exist as blog posts | |
| ## Components Quick Reference | |
| | Component | Use for | | |
| |-----------|---------| | |
| | `<Sidenote>` | Contextual notes in the margin | | |
| | `<Note variant="info/success/danger">` | Callouts, tips, warnings | | |
| | `<Accordion>` | Collapsible details, code examples | | |
| | `<Image>` | Optimized images with captions, zoom | | |
| | `<HtmlEmbed>` | D3/Plotly charts, interactive figures | | |
| | `<Wide>` / `<FullWidth>` | Break out of content column | | |
| | `<Glossary>` | Hover definitions for terms | | |
| | `<Quote>` | Attributed quotations | | |
| | `<Stack>` | Multi-column layouts | | |
| | `<FigRef>` | Auto-numbered figure cross-references (e.g., `<FigRef target="my-figure-id" />` renders as "Figure 3") | | |
| ## Chart Selection | |
| Match visualization to analytical goal: | |
| - **Compare values**: Bar, dot plot | |
| - **Show distribution**: Histogram, box plot, violin | |
| - **Part-to-whole**: Pie, stacked bar, treemap | |
| - **Trends over time**: Line chart | |
| - **Relationships**: Scatter, bubble | |
| - **Flow/process**: Sankey, flowchart | |
| Prefer D3 over Plotly for custom, web-native visualizations. | |
| --- | |
| ## Skills | |
| ### Vibe Coding D3 Charts | |
| Create interactive D3 charts as self-contained HTML fragments. Charts must be **responsive**, **accessible**, **interactive**, and **dark mode ready**. | |
| **Workflow:** | |
| 1. Read the directives at `app/src/content/embeds/vibe-code-d3-embeds-directives.md` | |
| 2. Optionally use an existing chart from `app/src/content/embeds/` as a starting point | |
| 3. Generate the chart with a prompt like: | |
| ``` | |
| I want you to code a new d3 chart named `yourchart`. | |
| I have one CSV file called `yourdata.csv` in the data folder. | |
| The csv has the following columns: `x`, `y`, `z`. | |
| ``` | |
| 4. Iterate with small adjustments until satisfied | |
| **Embedding:** | |
| Use the `<HtmlEmbed>` component in MDX: | |
| ```jsx | |
| <HtmlEmbed | |
| id="fig1" | |
| src="your-chart.html" | |
| title="Chart Title" | |
| desc="Figure 1: Description of the chart." | |
| /> | |
| ``` | |
| **Reference examples:** `app/src/content/chapters/demo/vibe-coding-charts.mdx` | |
| ### Taking Screenshots for Debugging | |
| Take screenshots of the rendered blog to visually verify how content looks. Requires the dev server to be running (`npm run dev` in the `app/` directory). | |
| **Script:** `app/scripts/screenshot.mjs` | |
| **Usage (run from the `app/` directory):** | |
| ```bash | |
| # Screenshot the current viewport (above the fold) | |
| node scripts/screenshot.mjs | |
| # Screenshot a specific section by heading anchor | |
| node scripts/screenshot.mjs --target "#experiments" | |
| # Screenshot a specific figure by its id | |
| node scripts/screenshot.mjs --target "#baselines-comparison" | |
| # Screenshot a CSS selector (e.g., the first mermaid diagram) | |
| node scripts/screenshot.mjs --target ".mermaid" --padding 80 | |
| # Full-page screenshot in dark mode | |
| node scripts/screenshot.mjs --full-page --dark | |
| # Custom output path | |
| node scripts/screenshot.mjs --target "#infrastructure" --output ./infra-shot.png | |
| ``` | |
| **Options:** | |
| | Option | Default | Description | | |
| |--------|---------|-------------| | |
| | `--target <selector>` | (viewport) | CSS selector or `#anchor` to screenshot | | |
| | `--output <path>` | `../../assets/screenshot-<target>.png` | Output file path | | |
| | `--width <px>` | 1400 | Viewport width | | |
| | `--height <px>` | 900 | Viewport height | | |
| | `--full-page` | false | Capture the full scrollable page | | |
| | `--padding <px>` | 40 | Extra padding around the target element | | |
| | `--url <url>` | `http://localhost:4321/` | Page URL | | |
| | `--wait <ms>` | 2000 | Extra wait after navigation for async content | | |
| | `--dark` | false | Use dark theme | | |
| **When to use:** After modifying MDX content, charts, CSS, or mermaid diagrams, take a screenshot to verify the rendered output looks correct. Particularly useful for checking figure numbering, chart rendering, and layout issues. | |
| --- | |
| ## End-of-Interaction Checklist | |
| After completing any content edit, perform this reference audit: | |
| 1. **Scan for uncited claims**: Look for statements about methods, benchmarks, or prior work that lack citations | |
| 2. **Search for BibTeX entries**: For each missing reference, search online (Google Scholar, Semantic Scholar, arXiv) for the correct BibTeX entry | |
| 3. **Add to bibliography**: Place new entries in `app/src/content/bibliography.bib` **in the correct section** (see Bibliography Organization above). Do not dump entries at the end of the file | |
| 4. **Insert citations**: Add `[@key]` inline at the first occurrence across the whole blog (respecting chapter rendering order from `article.mdx`) | |
| 5. **Ask if unsure**: If multiple papers match or authorship is ambiguous, ask the user which reference is correct before adding | |
| **What needs citations:** | |
| - Model names and families (Qwen, Llama, Gemma, SmolLM, etc.) | |
| - Datasets (FineWeb, DCLM, Cosmopedia, s1K, etc.) | |
| - Methods and techniques (speculative decoding, RLHF, etc.) | |
| - Architectural components (GQA, RoPE, Flash Attention, etc.) | |
| - Training details (AdamW optimizer, WSD learning rate schedule, etc.) | |
| - Inference engines (vLLM, SGLang, FlashInfer, etc.) | |
| - Evaluation benchmarks (ARC, HellaSwag, MMLU, GSM8K, etc.) | |
| - Tools and frameworks (DataTrove, DSPy, etc.) | |
| - Benchmark results or comparisons | |
| - Claims about prior work ("X showed that...") | |
| ## Python Dependencies | |
| Always use `uv` to install Python packages (not `pip`). Example: | |
| ```bash | |
| uv pip install pandas scipy numpy | |
| ``` | |