Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
File size: 12,701 Bytes
2797f5f bdefa11 2797f5f 4a65a6b 2797f5f 4a65a6b 2797f5f 4a65a6b 2797f5f e3e245a 2797f5f a55cd01 bb0d7f1 a55cd01 2797f5f 4a65a6b 2797f5f 17d0a5b 2797f5f a55cd01 2797f5f a55cd01 2797f5f a160373 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 | # Scientific Blog Writing Guidelines
Guidelines for writing casual, accessible scientific content in interactive blog posts.
## Style References
Use these blog posts as inspiration for writing style:
- [The Smol Training Playbook](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook)
- [Transformers Tenets](https://huggingface.co/spaces/transformers-community/Transformers-tenets)
- [FineWeb: decanting the web for the finest text data at scale](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1)
- [FineVision: Open Data is All You Need](https://huggingface.co/spaces/HuggingFaceM4/FineVision)
- [The Ultra-Scale Playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook)
## Voice and Tone
- **Conversational but precise**: Write like explaining to a curious colleague over coffee
- **Direct address**: Use "you" and "we" freely
- **Active voice**: Prefer "the model learns" over "learning is performed by the model"
- **Short sentences**: Break complex ideas into digestible chunks
- **No hedging**: Say "X improves Y" not "X tends to potentially improve Y"
## Punctuation
- **No em-dashes (β)**: Use parentheses, commas, or separate sentences instead
- **En-dashes (β)** for ranges only: "2020β2024", "pages 10β15"
- **Minimal semicolons**: Prefer two sentences over one with a semicolon
- **Angle brackets in MDX**: Use `{'<'}` and `{'>'}` to escape literal `<` and `>` characters in prose, since MDX interprets bare `<` as the start of a JSX tag
- **Curly braces in MDX**: Avoid bare `{` and `}` in prose since MDX interprets them as JSX expressions. Remove them or replace with parentheses/other punctuation
## Formatting
- **Bold for key terms** on first introduction
- **Inline code** for technical identifiers: model names, hyperparameters, file paths
- Lists for features, steps, or comparisons
- Keep paragraphs short (3-5 sentences max)
- **Minimize horizontal scrolling in tables**: If a table row has long content that causes horizontal overflow, convert it to a bullet list or restructure the information. Prefer lists over tables when rows have variable-length prose
## Structure
- Start sections with the conclusion or key insight
- Use progressive disclosure: overview first, details on demand
- Every claim should have a visible artifact (chart, code, demo) nearby
- Sidenotes for tangential but useful context
## Technical Writing
- Define jargon on first use
- Concrete examples over abstract explanations
- Show, don't tell: prefer an interactive demo over a paragraph of description
- Link to sources, don't just cite them
- **Never show plain URLs as link text**: Always integrate links into descriptive text (e.g., write `[DataTrove inference benchmark](https://github.com/...)` not `[github.com/huggingface/datatrove/...](https://github.com/...)`)
## Interactivity
- Treat figures as first-class content, not decoration
- **All images/plots must be wrapped as figures** (using `<figure>` with `id` and `<figcaption>`) and every figure must be mentioned in the surrounding text using `<FigRef target="figure-id" />`. Do NOT include "Figure N:" in `<figcaption>` text since the figure environment auto-numbers
- Interactive elements should reveal insight, not just look fancy
- **Hover tooltips everywhere**: Add informative hover tooltips wherever possible to surface additional detail (model name, exact values, metadata) without cluttering the chart
- **Readable text size**: Ensure all text in charts (axis labels, tick marks, legends, annotations) is large enough to read comfortably, comparable to the main body text size
- **Minimal padding**: Keep internal chart padding/margins small so the majority of space is used for actual content
- **Discuss every visualization**: Every chart must be cited in the surrounding text with `<FigRef>` and accompanied by 1β3 paragraphs that explain what the chart shows, highlight the key takeaway, and discuss implications
- Ensure all visualizations work in dark mode
- Mobile-friendly is mandatory
## Content Structure
- **One file, one article**: Main content lives in `app/src/content/article.mdx`
- **Chapters for length**: Split long articles into `app/src/content/chapters/` and import them
- **Short sections**: Each section answers one question or supports one idea
- **TOC-friendly headings**: Use H2βH4 with short, descriptive titles (auto-generates table of contents)
## Math Notation
- **Introduce symbols** the first time they appear
- **Prefer well-known notation** over custom shorthand
- **Plain-language interpretation** after every significant formula
- Inline math: `$x^2$`, block math: `$$...$$`
- Use `\htmlId{name}{...}` for cross-referenceable equations
## Citations and References
- Citations from `app/src/content/bibliography.bib` using `[@key]` or `@key`
- Footnotes: `[^id]` with definition `[^id]: explanation`
- Deep-link anything with `id` prop on `Image`, `HtmlEmbed`, `Reference` components
- Link format: `[Figure 1](#my-figure-id)`
### Bibliography Organization
The bibliography file is organized into these sections (in order). Always place new entries in the correct section:
| Section | Comment header | What belongs here |
|---------|---------------|-------------------|
| Datasets | `% Datasets` | Training/eval datasets (FineWeb, DCLM, Cosmopedia, s1K, etc.) |
| Synthetic data methods | `% Synthetic data methods` | Papers about data generation/rephrasing (WRAP, REWIRE, BeyondWeb, model collapse) |
| Models | `% Models` | LLM papers and technical reports (Qwen, Llama, Gemma, SmolLM, etc.) |
| Architecture | `% Architecture` | Architectural components (GQA, RoPE, etc.) |
| Inference | `% Inference` | Serving/inference engines and techniques (vLLM, SGLang, FlashAttention, speculative decoding) |
| Training | `% Training` | Optimizers, schedules, training methods (AdamW, WSD/MiniCPM, etc.) |
| Tools | `% Tools` | Software tools and frameworks (DSPy, DataTrove, etc.) |
| Benchmarks | `% Benchmarks` | Evaluation benchmarks (ARC, HellaSwag, MMLU, GSM8K, etc.) |
### Citation Placement Rules
- **Papers: cite every occurrence**: Place `[@key]` or `@key` every time a paper is referenced in the blog post. Papers are research contributions that readers may want to look up at any point, so always provide the citation.
- **Models, datasets, tools: cite on first occurrence only**: Place `[@key]` the first time a model, dataset, benchmark, tool, or other non-paper reference appears in the blog post. Subsequent mentions don't need the citation.
- **Blog chapter order matters**: The rendering order is defined in `app/src/content/article.mdx` (Introduction β Infrastructure β Setup β Experiments β Conclusions β Appendix). "First occurrence" means first across the entire concatenated blog, not first within a single chapter file.
- **Everything citable gets cited**: Models, datasets, benchmarks, architectural techniques, optimizers, inference engines, training schedules, and tools all need citations if they have a paper or official reference
- **Hyperlinks are not citations**: Even if something is already linked (e.g., `[DataTrove](https://github.com/...)`) it still needs a `[@key]` citation
- **Use `@software` for code-only references**: Libraries/tools without a paper get a `@software{...}` entry pointing to the repo or blog post
- **Use `@misc` with `note = {Blog post}`** for references that only exist as blog posts
## Components Quick Reference
| Component | Use for |
|-----------|---------|
| `<Sidenote>` | Contextual notes in the margin |
| `<Note variant="info/success/danger">` | Callouts, tips, warnings |
| `<Accordion>` | Collapsible details, code examples |
| `<Image>` | Optimized images with captions, zoom |
| `<HtmlEmbed>` | D3/Plotly charts, interactive figures |
| `<Wide>` / `<FullWidth>` | Break out of content column |
| `<Glossary>` | Hover definitions for terms |
| `<Quote>` | Attributed quotations |
| `<Stack>` | Multi-column layouts |
| `<FigRef>` | Auto-numbered figure cross-references (e.g., `<FigRef target="my-figure-id" />` renders as "Figure 3") |
## Chart Selection
Match visualization to analytical goal:
- **Compare values**: Bar, dot plot
- **Show distribution**: Histogram, box plot, violin
- **Part-to-whole**: Pie, stacked bar, treemap
- **Trends over time**: Line chart
- **Relationships**: Scatter, bubble
- **Flow/process**: Sankey, flowchart
Prefer D3 over Plotly for custom, web-native visualizations.
---
## Skills
### Vibe Coding D3 Charts
Create interactive D3 charts as self-contained HTML fragments. Charts must be **responsive**, **accessible**, **interactive**, and **dark mode ready**.
**Workflow:**
1. Read the directives at `app/src/content/embeds/vibe-code-d3-embeds-directives.md`
2. Optionally use an existing chart from `app/src/content/embeds/` as a starting point
3. Generate the chart with a prompt like:
```
I want you to code a new d3 chart named `yourchart`.
I have one CSV file called `yourdata.csv` in the data folder.
The csv has the following columns: `x`, `y`, `z`.
```
4. Iterate with small adjustments until satisfied
**Embedding:**
Use the `<HtmlEmbed>` component in MDX:
```jsx
<HtmlEmbed
id="fig1"
src="your-chart.html"
title="Chart Title"
desc="Figure 1: Description of the chart."
/>
```
**Reference examples:** `app/src/content/chapters/demo/vibe-coding-charts.mdx`
### Taking Screenshots for Debugging
Take screenshots of the rendered blog to visually verify how content looks. Requires the dev server to be running (`npm run dev` in the `app/` directory).
**Script:** `app/scripts/screenshot.mjs`
**Usage (run from the `app/` directory):**
```bash
# Screenshot the current viewport (above the fold)
node scripts/screenshot.mjs
# Screenshot a specific section by heading anchor
node scripts/screenshot.mjs --target "#experiments"
# Screenshot a specific figure by its id
node scripts/screenshot.mjs --target "#baselines-comparison"
# Screenshot a CSS selector (e.g., the first mermaid diagram)
node scripts/screenshot.mjs --target ".mermaid" --padding 80
# Full-page screenshot in dark mode
node scripts/screenshot.mjs --full-page --dark
# Custom output path
node scripts/screenshot.mjs --target "#infrastructure" --output ./infra-shot.png
```
**Options:**
| Option | Default | Description |
|--------|---------|-------------|
| `--target <selector>` | (viewport) | CSS selector or `#anchor` to screenshot |
| `--output <path>` | `../../assets/screenshot-<target>.png` | Output file path |
| `--width <px>` | 1400 | Viewport width |
| `--height <px>` | 900 | Viewport height |
| `--full-page` | false | Capture the full scrollable page |
| `--padding <px>` | 40 | Extra padding around the target element |
| `--url <url>` | `http://localhost:4321/` | Page URL |
| `--wait <ms>` | 2000 | Extra wait after navigation for async content |
| `--dark` | false | Use dark theme |
**When to use:** After modifying MDX content, charts, CSS, or mermaid diagrams, take a screenshot to verify the rendered output looks correct. Particularly useful for checking figure numbering, chart rendering, and layout issues.
---
## End-of-Interaction Checklist
After completing any content edit, perform this reference audit:
1. **Scan for uncited claims**: Look for statements about methods, benchmarks, or prior work that lack citations
2. **Search for BibTeX entries**: For each missing reference, search online (Google Scholar, Semantic Scholar, arXiv) for the correct BibTeX entry
3. **Add to bibliography**: Place new entries in `app/src/content/bibliography.bib` **in the correct section** (see Bibliography Organization above). Do not dump entries at the end of the file
4. **Insert citations**: Add `[@key]` inline at the first occurrence across the whole blog (respecting chapter rendering order from `article.mdx`)
5. **Ask if unsure**: If multiple papers match or authorship is ambiguous, ask the user which reference is correct before adding
**What needs citations:**
- Model names and families (Qwen, Llama, Gemma, SmolLM, etc.)
- Datasets (FineWeb, DCLM, Cosmopedia, s1K, etc.)
- Methods and techniques (speculative decoding, RLHF, etc.)
- Architectural components (GQA, RoPE, Flash Attention, etc.)
- Training details (AdamW optimizer, WSD learning rate schedule, etc.)
- Inference engines (vLLM, SGLang, FlashInfer, etc.)
- Evaluation benchmarks (ARC, HellaSwag, MMLU, GSM8K, etc.)
- Tools and frameworks (DataTrove, DSPy, etc.)
- Benchmark results or comparisons
- Claims about prior work ("X showed that...")
## Python Dependencies
Always use `uv` to install Python packages (not `pip`). Example:
```bash
uv pip install pandas scipy numpy
```
|