finephrase / AGENTS.md
joelniklaus's picture
joelniklaus HF Staff
adapt agents.md file to cite papers every time they are referenced.
bb0d7f1
# Scientific Blog Writing Guidelines
Guidelines for writing casual, accessible scientific content in interactive blog posts.
## Style References
Use these blog posts as inspiration for writing style:
- [The Smol Training Playbook](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook)
- [Transformers Tenets](https://huggingface.co/spaces/transformers-community/Transformers-tenets)
- [FineWeb: decanting the web for the finest text data at scale](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1)
- [FineVision: Open Data is All You Need](https://huggingface.co/spaces/HuggingFaceM4/FineVision)
- [The Ultra-Scale Playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook)
## Voice and Tone
- **Conversational but precise**: Write like explaining to a curious colleague over coffee
- **Direct address**: Use "you" and "we" freely
- **Active voice**: Prefer "the model learns" over "learning is performed by the model"
- **Short sentences**: Break complex ideas into digestible chunks
- **No hedging**: Say "X improves Y" not "X tends to potentially improve Y"
## Punctuation
- **No em-dashes (—)**: Use parentheses, commas, or separate sentences instead
- **En-dashes (–)** for ranges only: "2020–2024", "pages 10–15"
- **Minimal semicolons**: Prefer two sentences over one with a semicolon
- **Angle brackets in MDX**: Use `{'<'}` and `{'>'}` to escape literal `<` and `>` characters in prose, since MDX interprets bare `<` as the start of a JSX tag
- **Curly braces in MDX**: Avoid bare `{` and `}` in prose since MDX interprets them as JSX expressions. Remove them or replace with parentheses/other punctuation
## Formatting
- **Bold for key terms** on first introduction
- **Inline code** for technical identifiers: model names, hyperparameters, file paths
- Lists for features, steps, or comparisons
- Keep paragraphs short (3-5 sentences max)
- **Minimize horizontal scrolling in tables**: If a table row has long content that causes horizontal overflow, convert it to a bullet list or restructure the information. Prefer lists over tables when rows have variable-length prose
## Structure
- Start sections with the conclusion or key insight
- Use progressive disclosure: overview first, details on demand
- Every claim should have a visible artifact (chart, code, demo) nearby
- Sidenotes for tangential but useful context
## Technical Writing
- Define jargon on first use
- Concrete examples over abstract explanations
- Show, don't tell: prefer an interactive demo over a paragraph of description
- Link to sources, don't just cite them
- **Never show plain URLs as link text**: Always integrate links into descriptive text (e.g., write `[DataTrove inference benchmark](https://github.com/...)` not `[github.com/huggingface/datatrove/...](https://github.com/...)`)
## Interactivity
- Treat figures as first-class content, not decoration
- **All images/plots must be wrapped as figures** (using `<figure>` with `id` and `<figcaption>`) and every figure must be mentioned in the surrounding text using `<FigRef target="figure-id" />`. Do NOT include "Figure N:" in `<figcaption>` text since the figure environment auto-numbers
- Interactive elements should reveal insight, not just look fancy
- **Hover tooltips everywhere**: Add informative hover tooltips wherever possible to surface additional detail (model name, exact values, metadata) without cluttering the chart
- **Readable text size**: Ensure all text in charts (axis labels, tick marks, legends, annotations) is large enough to read comfortably, comparable to the main body text size
- **Minimal padding**: Keep internal chart padding/margins small so the majority of space is used for actual content
- **Discuss every visualization**: Every chart must be cited in the surrounding text with `<FigRef>` and accompanied by 1–3 paragraphs that explain what the chart shows, highlight the key takeaway, and discuss implications
- Ensure all visualizations work in dark mode
- Mobile-friendly is mandatory
## Content Structure
- **One file, one article**: Main content lives in `app/src/content/article.mdx`
- **Chapters for length**: Split long articles into `app/src/content/chapters/` and import them
- **Short sections**: Each section answers one question or supports one idea
- **TOC-friendly headings**: Use H2–H4 with short, descriptive titles (auto-generates table of contents)
## Math Notation
- **Introduce symbols** the first time they appear
- **Prefer well-known notation** over custom shorthand
- **Plain-language interpretation** after every significant formula
- Inline math: `$x^2$`, block math: `$$...$$`
- Use `\htmlId{name}{...}` for cross-referenceable equations
## Citations and References
- Citations from `app/src/content/bibliography.bib` using `[@key]` or `@key`
- Footnotes: `[^id]` with definition `[^id]: explanation`
- Deep-link anything with `id` prop on `Image`, `HtmlEmbed`, `Reference` components
- Link format: `[Figure 1](#my-figure-id)`
### Bibliography Organization
The bibliography file is organized into these sections (in order). Always place new entries in the correct section:
| Section | Comment header | What belongs here |
|---------|---------------|-------------------|
| Datasets | `% Datasets` | Training/eval datasets (FineWeb, DCLM, Cosmopedia, s1K, etc.) |
| Synthetic data methods | `% Synthetic data methods` | Papers about data generation/rephrasing (WRAP, REWIRE, BeyondWeb, model collapse) |
| Models | `% Models` | LLM papers and technical reports (Qwen, Llama, Gemma, SmolLM, etc.) |
| Architecture | `% Architecture` | Architectural components (GQA, RoPE, etc.) |
| Inference | `% Inference` | Serving/inference engines and techniques (vLLM, SGLang, FlashAttention, speculative decoding) |
| Training | `% Training` | Optimizers, schedules, training methods (AdamW, WSD/MiniCPM, etc.) |
| Tools | `% Tools` | Software tools and frameworks (DSPy, DataTrove, etc.) |
| Benchmarks | `% Benchmarks` | Evaluation benchmarks (ARC, HellaSwag, MMLU, GSM8K, etc.) |
### Citation Placement Rules
- **Papers: cite every occurrence**: Place `[@key]` or `@key` every time a paper is referenced in the blog post. Papers are research contributions that readers may want to look up at any point, so always provide the citation.
- **Models, datasets, tools: cite on first occurrence only**: Place `[@key]` the first time a model, dataset, benchmark, tool, or other non-paper reference appears in the blog post. Subsequent mentions don't need the citation.
- **Blog chapter order matters**: The rendering order is defined in `app/src/content/article.mdx` (Introduction → Infrastructure → Setup → Experiments → Conclusions → Appendix). "First occurrence" means first across the entire concatenated blog, not first within a single chapter file.
- **Everything citable gets cited**: Models, datasets, benchmarks, architectural techniques, optimizers, inference engines, training schedules, and tools all need citations if they have a paper or official reference
- **Hyperlinks are not citations**: Even if something is already linked (e.g., `[DataTrove](https://github.com/...)`) it still needs a `[@key]` citation
- **Use `@software` for code-only references**: Libraries/tools without a paper get a `@software{...}` entry pointing to the repo or blog post
- **Use `@misc` with `note = {Blog post}`** for references that only exist as blog posts
## Components Quick Reference
| Component | Use for |
|-----------|---------|
| `<Sidenote>` | Contextual notes in the margin |
| `<Note variant="info/success/danger">` | Callouts, tips, warnings |
| `<Accordion>` | Collapsible details, code examples |
| `<Image>` | Optimized images with captions, zoom |
| `<HtmlEmbed>` | D3/Plotly charts, interactive figures |
| `<Wide>` / `<FullWidth>` | Break out of content column |
| `<Glossary>` | Hover definitions for terms |
| `<Quote>` | Attributed quotations |
| `<Stack>` | Multi-column layouts |
| `<FigRef>` | Auto-numbered figure cross-references (e.g., `<FigRef target="my-figure-id" />` renders as "Figure 3") |
## Chart Selection
Match visualization to analytical goal:
- **Compare values**: Bar, dot plot
- **Show distribution**: Histogram, box plot, violin
- **Part-to-whole**: Pie, stacked bar, treemap
- **Trends over time**: Line chart
- **Relationships**: Scatter, bubble
- **Flow/process**: Sankey, flowchart
Prefer D3 over Plotly for custom, web-native visualizations.
---
## Skills
### Vibe Coding D3 Charts
Create interactive D3 charts as self-contained HTML fragments. Charts must be **responsive**, **accessible**, **interactive**, and **dark mode ready**.
**Workflow:**
1. Read the directives at `app/src/content/embeds/vibe-code-d3-embeds-directives.md`
2. Optionally use an existing chart from `app/src/content/embeds/` as a starting point
3. Generate the chart with a prompt like:
```
I want you to code a new d3 chart named `yourchart`.
I have one CSV file called `yourdata.csv` in the data folder.
The csv has the following columns: `x`, `y`, `z`.
```
4. Iterate with small adjustments until satisfied
**Embedding:**
Use the `<HtmlEmbed>` component in MDX:
```jsx
<HtmlEmbed
id="fig1"
src="your-chart.html"
title="Chart Title"
desc="Figure 1: Description of the chart."
/>
```
**Reference examples:** `app/src/content/chapters/demo/vibe-coding-charts.mdx`
### Taking Screenshots for Debugging
Take screenshots of the rendered blog to visually verify how content looks. Requires the dev server to be running (`npm run dev` in the `app/` directory).
**Script:** `app/scripts/screenshot.mjs`
**Usage (run from the `app/` directory):**
```bash
# Screenshot the current viewport (above the fold)
node scripts/screenshot.mjs
# Screenshot a specific section by heading anchor
node scripts/screenshot.mjs --target "#experiments"
# Screenshot a specific figure by its id
node scripts/screenshot.mjs --target "#baselines-comparison"
# Screenshot a CSS selector (e.g., the first mermaid diagram)
node scripts/screenshot.mjs --target ".mermaid" --padding 80
# Full-page screenshot in dark mode
node scripts/screenshot.mjs --full-page --dark
# Custom output path
node scripts/screenshot.mjs --target "#infrastructure" --output ./infra-shot.png
```
**Options:**
| Option | Default | Description |
|--------|---------|-------------|
| `--target <selector>` | (viewport) | CSS selector or `#anchor` to screenshot |
| `--output <path>` | `../../assets/screenshot-<target>.png` | Output file path |
| `--width <px>` | 1400 | Viewport width |
| `--height <px>` | 900 | Viewport height |
| `--full-page` | false | Capture the full scrollable page |
| `--padding <px>` | 40 | Extra padding around the target element |
| `--url <url>` | `http://localhost:4321/` | Page URL |
| `--wait <ms>` | 2000 | Extra wait after navigation for async content |
| `--dark` | false | Use dark theme |
**When to use:** After modifying MDX content, charts, CSS, or mermaid diagrams, take a screenshot to verify the rendered output looks correct. Particularly useful for checking figure numbering, chart rendering, and layout issues.
---
## End-of-Interaction Checklist
After completing any content edit, perform this reference audit:
1. **Scan for uncited claims**: Look for statements about methods, benchmarks, or prior work that lack citations
2. **Search for BibTeX entries**: For each missing reference, search online (Google Scholar, Semantic Scholar, arXiv) for the correct BibTeX entry
3. **Add to bibliography**: Place new entries in `app/src/content/bibliography.bib` **in the correct section** (see Bibliography Organization above). Do not dump entries at the end of the file
4. **Insert citations**: Add `[@key]` inline at the first occurrence across the whole blog (respecting chapter rendering order from `article.mdx`)
5. **Ask if unsure**: If multiple papers match or authorship is ambiguous, ask the user which reference is correct before adding
**What needs citations:**
- Model names and families (Qwen, Llama, Gemma, SmolLM, etc.)
- Datasets (FineWeb, DCLM, Cosmopedia, s1K, etc.)
- Methods and techniques (speculative decoding, RLHF, etc.)
- Architectural components (GQA, RoPE, Flash Attention, etc.)
- Training details (AdamW optimizer, WSD learning rate schedule, etc.)
- Inference engines (vLLM, SGLang, FlashInfer, etc.)
- Evaluation benchmarks (ARC, HellaSwag, MMLU, GSM8K, etc.)
- Tools and frameworks (DataTrove, DSPy, etc.)
- Benchmark results or comparisons
- Claims about prior work ("X showed that...")
## Python Dependencies
Always use `uv` to install Python packages (not `pip`). Example:
```bash
uv pip install pandas scipy numpy
```