finephrase

Running on CPU Upgrade

App Files Files Community

finephrase / AGENTS.md

joelniklaus HF Staff

adapt agents.md file to cite papers every time they are referenced.

bb0d7f1 about 11 hours ago

preview code

raw

history blame contribute delete

12.7 kB

Scientific Blog Writing Guidelines

Guidelines for writing casual, accessible scientific content in interactive blog posts.

Style References

Use these blog posts as inspiration for writing style:

Voice and Tone

Conversational but precise: Write like explaining to a curious colleague over coffee
Direct address: Use "you" and "we" freely
Active voice: Prefer "the model learns" over "learning is performed by the model"
Short sentences: Break complex ideas into digestible chunks
No hedging: Say "X improves Y" not "X tends to potentially improve Y"

Punctuation

No em-dashes (—): Use parentheses, commas, or separate sentences instead
En-dashes (–) for ranges only: "2020–2024", "pages 10–15"
Minimal semicolons: Prefer two sentences over one with a semicolon
Angle brackets in MDX: Use {'<'} and {'>'} to escape literal < and > characters in prose, since MDX interprets bare < as the start of a JSX tag
Curly braces in MDX: Avoid bare { and } in prose since MDX interprets them as JSX expressions. Remove them or replace with parentheses/other punctuation

Formatting

Bold for key terms on first introduction
Inline code for technical identifiers: model names, hyperparameters, file paths
Lists for features, steps, or comparisons
Keep paragraphs short (3-5 sentences max)
Minimize horizontal scrolling in tables: If a table row has long content that causes horizontal overflow, convert it to a bullet list or restructure the information. Prefer lists over tables when rows have variable-length prose

Structure

Start sections with the conclusion or key insight
Use progressive disclosure: overview first, details on demand
Every claim should have a visible artifact (chart, code, demo) nearby
Sidenotes for tangential but useful context

Technical Writing

Define jargon on first use
Concrete examples over abstract explanations
Show, don't tell: prefer an interactive demo over a paragraph of description
Link to sources, don't just cite them
Never show plain URLs as link text: Always integrate links into descriptive text (e.g., write [DataTrove inference benchmark](https://github.com/...) not [github.com/huggingface/datatrove/...](https://github.com/...))

Interactivity

Treat figures as first-class content, not decoration
All images/plots must be wrapped as figures (using <figure> with id and <figcaption>) and every figure must be mentioned in the surrounding text using <FigRef target="figure-id" />. Do NOT include "Figure N:" in <figcaption> text since the figure environment auto-numbers
Interactive elements should reveal insight, not just look fancy
Hover tooltips everywhere: Add informative hover tooltips wherever possible to surface additional detail (model name, exact values, metadata) without cluttering the chart
Readable text size: Ensure all text in charts (axis labels, tick marks, legends, annotations) is large enough to read comfortably, comparable to the main body text size
Minimal padding: Keep internal chart padding/margins small so the majority of space is used for actual content
Discuss every visualization: Every chart must be cited in the surrounding text with <FigRef> and accompanied by 1–3 paragraphs that explain what the chart shows, highlight the key takeaway, and discuss implications
Ensure all visualizations work in dark mode
Mobile-friendly is mandatory

Content Structure

One file, one article: Main content lives in app/src/content/article.mdx
Chapters for length: Split long articles into app/src/content/chapters/ and import them
Short sections: Each section answers one question or supports one idea
TOC-friendly headings: Use H2–H4 with short, descriptive titles (auto-generates table of contents)

Math Notation

Introduce symbols the first time they appear
Prefer well-known notation over custom shorthand
Plain-language interpretation after every significant formula
Inline math: $x^2$ , block math: $$...$$
Use \htmlId{name}{...} for cross-referenceable equations

Citations and References

Citations from app/src/content/bibliography.bib using [@key] or @key
Footnotes: [^id] with definition [^id]: explanation
Deep-link anything with id prop on Image, HtmlEmbed, Reference components
Link format: [Figure 1](#my-figure-id)

Bibliography Organization

The bibliography file is organized into these sections (in order). Always place new entries in the correct section:

Section	Comment header	What belongs here
Datasets	`% Datasets`	Training/eval datasets (FineWeb, DCLM, Cosmopedia, s1K, etc.)
Synthetic data methods	`% Synthetic data methods`	Papers about data generation/rephrasing (WRAP, REWIRE, BeyondWeb, model collapse)
Models	`% Models`	LLM papers and technical reports (Qwen, Llama, Gemma, SmolLM, etc.)
Architecture	`% Architecture`	Architectural components (GQA, RoPE, etc.)
Inference	`% Inference`	Serving/inference engines and techniques (vLLM, SGLang, FlashAttention, speculative decoding)
Training	`% Training`	Optimizers, schedules, training methods (AdamW, WSD/MiniCPM, etc.)
Tools	`% Tools`	Software tools and frameworks (DSPy, DataTrove, etc.)
Benchmarks	`% Benchmarks`	Evaluation benchmarks (ARC, HellaSwag, MMLU, GSM8K, etc.)

Citation Placement Rules

Papers: cite every occurrence: Place [@key] or @key every time a paper is referenced in the blog post. Papers are research contributions that readers may want to look up at any point, so always provide the citation.
Models, datasets, tools: cite on first occurrence only: Place [@key] the first time a model, dataset, benchmark, tool, or other non-paper reference appears in the blog post. Subsequent mentions don't need the citation.
Blog chapter order matters: The rendering order is defined in app/src/content/article.mdx (Introduction → Infrastructure → Setup → Experiments → Conclusions → Appendix). "First occurrence" means first across the entire concatenated blog, not first within a single chapter file.
Everything citable gets cited: Models, datasets, benchmarks, architectural techniques, optimizers, inference engines, training schedules, and tools all need citations if they have a paper or official reference
Hyperlinks are not citations: Even if something is already linked (e.g., [DataTrove](https://github.com/...)) it still needs a [@key] citation
Use @software for code-only references: Libraries/tools without a paper get a @software{...} entry pointing to the repo or blog post
Use @misc with note = {Blog post} for references that only exist as blog posts

Components Quick Reference

Component	Use for
`<Sidenote>`	Contextual notes in the margin
`<Note variant="info/success/danger">`	Callouts, tips, warnings
`<Accordion>`	Collapsible details, code examples
`<Image>`	Optimized images with captions, zoom
`<HtmlEmbed>`	D3/Plotly charts, interactive figures
`<Wide>` / `<FullWidth>`	Break out of content column
`<Glossary>`	Hover definitions for terms
`<Quote>`	Attributed quotations
`<Stack>`	Multi-column layouts
`<FigRef>`	Auto-numbered figure cross-references (e.g., `<FigRef target="my-figure-id" />` renders as "Figure 3")

Chart Selection

Match visualization to analytical goal:

Compare values: Bar, dot plot
Show distribution: Histogram, box plot, violin
Part-to-whole: Pie, stacked bar, treemap
Trends over time: Line chart
Relationships: Scatter, bubble
Flow/process: Sankey, flowchart

Prefer D3 over Plotly for custom, web-native visualizations.

Skills

Vibe Coding D3 Charts

Create interactive D3 charts as self-contained HTML fragments. Charts must be responsive, accessible, interactive, and dark mode ready.

Workflow:

Read the directives at app/src/content/embeds/vibe-code-d3-embeds-directives.md
Optionally use an existing chart from app/src/content/embeds/ as a starting point

Generate the chart with a prompt like:

I want you to code a new d3 chart named `yourchart`.
I have one CSV file called `yourdata.csv` in the data folder.
The csv has the following columns: `x`, `y`, `z`.

Iterate with small adjustments until satisfied

Embedding:

Use the <HtmlEmbed> component in MDX:

<HtmlEmbed
  id="fig1"
  src="your-chart.html"
  title="Chart Title"
  desc="Figure 1: Description of the chart."
/>

Reference examples: app/src/content/chapters/demo/vibe-coding-charts.mdx

Taking Screenshots for Debugging

Take screenshots of the rendered blog to visually verify how content looks. Requires the dev server to be running (npm run dev in the app/ directory).

Script: app/scripts/screenshot.mjs

Usage (run from the app/ directory):

# Screenshot the current viewport (above the fold)
node scripts/screenshot.mjs

# Screenshot a specific section by heading anchor
node scripts/screenshot.mjs --target "#experiments"

# Screenshot a specific figure by its id
node scripts/screenshot.mjs --target "#baselines-comparison"

# Screenshot a CSS selector (e.g., the first mermaid diagram)
node scripts/screenshot.mjs --target ".mermaid" --padding 80

# Full-page screenshot in dark mode
node scripts/screenshot.mjs --full-page --dark

# Custom output path
node scripts/screenshot.mjs --target "#infrastructure" --output ./infra-shot.png

Options:

Option	Default	Description
`--target <selector>`	(viewport)	CSS selector or `#anchor` to screenshot
`--output <path>`	`../../assets/screenshot-<target>.png`	Output file path
`--width <px>`	1400	Viewport width
`--height <px>`	900	Viewport height
`--full-page`	false	Capture the full scrollable page
`--padding <px>`	40	Extra padding around the target element
`--url <url>`	`http://localhost:4321/`	Page URL
`--wait <ms>`	2000	Extra wait after navigation for async content
`--dark`	false	Use dark theme

When to use: After modifying MDX content, charts, CSS, or mermaid diagrams, take a screenshot to verify the rendered output looks correct. Particularly useful for checking figure numbering, chart rendering, and layout issues.

End-of-Interaction Checklist

After completing any content edit, perform this reference audit:

Scan for uncited claims: Look for statements about methods, benchmarks, or prior work that lack citations
Search for BibTeX entries: For each missing reference, search online (Google Scholar, Semantic Scholar, arXiv) for the correct BibTeX entry
Add to bibliography: Place new entries in app/src/content/bibliography.bib in the correct section (see Bibliography Organization above). Do not dump entries at the end of the file
Insert citations: Add [@key] inline at the first occurrence across the whole blog (respecting chapter rendering order from article.mdx)
Ask if unsure: If multiple papers match or authorship is ambiguous, ask the user which reference is correct before adding

What needs citations:

Model names and families (Qwen, Llama, Gemma, SmolLM, etc.)
Datasets (FineWeb, DCLM, Cosmopedia, s1K, etc.)
Methods and techniques (speculative decoding, RLHF, etc.)
Architectural components (GQA, RoPE, Flash Attention, etc.)
Training details (AdamW optimizer, WSD learning rate schedule, etc.)
Inference engines (vLLM, SGLang, FlashInfer, etc.)
Evaluation benchmarks (ARC, HellaSwag, MMLU, GSM8K, etc.)
Tools and frameworks (DataTrove, DSPy, etc.)
Benchmark results or comparisons
Claims about prior work ("X showed that...")

Python Dependencies

Always use uv to install Python packages (not pip). Example:

uv pip install pandas scipy numpy