finephrase / AGENTS.md
joelniklaus's picture
joelniklaus HF Staff
adapt agents.md file to cite papers every time they are referenced.
bb0d7f1

Scientific Blog Writing Guidelines

Guidelines for writing casual, accessible scientific content in interactive blog posts.

Style References

Use these blog posts as inspiration for writing style:

Voice and Tone

  • Conversational but precise: Write like explaining to a curious colleague over coffee
  • Direct address: Use "you" and "we" freely
  • Active voice: Prefer "the model learns" over "learning is performed by the model"
  • Short sentences: Break complex ideas into digestible chunks
  • No hedging: Say "X improves Y" not "X tends to potentially improve Y"

Punctuation

  • No em-dashes (β€”): Use parentheses, commas, or separate sentences instead
  • En-dashes (–) for ranges only: "2020–2024", "pages 10–15"
  • Minimal semicolons: Prefer two sentences over one with a semicolon
  • Angle brackets in MDX: Use {'<'} and {'>'} to escape literal < and > characters in prose, since MDX interprets bare < as the start of a JSX tag
  • Curly braces in MDX: Avoid bare { and } in prose since MDX interprets them as JSX expressions. Remove them or replace with parentheses/other punctuation

Formatting

  • Bold for key terms on first introduction
  • Inline code for technical identifiers: model names, hyperparameters, file paths
  • Lists for features, steps, or comparisons
  • Keep paragraphs short (3-5 sentences max)
  • Minimize horizontal scrolling in tables: If a table row has long content that causes horizontal overflow, convert it to a bullet list or restructure the information. Prefer lists over tables when rows have variable-length prose

Structure

  • Start sections with the conclusion or key insight
  • Use progressive disclosure: overview first, details on demand
  • Every claim should have a visible artifact (chart, code, demo) nearby
  • Sidenotes for tangential but useful context

Technical Writing

  • Define jargon on first use
  • Concrete examples over abstract explanations
  • Show, don't tell: prefer an interactive demo over a paragraph of description
  • Link to sources, don't just cite them
  • Never show plain URLs as link text: Always integrate links into descriptive text (e.g., write [DataTrove inference benchmark](https://github.com/...) not [github.com/huggingface/datatrove/...](https://github.com/...))

Interactivity

  • Treat figures as first-class content, not decoration
  • All images/plots must be wrapped as figures (using <figure> with id and <figcaption>) and every figure must be mentioned in the surrounding text using <FigRef target="figure-id" />. Do NOT include "Figure N:" in <figcaption> text since the figure environment auto-numbers
  • Interactive elements should reveal insight, not just look fancy
  • Hover tooltips everywhere: Add informative hover tooltips wherever possible to surface additional detail (model name, exact values, metadata) without cluttering the chart
  • Readable text size: Ensure all text in charts (axis labels, tick marks, legends, annotations) is large enough to read comfortably, comparable to the main body text size
  • Minimal padding: Keep internal chart padding/margins small so the majority of space is used for actual content
  • Discuss every visualization: Every chart must be cited in the surrounding text with <FigRef> and accompanied by 1–3 paragraphs that explain what the chart shows, highlight the key takeaway, and discuss implications
  • Ensure all visualizations work in dark mode
  • Mobile-friendly is mandatory

Content Structure

  • One file, one article: Main content lives in app/src/content/article.mdx
  • Chapters for length: Split long articles into app/src/content/chapters/ and import them
  • Short sections: Each section answers one question or supports one idea
  • TOC-friendly headings: Use H2–H4 with short, descriptive titles (auto-generates table of contents)

Math Notation

  • Introduce symbols the first time they appear
  • Prefer well-known notation over custom shorthand
  • Plain-language interpretation after every significant formula
  • Inline math: $x^2$, block math: $$...$$
  • Use \htmlId{name}{...} for cross-referenceable equations

Citations and References

  • Citations from app/src/content/bibliography.bib using [@key] or @key
  • Footnotes: [^id] with definition [^id]: explanation
  • Deep-link anything with id prop on Image, HtmlEmbed, Reference components
  • Link format: [Figure 1](#my-figure-id)

Bibliography Organization

The bibliography file is organized into these sections (in order). Always place new entries in the correct section:

Section Comment header What belongs here
Datasets % Datasets Training/eval datasets (FineWeb, DCLM, Cosmopedia, s1K, etc.)
Synthetic data methods % Synthetic data methods Papers about data generation/rephrasing (WRAP, REWIRE, BeyondWeb, model collapse)
Models % Models LLM papers and technical reports (Qwen, Llama, Gemma, SmolLM, etc.)
Architecture % Architecture Architectural components (GQA, RoPE, etc.)
Inference % Inference Serving/inference engines and techniques (vLLM, SGLang, FlashAttention, speculative decoding)
Training % Training Optimizers, schedules, training methods (AdamW, WSD/MiniCPM, etc.)
Tools % Tools Software tools and frameworks (DSPy, DataTrove, etc.)
Benchmarks % Benchmarks Evaluation benchmarks (ARC, HellaSwag, MMLU, GSM8K, etc.)

Citation Placement Rules

  • Papers: cite every occurrence: Place [@key] or @key every time a paper is referenced in the blog post. Papers are research contributions that readers may want to look up at any point, so always provide the citation.
  • Models, datasets, tools: cite on first occurrence only: Place [@key] the first time a model, dataset, benchmark, tool, or other non-paper reference appears in the blog post. Subsequent mentions don't need the citation.
  • Blog chapter order matters: The rendering order is defined in app/src/content/article.mdx (Introduction β†’ Infrastructure β†’ Setup β†’ Experiments β†’ Conclusions β†’ Appendix). "First occurrence" means first across the entire concatenated blog, not first within a single chapter file.
  • Everything citable gets cited: Models, datasets, benchmarks, architectural techniques, optimizers, inference engines, training schedules, and tools all need citations if they have a paper or official reference
  • Hyperlinks are not citations: Even if something is already linked (e.g., [DataTrove](https://github.com/...)) it still needs a [@key] citation
  • Use @software for code-only references: Libraries/tools without a paper get a @software{...} entry pointing to the repo or blog post
  • Use @misc with note = {Blog post} for references that only exist as blog posts

Components Quick Reference

Component Use for
<Sidenote> Contextual notes in the margin
<Note variant="info/success/danger"> Callouts, tips, warnings
<Accordion> Collapsible details, code examples
<Image> Optimized images with captions, zoom
<HtmlEmbed> D3/Plotly charts, interactive figures
<Wide> / <FullWidth> Break out of content column
<Glossary> Hover definitions for terms
<Quote> Attributed quotations
<Stack> Multi-column layouts
<FigRef> Auto-numbered figure cross-references (e.g., <FigRef target="my-figure-id" /> renders as "Figure 3")

Chart Selection

Match visualization to analytical goal:

  • Compare values: Bar, dot plot
  • Show distribution: Histogram, box plot, violin
  • Part-to-whole: Pie, stacked bar, treemap
  • Trends over time: Line chart
  • Relationships: Scatter, bubble
  • Flow/process: Sankey, flowchart

Prefer D3 over Plotly for custom, web-native visualizations.


Skills

Vibe Coding D3 Charts

Create interactive D3 charts as self-contained HTML fragments. Charts must be responsive, accessible, interactive, and dark mode ready.

Workflow:

  1. Read the directives at app/src/content/embeds/vibe-code-d3-embeds-directives.md
  2. Optionally use an existing chart from app/src/content/embeds/ as a starting point
  3. Generate the chart with a prompt like:
    I want you to code a new d3 chart named `yourchart`.
    I have one CSV file called `yourdata.csv` in the data folder.
    The csv has the following columns: `x`, `y`, `z`.
    
  4. Iterate with small adjustments until satisfied

Embedding:

Use the <HtmlEmbed> component in MDX:

<HtmlEmbed
  id="fig1"
  src="your-chart.html"
  title="Chart Title"
  desc="Figure 1: Description of the chart."
/>

Reference examples: app/src/content/chapters/demo/vibe-coding-charts.mdx

Taking Screenshots for Debugging

Take screenshots of the rendered blog to visually verify how content looks. Requires the dev server to be running (npm run dev in the app/ directory).

Script: app/scripts/screenshot.mjs

Usage (run from the app/ directory):

# Screenshot the current viewport (above the fold)
node scripts/screenshot.mjs

# Screenshot a specific section by heading anchor
node scripts/screenshot.mjs --target "#experiments"

# Screenshot a specific figure by its id
node scripts/screenshot.mjs --target "#baselines-comparison"

# Screenshot a CSS selector (e.g., the first mermaid diagram)
node scripts/screenshot.mjs --target ".mermaid" --padding 80

# Full-page screenshot in dark mode
node scripts/screenshot.mjs --full-page --dark

# Custom output path
node scripts/screenshot.mjs --target "#infrastructure" --output ./infra-shot.png

Options:

Option Default Description
--target <selector> (viewport) CSS selector or #anchor to screenshot
--output <path> ../../assets/screenshot-<target>.png Output file path
--width <px> 1400 Viewport width
--height <px> 900 Viewport height
--full-page false Capture the full scrollable page
--padding <px> 40 Extra padding around the target element
--url <url> http://localhost:4321/ Page URL
--wait <ms> 2000 Extra wait after navigation for async content
--dark false Use dark theme

When to use: After modifying MDX content, charts, CSS, or mermaid diagrams, take a screenshot to verify the rendered output looks correct. Particularly useful for checking figure numbering, chart rendering, and layout issues.


End-of-Interaction Checklist

After completing any content edit, perform this reference audit:

  1. Scan for uncited claims: Look for statements about methods, benchmarks, or prior work that lack citations
  2. Search for BibTeX entries: For each missing reference, search online (Google Scholar, Semantic Scholar, arXiv) for the correct BibTeX entry
  3. Add to bibliography: Place new entries in app/src/content/bibliography.bib in the correct section (see Bibliography Organization above). Do not dump entries at the end of the file
  4. Insert citations: Add [@key] inline at the first occurrence across the whole blog (respecting chapter rendering order from article.mdx)
  5. Ask if unsure: If multiple papers match or authorship is ambiguous, ask the user which reference is correct before adding

What needs citations:

  • Model names and families (Qwen, Llama, Gemma, SmolLM, etc.)
  • Datasets (FineWeb, DCLM, Cosmopedia, s1K, etc.)
  • Methods and techniques (speculative decoding, RLHF, etc.)
  • Architectural components (GQA, RoPE, Flash Attention, etc.)
  • Training details (AdamW optimizer, WSD learning rate schedule, etc.)
  • Inference engines (vLLM, SGLang, FlashInfer, etc.)
  • Evaluation benchmarks (ARC, HellaSwag, MMLU, GSM8K, etc.)
  • Tools and frameworks (DataTrove, DSPy, etc.)
  • Benchmark results or comparisons
  • Claims about prior work ("X showed that...")

Python Dependencies

Always use uv to install Python packages (not pip). Example:

uv pip install pandas scipy numpy