finephrase

Running on CPU Upgrade

App Files Files Community

joelniklaus HF Staff commited on Feb 9

Commit

a55cd01

1 Parent(s): 5477bad

update agents.md file with guidelines for bibliography

Browse files

Files changed (1) hide show

AGENTS.md +34 -5

AGENTS.md CHANGED Viewed

@@ -76,6 +76,30 @@ Use these blog posts as inspiration for writing style:
 - Deep-link anything with `id` prop on `Image`, `HtmlEmbed`, `Reference` components
 - Link format: `[Figure 1](#my-figure-id)`
 ## Components Quick Reference
 | Component | Use for |
@@ -145,13 +169,18 @@ After completing any content edit, perform this reference audit:
 1. **Scan for uncited claims**: Look for statements about methods, benchmarks, or prior work that lack citations
 2. **Search for BibTeX entries**: For each missing reference, search online (Google Scholar, Semantic Scholar, arXiv) for the correct BibTeX entry
-3. **Add to bibliography**: Place new entries in `app/src/content/bibliography.bib` with a descriptive key (e.g., `@article{vaswani2017attention, ...}`)
-4. **Insert citations**: Add `[@key]` inline where the reference is needed
 5. **Ask if unsure**: If multiple papers match or authorship is ambiguous, ask the user which reference is correct before adding
 **What needs citations:**
-- Model names (BERT, GPT, LLaMA)
-- Datasets (ImageNet, C4, The Pile)
-- Methods and techniques (attention, LoRA, RLHF)
 - Benchmark results or comparisons
 - Claims about prior work ("X showed that...")

 - Deep-link anything with `id` prop on `Image`, `HtmlEmbed`, `Reference` components
 - Link format: `[Figure 1](#my-figure-id)`
+### Bibliography Organization
+The bibliography file is organized into these sections (in order). Always place new entries in the correct section:
+| Section | Comment header | What belongs here |
+|---------|---------------|-------------------|
+| Datasets | `% Datasets` | Training/eval datasets (FineWeb, DCLM, Cosmopedia, s1K, etc.) |
+| Synthetic data methods | `% Synthetic data methods` | Papers about data generation/rephrasing (WRAP, REWIRE, BeyondWeb, model collapse) |
+| Models | `% Models` | LLM papers and technical reports (Qwen, Llama, Gemma, SmolLM, etc.) |
+| Architecture | `% Architecture` | Architectural components (GQA, RoPE, etc.) |
+| Inference | `% Inference` | Serving/inference engines and techniques (vLLM, SGLang, FlashAttention, speculative decoding) |
+| Training | `% Training` | Optimizers, schedules, training methods (AdamW, WSD/MiniCPM, etc.) |
+| Tools | `% Tools` | Software tools and frameworks (DSPy, DataTrove, etc.) |
+| Benchmarks | `% Benchmarks` | Evaluation benchmarks (ARC, HellaSwag, MMLU, GSM8K, etc.) |
+### Citation Placement Rules
+- **Cite on first occurrence only**: Place `[@key]` the first time a paper/model/dataset/tool appears in the blog post
+- **Blog chapter order matters**: The rendering order is defined in `app/src/content/article.mdx` (Introduction → Infrastructure → Setup → Experiments → Conclusions → Appendix). "First occurrence" means first across the entire concatenated blog, not first within a single chapter file
+- **Everything citable gets cited**: Models, datasets, benchmarks, architectural techniques, optimizers, inference engines, training schedules, and tools all need citations if they have a paper or official reference
+- **Hyperlinks are not citations**: Even if something is already linked (e.g., `[DataTrove](https://github.com/...)`) it still needs a `[@key]` citation
+- **Use `@software` for code-only references**: Libraries/tools without a paper get a `@software{...}` entry pointing to the repo or blog post
+- **Use `@misc` with `note = {Blog post}`** for references that only exist as blog posts
 ## Components Quick Reference
 | Component | Use for |
 1. **Scan for uncited claims**: Look for statements about methods, benchmarks, or prior work that lack citations
 2. **Search for BibTeX entries**: For each missing reference, search online (Google Scholar, Semantic Scholar, arXiv) for the correct BibTeX entry
+3. **Add to bibliography**: Place new entries in `app/src/content/bibliography.bib` **in the correct section** (see Bibliography Organization above). Do not dump entries at the end of the file
+4. **Insert citations**: Add `[@key]` inline at the first occurrence across the whole blog (respecting chapter rendering order from `article.mdx`)
 5. **Ask if unsure**: If multiple papers match or authorship is ambiguous, ask the user which reference is correct before adding
 **What needs citations:**
+- Model names and families (Qwen, Llama, Gemma, SmolLM, etc.)
+- Datasets (FineWeb, DCLM, Cosmopedia, s1K, etc.)
+- Methods and techniques (speculative decoding, RLHF, etc.)
+- Architectural components (GQA, RoPE, Flash Attention, etc.)
+- Training details (AdamW optimizer, WSD learning rate schedule, etc.)
+- Inference engines (vLLM, SGLang, FlashInfer, etc.)
+- Evaluation benchmarks (ARC, HellaSwag, MMLU, GSM8K, etc.)
+- Tools and frameworks (DataTrove, DSPy, etc.)
 - Benchmark results or comparisons
 - Claims about prior work ("X showed that...")