joelniklaus HF Staff commited on
Commit
a55cd01
·
1 Parent(s): 5477bad

update agents.md file with guidelines for bibliography

Browse files
Files changed (1) hide show
  1. AGENTS.md +34 -5
AGENTS.md CHANGED
@@ -76,6 +76,30 @@ Use these blog posts as inspiration for writing style:
76
  - Deep-link anything with `id` prop on `Image`, `HtmlEmbed`, `Reference` components
77
  - Link format: `[Figure 1](#my-figure-id)`
78
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79
  ## Components Quick Reference
80
 
81
  | Component | Use for |
@@ -145,13 +169,18 @@ After completing any content edit, perform this reference audit:
145
 
146
  1. **Scan for uncited claims**: Look for statements about methods, benchmarks, or prior work that lack citations
147
  2. **Search for BibTeX entries**: For each missing reference, search online (Google Scholar, Semantic Scholar, arXiv) for the correct BibTeX entry
148
- 3. **Add to bibliography**: Place new entries in `app/src/content/bibliography.bib` with a descriptive key (e.g., `@article{vaswani2017attention, ...}`)
149
- 4. **Insert citations**: Add `[@key]` inline where the reference is needed
150
  5. **Ask if unsure**: If multiple papers match or authorship is ambiguous, ask the user which reference is correct before adding
151
 
152
  **What needs citations:**
153
- - Model names (BERT, GPT, LLaMA)
154
- - Datasets (ImageNet, C4, The Pile)
155
- - Methods and techniques (attention, LoRA, RLHF)
 
 
 
 
 
156
  - Benchmark results or comparisons
157
  - Claims about prior work ("X showed that...")
 
76
  - Deep-link anything with `id` prop on `Image`, `HtmlEmbed`, `Reference` components
77
  - Link format: `[Figure 1](#my-figure-id)`
78
 
79
+ ### Bibliography Organization
80
+
81
+ The bibliography file is organized into these sections (in order). Always place new entries in the correct section:
82
+
83
+ | Section | Comment header | What belongs here |
84
+ |---------|---------------|-------------------|
85
+ | Datasets | `% Datasets` | Training/eval datasets (FineWeb, DCLM, Cosmopedia, s1K, etc.) |
86
+ | Synthetic data methods | `% Synthetic data methods` | Papers about data generation/rephrasing (WRAP, REWIRE, BeyondWeb, model collapse) |
87
+ | Models | `% Models` | LLM papers and technical reports (Qwen, Llama, Gemma, SmolLM, etc.) |
88
+ | Architecture | `% Architecture` | Architectural components (GQA, RoPE, etc.) |
89
+ | Inference | `% Inference` | Serving/inference engines and techniques (vLLM, SGLang, FlashAttention, speculative decoding) |
90
+ | Training | `% Training` | Optimizers, schedules, training methods (AdamW, WSD/MiniCPM, etc.) |
91
+ | Tools | `% Tools` | Software tools and frameworks (DSPy, DataTrove, etc.) |
92
+ | Benchmarks | `% Benchmarks` | Evaluation benchmarks (ARC, HellaSwag, MMLU, GSM8K, etc.) |
93
+
94
+ ### Citation Placement Rules
95
+
96
+ - **Cite on first occurrence only**: Place `[@key]` the first time a paper/model/dataset/tool appears in the blog post
97
+ - **Blog chapter order matters**: The rendering order is defined in `app/src/content/article.mdx` (Introduction → Infrastructure → Setup → Experiments → Conclusions → Appendix). "First occurrence" means first across the entire concatenated blog, not first within a single chapter file
98
+ - **Everything citable gets cited**: Models, datasets, benchmarks, architectural techniques, optimizers, inference engines, training schedules, and tools all need citations if they have a paper or official reference
99
+ - **Hyperlinks are not citations**: Even if something is already linked (e.g., `[DataTrove](https://github.com/...)`) it still needs a `[@key]` citation
100
+ - **Use `@software` for code-only references**: Libraries/tools without a paper get a `@software{...}` entry pointing to the repo or blog post
101
+ - **Use `@misc` with `note = {Blog post}`** for references that only exist as blog posts
102
+
103
  ## Components Quick Reference
104
 
105
  | Component | Use for |
 
169
 
170
  1. **Scan for uncited claims**: Look for statements about methods, benchmarks, or prior work that lack citations
171
  2. **Search for BibTeX entries**: For each missing reference, search online (Google Scholar, Semantic Scholar, arXiv) for the correct BibTeX entry
172
+ 3. **Add to bibliography**: Place new entries in `app/src/content/bibliography.bib` **in the correct section** (see Bibliography Organization above). Do not dump entries at the end of the file
173
+ 4. **Insert citations**: Add `[@key]` inline at the first occurrence across the whole blog (respecting chapter rendering order from `article.mdx`)
174
  5. **Ask if unsure**: If multiple papers match or authorship is ambiguous, ask the user which reference is correct before adding
175
 
176
  **What needs citations:**
177
+ - Model names and families (Qwen, Llama, Gemma, SmolLM, etc.)
178
+ - Datasets (FineWeb, DCLM, Cosmopedia, s1K, etc.)
179
+ - Methods and techniques (speculative decoding, RLHF, etc.)
180
+ - Architectural components (GQA, RoPE, Flash Attention, etc.)
181
+ - Training details (AdamW optimizer, WSD learning rate schedule, etc.)
182
+ - Inference engines (vLLM, SGLang, FlashInfer, etc.)
183
+ - Evaluation benchmarks (ARC, HellaSwag, MMLU, GSM8K, etc.)
184
+ - Tools and frameworks (DataTrove, DSPy, etc.)
185
  - Benchmark results or comparisons
186
  - Claims about prior work ("X showed that...")