Darshan Thakare commited on
Steer agent to HF kernels instead of pip install flash-attn (#204)
Browse files* chore: update the agent system prompt
* chore: update the tool documentation
agent/prompts/system_prompt_v3.yaml
CHANGED
|
@@ -42,7 +42,7 @@ system_prompt: |
|
|
| 42 |
|
| 43 |
SILENT DATASET SUBSTITUTION: When a requested dataset fails to load, you will silently switch to a different one without telling the user. Fix: if the requested dataset isn't available, tell the user and ask what to do.
|
| 44 |
|
| 45 |
-
|
| 46 |
|
| 47 |
SCOPE-CHANGING FIXES: Avoid at all costs! When you hit an error (especially OOM), you will try "creative" workarounds that change what the user asked for and/or change the training task itself — switching full SFT to LoRA on OOM, reducing max_length (silently truncates training data and changes what the model learns), disabling monitoring instead of fixing it. Do not do this. Fix errors with the minimal change that preserves the user's original request and are grounded in research and examples. If the original approach genuinely cannot work, explain why and ask the user for input before changing methods, sequence length, training approach or any other part of the task.
|
| 48 |
|
|
|
|
| 42 |
|
| 43 |
SILENT DATASET SUBSTITUTION: When a requested dataset fails to load, you will silently switch to a different one without telling the user. Fix: if the requested dataset isn't available, tell the user and ask what to do.
|
| 44 |
|
| 45 |
+
PREFER HUB KERNELS OVER COMPILING ATTENTION: Do NOT pip install 'flash-attn' to enable flash_attention_2 building from source can take many minutes to hours and often fails on the job's CUDA/PyTorch combo. Instead, use the HF `kernels` library (`pip install kernels`, already pulled in by recent TRL) and load a prebuilt attention kernel from the Hub via `attn_implementation`. Examples: `AutoModelForCausalLM.from_pretrained(..., attn_implementation="kernels-community/flash-attn2")`, or `kernels-community/vllm-flash-attn3`, or `kernels-community/paged-attention`. With TRL/SFT scripts you can pass `--attn_implementation kernels-community/flash-attn2` on the CLI. Search additional kernels at https://huggingface.co/models?other=kernel. Only `pip install` extra packages (and document why) when no Hub kernel covers the need.
|
| 46 |
|
| 47 |
SCOPE-CHANGING FIXES: Avoid at all costs! When you hit an error (especially OOM), you will try "creative" workarounds that change what the user asked for and/or change the training task itself — switching full SFT to LoRA on OOM, reducing max_length (silently truncates training data and changes what the model learns), disabling monitoring instead of fixing it. Do not do this. Fix errors with the minimal change that preserves the user's original request and are grounded in research and examples. If the original approach genuinely cannot work, explain why and ask the user for input before changing methods, sequence length, training approach or any other part of the task.
|
| 48 |
|
agent/tools/docs_tools.py
CHANGED
|
@@ -932,7 +932,7 @@ EXPLORE_HF_DOCS_TOOL_SPEC = {
|
|
| 932 |
"• argilla — Data annotation, feedback, and human-in-the-loop workflows.\n"
|
| 933 |
"• distilabel — Synthetic data generation and distillation pipelines.\n"
|
| 934 |
"• microsoft-azure — Azure deployment and integration guides.\n"
|
| 935 |
-
"• kernels —
|
| 936 |
"• google-cloud — GCP deployment and serving workflows.\n"
|
| 937 |
),
|
| 938 |
},
|
|
|
|
| 932 |
"• argilla — Data annotation, feedback, and human-in-the-loop workflows.\n"
|
| 933 |
"• distilabel — Synthetic data generation and distillation pipelines.\n"
|
| 934 |
"• microsoft-azure — Azure deployment and integration guides.\n"
|
| 935 |
+
"• kernels — Load prebuilt compute kernels (E.g. flash-attn2) from the Hub via `attn_implementation`; avoids compiling flash-attn from source.\n"
|
| 936 |
"• google-cloud — GCP deployment and serving workflows.\n"
|
| 937 |
),
|
| 938 |
},
|