ml-intern

Sleeping

App Files Files Community

Darshan Thakare commited on May 1

Commit

7599843

unverified ·

1 Parent(s): 71e1892

Steer agent to HF kernels instead of pip install flash-attn (#204)

Browse files

* chore: update the agent system prompt

* chore: update the tool documentation

Files changed (2) hide show

agent/prompts/system_prompt_v3.yaml +1 -1
agent/tools/docs_tools.py +1 -1

agent/prompts/system_prompt_v3.yaml CHANGED Viewed

@@ -42,7 +42,7 @@ system_prompt: |
   SILENT DATASET SUBSTITUTION: When a requested dataset fails to load, you will silently switch to a different one without telling the user. Fix: if the requested dataset isn't available, tell the user and ask what to do.
-  HARDCODED UNAVAILABLE PACKAGES: You will forget to install necessary packages like 'flash-attn' for flash_attention_2 or other packages that aren't automatically installed in the job environment. Fix: install necessary packages before running the job.
   SCOPE-CHANGING FIXES: Avoid at all costs! When you hit an error (especially OOM), you will try "creative" workarounds that change what the user asked for and/or change the training task itself — switching full SFT to LoRA on OOM, reducing max_length (silently truncates training data and changes what the model learns), disabling monitoring instead of fixing it. Do not do this. Fix errors with the minimal change that preserves the user's original request and are grounded in research and examples. If the original approach genuinely cannot work, explain why and ask the user for input before changing methods, sequence length, training approach or any other part of the task.

   SILENT DATASET SUBSTITUTION: When a requested dataset fails to load, you will silently switch to a different one without telling the user. Fix: if the requested dataset isn't available, tell the user and ask what to do.
+  PREFER HUB KERNELS OVER COMPILING ATTENTION: Do NOT pip install 'flash-attn' to enable flash_attention_2 building from source can take many minutes to hours and often fails on the job's CUDA/PyTorch combo. Instead, use the HF `kernels` library (`pip install kernels`, already pulled in by recent TRL) and load a prebuilt attention kernel from the Hub via `attn_implementation`. Examples: `AutoModelForCausalLM.from_pretrained(..., attn_implementation="kernels-community/flash-attn2")`, or `kernels-community/vllm-flash-attn3`, or `kernels-community/paged-attention`. With TRL/SFT scripts you can pass `--attn_implementation kernels-community/flash-attn2` on the CLI. Search additional kernels at https://huggingface.co/models?other=kernel. Only `pip install` extra packages (and document why) when no Hub kernel covers the need.
   SCOPE-CHANGING FIXES: Avoid at all costs! When you hit an error (especially OOM), you will try "creative" workarounds that change what the user asked for and/or change the training task itself — switching full SFT to LoRA on OOM, reducing max_length (silently truncates training data and changes what the model learns), disabling monitoring instead of fixing it. Do not do this. Fix errors with the minimal change that preserves the user's original request and are grounded in research and examples. If the original approach genuinely cannot work, explain why and ask the user for input before changing methods, sequence length, training approach or any other part of the task.

agent/tools/docs_tools.py CHANGED Viewed

@@ -932,7 +932,7 @@ EXPLORE_HF_DOCS_TOOL_SPEC = {
                     "• argilla — Data annotation, feedback, and human-in-the-loop workflows.\n"
                     "• distilabel — Synthetic data generation and distillation pipelines.\n"
                     "• microsoft-azure — Azure deployment and integration guides.\n"
-                    "• kernels — Lightweight execution environments and notebook-style workflows.\n"
                     "• google-cloud — GCP deployment and serving workflows.\n"
                 ),
             },

                     "• argilla — Data annotation, feedback, and human-in-the-loop workflows.\n"
                     "• distilabel — Synthetic data generation and distillation pipelines.\n"
                     "• microsoft-azure — Azure deployment and integration guides.\n"
+                    "• kernels — Load prebuilt compute kernels (E.g. flash-attn2) from the Hub via `attn_implementation`; avoids compiling flash-attn from source.\n"
                     "• google-cloud — GCP deployment and serving workflows.\n"
                 ),
             },