Kernels

Upload README.md with huggingface_hub

#9
by sayakpaul HF Staff - opened
Files changed (1) hide show
  1. README.md +29 -69
README.md CHANGED
@@ -1,95 +1,55 @@
1
  ---
 
2
  license: apache-2.0
3
- tags:
4
- - kernels
5
  ---
6
 
7
- # vllm-flash-attn3
 
8
 
9
- This is an implementation of Flash Attention 3 CUDA kernels with support for attention sinks. The attention sinks implementation was contributed to Flash Attention by the [vLLM team](https://huggingface.co/vllm-project). The [transformers team](https://huggingface.co/transformers-community) packaged the implementation and pre-built it for use with the [kernels library](https://github.com/huggingface/kernels).
10
 
11
- Kernel source: https://github.com/huggingface/kernels-community/tree/main/vllm-flash-attn3
12
 
13
- ## Quickstart
14
 
15
- ```bash
16
- uv run https://huggingface.co/kernels-community/vllm-flash-attn3/raw/main/readme_example.py
17
- ```
18
 
19
  ```python
20
- # /// script
21
- # requires-python = ">=3.10"
22
- # dependencies = [
23
- # "torch",
24
- # "triton",
25
- # "numpy",
26
- # "kernels",
27
- # ]
28
- # ///
29
-
30
- import torch
31
  from kernels import get_kernel
32
 
33
- # Load vllm-flash-attn3 via kernels library
34
- vllm_flash_attn3 = get_kernel("kernels-community/vllm-flash-attn3")
35
-
36
- # Access Flash Attention function
37
- flash_attn_func = vllm_flash_attn3.flash_attn_func
38
-
39
- # Set device and seed for reproducibility
40
- device = "cuda"
41
- torch.manual_seed(42)
42
- torch.cuda.manual_seed(42)
43
 
44
- # Parameters
45
- batch_size = 2
46
- seqlen_q = 128 # Query sequence length
47
- seqlen_k = 256 # Key sequence length
48
- nheads = 8 # Number of attention heads
49
- d = 64 # Head dimension
50
 
51
- # Create input tensors (Q, K, V)
52
- q = torch.randn(batch_size, seqlen_q, nheads, d, device=device, dtype=torch.bfloat16)
53
- k = torch.randn(batch_size, seqlen_k, nheads, d, device=device, dtype=torch.bfloat16)
54
- v = torch.randn(batch_size, seqlen_k, nheads, d, device=device, dtype=torch.bfloat16)
55
 
56
- print(f"Query shape: {q.shape}")
57
- print(f"Key shape: {k.shape}")
58
- print(f"Value shape: {v.shape}")
 
 
 
59
 
60
- # Run Flash Attention 3
61
- output, lse = flash_attn_func(q, k, v, causal=True)
62
 
63
- print(f"\nOutput shape: {output.shape}")
64
- print(f"LSE (log-sum-exp) shape: {lse.shape}")
65
- print(f"\nAttention computation successful!")
66
- print(f"Output tensor stats - Mean: {output.mean().item():.4f}, Std: {output.std().item():.4f}")
67
- ```
68
 
69
- ## How to Use
70
 
71
- When loading your model with transformers, provide this repository id as the source of the attention implementation:
 
72
 
73
- ```diff
74
- from transformers import AutoModelForCausalLM, AutoTokenizer
75
 
76
- model_id = "<your model id on the Hub>"
77
 
78
- tokenizer = AutoTokenizer.from_pretrained(model_id)
79
- model = AutoModelForCausalLM.from_pretrained(
80
- model_id,
81
- device_map="auto",
82
- torch_dtype="auto",
83
- + # Flash Attention with Sinks
84
- + attn_implementation="kernels-community/vllm-flash-attn3”,
85
- )
86
- ```
87
 
88
- This will automatically resolve and download the appropriate code for your architecture. See more details in [this post](https://huggingface.co/blog/hello-hf-kernels).
89
 
90
- ## Credits
91
 
92
- - [Tri Dao](https://huggingface.co/tridao) and team for Flash Attention and [Flash Attention 3](https://tridao.me/blog/2024/flash3/).
93
- - The [vLLM team](https://huggingface.co/vllm-project) for their implementation and their contribution of attention sinks.
94
- - The [transformers team](https://huggingface.co/transformers-community) for packaging, testing, building and making it available for use with the [kernels library](https://github.com/huggingface/kernels).
95
 
 
 
1
  ---
2
+ library_name: kernels
3
  license: apache-2.0
 
 
4
  ---
5
 
6
+ <!-- This model card has automatically been generated. You
7
+ should probably proofread and complete it, then remove this comment. -->
8
 
 
9
 
10
+ This is the repository card of {repo_id} that has been pushed on the Hub. It was built to be used with the [`kernels` library](https://github.com/huggingface/kernels). This card was automatically generated.
11
 
 
12
 
13
+ ## How to use
 
 
14
 
15
  ```python
16
+ # make sure `kernels` is installed: `pip install -U kernels`
 
 
 
 
 
 
 
 
 
 
17
  from kernels import get_kernel
18
 
19
+ kernel_module = get_kernel("kernels-community/vllm-flash-attn3") # <- change the ID if needed
20
+ flash_attn_combine = kernel_module.flash_attn_combine
 
 
 
 
 
 
 
 
21
 
22
+ flash_attn_combine(...)
23
+ ```
 
 
 
 
24
 
25
+ ## Available functions
 
 
 
26
 
27
+ - `flash_attn_combine`
28
+ - `flash_attn_func`
29
+ - `flash_attn_qkvpacked_func`
30
+ - `flash_attn_varlen_func`
31
+ - `flash_attn_with_kvcache`
32
+ - `get_scheduler_metadata`
33
 
34
+ ## Supported backends
 
35
 
36
+ - cuda
 
 
 
 
37
 
38
+ ## CUDA Capabilities
39
 
40
+ - 8.0
41
+ - 9.0a
42
 
43
+ ## Benchmarks
 
44
 
45
+ Benchmarking script is available for this kernel. Make sure to run `kernels benchmark org-id/repo-id` (replace "org-id" and "repo-id" with actual values).
46
 
47
+ [TODO: provide benchmarks if available]
 
 
 
 
 
 
 
 
48
 
49
+ ## Source code
50
 
51
+ [TODO: provide original source code and other relevant citations if available]
52
 
53
+ ## Notes
 
 
54
 
55
+ [TODO: provide additional notes about this kernel if needed]