msj19 commited on Jan 19

Commit

50776d2

verified ·

1 Parent(s): 6d0130a

Add files using upload-large-folder tool

Browse files

Files changed (50) hide show

README.md +188 -0
fla3/ops/path_attn/__pycache__/parallel_path_bwd_inter_dkv.cpython-310.pyc +0 -0
fla3/ops/path_attn/__pycache__/prepare_k_cache.cpython-310.pyc +0 -0
fla3/ops/retention/__pycache__/parallel.cpython-312.pyc +0 -0
fla3/ops/rwkv7/__pycache__/chunk.cpython-312.pyc +0 -0
fla3/ops/rwkv7/__pycache__/fused_addcmul.cpython-310.pyc +0 -0
fla3/ops/rwkv7/__pycache__/fused_k_update.cpython-310.pyc +0 -0
fla3/ops/rwkv7/fused_recurrent.py +328 -0
fla3/ops/simple_gla/__pycache__/chunk.cpython-312.pyc +0 -0
fla3/ops/simple_gla/__pycache__/fused_recurrent.cpython-310.pyc +0 -0
fla3/ops/simple_gla/__pycache__/parallel.cpython-310.pyc +0 -0
fla3/ops/simple_gla/__pycache__/parallel.cpython-312.pyc +0 -0
fla3/ops/simple_gla/fused_recurrent.py +108 -0
fla3/ops/simple_gla/naive.py +54 -0
fla3/ops/ttt/naive.py +126 -0
fla3/ops/utils/__pycache__/__init__.cpython-310.pyc +0 -0
fla3/ops/utils/__pycache__/__init__.cpython-312.pyc +0 -0
fla3/ops/utils/__pycache__/asm.cpython-310.pyc +0 -0
fla3/ops/utils/__pycache__/asm.cpython-312.pyc +0 -0
fla3/ops/utils/__pycache__/cumsum.cpython-310.pyc +0 -0
fla3/ops/utils/__pycache__/cumsum.cpython-312.pyc +0 -0
fla3/ops/utils/__pycache__/index.cpython-310.pyc +0 -0
fla3/ops/utils/__pycache__/index.cpython-312.pyc +0 -0
fla3/ops/utils/__pycache__/logcumsumexp.cpython-310.pyc +0 -0
fla3/ops/utils/__pycache__/logsumexp.cpython-310.pyc +0 -0
fla3/ops/utils/__pycache__/logsumexp.cpython-312.pyc +0 -0
fla3/ops/utils/__pycache__/matmul.cpython-310.pyc +0 -0
fla3/ops/utils/__pycache__/op.cpython-312.pyc +0 -0
fla3/ops/utils/__pycache__/pooling.cpython-310.pyc +0 -0
fla3/ops/utils/cumsum.py +414 -0
fla3/ops/utils/index.py +83 -0
fla3/ops/utils/logcumsumexp.py +52 -0
fla3/ops/utils/matmul.py +245 -0
fla3/ops/utils/op.py +39 -0
fla3/ops/utils/pack.py +208 -0
fla3/ops/utils/pooling.py +207 -0
fla3/ops/utils/softmax.py +111 -0
fla3/ops/utils/solve_tril.py +276 -0
flame/__init__.py +0 -0
flame/__pycache__/__init__.cpython-310.pyc +0 -0
flame/__pycache__/__init__.cpython-312.pyc +0 -0
flame/__pycache__/data.cpython-310.pyc +0 -0
flame/__pycache__/data.cpython-312.pyc +0 -0
flame/__pycache__/logging.cpython-310.pyc +0 -0
flame/__pycache__/logging.cpython-312.pyc +0 -0
flame/__pycache__/parser.cpython-310.pyc +0 -0
flame/__pycache__/parser.cpython-312.pyc +0 -0
flame/data.py +246 -0
flame/logging.py +118 -0
flame/parser.py +94 -0

README.md ADDED Viewed

	@@ -0,0 +1,188 @@

+<div align="center">
+# 🔥 Flame: Flash Linear Attention Made Easy
+</div>
+> [!IMPORTANT]
+> The `flame` project has been migrated to a new project built on torchtitan.
+> Please visit the [new repository](https://github.com/fla-org/flame) for details and updates.
+>
+> The code here is now **archived as legacy**, and no future updates will be synchronized here.
+A minimal framework for training FLA models, whether from scratch or through finetuning.
+Built on the robust infrastructure of 🤗, `flame` enables you to train large language models with just a few lines of code:
+we use `datasets` for data processing, `transformers` for model definitions, and `accelerate`[^1] for seamless distributed training.
+In this README, we will guide you through the process of using `flame` to train GLA models.
+## Setup
+To get started, you'll need to install the required packages.
+Both `fla` and `flame` have minimal dependencies.
+Clone the `fla` repository and install the necessary packages as follows:
+```bash
+git clone https://github.com/sustcsonglin/flash-linear-attention.git
+pip install .
+pip install accelerate
+```
+> [!CAUTION]
+> The 🤗 `tokenizers` have some [memory leak issues](https://github.com/huggingface/tokenizers/issues/1539) when processing very long documents.
+> To address this, please ensure you install `tokenizers>=0.20.4`.
+## Preprocessing
+Before training, you need to download and pre-tokenize your dataset.
+We provide a straightforward script for this.
+For instance, to tokenize a 10B sample of the `fineweb-edu` dataset, run:
+```bash
+python preprocess.py \
+  --dataset HuggingFaceFW/fineweb-edu \
+  --name sample-10BT \
+  --split train \
+  --context_length 2048
+```
+```
+python preprocess.py \
+  --dataset /mnt/jfzn/msj/fineweb100B_hf/datasets--HuggingFaceFW--fineweb-edu/sample/100BT \
+  --name sample-100BT \
+  --split train \
+  --context_length 2048
+```
+/mnt/jfzn/msj/fineweb100B_hf/datasets--HuggingFaceFW--fineweb-edu/sample/100BT
+This will cache the processed dataset at `data/HuggingFaceFW/fineweb-edu/sample-10BT/train`.
+GLA utilizes a subset of Slimpajama for pretraining [in the paper](https://proceedings.mlr.press/v235/yang24ab.html).
+Given the size of the dataset, the fastest way to download it is using `git lfs` (refer to [this issue](https://huggingface.co/datasets/cerebras/SlimPajama-627B/discussions/2)).
+```bash
+git lfs install
+git clone https://huggingface.co/datasets/cerebras/SlimPajama-627B --depth 1
+python preprocess.py \
+  --dataset SlimPajama-627B \
+  --split train \
+  --context_length 2048
+```
+## Training from scratch
+To train your 340M model from scratch, execute the following command:
+```bash
+bash train.sh \
+  type=gla \
+  lr=3e-4 \
+  scheduler=cosine_with_min_lr \
+  batch=32 \
+  update=1 \
+  warmup=1024 \
+  steps=20480 \
+  context=2048 \
+  gpus=8 \
+  nodes=1 \
+  path=exp/gla-340M-10B \
+  project=fla \
+  model=configs/gla_340M.json \
+  data=HuggingFaceFW/fineweb-edu \
+  name=sample-10BT \
+  cache=data/HuggingFaceFW/fineweb-edu/sample-10BT/train
+```
+Key parameters:
+|           | Description                   | Default              |
+| :-------- | :---------------------------- | -------------------- |
+| lr        | `learning_rate`               | `3e-4`               |
+| scheduler | `lr_scheduler_type`           | `cosine_with_min_lr` |
+| batch     | `batch_size`                  | `32`                 |
+| update    | `gradient_accumulation_steps` | `1`                  |
+| context   | `context_length`              | `2048`               |
+| gpus      | `num_gpus_per_node`           | `8`                  |
+| nodes     | `num_nodes`                   | `1`                  |
+| warmup    | `warmup_steps`                | `1024`               |
+| steps     | `max_steps`                   | `20480`              |
+The learning rate is set to `3e-4` by default, equipped with a cosine scheduler.
+Other scheduler types like WSD (`warmup_stable_decay`)[^2] are also supported.
+The total number of tokens processed per batch, referred to as `global_batch_size`, is calculated as
+`batch_size × gradient_accumulation_steps × context_length × num_gpus_per_node × num_nodes`.
+For instance, in the 340M model example, the `global_batch_size` calculates to $32 \times 1 \times 2048 \times 8 \times 1 = 524,288$ (0.5M tokens).
+The `warmup_steps` parameter indicates the number of steps for the learning rate warmup phase, while `max_steps` represents the maximum number of training steps.
+Each step processes `global_batch_size` tokens.
+Consequently, `512` and `20480` correspond to processing 0.5B and 10B tokens, respectively.
+:warning: Monitor the value of `global_batch_size`, `warmup_steps`, and `max_steps` carefully when modifying any of the hyperparameters!!
+`flame` also supports resuming interrupted training by specifying the checkpoint path.
+Simply use the following command:
+```bash
+bash train.sh \
+  type=gla \
+  lr=3e-4 \
+  steps=20480 \
+  batch=32 \
+  update=1 \
+  warmup=1024 \
+  context=2048 \
+  gpus=8 \
+  nodes=1 \
+  path=exp/gla-340M-10B \
+  project=fla \
+  model=configs/gla_340M.json \
+  data=HuggingFaceFW/fineweb-edu \
+  name=sample-10BT \
+  cache=data/HuggingFaceFW/fineweb-edu/sample-10BT/train \
+  checkpoint=exp/gla-340M-10B/checkpoint-8192
+```
+You can also use `wandb` to monitor your training process effectively.
+![wandb](https://github.com/user-attachments/assets/05ca031c-1cae-41c9-bfcb-5b6b6d0df729)
+## Continual Pretraining
+`flame` supports continual training from a pretrained checkpoint.
+Below, we provide an example of how to finetune Mistral-7B to GLA.
+You can follow similar steps to reproduce the results in the [GSA paper](https://arxiv.org/abs/2409.07146):
+1. Initialize a brand-new GLA-7B model from the config and copy the mathced pretrained weights from Mistral-7B:
+```bash
+cd ../utils
+python convert_from_llama.py \
+  --model mistralai/Mistral-7B-v0.1 \
+  --config ../training/configs/gla_7B.json \
+  --output ../training/converted/gla-7B
+cd -
+```
+2. Directly launch training from the converted checkpoint:
+```bash
+bash train.sh \
+  type=gla \
+  lr=3e-5 \
+  steps=10240 \
+  batch=4 \
+  update=8 \
+  warmup=512 \
+  context=2048 \
+  path=exp/gla-7B-20B \
+  project=fla \
+  model=converted/gla-7B \
+  data=SlimPajama-627B \
+  cache=data/SlimPajama-627B/train
+```
+Please be aware that finetuning on a single node may not be the most efficient approach.
+If available, consider leveraging multi-node GPUs for optimal performance.
+You can find guidance on how to launch a multi-node job in the [accelerate tutorial](https://github.com/huggingface/accelerate/blob/main/examples/slurm/submit_multinode.sh).
+[^1]: The `accelerate` library supports various distributed frameworks, like `deepspeed` and `megatron` for large-scale training. We use `deepspeed` in our case.
+[^2]: https://arxiv.org/abs/2404.06395

fla3/ops/path_attn/__pycache__/parallel_path_bwd_inter_dkv.cpython-310.pyc ADDED Viewed

Binary file (5.47 kB). View file

fla3/ops/path_attn/__pycache__/prepare_k_cache.cpython-310.pyc ADDED Viewed

Binary file (2.26 kB). View file

fla3/ops/retention/__pycache__/parallel.cpython-312.pyc ADDED Viewed

Binary file (3.75 kB). View file

fla3/ops/rwkv7/__pycache__/chunk.cpython-312.pyc ADDED Viewed

Binary file (2.56 kB). View file

fla3/ops/rwkv7/__pycache__/fused_addcmul.cpython-310.pyc ADDED Viewed

Binary file (6.37 kB). View file

fla3/ops/rwkv7/__pycache__/fused_k_update.cpython-310.pyc ADDED Viewed

Binary file (3.93 kB). View file

fla3/ops/rwkv7/fused_recurrent.py ADDED Viewed

	@@ -0,0 +1,328 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) 2023-2025, Songlin Yang, Yu Zhang
+import warnings
+from typing import Optional, Tuple
+import torch
+import triton
+import triton.language as tl
+from einops import rearrange
+from fla.ops.generalized_delta_rule import fused_recurrent_dplr_delta_rule
+from fla.ops.utils.op import exp
+from fla.utils import input_guard, use_cuda_graph
+@triton.heuristics({
+    'USE_INITIAL_STATE': lambda args: args['h0'] is not None,
+    'STORE_FINAL_STATE': lambda args: args['ht'] is not None,
+    'IS_VARLEN': lambda args: args['cu_seqlens'] is not None
+})
+@triton.autotune(
+    configs=[
+        triton.Config({'BV': BV}, num_warps=num_warps, num_stages=num_stages)
+        for BV in [16, 32, 64]
+        for num_warps in [2, 4, 8, 16, 32]
+        for num_stages in [2, 3, 4]
+    ],
+    key=['BK'],
+    use_cuda_graph=use_cuda_graph,
+)
+@triton.jit(do_not_specialize=['T'])
+def fused_recurrent_rwkv7_fwd_kernel(
+    r,
+    w,
+    k,
+    v,
+    kk,
+    a,
+    o,
+    h0,
+    ht,
+    cu_seqlens,
+    scale,
+    T,
+    B: tl.constexpr,
+    H: tl.constexpr,
+    K: tl.constexpr,
+    V: tl.constexpr,
+    BK: tl.constexpr,
+    BV: tl.constexpr,
+    REVERSE: tl.constexpr,
+    USE_INITIAL_STATE: tl.constexpr,
+    STORE_FINAL_STATE: tl.constexpr,
+    IS_VARLEN: tl.constexpr,
+    IS_DECODE: tl.constexpr,
+):
+    i_v, i_nh = tl.program_id(0).to(tl.int64), tl.program_id(1).to(tl.int64)
+    i_n, i_h = i_nh // H, i_nh % H
+    if IS_VARLEN:
+        bos, eos = tl.load(cu_seqlens + i_n).to(tl.int64), tl.load(cu_seqlens + i_n + 1).to(tl.int64)
+        T = eos - bos
+    else:
+        bos, eos = i_n * T, i_n * T + T
+    o_k = tl.arange(0, BK)
+    o_v = i_v * BV + tl.arange(0, BV)
+    p_r = r + (bos + ((T-1) if REVERSE else 0)) * H*K + i_h * K + o_k
+    p_w = w + (bos + ((T-1) if REVERSE else 0)) * H*K + i_h * K + o_k
+    p_k = k + (bos + ((T-1) if REVERSE else 0)) * H*K + i_h * K + o_k
+    p_v = v + (bos + ((T-1) if REVERSE else 0)) * H*V + i_h * V + o_v
+    p_a = a + (bos + ((T-1) if REVERSE else 0)) * H*K + i_h * K + o_k
+    p_kk = kk + (bos + ((T-1) if REVERSE else 0)) * H*K + i_h * K + o_k
+    p_o = o + (bos + ((T-1) if REVERSE else 0)) * H*V + i_h * V + o_v
+    mask_k = o_k < K
+    mask_v = o_v < V
+    mask_h = mask_k[None, :] & mask_v[:, None]
+    b_h = tl.zeros([BV, BK], dtype=tl.float32)
+    if USE_INITIAL_STATE:
+        p_h0 = h0 + i_nh * K*V + o_k[None, :] * V + o_v[:, None]
+        b_h += tl.load(p_h0, mask=mask_h, other=0).to(tl.float32)
+    if IS_DECODE:
+        b_r = tl.load(p_r, mask=mask_k, other=0).to(tl.float32) * scale
+        b_w = tl.load(p_w, mask=mask_k, other=0).to(tl.float32)
+        b_k = tl.load(p_k, mask=mask_k, other=0).to(tl.float32)
+        b_v = tl.load(p_v, mask=mask_v, other=0).to(tl.float32)
+        b_a = tl.load(p_a, mask=mask_k, other=0).to(tl.float32)
+        b_kk = tl.load(p_kk, mask=mask_k, other=0).to(tl.float32)
+        b_act_a = -b_kk
+        b_b = b_kk * b_a
+        tmp = tl.sum(b_h * b_act_a[None, :], axis=1)
+        b_h = exp(b_w)[None, :] * b_h + (tmp[:, None] * b_b[None, :] + b_k[None, :] * b_v[:, None])
+        b_o = tl.sum(b_h * b_r[None, :], axis=1)
+        tl.store(p_o, b_o.to(p_o.dtype.element_ty), mask=mask_v)
+    else:
+        for _ in range(0, T):
+            b_r = tl.load(p_r, mask=mask_k, other=0).to(tl.float32) * scale
+            b_w = tl.load(p_w, mask=mask_k, other=0).to(tl.float32)
+            b_k = tl.load(p_k, mask=mask_k, other=0).to(tl.float32)
+            b_v = tl.load(p_v, mask=mask_v, other=0).to(tl.float32)
+            b_a = tl.load(p_a, mask=mask_k, other=0).to(tl.float32)
+            b_kk = tl.load(p_kk, mask=mask_k, other=0).to(tl.float32)
+            b_act_a = -b_kk
+            b_b = b_kk * b_a
+            tmp = tl.sum(b_h * b_act_a[None, :], axis=1)
+            b_h = exp(b_w)[None, :] * b_h + (tmp[:, None] * b_b[None, :] + b_k[None, :] * b_v[:, None])
+            b_o = tl.sum(b_h * b_r[None, :], axis=1)
+            tl.store(p_o, b_o.to(p_o.dtype.element_ty), mask=mask_v)
+            p_r += (-1 if REVERSE else 1) * H*K
+            p_w += (-1 if REVERSE else 1) * H*K
+            p_k += (-1 if REVERSE else 1) * H*K
+            p_v += (-1 if REVERSE else 1) * H*V
+            p_a += (-1 if REVERSE else 1) * H*K
+            p_kk += (-1 if REVERSE else 1) * H*K
+            p_o += (-1 if REVERSE else 1) * H*V
+    if STORE_FINAL_STATE:
+        p_ht = ht + i_nh * K*V + o_k[None, :] * V + o_v[:, None]
+        tl.store(p_ht, b_h.to(p_ht.dtype.element_ty), mask=mask_h)
+@input_guard
+def fused_recurrent_rwkv7_fwd(
+    r: torch.Tensor,
+    w: torch.Tensor,
+    k: torch.Tensor,
+    v: torch.Tensor,
+    kk: torch.Tensor,
+    a: torch.Tensor,
+    scale: Optional[float] = 1.0,
+    initial_state: Optional[torch.Tensor] = None,
+    output_final_state: bool = False,
+    reverse: bool = False,
+    cu_seqlens: Optional[torch.LongTensor] = None,
+):
+    B, T, H, K, V = *k.shape, v.shape[-1]
+    N = B if cu_seqlens is None else len(cu_seqlens) - 1
+    BK = triton.next_power_of_2(K)
+    IS_DECODE = (T == 1)
+    h0 = initial_state
+    if not output_final_state:
+        ht = None
+    else:
+        ht = r.new_empty(N, H, K, V, dtype=torch.float32)
+    o = torch.empty_like(v)
+    def grid(meta): return (triton.cdiv(V, meta['BV']), N * H)
+    fused_recurrent_rwkv7_fwd_kernel[grid](
+        r,
+        w,
+        k,
+        v,
+        kk,
+        a,
+        o,
+        h0,
+        ht,
+        cu_seqlens,
+        scale,
+        T=T,
+        B=B,
+        H=H,
+        K=K,
+        V=V,
+        BK=BK,
+        REVERSE=reverse,
+        IS_DECODE=IS_DECODE
+    )
+    return o, ht
+def fused_recurrent_rwkv7(
+    r: torch.Tensor,
+    w: torch.Tensor,
+    k: torch.Tensor,
+    v: torch.Tensor,
+    a: torch.Tensor,
+    b: torch.Tensor,
+    scale: float = 1.0,
+    initial_state: torch.Tensor = None,
+    output_final_state: bool = True,
+    cu_seqlens: Optional[torch.LongTensor] = None,
+    head_first: bool = False,
+):
+    """
+    Args:
+        r (torch.Tensor):
+            r of shape `[B, T, H, K]` if `head_first=False` else `[B, H, T, K]`.
+        w (torch.Tensor):
+            log decay of shape `[B, T, H, K]` if `head_first=False` else `[B, H, T, K]`.
+        k (torch.Tensor):
+            k of shape `[B, T, H, K]` if `head_first=False` else `[B, H, T, K]`.
+        v (torch.Tensor):
+            v of shape `[B, T, H, V]` if `head_first=False` else `[B, H, T, V]`.
+        a (torch.Tensor):
+            a of shape `[B, T, H, K]` if `head_first=False` else `[B, H, T, K]`.
+        b (torch.Tensor):
+            b of shape `[B, T, H, K]` if `head_first=False` else `[B, H, T, K]`.
+        scale (float):
+            scale of the attention.
+        initial_state (torch.Tensor):
+            initial state of shape `[B, H, K, V]` if cu_seqlens is None else `[N, H, K, V]` where N = len(cu_seqlens) - 1.
+        output_final_state (bool):
+            whether to output the final state.
+        cu_seqlens (torch.LongTensor):
+            Cumulative sequence lengths of shape `[N+1]` used for variable-length training,
+            consistent with the FlashAttention API.
+        head_first (bool):
+            whether to use head first. Recommended to be False to avoid extra transposes.
+            Default: `False`.
+    """
+    return fused_recurrent_dplr_delta_rule(
+        q=r,
+        k=k,
+        v=v,
+        a=a,
+        b=b,
+        gk=w,
+        scale=scale,
+        initial_state=initial_state,
+        output_final_state=output_final_state,
+        cu_seqlens=cu_seqlens,
+        head_first=head_first,
+    )
+def fused_mul_recurrent_rwkv7(
+    r: torch.Tensor,
+    w: torch.Tensor,
+    k: torch.Tensor,
+    v: torch.Tensor,
+    kk: torch.Tensor,
+    a: torch.Tensor,
+    scale: Optional[float] = 1.0,
+    initial_state: Optional[torch.Tensor] = None,
+    output_final_state: bool = False,
+    reverse: bool = False,
+    cu_seqlens: Optional[torch.Tensor] = None,
+    head_first: bool = False,
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    r"""
+    This function computes the recurrence S_t = S_t @ (I + a_t b_t^T) + v_t k_t^T in a recurrent manner.
+    Args:
+        r (torch.Tensor):
+            queries of shape `[B, T, H, K]` if `head_first=False` else `[B, H, T, K]`.
+        w (torch.Tensor):
+            keys of shape `[B, T, H, K]` if `head_first=False` else `[B, H, T, K]`.
+        k (torch.Tensor):
+            values of shape `[B, T, H, V]` if `head_first=False` else `[B, H, T, V]`.
+        v (torch.Tensor):
+            a of shape `[B, T, H, K]` if `head_first=False` else `[B, H, T, K]`.
+        kk (torch.Tensor):
+            b of shape `[B, T, H, K]` if `head_first=False` else `[B, H, T, K]`.
+        a (torch.Tensor):
+            gk of shape `[B, T, H, K]` if `head_first=False` else `[B, H, T, K]`. decay term in log space!
+        scale (Optional[int]):
+            Scale factor for the RetNet attention scores.
+            If not provided, it will default to `1 / sqrt(K)`. Default: 1.
+        initial_state (Optional[torch.Tensor]):
+            Initial state of shape `[N, H, K, V]` for `N` input sequences.
+            For equal-length input sequences, `N` equals the batch size `B`.
+            Default: `None`.
+        output_final_state (Optional[bool]):
+            Whether to output the final state of shape `[N, H, K, V]`. Default: `False`.
+        reverse (Optional[bool]):
+            If `True`, process the state passing in reverse order. Default: `False`.
+        cu_seqlens (Optional[torch.Tensor]):
+            Cumulative sequence lengths of shape `[N + 1]` used for variable-length training,
+            consistent with the FlashAttention API.
+        head_first (Optional[bool]):
+            Whether the inputs are in the head-first format, which is not supported for variable-length inputs.
+            Default: `False`.
+    """
+    if head_first:
+        raise DeprecationWarning(
+            "head_first is deprecated and will be removed in a future version. "
+            "Please use head_first=False for now instead."
+        )
+        r, w, k, v, kk, a = map(lambda x: rearrange(x, 'b h t ... -> b t h ...'), (r, w, k, v, kk, a))
+    if not head_first and r.shape[1] < r.shape[2]:
+        warnings.warn(
+            f"Input tensor shape suggests potential format mismatch: seq_len ({r.shape[1]}) < num_heads ({r.shape[2]}). "
+            "This may indicate the inputs were passed in head-first format [B, H, T, ...] "
+            "when head_first=False was specified. "
+            "Please verify your input tensor format matches the expected shape [B, T, H, ...]."
+        )
+    if cu_seqlens is not None:
+        if r.shape[0] != 1:
+            raise ValueError(
+                f"The batch size is expected to be 1 rather than {r.shape[0]} when using `cu_seqlens`."
+                f"Please flatten variable-length inputs before processing."
+            )
+        if initial_state is not None and initial_state.shape[0] != len(cu_seqlens) - 1:
+            raise ValueError(
+                f"The number of initial states is expected to be equal to the number of input sequences, "
+                f"i.e., {len(cu_seqlens) - 1} rather than {initial_state.shape[0]}."
+            )
+    if scale is None:
+        scale = r.shape[-1] ** -0.5
+    else:
+        assert scale > 0, "scale must be positive"
+    o, final_state = fused_recurrent_rwkv7_fwd(
+        r,
+        w,
+        k,
+        v,
+        kk,
+        a,
+        scale,
+        initial_state,
+        output_final_state,
+        reverse,
+        cu_seqlens,
+    )
+    if head_first:
+        o = rearrange(o, 'b t h ... -> b h t ...')
+    return o, final_state

fla3/ops/simple_gla/__pycache__/chunk.cpython-312.pyc ADDED Viewed

Binary file (10.6 kB). View file

fla3/ops/simple_gla/__pycache__/fused_recurrent.cpython-310.pyc ADDED Viewed

Binary file (4.13 kB). View file

fla3/ops/simple_gla/__pycache__/parallel.cpython-310.pyc ADDED Viewed

Binary file (17 kB). View file

fla3/ops/simple_gla/__pycache__/parallel.cpython-312.pyc ADDED Viewed

Binary file (35.6 kB). View file

fla3/ops/simple_gla/fused_recurrent.py ADDED Viewed

	@@ -0,0 +1,108 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) 2023-2025, Songlin Yang, Yu Zhang
+from typing import Optional, Tuple
+import torch
+from fla.ops.common.fused_recurrent import fused_recurrent
+def fused_recurrent_simple_gla(
+    q: torch.Tensor,
+    k: torch.Tensor,
+    v: torch.Tensor,
+    g: torch.Tensor,
+    scale: Optional[float] = None,
+    initial_state: Optional[torch.Tensor] = None,
+    output_final_state: bool = False,
+    reverse: bool = False,
+    cu_seqlens: Optional[torch.LongTensor] = None,
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    r"""
+    Args:
+        q (torch.Tensor):
+            queries of shape `[B, T, H, K]`.
+        k (torch.Tensor):
+            keys of shape `[B, T, H, K]`.
+        v (torch.Tensor):
+            values of shape `[B, T, H, V]`.
+        g (torch.Tensor):
+            Forget gates of shape `[B, T, H]`.
+            Compared to GLA, the gating is head-wise instead of elementwise.
+        scale (Optional[int]):
+            Scale factor for the attention scores.
+            If not provided, it will default to `1 / sqrt(K)`. Default: `None`.
+        initial_state (Optional[torch.Tensor]):
+            Initial state of shape `[N, H, K, V]` for `N` input sequences.
+            For equal-length input sequences, `N` equals the batch size `B`.
+            Default: `None`.
+        output_final_state (Optional[bool]):
+            Whether to output the final state of shape `[N, H, K, V]`. Default: `False`.
+        reverse (Optional[bool]):
+            If `True`, process the state passing in reverse order. Default: `False`.
+        cu_seqlens (torch.LongTensor):
+            Cumulative sequence lengths of shape `[N+1]` used for variable-length training,
+            consistent with the FlashAttention API.
+    Returns:
+        o (torch.Tensor):
+            Outputs of shape `[B, T, H, V]`.
+        final_state (torch.Tensor):
+            Final state of shape `[N, H, K, V]` if `output_final_state=True` else `None`.
+    Examples::
+        >>> import torch
+        >>> import torch.nn.functional as F
+        >>> from einops import rearrange
+        >>> from fla.ops.simple_gla import fused_recurrent_simple_gla
+        # inputs with equal lengths
+        >>> B, T, H, K, V = 4, 2048, 4, 512, 512
+        >>> q = torch.randn(B, T, H, K, device='cuda')
+        >>> k = torch.randn(B, T, H, K, device='cuda')
+        >>> v = torch.randn(B, T, H, V, device='cuda')
+        >>> g = F.logsigmoid(torch.randn(B, T, H, K, device='cuda'))
+        >>> h0 = torch.randn(B, H, K, V, device='cuda')
+        >>> o, ht = fused_recurrent_simple_gla(
+            q, k, v, g,
+            initial_state=h0,
+            output_final_state=True
+        )
+        # for variable-length inputs, the batch size `B` is expected to be 1 and `cu_seqlens` is required
+        >>> q, k, v, g = map(lambda x: rearrange(x, 'b t h d -> 1 (b t) h d'), (q, k, v, g))
+        # for a batch with 4 sequences, `cu_seqlens` with 5 start/end positions are expected
+        >>> cu_seqlens = q.new_tensor([0, 2048, 4096, 6144, 8192], dtype=torch.long)
+        >>> o_var, ht_var = fused_recurrent_simple_gla(
+            q, k, v, g,
+            initial_state=h0,
+            output_final_state=True,
+            cu_seqlens=cu_seqlens
+        )
+        >>> assert o.allclose(o_var.view(o.shape))
+        >>> assert ht.allclose(ht_var)
+    """
+    if cu_seqlens is not None:
+        if q.shape[0] != 1:
+            raise ValueError(
+                f"The batch size is expected to be 1 rather than {q.shape[0]} when using `cu_seqlens`."
+                f"Please flatten variable-length inputs before processing."
+            )
+        if initial_state is not None and initial_state.shape[0] != len(cu_seqlens) - 1:
+            raise ValueError(
+                f"The number of initial states is expected to be equal to the number of input sequences, "
+                f"i.e., {len(cu_seqlens) - 1} rather than {initial_state.shape[0]}."
+            )
+    if scale is None:
+        scale = k.shape[-1] ** -0.5
+    o, final_state = fused_recurrent(
+        q=q,
+        k=k,
+        v=v,
+        g=g,
+        scale=scale,
+        initial_state=initial_state,
+        output_final_state=output_final_state,
+        reverse=reverse,
+        cu_seqlens=cu_seqlens
+    )
+    return o, final_state

fla3/ops/simple_gla/naive.py ADDED Viewed

	@@ -0,0 +1,54 @@

+# -*- coding: utf-8 -*-
+import torch
+from einops import rearrange
+def torch_simple_gla(q, k, v, g, chunk_size=64, scale=None):
+    if scale is None:
+        scale = (q.shape[-1] ** -0.5)
+    q = rearrange(q, 'b h (n c) d -> b h n c d', c=chunk_size) * scale
+    k = rearrange(k, 'b h (n c) d -> b h n c d', c=chunk_size)
+    v = rearrange(v, 'b h (n c) d -> b h n c d', c=chunk_size)
+    g = rearrange(g, 'b h (n c) -> b h n c', c=chunk_size)
+    g = g.cumsum(-1)
+    kv = k.transpose(-1, -2) @ (v * (-g + g[:, :, :, -1, None]).exp()[..., None])
+    S = torch.zeros_like(kv)
+    for i in range(1, g.shape[-2]):
+        S[:, :, i] = S[:, :, i-1].clone() * g[:, :, i-1, -1, None, None].exp() + kv[:, :, i-1]
+    inter = (q * g[..., None].exp()) @ S
+    attn = q @ k.transpose(-1, -2)
+    attn = attn * (g[..., None] - g[..., None, :]).exp()
+    attn = attn.masked_fill(torch.triu(torch.ones(chunk_size, chunk_size, dtype=bool, device=q.device), diagonal=1), 0)
+    intra = attn @ v
+    o = inter + intra
+    return rearrange(o, 'b h n c d -> b h (n c) d')
+def torch_simple_gla_recurrent(q, k, v, g, scale=None, initial_state=None, output_final_state=True):
+    B, H, T, DK = q.shape
+    original_dtype = q.dtype
+    q, k, v, g = q.float(), k.float(), v.float(), g.float()
+    if scale is None:
+        scale = DK ** -0.5
+    q = q * scale
+    _, _, _, DV = v.shape
+    if initial_state is None:
+        S = torch.zeros(B, H, DK, DV)
+    else:
+        S = initial_state
+    o = torch.zeros(B, H, T, DV).to(q)
+    for i in range(T):
+        gate = g[:, :, i].exp()
+        key = k[:, :, i]
+        value = v[:, :, i]
+        kv = key.unsqueeze(-1) * value.unsqueeze(-2)
+        S = S.clone() * gate.unsqueeze(-1).unsqueeze(-1) + kv
+        q_i = q[:, :, i, :]
+        o_i = (q_i.unsqueeze(-1) * S).sum(-2)
+        o[:, :, i] = o_i
+    if not output_final_state:
+        S = None
+    return o.to(original_dtype), S

fla3/ops/ttt/naive.py ADDED Viewed

	@@ -0,0 +1,126 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) 2023-2025, Songlin Yang, Yu Zhang, Yuqi Pan
+import torch
+import torch.nn.functional as F
+def ttt_linear(
+    q: torch.Tensor,
+    k: torch.Tensor,
+    v: torch.Tensor,
+    w: torch.Tensor,
+    b: torch.Tensor,
+    eta: torch.Tensor,
+    scale: float,
+    eps: float,
+    mini_batch_size: int,
+    initial_state: torch.Tensor,
+    initial_state_bias: torch.Tensor,
+    output_final_state: bool
+):
+    B, H, T, D = q.shape
+    BT = mini_batch_size
+    NT = T // BT
+    # [NT, B, H, mini_batch_size, D]
+    _q = q.reshape(B, H, NT, BT, D).permute(2, 0, 1, 3, 4)
+    _k = k.reshape(B, H, NT, BT, D).permute(2, 0, 1, 3, 4)
+    _v = v.reshape(B, H, NT, BT, D).permute(2, 0, 1, 3, 4)
+    # [NT, B, H, BT, 1]
+    _eta = eta.reshape(B, H, NT, BT, 1).permute(2, 0, 1, 3, 4)
+    # [H, 1, D]
+    w = w.reshape(H, 1, D).to(torch.float32)
+    b = b.reshape(H, 1, D).to(torch.float32)
+    h = torch.zeros((B, H, D, D), device=v.device, dtype=torch.float32) if initial_state is None else initial_state
+    hb = torch.zeros((B, H, 1, D), device=v.device, dtype=torch.float32) if initial_state_bias is None else initial_state_bias
+    q *= scale
+    # [NT, B, H, BT, D]
+    o = torch.empty_like(_v)
+    for i in range(NT):
+        q_i, k_i, v_i, eta_i = [x[i] for x in [_q, _k, _v, _eta]]
+        kh = k_i @ h + hb
+        reconstruction_target = v_i - k_i
+        mean = kh.mean(-1, True)
+        var = kh.var(-1, unbiased=False, keepdim=True).to(torch.float32)
+        rstd = torch.sqrt(var + eps).to(torch.float32)
+        kh_hat = (kh - mean) / rstd
+        g = w * kh_hat + b - reconstruction_target
+        g *= w
+        v_new = (D * g - g.sum(-1, True) - kh_hat * (g * kh_hat).sum(-1, True)) / (rstd * D)
+        Attn = torch.tril(q_i @ k_i.transpose(-2, -1))
+        o_i = q_i @ h - (eta_i * Attn) @ v_new + hb - torch.tril(eta_i.expand_as(Attn)) @ v_new
+        h = h - (eta_i[:, :, -1, :, None] * k_i).transpose(-1, -2) @ v_new
+        hb = hb - torch.sum(eta_i[:, :, -1, :, None] * v_new, dim=-2, keepdim=True)
+        # layer norm with residuals
+        mean = o_i.mean(dim=-1, keepdim=True)
+        var = o_i.var(dim=-1, unbiased=False, keepdim=True).to(torch.float32)
+        rstd = torch.sqrt(var + eps).to(torch.float32)
+        o[i] = o_i + (o_i - mean) / rstd * w + b
+    # [B, H, T, D]
+    o = o.permute(1, 2, 0, 3, 4).reshape(B, H, T, D)
+    h = h if output_final_state else None
+    hb = hb if output_final_state else None
+    return o, h, hb
+def chunk_ttt_linear_ref(
+    q: torch.Tensor,
+    k: torch.Tensor,
+    v: torch.Tensor,
+    w: torch.Tensor,
+    b: torch.Tensor,
+    eta: torch.Tensor,
+    scale: float = None,
+    eps: float = 1e-6,
+    mini_batch_size: int = 16,
+    initial_state: torch.Tensor = None,
+    initial_state_bias: torch.Tensor = None,
+    output_final_state: bool = False,
+    head_first: bool = False,
+):
+    assert q.dtype == k.dtype == v.dtype
+    assert k.shape[-1] == v.shape[-1], "The key and value dimension must be the same."
+    if isinstance(eta, float):
+        eta = torch.full_like(q[:, :, :, :1], eta)
+    if scale is None:
+        scale = k.shape[-1] ** -0.5
+    if not head_first:
+        q = q.transpose(1, 2)
+        k = k.transpose(1, 2)
+        v = v.transpose(1, 2)
+        eta = eta.transpose(1, 2)
+    T = q.shape[-2]
+    padded = (mini_batch_size - (T % mini_batch_size)) % mini_batch_size
+    if padded > 0:
+        q = F.pad(q, (0, 0, 0, padded))
+        k = F.pad(k, (0, 0, 0, padded))
+        v = F.pad(v, (0, 0, 0, padded))
+        eta = F.pad(eta, (0, 0, 0, padded))
+        eta[:, :, -1, :] = eta[:, :, -(padded+1), :]
+    assert q.shape[-2] % mini_batch_size == 0, "Sequence length should be a multiple of mini_batch_size."
+    q, k, v, eta, w, b = map(lambda x: x.to(torch.float32), [q, k, v, eta, w, b])
+    o, final_state, final_state_bias = ttt_linear(
+        q,
+        k,
+        v,
+        w,
+        b,
+        eta,
+        scale,
+        eps,
+        mini_batch_size,
+        initial_state,
+        initial_state_bias,
+        output_final_state,
+    )
+    o = o[:, :, :T, :].contiguous()
+    if not head_first:
+        o = o.transpose(1, 2)
+    return o, final_state, final_state_bias

fla3/ops/utils/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (1.16 kB). View file

fla3/ops/utils/__pycache__/__init__.cpython-312.pyc ADDED Viewed

Binary file (1.2 kB). View file

fla3/ops/utils/__pycache__/asm.cpython-310.pyc ADDED Viewed

Binary file (482 Bytes). View file

fla3/ops/utils/__pycache__/asm.cpython-312.pyc ADDED Viewed

Binary file (543 Bytes). View file

fla3/ops/utils/__pycache__/cumsum.cpython-310.pyc ADDED Viewed

Binary file (10.3 kB). View file

fla3/ops/utils/__pycache__/cumsum.cpython-312.pyc ADDED Viewed

Binary file (21.4 kB). View file

fla3/ops/utils/__pycache__/index.cpython-310.pyc ADDED Viewed

Binary file (3.12 kB). View file

fla3/ops/utils/__pycache__/index.cpython-312.pyc ADDED Viewed

Binary file (5.48 kB). View file

fla3/ops/utils/__pycache__/logcumsumexp.cpython-310.pyc ADDED Viewed

Binary file (1.54 kB). View file

fla3/ops/utils/__pycache__/logsumexp.cpython-310.pyc ADDED Viewed

Binary file (2.25 kB). View file

fla3/ops/utils/__pycache__/logsumexp.cpython-312.pyc ADDED Viewed

Binary file (3.66 kB). View file

fla3/ops/utils/__pycache__/matmul.cpython-310.pyc ADDED Viewed

Binary file (5.29 kB). View file

fla3/ops/utils/__pycache__/op.cpython-312.pyc ADDED Viewed

Binary file (1.56 kB). View file

fla3/ops/utils/__pycache__/pooling.cpython-310.pyc ADDED Viewed

Binary file (5.61 kB). View file

fla3/ops/utils/cumsum.py ADDED Viewed

	@@ -0,0 +1,414 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) 2023-2025, Songlin Yang, Yu Zhang
+import warnings
+from typing import Optional
+import torch
+import triton
+import triton.language as tl
+from ...ops.utils.index import prepare_chunk_indices
+from ...utils import check_shared_mem, input_guard
+BS_LIST = [32, 64] if check_shared_mem() else [16, 32]
+@triton.heuristics({
+    'IS_VARLEN': lambda args: args['cu_seqlens'] is not None
+})
+@triton.autotune(
+    configs=[
+        triton.Config({}, num_warps=num_warps)
+        for num_warps in [1, 2, 4, 8]
+    ],
+    key=['B', 'H', 'BT', 'IS_VARLEN', 'REVERSE']
+)
+@triton.jit(do_not_specialize=['T'])
+def chunk_local_cumsum_scalar_kernel(
+    s,
+    o,
+    cu_seqlens,
+    chunk_indices,
+    T,
+    B: tl.constexpr,
+    H: tl.constexpr,
+    BT: tl.constexpr,
+    REVERSE: tl.constexpr,
+    IS_VARLEN: tl.constexpr,
+    HEAD_FIRST: tl.constexpr,
+):
+    i_t, i_bh = tl.program_id(0), tl.program_id(1)
+    i_b, i_h = i_bh // H, i_bh % H
+    if IS_VARLEN:
+        i_n, i_t = tl.load(chunk_indices + i_t * 2).to(tl.int32), tl.load(chunk_indices + i_t * 2 + 1).to(tl.int32)
+        bos, eos = tl.load(cu_seqlens + i_n).to(tl.int32), tl.load(cu_seqlens + i_n + 1).to(tl.int32)
+        T = eos - bos
+    else:
+        bos, eos = i_b * T, i_b * T + T
+    if HEAD_FIRST:
+        p_s = tl.make_block_ptr(s + bos*H + i_h*T, (T,), (1,), (i_t * BT,), (BT,), (0,))
+        p_o = tl.make_block_ptr(o + bos*H + i_h*T, (T,), (1,), (i_t * BT,), (BT,), (0,))
+    else:
+        p_s = tl.make_block_ptr(s + bos*H + i_h, (T,), (H,), (i_t * BT,), (BT,), (0,))
+        p_o = tl.make_block_ptr(o + bos*H + i_h, (T,), (H,), (i_t * BT,), (BT,), (0,))
+    # [BT]
+    b_s = tl.load(p_s, boundary_check=(0,)).to(tl.float32)
+    b_o = tl.cumsum(b_s, axis=0)
+    if REVERSE:
+        b_z = tl.sum(b_s, axis=0)
+        b_o = -b_o + b_z[None] + b_s
+    tl.store(p_o, b_o.to(p_o.dtype.element_ty), boundary_check=(0,))
+@triton.heuristics({
+    'IS_VARLEN': lambda args: args['cu_seqlens'] is not None
+})
+@triton.autotune(
+    configs=[
+        triton.Config({'BS': BS}, num_warps=num_warps)
+        for BS in BS_LIST
+        for num_warps in [2, 4, 8]
+    ],
+    key=['B', 'H', 'S', 'BT', 'IS_VARLEN', 'REVERSE']
+)
+@triton.jit(do_not_specialize=['T'])
+def chunk_local_cumsum_vector_kernel(
+    s,
+    o,
+    cu_seqlens,
+    chunk_indices,
+    T,
+    B: tl.constexpr,
+    H: tl.constexpr,
+    S: tl.constexpr,
+    BT: tl.constexpr,
+    BS: tl.constexpr,
+    REVERSE: tl.constexpr,
+    IS_VARLEN: tl.constexpr,
+    HEAD_FIRST: tl.constexpr,
+):
+    i_s, i_t, i_bh = tl.program_id(0), tl.program_id(1), tl.program_id(2)
+    i_b, i_h = i_bh // H, i_bh % H
+    if IS_VARLEN:
+        i_n, i_t = tl.load(chunk_indices + i_t * 2).to(tl.int32), tl.load(chunk_indices + i_t * 2 + 1).to(tl.int32)
+        bos, eos = tl.load(cu_seqlens + i_n).to(tl.int32), tl.load(cu_seqlens + i_n + 1).to(tl.int32)
+        T = eos - bos
+    else:
+        bos, eos = i_b * T, i_b * T + T
+    o_i = tl.arange(0, BT)
+    if REVERSE:
+        m_s = tl.where(o_i[:, None] <= o_i[None, :], 1., 0.)
+    else:
+        m_s = tl.where(o_i[:, None] >= o_i[None, :], 1., 0.)
+    if HEAD_FIRST:
+        p_s = tl.make_block_ptr(s + (bos * H + i_h*T)*S, (T, S), (S, 1), (i_t * BT, i_s * BS), (BT, BS), (1, 0))
+        p_o = tl.make_block_ptr(o + (bos * H + i_h*T)*S, (T, S), (S, 1), (i_t * BT, i_s * BS), (BT, BS), (1, 0))
+    else:
+        p_s = tl.make_block_ptr(s + (bos * H + i_h) * S, (T, S), (H*S, 1), (i_t * BT, i_s * BS), (BT, BS), (1, 0))
+        p_o = tl.make_block_ptr(o + (bos * H + i_h) * S, (T, S), (H*S, 1), (i_t * BT, i_s * BS), (BT, BS), (1, 0))
+    # [BT, BS]
+    b_s = tl.load(p_s, boundary_check=(0, 1)).to(tl.float32)
+    b_o = tl.dot(m_s, b_s, allow_tf32=False)
+    tl.store(p_o, b_o.to(p_o.dtype.element_ty), boundary_check=(0, 1))
+@triton.heuristics({
+    'IS_VARLEN': lambda args: args['cu_seqlens'] is not None
+})
+@triton.autotune(
+    configs=[
+        triton.Config({'BT': BT}, num_warps=num_warps, num_stages=num_stages)
+        for BT in [32, 64, 128, 256]
+        for num_warps in [2, 4, 8]
+        for num_stages in [1, 2, 3, 4]
+    ],
+    key=['B', 'H', 'IS_VARLEN', 'REVERSE']
+)
+@triton.jit(do_not_specialize=['T'])
+def chunk_global_cumsum_scalar_kernel(
+    s,
+    o,
+    cu_seqlens,
+    T,
+    B: tl.constexpr,
+    H: tl.constexpr,
+    BT: tl.constexpr,
+    REVERSE: tl.constexpr,
+    IS_VARLEN: tl.constexpr,
+    HEAD_FIRST: tl.constexpr,
+):
+    i_nh = tl.program_id(0)
+    i_n, i_h = i_nh // H, i_nh % H
+    if IS_VARLEN:
+        bos, eos = tl.load(cu_seqlens + i_n).to(tl.int32), tl.load(cu_seqlens + i_n + 1).to(tl.int32)
+    else:
+        bos, eos = i_n * T, i_n * T + T
+    T = eos - bos
+    b_z = tl.zeros([], dtype=tl.float32)
+    NT = tl.cdiv(T, BT)
+    for i_c in range(NT):
+        i_t = NT-1-i_c if REVERSE else i_c
+        if HEAD_FIRST:
+            p_s = tl.make_block_ptr(s + bos*H + i_h*T, (T,), (1,), (i_t * BT,), (BT,), (0,))
+            p_o = tl.make_block_ptr(o + bos*H + i_h*T, (T,), (1,), (i_t * BT,), (BT,), (0,))
+        else:
+            p_s = tl.make_block_ptr(s + bos*H + i_h, (T,), (H,), (i_t * BT,), (BT,), (0,))
+            p_o = tl.make_block_ptr(o + bos*H + i_h, (T,), (H,), (i_t * BT,), (BT,), (0,))
+        b_s = tl.load(p_s, boundary_check=(0,)).to(tl.float32)
+        b_o = tl.cumsum(b_s, axis=0)
+        b_ss = tl.sum(b_s, 0)
+        if REVERSE:
+            b_o = -b_o + b_ss + b_s
+        b_o += b_z
+        if i_c >= 0:
+            b_z += b_ss
+        tl.store(p_o, b_o.to(p_o.dtype.element_ty), boundary_check=(0,))
+@triton.heuristics({
+    'IS_VARLEN': lambda args: args['cu_seqlens'] is not None,
+})
+@triton.autotune(
+    configs=[
+        triton.Config({'BT': BT}, num_warps=num_warps, num_stages=num_stages)
+        for BT in [16, 32, 64, 128]
+        for num_warps in [2, 4, 8]
+        for num_stages in [1, 2, 3, 4]
+    ],
+    key=['B', 'H', 'S', 'IS_VARLEN', 'REVERSE']
+)
+@triton.jit(do_not_specialize=['T'])
+def chunk_global_cumsum_vector_kernel(
+    s,
+    z,
+    cu_seqlens,
+    T,
+    B: tl.constexpr,
+    H: tl.constexpr,
+    S: tl.constexpr,
+    BT: tl.constexpr,
+    BS: tl.constexpr,
+    REVERSE: tl.constexpr,
+    IS_VARLEN: tl.constexpr,
+    HEAD_FIRST: tl.constexpr,
+):
+    i_s, i_nh = tl.program_id(0), tl.program_id(1)
+    i_n, i_h = i_nh // H, i_nh % H
+    if IS_VARLEN:
+        bos, eos = tl.load(cu_seqlens + i_n).to(tl.int32), tl.load(cu_seqlens + i_n + 1).to(tl.int32)
+    else:
+        bos, eos = i_n * T, i_n * T + T
+    T = eos - bos
+    o_i = tl.arange(0, BT)
+    if REVERSE:
+        m_s = tl.where(o_i[:, None] <= o_i[None, :], 1., 0.)
+    else:
+        m_s = tl.where(o_i[:, None] >= o_i[None, :], 1., 0.)
+    b_z = tl.zeros([BS], dtype=tl.float32)
+    NT = tl.cdiv(T, BT)
+    for i_c in range(NT):
+        i_t = NT-1-i_c if REVERSE else i_c
+        if HEAD_FIRST:
+            p_s = tl.make_block_ptr(s + (bos * H + i_h*T)*S, (T, S), (S, 1), (i_t * BT, i_s * BS), (BT, BS), (1, 0))
+            p_z = tl.make_block_ptr(z + (bos * H + i_h*T)*S, (T, S), (S, 1), (i_t * BT, i_s * BS), (BT, BS), (1, 0))
+        else:
+            p_s = tl.make_block_ptr(s + (bos * H + i_h) * S, (T, S), (H*S, 1), (i_t * BT, i_s * BS), (BT, BS), (1, 0))
+            p_z = tl.make_block_ptr(z + (bos * H + i_h) * S, (T, S), (H*S, 1), (i_t * BT, i_s * BS), (BT, BS), (1, 0))
+        # [BT, BS]
+        b_s = tl.load(p_s, boundary_check=(0, 1)).to(tl.float32)
+        b_c = b_z[None, :] + tl.dot(m_s, b_s, allow_tf32=False)
+        tl.store(p_z, b_c.to(p_z.dtype.element_ty), boundary_check=(0, 1))
+        if i_c >= 0:
+            b_z += tl.sum(b_s, 0)
+def chunk_local_cumsum_scalar(
+    g: torch.Tensor,
+    chunk_size: int,
+    reverse: bool = False,
+    cu_seqlens: Optional[torch.Tensor] = None,
+    head_first: bool = False,
+    output_dtype: Optional[torch.dtype] = torch.float
+) -> torch.Tensor:
+    if head_first:
+        B, H, T = g.shape
+    else:
+        B, T, H = g.shape
+    assert chunk_size == 2**(chunk_size.bit_length()-1), "chunk_size must be a power of 2"
+    BT = chunk_size
+    chunk_indices = prepare_chunk_indices(cu_seqlens, BT) if cu_seqlens is not None else None
+    NT = triton.cdiv(T, BT) if cu_seqlens is None else len(chunk_indices)
+    g_org, g = g, torch.empty_like(g, dtype=output_dtype or g.dtype)
+    grid = (NT, B * H)
+    chunk_local_cumsum_scalar_kernel[grid](
+        g_org,
+        g,
+        cu_seqlens,
+        chunk_indices,
+        T=T,
+        B=B,
+        H=H,
+        BT=BT,
+        HEAD_FIRST=head_first,
+        REVERSE=reverse
+    )
+    return g
+def chunk_local_cumsum_vector(
+    g: torch.Tensor,
+    chunk_size: int,
+    reverse: bool = False,
+    cu_seqlens: Optional[torch.Tensor] = None,
+    head_first: bool = False,
+    output_dtype: Optional[torch.dtype] = torch.float
+) -> torch.Tensor:
+    if head_first:
+        B, H, T, S = g.shape
+    else:
+        B, T, H, S = g.shape
+    BT = chunk_size
+    chunk_indices = prepare_chunk_indices(cu_seqlens, chunk_size) if cu_seqlens is not None else None
+    NT = triton.cdiv(T, BT) if cu_seqlens is None else len(chunk_indices)
+    assert chunk_size == 2**(chunk_size.bit_length()-1), "chunk_size must be a power of 2"
+    g_org, g = g, torch.empty_like(g, dtype=output_dtype or g.dtype)
+    def grid(meta): return (triton.cdiv(meta['S'], meta['BS']), NT, B * H)
+    # keep cummulative normalizer in fp32
+    # this kernel is equivalent to
+    # g = g.view(B, H, NT, BT, -1).cumsum(-2).view(B, H, T, -1)
+    chunk_local_cumsum_vector_kernel[grid](
+        g_org,
+        g,
+        cu_seqlens,
+        chunk_indices,
+        T=T,
+        B=B,
+        H=H,
+        S=S,
+        BT=BT,
+        HEAD_FIRST=head_first,
+        REVERSE=reverse
+    )
+    return g
+@input_guard
+def chunk_global_cumsum_scalar(
+    s: torch.Tensor,
+    reverse: bool = False,
+    cu_seqlens: Optional[torch.Tensor] = None,
+    head_first: bool = False,
+    output_dtype: Optional[torch.dtype] = torch.float
+) -> torch.Tensor:
+    if head_first:
+        B, H, T = s.shape
+    else:
+        B, T, H = s.shape
+    N = len(cu_seqlens) - 1 if cu_seqlens is not None else B
+    z = torch.empty_like(s, dtype=output_dtype or s.dtype)
+    grid = (N * H,)
+    chunk_global_cumsum_scalar_kernel[grid](
+        s,
+        z,
+        cu_seqlens,
+        T=T,
+        B=B,
+        H=H,
+        HEAD_FIRST=head_first,
+        REVERSE=reverse
+    )
+    return z
+@input_guard
+def chunk_global_cumsum_vector(
+    s: torch.Tensor,
+    reverse: bool = False,
+    cu_seqlens: Optional[torch.Tensor] = None,
+    head_first: bool = False,
+    output_dtype: Optional[torch.dtype] = torch.float
+) -> torch.Tensor:
+    if head_first:
+        B, H, T, S = s.shape
+    else:
+        B, T, H, S = s.shape
+    N = len(cu_seqlens) - 1 if cu_seqlens is not None else B
+    BS = min(32, triton.next_power_of_2(S))
+    z = torch.empty_like(s, dtype=output_dtype or s.dtype)
+    grid = (triton.cdiv(S, BS), N * H)
+    chunk_global_cumsum_vector_kernel[grid](
+        s,
+        z,
+        cu_seqlens,
+        T=T,
+        B=B,
+        H=H,
+        S=S,
+        BS=BS,
+        HEAD_FIRST=head_first,
+        REVERSE=reverse
+    )
+    return z
+@input_guard
+def chunk_global_cumsum(
+    s: torch.Tensor,
+    reverse: bool = False,
+    cu_seqlens: Optional[torch.Tensor] = None,
+    head_first: bool = False,
+    output_dtype: Optional[torch.dtype] = torch.float
+) -> torch.Tensor:
+    if cu_seqlens is not None:
+        assert s.shape[0] == 1, "Only batch size 1 is supported when cu_seqlens are provided"
+    if len(s.shape) == 3:
+        return chunk_global_cumsum_scalar(s, reverse, cu_seqlens, head_first, output_dtype)
+    elif len(s.shape) == 4:
+        return chunk_global_cumsum_vector(s, reverse, cu_seqlens, head_first, output_dtype)
+    else:
+        raise ValueError(
+            f"Unsupported input shape {s.shape}. "
+            f"which should be [B, T, H]/[B, T, H, D] if `head_first=False` "
+            f"or [B, H, T]/[B, H, T, D] otherwise"
+        )
+@input_guard
+def chunk_local_cumsum(
+    g: torch.Tensor,
+    chunk_size: int,
+    reverse: bool = False,
+    cu_seqlens: Optional[torch.Tensor] = None,
+    head_first: bool = False,
+    output_dtype: Optional[torch.dtype] = torch.float,
+    **kwargs
+) -> torch.Tensor:
+    if not head_first and g.shape[1] < g.shape[2]:
+        warnings.warn(
+            f"Input tensor shape suggests potential format mismatch: seq_len ({g.shape[1]}) < num_heads ({g.shape[2]}). "
+            "This may indicate the inputs were passed in head-first format [B, H, T, ...] "
+            "when head_first=False was specified. "
+            "Please verify your input tensor format matches the expected shape [B, T, H, ...]."
+        )
+    if cu_seqlens is not None:
+        assert g.shape[0] == 1, "Only batch size 1 is supported when cu_seqlens are provided"
+    if len(g.shape) == 3:
+        return chunk_local_cumsum_scalar(g, chunk_size, reverse, cu_seqlens, head_first, output_dtype)
+    elif len(g.shape) == 4:
+        return chunk_local_cumsum_vector(g, chunk_size, reverse, cu_seqlens, head_first, output_dtype)
+    else:
+        raise ValueError(
+            f"Unsupported input shape {g.shape}. "
+            f"which should be (B, T, H, D) if `head_first=False` "
+            f"or (B, H, T, D) otherwise"
+        )

fla3/ops/utils/index.py ADDED Viewed

	@@ -0,0 +1,83 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) 2023-2025, Songlin Yang, Yu Zhang
+import torch
+import torch.nn.functional as F
+import triton
+import triton.language as tl
+from ...utils import tensor_cache
+@triton.autotune(
+    configs=[
+        triton.Config({}, num_warps=num_warps)
+        for num_warps in [4, 8, 16, 32]
+    ],
+    key=['B'],
+)
+@triton.jit
+def prepare_position_ids_kernel(
+    y,
+    cu_seqlens,
+    B: tl.constexpr
+):
+    i_n = tl.program_id(0)
+    bos, eos = tl.load(cu_seqlens + i_n).to(tl.int32), tl.load(cu_seqlens + i_n + 1).to(tl.int32)
+    T = eos - bos
+    o = tl.arange(0, B)
+    for i in range(0, tl.cdiv(T, B) * B, B):
+        o_i = o + i
+        tl.store(y + bos + o_i, o_i, o_i < T)
+@tensor_cache
+def prepare_lens(cu_seqlens: torch.LongTensor) -> torch.LongTensor:
+    return cu_seqlens[1:] - cu_seqlens[:-1]
+@tensor_cache
+def prepare_lens_from_mask(mask: torch.BoolTensor) -> torch.LongTensor:
+    return mask.sum(dim=-1, dtype=torch.int32)
+@tensor_cache
+def prepare_cu_seqlens_from_mask(mask: torch.BoolTensor, out_dtype: torch.dtype = torch.int32) -> torch.LongTensor:
+    return F.pad(prepare_lens_from_mask(mask).cumsum(dim=0, dtype=out_dtype), (1, 0))
+@tensor_cache
+def prepare_position_ids(cu_seqlens: torch.LongTensor) -> torch.LongTensor:
+    return torch.cat([
+        torch.arange(n, dtype=cu_seqlens.dtype, device=cu_seqlens.device)
+        for n in prepare_lens(cu_seqlens).unbind()
+    ])
+@tensor_cache
+def prepare_sequence_ids(cu_seqlens: torch.LongTensor) -> torch.LongTensor:
+    return prepare_position_ids(cu_seqlens).eq(0).cumsum(0) - 1
+@tensor_cache
+def prepare_token_indices(cu_seqlens: torch.LongTensor) -> torch.LongTensor:
+    position_ids = prepare_position_ids(cu_seqlens)
+    return torch.stack([prepare_sequence_ids(cu_seqlens), position_ids], 1).to(cu_seqlens)
+@tensor_cache
+def prepare_chunk_indices(
+    cu_seqlens: torch.LongTensor,
+    chunk_size: int
+) -> torch.LongTensor:
+    indices = torch.cat([torch.arange(n) for n in triton.cdiv(prepare_lens(cu_seqlens), chunk_size).tolist()])
+    return torch.stack([indices.eq(0).cumsum(0) - 1, indices], 1).to(cu_seqlens)
+@tensor_cache
+def prepare_chunk_offsets(
+    cu_seqlens: torch.LongTensor,
+    chunk_size: int
+) -> torch.LongTensor:
+    return torch.cat([cu_seqlens.new_tensor([0]), triton.cdiv(prepare_lens(cu_seqlens), chunk_size)]).cumsum(-1)

fla3/ops/utils/logcumsumexp.py ADDED Viewed

	@@ -0,0 +1,52 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) 2023-2025, Songlin Yang, Yu Zhang
+import triton
+import triton.language as tl
+from ...ops.utils.op import exp, log
+@triton.autotune(
+    configs=[
+        triton.Config({'BT': BT}, num_warps=num_warps)
+        for BT in [16, 32, 64]
+        for num_warps in [2, 4, 8]
+    ],
+    key=['S']
+)
+@triton.jit(do_not_specialize=['T'])
+def logcumsumexp_fwd_kernel(
+    s,
+    z,
+    T,
+    S: tl.constexpr,
+    BT: tl.constexpr
+):
+    i_bh = tl.program_id(0)
+    o_i = tl.arange(0, BT)
+    m_s = tl.where(o_i[:, None] >= o_i[None, :], 1., 0.)
+    b_mp = tl.full([S,], float('-inf'), dtype=tl.float32)
+    b_zp = tl.zeros([S,], dtype=tl.float32)
+    for i_t in range(tl.cdiv(T, BT)):
+        p_s = tl.make_block_ptr(s + i_bh * T*S, (T, S), (S, 1), (i_t * BT, 0), (BT, S), (1, 0))
+        p_z = tl.make_block_ptr(z + i_bh * T*S, (T, S), (S, 1), (i_t * BT, 0), (BT, S), (1, 0))
+        # [BT, S]
+        b_s = tl.load(p_s, boundary_check=(0, 1)).to(tl.float32)
+        # [S,]
+        b_mc = tl.max(b_s, 0)
+        b_mc = tl.maximum(b_mp, b_mc)
+        b_zp = b_zp * exp(b_mp - b_mc)
+        # [BT, S]
+        b_s = exp(b_s - b_mc)
+        b_z = tl.dot(m_s, b_s, allow_tf32=False) + b_zp
+        # [S,]
+        b_zc = tl.max(b_z, 0)
+        b_mp = b_mc
+        b_zp = b_zc
+        # [BT, BS]
+        # small eps to prevent underflows
+        b_z = log(tl.where(b_z != 0, b_z, 1e-20)) + b_mc
+        tl.store(p_z, b_z.to(p_z.dtype.element_ty), boundary_check=(0, 1))

fla3/ops/utils/matmul.py ADDED Viewed

	@@ -0,0 +1,245 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) 2023-2025, Songlin Yang, Yu Zhang
+# code adapted from
+# https://triton-lang.org/main/getting-started/tutorials/03-matrix-multiplication.html
+from typing import Optional
+import torch
+import triton
+import triton.language as tl
+from ...ops.utils.op import exp
+from ...utils import input_guard
+# `triton.jit`'ed functions can be auto-tuned by using the `triton.autotune` decorator, which consumes:
+#   - A list of `triton.Config` objects that define different configurations of
+#       meta-parameters (e.g., `BM`) and compilation options (e.g., `num_warps`) to try
+#   - An auto-tuning *key* whose change in values will trigger evaluation of all the
+#       provided configs
+@triton.heuristics({
+    'HAS_ALPHA': lambda args: args['alpha'] is not None,
+    'HAS_BETA': lambda args: args['beta'] is not None
+})
+@triton.autotune(
+    configs=[
+        triton.Config({'BM': 128, 'BK': 64, 'BN': 256, 'G': 4}, num_stages=3, num_warps=8),
+        triton.Config({'BM': 64, 'BK': 32, 'BN': 256, 'G': 4}, num_stages=4, num_warps=4),
+        triton.Config({'BM': 128, 'BK': 32, 'BN': 128, 'G': 4}, num_stages=4, num_warps=4),
+        triton.Config({'BM': 128, 'BK': 32, 'BN': 64, 'G': 4}, num_stages=4, num_warps=4),
+        triton.Config({'BM': 64, 'BK': 32, 'BN': 128, 'G': 4}, num_stages=4, num_warps=4),
+        triton.Config({'BM': 128, 'BK': 32, 'BN': 32, 'G': 4}, num_stages=4, num_warps=4),
+        triton.Config({'BM': 64, 'BK': 32, 'BN': 32, 'G': 4}, num_stages=5, num_warps=2),
+        triton.Config({'BM': 32, 'BK': 32, 'BN': 64, 'G': 4}, num_stages=5, num_warps=2),
+        # Good config for fp8 inputs.
+        # triton.Config({'BM': 128, 'BK': 128, 'BN': 256, 'G': 4}, num_stages=3, num_warps=8),
+        # triton.Config({'BM': 256, 'BK': 128, 'BN': 128, 'G': 4}, num_stages=3, num_warps=8),
+        # triton.Config({'BM': 256, 'BK': 128, 'BN': 64, 'G': 4}, num_stages=4, num_warps=4),
+        # triton.Config({'BM': 64, 'BK': 128, 'BN': 256, 'G': 4}, num_stages=4, num_warps=4),
+        # triton.Config({'BM': 128, 'BK': 128, 'BN': 128, 'G': 4}, num_stages=4, num_warps=4),
+        # triton.Config({'BM': 128, 'BK': 64, 'BN': 64, 'G': 4}, num_stages=4, num_warps=4),
+        # triton.Config({'BM': 64, 'BK': 64, 'BN': 128, 'G': 4}, num_stages=4, num_warps=4),
+        # triton.Config({'BM': 128, 'BK': 64, 'BN': 32, 'G': 4}, num_stages=4, num_warps=4)
+    ],
+    key=['M', 'N', 'K']
+)
+@triton.jit
+def matmul_kernel(
+    # Pointers to matrices
+    a,
+    b,
+    c,
+    input,
+    alpha,
+    beta,
+    # Matrix dimensions
+    M,
+    N,
+    K,
+    # The stride variables represent how much to increase the ptr by when moving by 1
+    # element in a particular dimension. E.g. `s_am` is how much to increase `a`
+    # by to get the element one row down (A has M rows).
+    stride_ab, stride_am, stride_ak,  # a: batch, M, K
+    stride_bk, stride_bn,             # b: K, N
+    stride_cb, stride_cm, stride_cn,  # c: batch, M, N
+    # Meta-parameters
+    BM: tl.constexpr,
+    BK: tl.constexpr,
+    BN: tl.constexpr,
+    G: tl.constexpr,
+    ACTIVATION: tl.constexpr,
+    HAS_INPUT: tl.constexpr,
+    HAS_ALPHA: tl.constexpr,
+    HAS_BETA: tl.constexpr,
+    ALLOW_TF32: tl.constexpr,
+    X_DIM: tl.constexpr = 1,
+):
+    """Kernel for computing the matmul C = A x B.
+    A has shape (M, K), B has shape (K, N) and C has shape (M, N)
+    """
+    # -----------------------------------------------------------
+    # Map program ids `pid` to the block of C it should compute.
+    # This is done in a grouped ordering to promote L2 data reuse.
+    # See above `L2 Cache Optimizations` section for details.
+    i_b, i_m, i_n = tl.program_id(0), tl.program_id(1), tl.program_id(2)
+    NM, NN = tl.num_programs(1), tl.num_programs(2)
+    i_m, i_n = tl.swizzle2d(i_m, i_n, NM, NN, G)
+    # ----------------------------------------------------------
+    # Create pointers for the first blocks of A and B.
+    # We will advance this pointer as we move in the K direction
+    # and accumulate
+    # `p_a` is a block of [BM, BK] pointers
+    # `p_b` is a block of [BK, BN] pointers
+    # See above `Pointer Arithmetic` section for details
+    a_batch_ptr = a + i_b * stride_ab
+    o_am = (i_m * BM + tl.arange(0, BM)) % M
+    o_bn = (i_n * BN + tl.arange(0, BN)) % N
+    o_k = tl.arange(0, BK)
+    p_a = a_batch_ptr + (o_am[:, None] * stride_am + o_k[None, :] * stride_ak)
+    p_b = b + (o_k[:, None] * stride_bk + o_bn[None, :] * stride_bn)
+    b_acc = tl.zeros((BM, BN), dtype=tl.float32)
+    for k in range(0, tl.cdiv(K, BK)):
+        # Load the next block of A and B, generate a mask by checking the K dimension.
+        # If it is out of bounds, set it to 0.
+        b_a = tl.load(p_a, mask=o_k[None, :] < K - k * BK, other=0.0)
+        b_b = tl.load(p_b, mask=o_k[:, None] < K - k * BK, other=0.0)
+        # We accumulate along the K dimension.
+        b_acc = tl.dot(b_a, b_b, acc=b_acc, allow_tf32=ALLOW_TF32)
+        # Advance the ptrs to the next K block.
+        p_a += BK * stride_ak
+        p_b += BK * stride_bk
+    o_cm = i_m * BM + tl.arange(0, BM)
+    o_cn = i_n * BN + tl.arange(0, BN)
+    mask = (o_cm[:, None] < M) & (o_cn[None, :] < N)
+    b_c = b_acc
+    # You can fuse arbitrary activation functions here
+    # while the b_acc is still in FP32!
+    if ACTIVATION == "leaky_relu":
+        b_c = leaky_relu(b_c)
+    elif ACTIVATION == "relu":
+        b_c = relu(b_c)
+    elif ACTIVATION == "sigmoid":
+        b_c = sigmoid(b_c)
+    elif ACTIVATION == "tanh":
+        b_c = tanh(b_c)
+    if HAS_ALPHA:
+        b_c *= tl.load(alpha)
+    if HAS_INPUT:
+        p_i = input + (stride_cm * o_cm[:, None] if X_DIM == 2 else 0) + stride_cn * o_cn[None, :]
+        mask_p = (o_cn[None, :] < N) if X_DIM == 1 else mask
+        b_i = tl.load(p_i, mask=mask_p, other=0.0).to(tl.float32)
+        if HAS_BETA:
+            b_i *= tl.load(beta)
+        b_c += b_i
+    # -----------------------------------------------------------
+    # Write back the block of the output matrix C with masks.
+    c_batch_ptr = c + i_b * stride_cb
+    p_c = c_batch_ptr + stride_cm * o_cm[:, None] + stride_cn * o_cn[None, :]
+    tl.store(p_c, b_c.to(c.dtype.element_ty), mask=mask)
+# We can fuse `leaky_relu` by providing it as an `ACTIVATION` meta-parameter in `matmul_kernel`.
+@triton.jit
+def leaky_relu(x):
+    return tl.where(x >= 0, x, 0.01 * x)
+@triton.jit
+def sigmoid(x):
+    # σ(x) = 1 / (1 + exp(-x))
+    return 1.0 / (1.0 + exp(-x))
+@triton.jit
+def tanh(x):
+    # tanh(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))
+    # 2 * sigmoid(2x) - 1
+    return (exp(x) - exp(-x)) / (exp(x) + exp(-x))
+@triton.jit
+def relu(x):
+    # ReLU(x) = max(0, x)
+    return tl.maximum(x, 0.0)
+@input_guard
+def matmul(a, b, activation=''):
+    assert a.dim() in [2, 3], "a must be 2D or 3D"
+    assert b.dim() == 2, "b must be 2D"
+    assert a.shape[-1] == b.shape[0], f"Incompatible dimensions: A {a.shape}, B {b.shape}"
+    if a.dim() == 2:
+        a_dim = 2
+        a = a.unsqueeze(0).contiguous()  # (1, M, K)
+    else:
+        a_dim = 3
+    allow_tf32 = False if a.dtype == torch.float32 else True
+    B, M, K = a.shape[0], a.shape[1], a.shape[2]
+    K_b, N = b.shape
+    assert K == K_b, f"Incompatible K dimension: A {K} vs B {K_b}"
+    c = a.new_empty(B, M, N)
+    def grid(meta): return (B, triton.cdiv(M, meta['BM']), triton.cdiv(N, meta['BN']))
+    matmul_kernel[grid](
+        a, b, c, None, None, None,
+        M, N, K,
+        a.stride(0), a.stride(1), a.stride(2),  # stride_ab, stride_am, stride_ak
+        b.stride(0), b.stride(1),               # stride_bk, stride_bn (b.dim() == 2)
+        c.stride(0), c.stride(1), c.stride(2),  # stride_cb, stride_cm, stride_cn
+        ACTIVATION=activation,
+        ALLOW_TF32=allow_tf32,
+        HAS_INPUT=False,
+    )
+    return c.squeeze(0) if a_dim == 2 else c
+@input_guard
+def addmm(
+    x: torch.Tensor,
+    a: torch.Tensor,
+    b: torch.Tensor,
+    alpha: Optional[float] = None,
+    beta: Optional[float] = None,
+) -> torch.Tensor:
+    assert a.dim() in [2, 3], "a must be 2D or 3D"
+    assert b.dim() == 2, "b must be 2D"
+    assert a.shape[-1] == b.shape[0], f"Incompatible dimensions: A {a.shape}, B {b.shape}"
+    if a.dim() == 2:
+        a_dim = 2
+        a = a.unsqueeze(0).contiguous()  # (1, M, K)
+    else:
+        a_dim = 3
+    allow_tf32 = False if a.dtype == torch.float32 else True
+    B, M, K = a.shape[0], a.shape[1], a.shape[2]
+    K_b, N = b.shape
+    assert K == K_b, f"Incompatible K dimension: A {K} vs B {K_b}"
+    c = a.new_empty(B, M, N)
+    def grid(meta): return (B, triton.cdiv(M, meta['BM']), triton.cdiv(N, meta['BN']))
+    matmul_kernel[grid](
+        a, b, c, x, alpha, beta,
+        M, N, K,
+        a.stride(0), a.stride(1), a.stride(2),  # stride_ab, stride_am, stride_ak
+        b.stride(0), b.stride(1),               # stride_bk, stride_bn (b.dim() == 2)
+        c.stride(0), c.stride(1), c.stride(2),  # stride_cb, stride_cm, stride_cn
+        ACTIVATION=None,
+        ALLOW_TF32=allow_tf32,
+        HAS_INPUT=True,
+        X_DIM=x.dim(),
+    )
+    return c.squeeze(0) if a_dim == 2 else c

fla3/ops/utils/op.py ADDED Viewed

	@@ -0,0 +1,39 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) 2024, Songlin Yang, Yu Zhang
+import os
+import triton
+import triton.language as tl
+import triton.language.extra.libdevice as tldevice
+from ...utils import is_gather_supported
+if os.environ.get('FLA_USE_FAST_OPS', '0') == '1':
+    div = tldevice.fast_dividef
+    exp = tldevice.fast_expf
+    log = tldevice.fast_logf
+    log2 = tldevice.fast_log2f
+else:
+    @triton.jit
+    def div_normal(x, y):
+        return x / y
+    div = div_normal
+    exp = tl.exp
+    log = tl.log
+    log2 = tl.log2
+@triton.jit
+def safe_exp(x):
+    return exp(tl.where(x <= 0, x, float('-inf')))
+if not is_gather_supported:
+    @triton.jit
+    def gather(src, index, axis, _builder=None):
+        # This is a fallback implementation when tl.gather is not supported
+        # In order to pass triton compiler, there is no actual gather operation
+        return src
+else:
+    gather = tl.gather

fla3/ops/utils/pack.py ADDED Viewed

	@@ -0,0 +1,208 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) 2023-2025, Songlin Yang, Yu Zhang
+# Code adapted from https://github.com/mayank31398/cute-kernels
+from typing import Optional
+import torch
+import triton
+import triton.language as tl
+from ...ops.utils.index import prepare_lens
+from ...utils import input_guard
+@triton.autotune(
+    configs=[
+        triton.Config({}, num_warps=num_warps)
+        for num_warps in [4, 8, 16, 32]
+    ],
+    key=['D', 'PADDING_SIDE', 'PACK']
+)
+@triton.jit
+def packunpack_sequence_kernel(
+    x,
+    y,
+    cu_seqlens,
+    S,
+    D,
+    BD: tl.constexpr,
+    PADDING_SIDE: tl.constexpr,
+    PACK: tl.constexpr,
+):
+    i_d, i_s, i_b = tl.program_id(0), tl.program_id(1), tl.program_id(2)
+    bos, eos = tl.load(cu_seqlens + i_b), tl.load(cu_seqlens + i_b + 1)
+    T = eos - bos
+    if PADDING_SIDE == 'left':
+        NP = S - T
+        if i_s < NP:
+            return
+        i_t = bos + (i_s - NP)
+    else:
+        if i_s >= T:
+            return
+        i_t = bos + i_s
+    o_d = i_d * BD + tl.arange(0, BD)
+    mask = o_d < D
+    if PACK:
+        b_x = tl.load(x + (i_b * S + i_s) * D + o_d, mask=mask)
+        tl.store(y + i_t * D + o_d, b_x, mask=mask)
+    else:
+        b_x = tl.load(x + i_t * D + o_d, mask=mask)
+        tl.store(y + (i_b * S + i_s) * D + o_d, b_x, mask=mask)
+def pack_sequence_fwdbwd(
+    x: torch.Tensor,
+    cu_seqlens: torch.Tensor,
+    padding_side: str,
+) -> torch.Tensor:
+    B, S = x.shape[:2]
+    D = x.numel() // (B * S)
+    BD = min(triton.next_power_of_2(D), 4096)
+    ND = triton.cdiv(D, BD)
+    y = torch.empty(cu_seqlens[-1].item(), *x.shape[2:], device=x.device, dtype=x.dtype)
+    packunpack_sequence_kernel[ND, S, B](
+        x=x,
+        y=y,
+        cu_seqlens=cu_seqlens,
+        S=S,
+        D=D,
+        BD=BD,
+        PADDING_SIDE=padding_side,
+        PACK=True,
+    )
+    return y
+def unpack_sequence_fwdbwd(
+    x: torch.Tensor,
+    cu_seqlens: torch.Tensor,
+    padding_side: str,
+    desired_shape: torch.Size,
+) -> torch.Tensor:
+    if desired_shape is None:
+        desired_shape = (len(cu_seqlens) - 1, prepare_lens(cu_seqlens).max().item(), *x.shape[1:])
+    y = torch.zeros(desired_shape, device=x.device, dtype=x.dtype)
+    B, S = y.shape[:2]
+    D = y.numel() // (B * S)
+    BD = min(triton.next_power_of_2(D), 4096)
+    ND = triton.cdiv(D, BD)
+    packunpack_sequence_kernel[ND, S, B](
+        x=x,
+        y=y,
+        cu_seqlens=cu_seqlens,
+        S=S,
+        D=D,
+        BD=BD,
+        PADDING_SIDE=padding_side,
+        PACK=False,
+    )
+    return y
+class PackSequenceFunction(torch.autograd.Function):
+    @staticmethod
+    @input_guard
+    def forward(
+        ctx,
+        x: torch.Tensor,
+        cu_seqlens: torch.Tensor,
+        padding_side: str,
+    ) -> torch.Tensor:
+        assert padding_side in ['left', 'right']
+        assert x.ndim >= 2
+        ctx.cu_seqlens = cu_seqlens
+        ctx.padding_side = padding_side
+        ctx.desired_shape = x.shape
+        y = pack_sequence_fwdbwd(
+            x=x,
+            cu_seqlens=cu_seqlens,
+            padding_side=padding_side,
+        )
+        return y
+    @staticmethod
+    @input_guard
+    def backward(ctx, dy: torch.Tensor) -> tuple[torch.Tensor | None]:
+        dx = unpack_sequence_fwdbwd(
+            x=dy,
+            cu_seqlens=ctx.cu_seqlens,
+            padding_side=ctx.padding_side,
+            desired_shape=ctx.desired_shape,
+        )
+        return dx, *[None] * 10
+class UnpackSequenceFunction(torch.autograd.Function):
+    @staticmethod
+    @input_guard
+    def forward(
+        ctx,
+        x: torch.Tensor,
+        cu_seqlens: torch.Tensor,
+        padding_side: str,
+        desired_shape: Optional[torch.Size] = None,
+    ) -> torch.Tensor:
+        assert padding_side in ['left', 'right']
+        assert x.ndim >= 2
+        if desired_shape is not None:
+            assert desired_shape[0] == cu_seqlens.shape[0] - 1
+            assert desired_shape[2:] == x.shape[1:]
+        ctx.cu_seqlens = cu_seqlens
+        ctx.padding_side = padding_side
+        y = unpack_sequence_fwdbwd(
+            x=x,
+            cu_seqlens=cu_seqlens,
+            padding_side=padding_side,
+            desired_shape=desired_shape,
+        )
+        return y
+    @staticmethod
+    @input_guard
+    def backward(ctx, dy: torch.Tensor) -> tuple[torch.Tensor | None]:
+        dx = pack_sequence_fwdbwd(
+            x=dy,
+            cu_seqlens=ctx.cu_seqlens,
+            padding_side=ctx.padding_side,
+        )
+        return dx, None, None, None
+def pack_sequence(
+    x: torch.Tensor,
+    cu_seqlens: torch.Tensor,
+    padding_side: str = 'left'
+) -> torch.Tensor:
+    return PackSequenceFunction.apply(
+        x,
+        cu_seqlens,
+        padding_side,
+    )
+def unpack_sequence(
+    x: torch.Tensor,
+    cu_seqlens: torch.Tensor,
+    padding_side: str = 'left',
+    desired_shape: Optional[torch.Size] = None,
+) -> torch.Tensor:
+    return UnpackSequenceFunction.apply(
+        x,
+        cu_seqlens,
+        padding_side,
+        desired_shape,
+    )

fla3/ops/utils/pooling.py ADDED Viewed

	@@ -0,0 +1,207 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) 2023-2025, Songlin Yang, Yu Zhang
+from typing import Optional, Tuple
+import torch
+import triton
+import triton.language as tl
+from ...ops.utils.index import prepare_chunk_indices
+from ...utils import autocast_custom_bwd, autocast_custom_fwd, input_guard
+@triton.heuristics({
+    'IS_VARLEN': lambda args: args['cu_seqlens'] is not None
+})
+@triton.autotune(
+    configs=[
+        triton.Config({'BD': BD}, num_warps=num_warps)
+        for BD in [16, 32, 64, 128]
+        for num_warps in [1, 2, 4, 8]
+    ],
+    key=['BT']
+)
+@triton.jit(do_not_specialize=['T'])
+def mean_pooling_fwd_kernel(
+    x,
+    o,
+    cu_seqlens,
+    chunk_indices,
+    T,
+    H: tl.constexpr,
+    D: tl.constexpr,
+    BT: tl.constexpr,
+    BD: tl.constexpr,
+    IS_VARLEN: tl.constexpr
+):
+    i_d, i_t, i_bh = tl.program_id(0), tl.program_id(1), tl.program_id(2)
+    i_b, i_h = i_bh // H, i_bh % H
+    if IS_VARLEN:
+        i_tg = i_t
+        i_n, i_t = tl.load(chunk_indices + i_t * 2).to(tl.int32), tl.load(chunk_indices + i_t * 2 + 1).to(tl.int32)
+        bos, eos = tl.load(cu_seqlens + i_n).to(tl.int32), tl.load(cu_seqlens + i_n + 1).to(tl.int32)
+        T = eos - bos
+        NT = tl.cdiv(T, BT)
+    else:
+        NT = tl.cdiv(T, BT)
+        i_tg = i_b * NT + i_t
+        bos, eos = i_b * T, i_b * T + T
+    p_x = tl.make_block_ptr(x + (bos * H + i_h) * D, (T, D), (H*D, 1), (i_t * BT, i_d * BD), (BT, BD), (1, 0))
+    p_o = tl.make_block_ptr(o + (i_tg * H + i_h) * D, (D,), (1,), (i_d * BD,), (BD,), (0,))
+    # [BT, BD]
+    b_x = tl.load(p_x, boundary_check=(0, 1)).to(tl.float32)
+    # [BD]
+    b_o = tl.sum(b_x, axis=0) / min(BT, T - i_t * BT)
+    tl.store(p_o, b_o.to(p_o.dtype.element_ty), boundary_check=(0,))
+@triton.heuristics({
+    'IS_VARLEN': lambda args: args['cu_seqlens'] is not None
+})
+@triton.autotune(
+    configs=[
+        triton.Config({'BD': BD}, num_warps=num_warps)
+        for BD in [16, 32, 64, 128]
+        for num_warps in [1, 2, 4, 8]
+    ],
+    key=['BT']
+)
+@triton.jit(do_not_specialize=['T'])
+def mean_pooling_bwd_kernel(
+    do,
+    dx,
+    cu_seqlens,
+    chunk_indices,
+    T,
+    H: tl.constexpr,
+    D: tl.constexpr,
+    BT: tl.constexpr,
+    BD: tl.constexpr,
+    IS_VARLEN: tl.constexpr
+):
+    i_d, i_t, i_bh = tl.program_id(0), tl.program_id(1), tl.program_id(2)
+    i_b, i_h = i_bh // H, i_bh % H
+    if IS_VARLEN:
+        i_tg = i_t
+        i_n, i_t = tl.load(chunk_indices + i_t * 2).to(tl.int32), tl.load(chunk_indices + i_t * 2 + 1).to(tl.int32)
+        bos, eos = tl.load(cu_seqlens + i_n).to(tl.int32), tl.load(cu_seqlens + i_n + 1).to(tl.int32)
+        T = eos - bos
+        NT = tl.cdiv(T, BT)
+    else:
+        NT = tl.cdiv(T, BT)
+        i_tg = i_b * NT + i_t
+        bos, eos = i_b * T, i_b * T + T
+    p_dx = tl.make_block_ptr(dx + (bos * H + i_h) * D, (T, D), (H*D, 1), (i_t * BT, i_d * BD), (BT, BD), (1, 0))
+    p_do = tl.make_block_ptr(do + (i_tg * H + i_h) * D, (D,), (1,), (i_d * BD,), (BD,), (0,))
+    # [BD]
+    b_do = tl.load(p_do, boundary_check=(0,)).to(tl.float32)
+    # [BT, BD]
+    b_dx = b_do / tl.full((BT,), min(BT, T - i_t * BT), dtype=tl.float32)[:, None]
+    tl.store(p_dx, b_dx.to(p_dx.dtype.element_ty), boundary_check=(0, 1))
+def mean_pooling_fwd(
+    x: torch.Tensor,
+    chunk_size: int,
+    cu_seqlens: Optional[torch.LongTensor] = None
+) -> torch.Tensor:
+    B, T, H, D = x.shape
+    BT = chunk_size
+    chunk_indices = prepare_chunk_indices(cu_seqlens, chunk_size) if cu_seqlens is not None else None
+    NT = triton.cdiv(T, BT) if cu_seqlens is None else len(chunk_indices)
+    o = x.new_empty(B, NT, H, D)
+    def grid(meta): return (triton.cdiv(D, meta['BD']), NT, B * H)
+    mean_pooling_fwd_kernel[grid](
+        x,
+        o,
+        cu_seqlens,
+        chunk_indices,
+        T=T,
+        H=H,
+        D=D,
+        BT=BT,
+    )
+    return o
+def mean_pooling_bwd(
+    do: torch.Tensor,
+    batch_size: int,
+    seq_len: int,
+    chunk_size: int,
+    cu_seqlens: Optional[torch.LongTensor] = None
+) -> torch.Tensor:
+    B, T, H, D = batch_size, seq_len, *do.shape[-2:]
+    BT = chunk_size
+    chunk_indices = prepare_chunk_indices(cu_seqlens, chunk_size) if cu_seqlens is not None else None
+    NT = triton.cdiv(T, BT) if cu_seqlens is None else len(chunk_indices)
+    dx = do.new_empty(B, T, H, D)
+    def grid(meta): return (triton.cdiv(D, meta['BD']), NT, B * H)
+    mean_pooling_bwd_kernel[grid](
+        do,
+        dx,
+        cu_seqlens,
+        chunk_indices,
+        T=T,
+        H=H,
+        D=D,
+        BT=BT,
+    )
+    return dx
+class MeanPoolingFunction(torch.autograd.Function):
+    @staticmethod
+    @input_guard
+    @autocast_custom_fwd
+    def forward(
+        ctx,
+        x: torch.Tensor,
+        chunk_size: int,
+        cu_seqlens: Optional[torch.LongTensor] = None
+    ) -> torch.Tensor:
+        o = mean_pooling_fwd(x, chunk_size, cu_seqlens)
+        ctx.batch_size = x.shape[0]
+        ctx.seq_len = x.shape[1]
+        ctx.chunk_size = chunk_size
+        ctx.cu_seqlens = cu_seqlens
+        return o
+    @staticmethod
+    @input_guard
+    @autocast_custom_bwd
+    def backward(
+        ctx, do
+    ) -> Tuple[torch.Tensor, None, None]:
+        batch_size = ctx.batch_size
+        seq_len = ctx.seq_len
+        chunk_size = ctx.chunk_size
+        cu_seqlens = ctx.cu_seqlens
+        dx = mean_pooling_bwd(do, batch_size, seq_len, chunk_size, cu_seqlens)
+        return dx, None, None
+def mean_pooling(
+    x: torch.Tensor,
+    chunk_size: int,
+    cu_seqlens: Optional[torch.LongTensor] = None,
+    head_first: bool = False
+) -> torch.Tensor:
+    if head_first:
+        x = x.transpose(1, 2)
+    if cu_seqlens is not None:
+        if x.shape[0] != 1:
+            raise ValueError(
+                f"The batch size is expected to be 1 rather than {x.shape[0]} when using `cu_seqlens`."
+                f"Please ..tten variable-length inputs before processing."
+            )
+    o = MeanPoolingFunction.apply(x, chunk_size, cu_seqlens)
+    if head_first:
+        o = o.transpose(1, 2)
+    return o

fla3/ops/utils/softmax.py ADDED Viewed

	@@ -0,0 +1,111 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) 2023-2024, Songlin Yang, Yu Zhang
+from typing import Optional
+import torch
+import triton
+import triton.language as tl
+from ...ops.utils.op import exp
+@triton.autotune(
+    configs=[
+        triton.Config({}, num_warps=1),
+        triton.Config({}, num_warps=2),
+        triton.Config({}, num_warps=4),
+        triton.Config({}, num_warps=8),
+        triton.Config({}, num_warps=16),
+        triton.Config({}, num_warps=32)
+    ],
+    key=['D']
+)
+@triton.jit
+def softmax_fwd_kernel(
+    x,
+    p,
+    D: tl.constexpr,
+    B: tl.constexpr
+):
+    i_n = tl.program_id(0)
+    o_d = tl.arange(0, B)
+    m_d = o_d < D
+    b_x = tl.load(x + i_n * D + o_d, mask=m_d, other=-float('inf'))
+    b_m = tl.max(b_x, 0)
+    b_x = exp(b_x - b_m)
+    b_p = b_x / tl.sum(b_x, 0)
+    tl.store(p + i_n * D + o_d, b_p.to(p.dtype.element_ty), mask=m_d)
+@triton.autotune(
+    configs=[
+        triton.Config({}, num_warps=1),
+        triton.Config({}, num_warps=2),
+        triton.Config({}, num_warps=4),
+        triton.Config({}, num_warps=8),
+        triton.Config({}, num_warps=16),
+        triton.Config({}, num_warps=32)
+    ],
+    key=['D']
+)
+@triton.jit
+def softmax_bwd_kernel(
+    p,
+    dp,
+    ds,
+    D: tl.constexpr,
+    B: tl.constexpr
+):
+    i_n = tl.program_id(0)
+    o_d = tl.arange(0, B)
+    m_d = o_d < D
+    b_p = tl.load(p + i_n * D + o_d, mask=m_d, other=0.)
+    b_dp = tl.load(dp + i_n * D + o_d, mask=m_d, other=0.)
+    b_pp = tl.sum(b_p * b_dp, 0)
+    b_ds = b_p * b_dp - b_p * b_pp
+    tl.store(ds + i_n * D + o_d, b_ds.to(ds.dtype.element_ty), mask=m_d)
+def softmax_fwd(
+    x: torch.Tensor,
+    dtype: Optional[torch.dtype] = torch.float
+) -> torch.Tensor:
+    shape = x.shape
+    x = x.view(-1, x.shape[-1])
+    N, D = x.shape
+    B = triton.next_power_of_2(D)
+    p = torch.empty_like(x, dtype=dtype)
+    softmax_fwd_kernel[(N,)](
+        x=x,
+        p=p,
+        D=D,
+        B=B
+    )
+    return p.view(*shape)
+def softmax_bwd(
+    p: torch.Tensor,
+    dp: torch.Tensor,
+    dtype: Optional[torch.dtype] = torch.float
+) -> torch.Tensor:
+    shape = p.shape
+    p = p.view(-1, p.shape[-1])
+    ds = torch.empty_like(p, dtype=dtype)
+    N, D = p.shape
+    B = triton.next_power_of_2(D)
+    softmax_bwd_kernel[(N,)](
+        p=p,
+        dp=dp,
+        ds=ds,
+        D=D,
+        B=B
+    )
+    return ds.view(*shape)

fla3/ops/utils/solve_tril.py ADDED Viewed

	@@ -0,0 +1,276 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) 2023-2025, Songlin Yang, Yu Zhang
+from typing import Optional
+import torch
+import triton
+import triton.language as tl
+from ...ops.utils.index import prepare_chunk_indices
+from ...utils import input_guard
+@triton.heuristics({
+    'IS_VARLEN': lambda args: args['cu_seqlens'] is not None
+})
+@triton.autotune(
+    configs=[
+        triton.Config({}, num_warps=num_warps, num_stages=num_stages)
+        for num_warps in [1, 2, 4, 8]
+        for num_stages in [2, 3, 4, 5]
+    ],
+    key=['BT'],
+)
+@triton.jit(do_not_specialize=['T'])
+def solve_tril_16x16_kernel(
+    A,
+    Ad,
+    cu_seqlens,
+    chunk_indices,
+    T,
+    H: tl.constexpr,
+    BT: tl.constexpr,
+    IS_VARLEN: tl.constexpr,
+):
+    i_t, i_bh = tl.program_id(0), tl.program_id(1)
+    i_b, i_h = i_bh // H, i_bh % H
+    if IS_VARLEN:
+        i_n, i_t = tl.load(chunk_indices + i_t * 2).to(tl.int32), tl.load(chunk_indices + i_t * 2 + 1).to(tl.int32)
+        bos, eos = tl.load(cu_seqlens + i_n).to(tl.int32), tl.load(cu_seqlens + i_n + 1).to(tl.int32)
+        T = eos - bos
+    else:
+        bos, eos = i_b * T, i_b * T + T
+    A = A + (bos*H + i_h) * BT
+    Ad = Ad + (bos*H + i_h) * 16
+    offset = (i_t * 16) % BT
+    p_A = tl.make_block_ptr(A, (T, BT), (H*BT, 1), (i_t * 16, offset), (16, 16), (1, 0))
+    p_Ai = tl.make_block_ptr(Ad, (T, 16), (H*16, 1), (i_t * 16, 0), (16, 16), (1, 0))
+    b_A = tl.load(p_A, boundary_check=(0, 1)).to(tl.float32)
+    b_A = -tl.where(tl.arange(0, 16)[:, None] > tl.arange(0, 16)[None, :], b_A, 0)
+    o_i = tl.arange(0, 16)
+    for i in range(1, min(16, T-i_t*16)):
+        b_a = -tl.load(A + (i_t * 16 + i) * H*BT + o_i + offset)
+        b_a = b_a + tl.sum(b_a[:, None] * b_A, 0)
+        mask = o_i == i
+        b_A = tl.where(mask[:, None], b_a, b_A)
+    b_A += o_i[:, None] == o_i[None, :]
+    tl.store(p_Ai, b_A.to(p_Ai.dtype.element_ty, fp_downcast_rounding="rtne"), boundary_check=(0, 1))
+@triton.heuristics({
+    'IS_VARLEN': lambda args: args['cu_seqlens'] is not None
+})
+@triton.autotune(
+    configs=[
+        triton.Config({}, num_warps=num_warps, num_stages=num_stages)
+        for num_warps in [1, 2, 4, 8]
+        for num_stages in [2, 3, 4, 5]
+    ],
+    key=['H', 'BT', 'IS_VARLEN'],
+)
+@triton.jit(do_not_specialize=['T'])
+def merge_16x16_to_32x32_inverse_kernel(
+    A,
+    Ad,
+    Ai,
+    cu_seqlens,
+    chunk_indices,
+    T,
+    H: tl.constexpr,
+    BT: tl.constexpr,
+    IS_VARLEN: tl.constexpr
+):
+    i_t, i_bh = tl.program_id(0), tl.program_id(1)
+    i_b, i_h = i_bh // H, i_bh % H
+    if IS_VARLEN:
+        i_n, i_t = tl.load(chunk_indices + i_t * 2).to(tl.int32), tl.load(chunk_indices + i_t * 2 + 1).to(tl.int32)
+        bos, eos = tl.load(cu_seqlens + i_n).to(tl.int32), tl.load(cu_seqlens + i_n + 1).to(tl.int32)
+        T = eos - bos
+    else:
+        bos, eos = i_b * T, i_b * T + T
+    A += (bos*H + i_h) * 32
+    Ad += (bos*H + i_h) * 16
+    Ai += (bos*H + i_h) * 32
+    p_A_21 = tl.make_block_ptr(A, (T, 32), (H*32, 1), (i_t * 32 + 16, 0), (16, 16), (1, 0))
+    p_Ad_11 = tl.make_block_ptr(Ad, (T, 16), (H*16, 1), (i_t * 32, 0), (16, 16), (1, 0))
+    p_Ad_22 = tl.make_block_ptr(Ad, (T, 16), (H*16, 1), (i_t * 32 + 16, 0), (16, 16), (1, 0))
+    p_Ai_11 = tl.make_block_ptr(Ai, (T, 32), (H*32, 1), (i_t * 32, 0), (16, 16), (1, 0))
+    p_Ai_22 = tl.make_block_ptr(Ai, (T, 32), (H*32, 1), (i_t * 32 + 16, 16), (16, 16), (1, 0))
+    p_Ai_21 = tl.make_block_ptr(Ai, (T, 32), (H*32, 1), (i_t * 32 + 16, 0), (16, 16), (1, 0))
+    A_21 = tl.load(p_A_21, boundary_check=(0, 1)).to(tl.float32)
+    Ai_11 = tl.load(p_Ad_11, boundary_check=(0, 1)).to(tl.float32)
+    Ai_22 = tl.load(p_Ad_22, boundary_check=(0, 1)).to(tl.float32)
+    Ai_21 = -tl.dot(tl.dot(Ai_22, A_21, input_precision='ieee'), Ai_11, input_precision='ieee')
+    tl.store(p_Ai_11, Ai_11.to(p_Ai_11.dtype.element_ty, fp_downcast_rounding="rtne"), boundary_check=(0, 1))
+    tl.store(p_Ai_22, Ai_22.to(p_Ai_22.dtype.element_ty, fp_downcast_rounding="rtne"), boundary_check=(0, 1))
+    tl.store(p_Ai_21, Ai_21.to(p_Ai_21.dtype.element_ty, fp_downcast_rounding="rtne"), boundary_check=(0, 1))
+@triton.heuristics({
+    'IS_VARLEN': lambda args: args['cu_seqlens'] is not None
+})
+@triton.autotune(
+    configs=[
+        triton.Config({}, num_warps=num_warps, num_stages=num_stages)
+        for num_warps in [2, 4, 8]
+        for num_stages in [2, 3, 4, 5]
+    ],
+    key=['H', 'BT', 'IS_VARLEN'],
+)
+@triton.jit(do_not_specialize=['T'])
+def merge_16x16_to_64x64_inverse_kernel(
+    A,
+    Ad,
+    Ai,
+    cu_seqlens,
+    chunk_indices,
+    T,
+    H: tl.constexpr,
+    BT: tl.constexpr,
+    IS_VARLEN: tl.constexpr
+):
+    i_t, i_bh = tl.program_id(0), tl.program_id(1)
+    i_b, i_h = i_bh // H, i_bh % H
+    if IS_VARLEN:
+        i_n, i_t = tl.load(chunk_indices + i_t * 2).to(tl.int32), tl.load(chunk_indices + i_t * 2 + 1).to(tl.int32)
+        bos, eos = tl.load(cu_seqlens + i_n).to(tl.int32), tl.load(cu_seqlens + i_n + 1).to(tl.int32)
+        T = eos - bos
+    else:
+        bos, eos = i_b * T, i_b * T + T
+    A += (bos*H + i_h) * 64
+    Ad += (bos*H + i_h) * 16
+    Ai += (bos*H + i_h) * 64
+    p_A_21 = tl.make_block_ptr(A, (T, 64), (H*64, 1), (i_t * 64 + 16, 0), (16, 16), (1, 0))
+    p_A_32 = tl.make_block_ptr(A, (T, 64), (H*64, 1), (i_t * 64 + 32, 16), (16, 16), (1, 0))
+    p_A_31 = tl.make_block_ptr(A, (T, 64), (H*64, 1), (i_t * 64 + 32, 0), (16, 16), (1, 0))
+    p_A_43 = tl.make_block_ptr(A, (T, 64), (H*64, 1), (i_t * 64 + 48, 32), (16, 16), (1, 0))
+    p_A_42 = tl.make_block_ptr(A, (T, 64), (H*64, 1), (i_t * 64 + 48, 16), (16, 16), (1, 0))
+    p_A_41 = tl.make_block_ptr(A, (T, 64), (H*64, 1), (i_t * 64 + 48, 0), (16, 16), (1, 0))
+    p_Ad_11 = tl.make_block_ptr(Ad, (T, 16), (H*16, 1), (i_t * 64, 0), (16, 16), (1, 0))
+    p_Ad_22 = tl.make_block_ptr(Ad, (T, 16), (H*16, 1), (i_t * 64 + 16, 0), (16, 16), (1, 0))
+    p_Ad_33 = tl.make_block_ptr(Ad, (T, 16), (H*16, 1), (i_t * 64 + 32, 0), (16, 16), (1, 0))
+    p_Ad_44 = tl.make_block_ptr(Ad, (T, 16), (H*16, 1), (i_t * 64 + 48, 0), (16, 16), (1, 0))
+    A_21 = tl.load(p_A_21, boundary_check=(0, 1)).to(tl.float32)
+    A_32 = tl.load(p_A_32, boundary_check=(0, 1)).to(tl.float32)
+    A_31 = tl.load(p_A_31, boundary_check=(0, 1)).to(tl.float32)
+    A_43 = tl.load(p_A_43, boundary_check=(0, 1)).to(tl.float32)
+    A_42 = tl.load(p_A_42, boundary_check=(0, 1)).to(tl.float32)
+    A_41 = tl.load(p_A_41, boundary_check=(0, 1)).to(tl.float32)
+    Ai_11 = tl.load(p_Ad_11, boundary_check=(0, 1)).to(tl.float32)
+    Ai_22 = tl.load(p_Ad_22, boundary_check=(0, 1)).to(tl.float32)
+    Ai_33 = tl.load(p_Ad_33, boundary_check=(0, 1)).to(tl.float32)
+    Ai_44 = tl.load(p_Ad_44, boundary_check=(0, 1)).to(tl.float32)
+    Ai_21 = -tl.dot(tl.dot(Ai_22, A_21, input_precision='ieee'), Ai_11, input_precision='ieee')
+    Ai_32 = -tl.dot(tl.dot(Ai_33, A_32, input_precision='ieee'), Ai_22, input_precision='ieee')
+    Ai_43 = -tl.dot(tl.dot(Ai_44, A_43, input_precision='ieee'), Ai_33, input_precision='ieee')
+    Ai_31 = -tl.dot(
+        Ai_33,
+        tl.dot(A_31, Ai_11, input_precision='ieee') +
+        tl.dot(A_32, Ai_21, input_precision='ieee'),
+        input_precision='ieee'
+    )
+    Ai_42 = -tl.dot(
+        Ai_44,
+        tl.dot(A_42, Ai_22, input_precision='ieee') +
+        tl.dot(A_43, Ai_32, input_precision='ieee'),
+        input_precision='ieee'
+    )
+    Ai_41 = -tl.dot(
+        Ai_44,
+        tl.dot(A_41, Ai_11, input_precision='ieee') +
+        tl.dot(A_42, Ai_21, input_precision='ieee') +
+        tl.dot(A_43, Ai_31, input_precision='ieee'),
+        input_precision='ieee'
+    )
+    p_Ai_11 = tl.make_block_ptr(Ai, (T, 64), (H*64, 1), (i_t * 64, 0), (16, 16), (1, 0))
+    p_Ai_22 = tl.make_block_ptr(Ai, (T, 64), (H*64, 1), (i_t * 64 + 16, 16), (16, 16), (1, 0))
+    p_Ai_33 = tl.make_block_ptr(Ai, (T, 64), (H*64, 1), (i_t * 64 + 32, 32), (16, 16), (1, 0))
+    p_Ai_44 = tl.make_block_ptr(Ai, (T, 64), (H*64, 1), (i_t * 64 + 48, 48), (16, 16), (1, 0))
+    p_Ai_21 = tl.make_block_ptr(Ai, (T, 64), (H*64, 1), (i_t * 64 + 16, 0), (16, 16), (1, 0))
+    p_Ai_31 = tl.make_block_ptr(Ai, (T, 64), (H*64, 1), (i_t * 64 + 32, 0), (16, 16), (1, 0))
+    p_Ai_32 = tl.make_block_ptr(Ai, (T, 64), (H*64, 1), (i_t * 64 + 32, 16), (16, 16), (1, 0))
+    p_Ai_41 = tl.make_block_ptr(Ai, (T, 64), (H*64, 1), (i_t * 64 + 48, 0), (16, 16), (1, 0))
+    p_Ai_42 = tl.make_block_ptr(Ai, (T, 64), (H*64, 1), (i_t * 64 + 48, 16), (16, 16), (1, 0))
+    p_Ai_43 = tl.make_block_ptr(Ai, (T, 64), (H*64, 1), (i_t * 64 + 48, 32), (16, 16), (1, 0))
+    tl.store(p_Ai_11, Ai_11.to(p_Ai_11.dtype.element_ty, fp_downcast_rounding="rtne"), boundary_check=(0, 1))
+    tl.store(p_Ai_22, Ai_22.to(p_Ai_22.dtype.element_ty, fp_downcast_rounding="rtne"), boundary_check=(0, 1))
+    tl.store(p_Ai_33, Ai_33.to(p_Ai_33.dtype.element_ty, fp_downcast_rounding="rtne"), boundary_check=(0, 1))
+    tl.store(p_Ai_44, Ai_44.to(p_Ai_44.dtype.element_ty, fp_downcast_rounding="rtne"), boundary_check=(0, 1))
+    tl.store(p_Ai_21, Ai_21.to(p_Ai_21.dtype.element_ty, fp_downcast_rounding="rtne"), boundary_check=(0, 1))
+    tl.store(p_Ai_31, Ai_31.to(p_Ai_31.dtype.element_ty, fp_downcast_rounding="rtne"), boundary_check=(0, 1))
+    tl.store(p_Ai_32, Ai_32.to(p_Ai_32.dtype.element_ty, fp_downcast_rounding="rtne"), boundary_check=(0, 1))
+    tl.store(p_Ai_41, Ai_41.to(p_Ai_41.dtype.element_ty, fp_downcast_rounding="rtne"), boundary_check=(0, 1))
+    tl.store(p_Ai_42, Ai_42.to(p_Ai_42.dtype.element_ty, fp_downcast_rounding="rtne"), boundary_check=(0, 1))
+    tl.store(p_Ai_43, Ai_43.to(p_Ai_43.dtype.element_ty, fp_downcast_rounding="rtne"), boundary_check=(0, 1))
+@input_guard
+def solve_tril(
+    A: torch.Tensor,
+    cu_seqlens: Optional[torch.Tensor] = None,
+    output_dtype: torch.dtype = torch.float
+) -> torch.Tensor:
+    """
+    Compute the inverse of the lower triangular matrix
+    A should be strictly lower triangular, i.e., A.triu() == 0.
+    Args:
+        A (torch.Tensor):
+            [B, T, H, K]
+        cu_seqlens (torch.Tensor):
+            The cumulative sequence lengths of the input tensor.
+            Default: None.
+        output_dtype (torch.dtype):
+            The dtype of the output tensor. Default: `torch.float`
+    Returns:
+        (I + A)^-1 with the same shape as A
+    """
+    assert A.shape[-1] in [16, 32, 64]
+    B, T, H, BT = A.shape
+    Ad = torch.empty(B, T, H, 16, device=A.device, dtype=torch.float if BT != 16 else output_dtype)
+    chunk_indices = prepare_chunk_indices(cu_seqlens, 16) if cu_seqlens is not None else None
+    NT = len(chunk_indices) if cu_seqlens is not None else triton.cdiv(T, 16)
+    solve_tril_16x16_kernel[NT, B * H](
+        A=A,
+        Ad=Ad,
+        cu_seqlens=cu_seqlens,
+        chunk_indices=chunk_indices,
+        T=T,
+        H=H,
+        BT=BT,
+    )
+    if BT == 16:
+        return Ad
+    Ai = torch.zeros(B, T, H, BT, device=A.device, dtype=output_dtype)
+    merge_fn = merge_16x16_to_32x32_inverse_kernel if BT == 32 else merge_16x16_to_64x64_inverse_kernel
+    chunk_indices = prepare_chunk_indices(cu_seqlens, BT) if cu_seqlens is not None else None
+    NT = len(chunk_indices) if cu_seqlens is not None else triton.cdiv(T, BT)
+    merge_fn[NT, B * H](
+        A=A,
+        Ad=Ad,
+        Ai=Ai,
+        cu_seqlens=cu_seqlens,
+        chunk_indices=chunk_indices,
+        T=T,
+        H=H,
+        BT=BT,
+    )
+    return Ai

flame/__init__.py ADDED Viewed

File without changes

flame/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (167 Bytes). View file

flame/__pycache__/__init__.cpython-312.pyc ADDED Viewed

Binary file (167 Bytes). View file

flame/__pycache__/data.cpython-310.pyc ADDED Viewed

Binary file (8.17 kB). View file

flame/__pycache__/data.cpython-312.pyc ADDED Viewed

Binary file (14.9 kB). View file

flame/__pycache__/logging.cpython-310.pyc ADDED Viewed

Binary file (3.56 kB). View file

flame/__pycache__/logging.cpython-312.pyc ADDED Viewed

Binary file (6.44 kB). View file

flame/__pycache__/parser.cpython-310.pyc ADDED Viewed

Binary file (2.89 kB). View file

flame/__pycache__/parser.cpython-312.pyc ADDED Viewed

Binary file (4.07 kB). View file

flame/data.py ADDED Viewed

	@@ -0,0 +1,246 @@

+# -*- coding: utf-8 -*-
+from __future__ import annotations
+from copy import deepcopy
+from dataclasses import dataclass
+from typing import Any, Dict, Iterable, List, Union
+import numpy as np
+import torch
+from datasets import Dataset, IterableDataset
+from flame.logging import get_logger
+from transformers import PreTrainedTokenizer
+logger = get_logger(__name__)
+class HuggingfaceDataset(IterableDataset):
+    def __init__(
+        self,
+        dataset: Dataset,
+        tokenizer: PreTrainedTokenizer,
+        context_len: int = 2048,
+        rank: int = 0,
+        world_size: int = 1,
+        buffer_size: int = 1024
+    ) -> HuggingfaceDataset:
+        self.dataset = dataset
+        self.tokenizer = tokenizer
+        self.data = dataset.shard(world_size, rank)
+        self.context_len = context_len
+        self.rank = rank
+        self.world_size = world_size
+        self.buffer_size = buffer_size
+        if tokenizer.vocab_size < torch.iinfo(torch.int16).max:
+            self.dtype = torch.int16
+        elif tokenizer.vocab_size < torch.iinfo(torch.int32).max:
+            self.dtype = torch.int32
+        else:
+            self.dtype = torch.int64
+        self.states = None
+        self.buffer = torch.tensor([], dtype=self.dtype)
+        self.tokens = []
+        self.rand_id = 0
+        self.token_id = 0
+        self.rng_state = None
+        self._epoch = 0
+    def __iter__(self):
+        g = torch.Generator()
+        g.manual_seed(self._epoch + self.rank)
+        if self.rng_state is not None:
+            g.set_state(self.rng_state)
+        rand_it = self.randint(0, self.buffer_size, g=g)
+        if self.states is not None:
+            self.data.load_state_dict(self.states)
+        # max number of tokens allowed in the chunk buffer
+        n_tokens = self.buffer_size * self.context_len
+        while True:
+            for sample in self.tokenize(self.data):
+                # keep appending the samples to the token buffer
+                self.tokens += sample
+                # if the token buffer is full, start sampling
+                # NOTE: we first convert the token ids to a tensor of shape [n_chunks, context_len] for efficiency
+                if len(self.buffer) == 0 and len(self.tokens) >= n_tokens:
+                    self.buffer = torch.tensor(self.tokens[:n_tokens], dtype=self.dtype).view(self.buffer_size, -1)
+                    self.tokens = self.tokens[n_tokens:]
+                if len(self.buffer) == self.buffer_size:
+                    yield from self.sample(rand_it)
+            n_chunks = len(self.tokens) // self.context_len
+            # handle the left tokens in the buffer
+            if n_chunks > 0:
+                n_tokens = n_chunks * self.context_len
+                indices = torch.randperm(n_chunks, generator=g).tolist()
+                self.buffer = torch.tensor(self.tokens[:n_tokens], dtype=torch.long).view(n_chunks, -1)
+                self.tokens = self.tokens[n_tokens:]
+                for i in indices:
+                    yield {'input_ids': self.buffer[i]}
+    def tokenize(self, data, batch_size: int = 64):
+        texts, states = [], []
+        for sample in data:
+            texts.append(sample['text'])
+            states.append(self.data.state_dict())
+            if len(texts) == batch_size:
+                for s, tokenized in zip(states, self.tokenizer(texts, return_attention_mask=False)['input_ids']):
+                    self.states = s
+                    yield tokenized
+                texts, states = [], []
+        if len(texts) > 0:
+            for s, tokenized in zip(states, self.tokenizer(texts, return_attention_mask=False)['input_ids']):
+                self.states = s
+                yield tokenized
+    def sample(self, indices):
+        n_tokens = (len(self.tokens) // self.context_len) * self.context_len
+        while self.token_id < n_tokens:
+            i = next(indices)
+            start, end = self.token_id, self.token_id + self.context_len
+            self.token_id += self.context_len
+            yield {'input_ids': self.buffer[i].to(torch.long)}
+            self.buffer[i] = torch.tensor(self.tokens[start:end], dtype=self.dtype)
+        self.token_id = 0
+        self.tokens = self.tokens[n_tokens:]
+    def randint(
+        self,
+        low: int,
+        high: int,
+        batch_size: int = 1024,
+        g: torch.Generator = torch.Generator()
+    ) -> Iterable[int]:
+        indices = torch.empty(batch_size, dtype=torch.long)
+        while True:
+            # record the generator states before sampling
+            self.rng_state = g.get_state()
+            indices = torch.randint(low, high, (batch_size,), out=indices, generator=g)
+            for i in indices[self.rand_id:].tolist():
+                self.rand_id += 1
+                yield i
+            self.rand_id = 0
+    def set_epoch(self, epoch):
+        self._epoch = epoch
+        if hasattr(self.dataset, "set_epoch"):
+            self.dataset.set_epoch(epoch)
+    def state_dict(self):
+        return {
+            'states': self.states,
+            'buffer': self.buffer.clone(),
+            'tokens': deepcopy(self.tokens),
+            'rand_id': self.rand_id,
+            'token_id': self.token_id,
+            'rng_state': self.rng_state,
+            'epoch': self._epoch
+        }
+    def load_state_dict(self, state_dict):
+        self.states = state_dict['states']
+        self.buffer = state_dict['buffer'].clone()
+        self.tokens = deepcopy(state_dict['tokens'])
+        self.rand_id = state_dict['rand_id']
+        self.token_id = state_dict['token_id']
+        self.rng_state = state_dict['rng_state'].clone() if state_dict['rng_state'] is not None else None
+        self._epoch = state_dict['epoch']
+@dataclass
+class DataCollatorForLanguageModeling:
+    """
+    Data collator used for language modeling.
+    Args:
+        tokenizer ([`PreTrainedTokenizer`] or [`PreTrainedTokenizerFast`]):
+            The tokenizer used for encoding the data.
+        varlen (`bool`):
+            Whether to return sequences with variable lengths.
+            If `True`, the offsets indicating the start and end of each sequence will be returned.
+            For example, if the sequence lengths are `[4, 8, 12]`,
+            the returned `input_ids` will be a long flattened tensor of shape `[1, 24]`, with `offsets` being `[0, 4, 12, 24]`.
+            If `False`, the `input_ids` with shape `[batch_size, seq_len]` will be returned directly.
+        return_tensors (`str`):
+            The type of Tensor to return. Allowable values are "pt".
+    """
+    tokenizer: PreTrainedTokenizer
+    varlen: bool = False
+    return_tensors: str = "pt"
+    def __call__(
+        self,
+        examples: List[Union[List[int], Dict[str, Any]]]
+    ) -> Dict[str, Any]:
+        if not isinstance(examples[0], Dict):
+            examples = [{'input_ids': example} for example in examples]
+        def tensorize(example: Dict[str, Any]) -> Dict[str, Any]:
+            tensorized = {}
+            for key in ['input_ids', 'offsets']:
+                if key not in example:
+                    continue
+                if isinstance(example[key], List):
+                    tensorized[key] = torch.tensor(example[key], dtype=torch.long)
+                elif isinstance(example[key], np.ndarray):
+                    tensorized[key] = torch.from_numpy(example[key])
+                else:
+                    tensorized[key] = example[key]
+            return tensorized
+        examples = list(map(tensorize, examples))
+        if not self.varlen:
+            length_of_first = examples[0]['input_ids'].size(0)
+            # Check if padding is necessary.
+            if all(example['input_ids'].size(0) == length_of_first for example in examples):
+                batch = {
+                    'input_ids': torch.stack([example['input_ids'] for example in examples], dim=0),
+                }
+            else:
+                # If yes, check if we have a `pad_token`.
+                if self.tokenizer._pad_token is None:
+                    raise ValueError(
+                        f"You are attempting to pad samples but the tokenizer you are using "
+                        f"({self.tokenizer.__class__.__name__}) does not have a pad token."
+                    )
+                batch = self.tokenizer.pad(examples, return_tensors=self.return_tensors, return_attention_mask=False)
+        else:
+            if len(examples) > 1:
+                raise ValueError("The batch size must be 1 for variable length inputs.")
+            batch = {
+                'input_ids': torch.cat([example['input_ids'] for example in examples], dim=0).unsqueeze(0)
+            }
+            if 'offsets' in examples[0]:
+                batch['offsets'] = torch.cat([example['offsets'] for example in examples], dim=0).unsqueeze(0)
+            else:
+                # determine boundaries by bos/eos positions
+                if self.tokenizer.add_bos_token:
+                    offsets = []
+                    if batch['input_ids'][0, 0] != self.tokenizer.bos_token_id:
+                        offsets.append(torch.tensor([0], dtype=torch.long))
+                    offsets.append(torch.where(batch['input_ids'].eq(self.tokenizer.bos_token_id))[1])
+                    offsets.append(torch.tensor([len(batch['input_ids'][0])], dtype=torch.long))
+                    batch['offsets'] = torch.cat(offsets, dim=0)
+                elif self.tokenizer.add_eos_token:
+                    offsets = [torch.tensor([0], dtype=torch.long)]
+                    offsets.append(torch.where(batch['input_ids'].eq(self.tokenizer.eos_token_id))[1] + 1)
+                    if batch['input_ids'][0, -1] != self.tokenizer.eos_token_id:
+                        offsets.append(torch.tensor([len(batch['input_ids'][0])], dtype=torch.long))
+                    batch['offsets'] = torch.cat(offsets, dim=0)
+                else:
+                    raise ValueError("You must allow the tokenizer to add either a bos or eos token as separators.")
+        labels = batch['input_ids'].clone()
+        if self.tokenizer.pad_token_id is not None:
+            labels[labels == self.tokenizer.pad_token_id] = -100
+        batch["labels"] = labels
+        return batch

flame/logging.py ADDED Viewed

	@@ -0,0 +1,118 @@

+# -*- coding: utf-8 -*-
+import json
+import logging
+import os
+import sys
+import time
+from transformers.trainer_callback import (ExportableState, TrainerCallback,
+                                           TrainerControl, TrainerState)
+from transformers.training_args import TrainingArguments
+def get_logger(name: str = None) -> logging.Logger:
+    formatter = logging.Formatter(
+        fmt="%(asctime)s - %(levelname)s - %(name)s - %(message)s", datefmt="%m/%d/%Y %H:%M:%S"
+    )
+    handler = logging.StreamHandler(sys.stdout)
+    handler.setFormatter(formatter)
+    logger = logging.getLogger(name)
+    if 'RANK' in os.environ and int(os.environ['RANK']) == 0:
+        logger.setLevel(logging.INFO)
+        logger.addHandler(handler)
+    return logger
+logger = get_logger(__name__)
+LOG_FILE_NAME = "trainer_log.jsonl"
+class LogCallback(TrainerCallback, ExportableState):
+    def __init__(self, start_time: float = None, elapsed_time: float = None):
+        self.start_time = time.time() if start_time is None else start_time
+        self.elapsed_time = 0 if elapsed_time is None else elapsed_time
+        self.last_time = self.start_time
+    def on_train_begin(
+        self,
+        args: TrainingArguments,
+        state: TrainerState,
+        control: TrainerControl,
+        **kwargs
+    ):
+        r"""
+        Event called at the beginning of training.
+        """
+        if state.is_local_process_zero:
+            if not args.resume_from_checkpoint:
+                self.start_time = time.time()
+                self.elapsed_time = 0
+            else:
+                self.start_time = state.stateful_callbacks['LogCallback']['start_time']
+                self.elapsed_time = state.stateful_callbacks['LogCallback']['elapsed_time']
+        if args.save_on_each_node:
+            if not state.is_local_process_zero:
+                return
+        else:
+            if not state.is_world_process_zero:
+                return
+        self.last_time = time.time()
+        if os.path.exists(os.path.join(args.output_dir, LOG_FILE_NAME)) and args.overwrite_output_dir:
+            logger.warning("Previous log file in this folder will be deleted.")
+            os.remove(os.path.join(args.output_dir, LOG_FILE_NAME))
+    def on_log(
+        self,
+        args: TrainingArguments,
+        state: TrainerState,
+        control: TrainerControl,
+        logs,
+        **kwargs
+    ):
+        if args.save_on_each_node:
+            if not state.is_local_process_zero:
+                return
+        else:
+            if not state.is_world_process_zero:
+                return
+        self.elapsed_time += time.time() - self.last_time
+        self.last_time = time.time()
+        if 'num_input_tokens_seen' in logs:
+            logs['num_tokens'] = logs.pop('num_input_tokens_seen')
+            state.log_history[-1].pop('num_input_tokens_seen')
+            throughput = logs['num_tokens'] / args.world_size / self.elapsed_time
+            state.log_history[-1]['throughput'] = logs['throughput'] = throughput
+        state.stateful_callbacks["LogCallback"] = self.state()
+        logs = dict(
+            current_steps=state.global_step,
+            total_steps=state.max_steps,
+            loss=state.log_history[-1].get("loss", None),
+            eval_loss=state.log_history[-1].get("eval_loss", None),
+            predict_loss=state.log_history[-1].get("predict_loss", None),
+            learning_rate=state.log_history[-1].get("learning_rate", None),
+            epoch=state.log_history[-1].get("epoch", None),
+            percentage=round(state.global_step / state.max_steps * 100, 2) if state.max_steps != 0 else 100,
+        )
+        os.makedirs(args.output_dir, exist_ok=True)
+        with open(os.path.join(args.output_dir, "trainer_log.jsonl"), "a", encoding="utf-8") as f:
+            f.write(json.dumps(logs) + "\n")
+    def state(self) -> dict:
+        return {
+            'start_time': self.start_time,
+            'elapsed_time': self.elapsed_time
+        }
+    @classmethod
+    def from_state(cls, state):
+        return cls(state['start_time'], state['elapsed_time'])

flame/parser.py ADDED Viewed

	@@ -0,0 +1,94 @@

+# -*- coding: utf-8 -*-
+from __future__ import annotations
+from dataclasses import dataclass, field
+from typing import Optional
+import transformers
+from transformers import HfArgumentParser, TrainingArguments
+from flame.logging import get_logger
+logger = get_logger(__name__)
+@dataclass
+class TrainingArguments(TrainingArguments):
+    model_name_or_path: str = field(
+        default=None,
+        metadata={
+            "help": "Path to the model weight or identifier from huggingface.co/models or modelscope.cn/models."
+        },
+    )
+    tokenizer: str = field(
+        default="fla-hub/gla-1.3B-100B",
+        metadata={"help": "Name of the tokenizer to use."}
+    )
+    use_fast_tokenizer: bool = field(
+        default=False,
+        metadata={"help": "Whether or not to use one of the fast tokenizer (backed by the tokenizers library)."},
+    )
+    from_config: bool = field(
+        default=True,
+        metadata={"help": "Whether to initialize models from scratch."},
+    )
+    dataset: Optional[str] = field(
+        default=None,
+        metadata={"help": "The dataset(s) to use. Use commas to separate multiple datasets."},
+    )
+    dataset_name: Optional[str] = field(
+        default=None,
+        metadata={"help": "The name of provided dataset(s) to use."},
+    )
+    cache_dir: str = field(
+        default=None,
+        metadata={"help": "Path to the cached tokenized dataset."},
+    )
+    split: str = field(
+        default="train",
+        metadata={"help": "Which dataset split to use for training and evaluation."},
+    )
+    streaming: bool = field(
+        default=False,
+        metadata={"help": "Enable dataset streaming."},
+    )
+    hf_hub_token: Optional[str] = field(
+        default=None,
+        metadata={"help": "Auth token to log in with Hugging Face Hub."},
+    )
+    preprocessing_num_workers: Optional[int] = field(
+        default=None,
+        metadata={"help": "The number of processes to use for the pre-processing."},
+    )
+    buffer_size: int = field(
+        default=2048,
+        metadata={"help": "Size of the buffer to randomly sample examples from in dataset streaming."},
+    )
+    context_length: int = field(
+        default=2048,
+        metadata={"help": "The context length of the tokenized inputs in the dataset."},
+    )
+    varlen: bool = field(
+        default=False,
+        metadata={"help": "Enable training with variable length inputs."},
+    )
+def get_train_args():
+    parser = HfArgumentParser(TrainingArguments)
+    args, unknown_args = parser.parse_args_into_dataclasses(return_remaining_strings=True)
+    if unknown_args:
+        print(parser.format_help())
+        print("Got unknown args, potentially deprecated arguments: {}".format(unknown_args))
+        raise ValueError("Some specified arguments are not used by the HfArgumentParser: {}".format(unknown_args))
+    if args.should_log:
+        transformers.utils.logging.set_verbosity(args.get_process_log_level())
+        transformers.utils.logging.enable_default_handler()
+        transformers.utils.logging.enable_explicit_format()
+    # set seeds manually
+    transformers.set_seed(args.seed)
+    return args