NickMarkovsky commited on Apr 22

Commit

c8c055f

0 Parent(s):

Duplicate from NickMarkovsky/tenns-llm-1b

Browse files

Co-authored-by: Nick Markovsky <NickMarkovsky@users.noreply.huggingface.co>

Files changed (29) hide show

.gitattributes +35 -0
README.md +101 -0
config.json +13 -0
configuration_tenns_llm.py +30 -0
model.safetensors +3 -0
modeling_tenns_llm.py +289 -0
tenns_core/__init__.py +50 -0
tenns_core/__pycache__/__init__.cpython-310.pyc +0 -0
tenns_core/__pycache__/__init__.cpython-312.pyc +0 -0
tenns_core/__pycache__/activations.cpython-310.pyc +0 -0
tenns_core/__pycache__/activations.cpython-312.pyc +0 -0
tenns_core/__pycache__/fft_ops.cpython-310.pyc +0 -0
tenns_core/__pycache__/fft_ops.cpython-312.pyc +0 -0
tenns_core/__pycache__/inference.cpython-310.pyc +0 -0
tenns_core/__pycache__/inference.cpython-312.pyc +0 -0
tenns_core/__pycache__/recurrent_ops.cpython-310.pyc +0 -0
tenns_core/__pycache__/recurrent_ops.cpython-312.pyc +0 -0
tenns_core/__pycache__/scan_ops.cpython-310.pyc +0 -0
tenns_core/__pycache__/scan_ops.cpython-312.pyc +0 -0
tenns_core/__pycache__/ssm.cpython-310.pyc +0 -0
tenns_core/__pycache__/ssm.cpython-312.pyc +0 -0
tenns_core/activations.py +158 -0
tenns_core/fft_ops.py +174 -0
tenns_core/inference.py +540 -0
tenns_core/recurrent_ops.py +437 -0
tenns_core/scan_ops.py +515 -0
tenns_core/ssm.py +481 -0
tokenizer.json +0 -0
tokenizer_config.json +17 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,101 @@

+---
+license: cc-by-nc-4.0
+language:
+- en
+library_name: transformers
+tags:
+- ssm
+- causal-lm
+- custom-architecture
+- recurrent
+pipeline_tag: text-generation
+---
+# TENNs LLM 1B
+A 1-billion-parameter causal language model built on gate-mode SSM (State Space Model) layers from [TENNs Core](https://huggingface.co/BrainChipInc/tenns-llm-1b/tree/main/tenns_core). Uses recurrent inference instead of attention, making it efficient for streaming and long-context generation.
+## Architecture
+| Component | Details |
+|-----------|---------|
+| Layers | 24 × TENNsBlock (gate mode) |
+| Hidden dim | 2048 |
+| Inner dim | 4096 |
+| Vocabulary | 32,000 (Mistral-7B tokenizer) |
+| Parameters | ~1B |
+Each TENNsBlock: `RMSNorm → in_proj → causal_conv(4) → SSM(gate) → out_proj → residual`
+## Quick Start (Google Colab / any environment)
+```python
+!pip install transformers torch einops opt_einsum safetensors
+from transformers import AutoModelForCausalLM, AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("BrainChipInc/tenns-llm-1b")
+model = AutoModelForCausalLM.from_pretrained(
+    "BrainChipInc/tenns-llm-1b",
+    trust_remote_code=True,
+)
+output = model.generate_text("The history of artificial intelligence", tokenizer, max_new_tokens=100)
+print(output)
+```
+> **Do not use `pipeline()`** — this model uses a custom recurrent architecture that is not
+> compatible with HuggingFace's standard text-generation pipeline.
+## Installation
+```bash
+pip install transformers torch einops opt_einsum safetensors
+```
+## Usage
+> **Note:** Do **not** use `pipeline()` — this model requires `model.generate_text()` instead of
+> HuggingFace's standard `generate()`. The recurrent SSM architecture is not compatible with the
+> attention KV-cache pipeline.
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("BrainChipInc/tenns-llm-1b")
+model = AutoModelForCausalLM.from_pretrained(
+    "BrainChipInc/tenns-llm-1b",
+    trust_remote_code=True,
+)
+output = model.generate_text("The history of artificial intelligence", tokenizer, max_new_tokens=100)
+print(output)
+```
+### Generation options
+```python
+# Greedy decoding (default)
+output = model.generate_text(prompt, tokenizer, max_new_tokens=50)
+# Top-k sampling with temperature
+output = model.generate_text(prompt, tokenizer, max_new_tokens=100, temperature=0.8, top_k=50)
+```
+## `trust_remote_code=True`
+This model uses custom modeling code bundled in the repository
+(`modeling_tenns_llm.py`, `configuration_tenns_llm.py`, `tenns_core/`).
+Loading requires `trust_remote_code=True`. The bundled `tenns_core/` package
+is a snapshot of the TENNs Core SSM library — no separate installation needed.
+## Training
+Fine-tuned from a base TENNs gate-mode model using LoRA adapters on English instruction data.
+LoRA adapters are merged into base weights at export time.
+## Limitations
+- English only
+- No system prompt or chat template — plain completion model
+- Recurrent state resets between calls to `generate_text()`

config.json ADDED Viewed

	@@ -0,0 +1,13 @@

+{
+  "model_type": "tenns_llm",
+  "auto_map": {
+    "AutoConfig": "configuration_tenns_llm.TennsLLMConfig",
+    "AutoModelForCausalLM": "modeling_tenns_llm.TennsLLMForCausalLM"
+  },
+  "vocab_size": 32000,
+  "channels": 2048,
+  "num_blocks": 24,
+  "num_coeffs": 16,
+  "repeat": 256,
+  "transformers_version": "4.40.0"
+}

configuration_tenns_llm.py ADDED Viewed

	@@ -0,0 +1,30 @@

+import os
+import sys
+from transformers import PretrainedConfig
+# Inject the repo directory into sys.path so the bundled tenns_core/ is
+# importable without a pip install, both locally and when loaded from HF hub.
+_HERE = os.path.dirname(os.path.abspath(__file__))
+if _HERE not in sys.path:
+    sys.path.insert(0, _HERE)
+class TennsLLMConfig(PretrainedConfig):
+    model_type = "tenns_llm"
+    def __init__(
+        self,
+        vocab_size=32000,
+        channels=2048,
+        num_blocks=24,
+        num_coeffs=16,
+        repeat=256,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.vocab_size = vocab_size
+        self.channels = channels
+        self.num_blocks = num_blocks
+        self.num_coeffs = num_coeffs
+        self.repeat = repeat

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:695805667bf74d3bb24b8fc0c676e75c26c21191ebe91326429d4f61e43740ff
+size 4957835584

modeling_tenns_llm.py ADDED Viewed

	@@ -0,0 +1,289 @@

+import importlib
+import os
+import sys
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.nn import RMSNorm
+from transformers import PreTrainedModel
+from transformers.modeling_outputs import CausalLMOutputWithPast
+from configuration_tenns_llm import TennsLLMConfig
+def _get_tenns_core_path():
+    """Return a directory that contains tenns_core/.
+    HF's from_pretrained only downloads the .py files listed in auto_map —
+    it does not download subdirectories like tenns_core/. We use
+    snapshot_download (with local cache) to ensure tenns_core/ is present.
+    The first call downloads it; subsequent calls are instant cache hits.
+    """
+    # Derive the repo_id from __file__ path in the HF modules cache:
+    # .../modules/transformers_modules/ORG/REPO_SLUG/HASH/modeling_tenns_llm.py
+    here = os.path.dirname(os.path.abspath(__file__))
+    parts = here.replace("\\", "/").split("/")
+    try:
+        idx = next(i for i, p in enumerate(parts) if p == "transformers_modules")
+        org_id  = parts[idx + 1].replace("_hyphen_", "-")
+        repo_id = parts[idx + 2].replace("_hyphen_", "-")
+    except (StopIteration, IndexError):
+        return here  # not in HF cache — assume tenns_core/ is next to this file
+    from huggingface_hub import snapshot_download
+    snapshot = snapshot_download(
+        f"{org_id}/{repo_id}",
+        allow_patterns=["tenns_core/**"],
+    )
+    return snapshot
+_tenns_core_dir = _get_tenns_core_path()
+if _tenns_core_dir not in sys.path:
+    sys.path.insert(0, _tenns_core_dir)
+_tc = importlib.import_module("tenns_core")
+_rc = importlib.import_module("tenns_core.recurrent_ops")
+SSMLayer = _tc.SSMLayer
+recurrent_gate = _rc.recurrent_gate
+# ============================================================================
+# Model Components (from tenns_llm.py)
+# ============================================================================
+class CausalConvDwFast(nn.Module):
+    """Holds depthwise causal convolution weights for TENNs blocks."""
+    def __init__(self, coeffs, kernel_size):
+        super().__init__()
+        self.weight = nn.Parameter(torch.rand(kernel_size, coeffs))
+class PassthroughConv(nn.Module):
+    """Applies causal convolution via FIFO buffer for streaming inference."""
+    def __init__(self, causal_conv, d_inner):
+        super().__init__()
+        self.causal_conv = causal_conv
+        self.d_inner = d_inner
+        self.fifo = None
+    def apply_conv(self, x):
+        """Apply causal convolution. x: (B, T, C) -> (B, T, C)"""
+        B, T, C = x.shape
+        if self.fifo is None or self.fifo.shape[0] != B:
+            self.fifo = torch.zeros(B, C, 4, device=x.device, dtype=x.dtype)
+        conv_weight = self.causal_conv.weight.squeeze().T  # (C, 4)
+        x_conv = []
+        for t in range(T):
+            self.fifo = self.fifo.roll(-1, dims=-1)
+            self.fifo[:, :, -1] = x[:, t, :]
+            x_t = (self.fifo * conv_weight).sum(-1)
+            x_conv.append(x_t)
+        x_conv = torch.stack(x_conv, dim=1)
+        x_conv = F.silu(x_conv)
+        return x_conv
+    def reset_states(self):
+        if self.fifo is not None:
+            self.fifo.zero_()
+class TENNsBlock(nn.Module):
+    """TENNs block with gate-mode SSM for LLM inference."""
+    def __init__(self, channels, num_coeffs, repeat, mode='gate'):
+        super().__init__()
+        d_inner = channels * 2
+        self.d_inner = d_inner
+        self.pre_norm = RMSNorm(channels, elementwise_affine=True)
+        self.pre_conv = CausalConvDwFast(d_inner, 4)
+        self.in_proj = nn.Linear(channels, d_inner * 2, bias=True)
+        self.out_proj = nn.Linear(d_inner, channels, bias=True)
+        self.ssm_layer = SSMLayer(num_coeffs, d_inner, d_inner,
+                                  repeat=repeat, mode=mode, transposed=True)
+        self.ssm_layer.register_buffer('state_lora', torch.zeros(d_inner))
+        self.D = nn.Parameter(torch.ones(d_inner, dtype=torch.float))
+        self._conv_handler = None
+        self.state = None
+    def forward(self, input):
+        x = self.pre_norm(input)
+        x_and_res = self.in_proj(x)
+        x, res = x_and_res.split([self.d_inner, self.d_inner], -1)
+        if self._conv_handler is None:
+            self._conv_handler = PassthroughConv(self.pre_conv, self.d_inner)
+        x_conv = self._conv_handler.apply_conv(x)
+        state = self.state
+        if state is None:
+            state = self.ssm_layer.state_lora
+        y, self.state = recurrent_gate(
+            x_conv,
+            self.ssm_layer.A,
+            self.ssm_layer.B,
+            self.ssm_layer.C,
+            self.ssm_layer.log_dt,
+            state
+        )
+        y = y.transpose(1, 2)
+        y = y + self.D * x_conv
+        output = self.out_proj(y * F.silu(res))
+        return input + output
+    def reset_states(self):
+        if self._conv_handler is not None:
+            self._conv_handler.reset_states()
+        self.state = None
+class TENNsLLM(nn.Module):
+    """TENNs-based language model for autoregressive text generation."""
+    def __init__(self, vocab_size=32000, channels=2048, num_blocks=24,
+                 num_coeffs=16, repeat=256):
+        super().__init__()
+        self.channels = channels
+        self.embedding = nn.Embedding(vocab_size, channels)
+        self.backbone = nn.Sequential(
+            *[TENNsBlock(channels, num_coeffs, repeat, mode='gate')
+              for _ in range(num_blocks)]
+        )
+        self.head = nn.Sequential(
+            RMSNorm(channels, elementwise_affine=False),
+            nn.Linear(channels, vocab_size, bias=False),
+        )
+    def forward(self, tokens):
+        x = self.embedding(tokens)
+        x = self.backbone(x)
+        return self.head(x)
+    def reset_states(self):
+        for module in self.modules():
+            if isinstance(module, TENNsBlock):
+                module.reset_states()
+# ============================================================================
+# HuggingFace wrapper
+# ============================================================================
+class TennsLLMForCausalLM(PreTrainedModel):
+    """HuggingFace PreTrainedModel wrapper for TENNsLLM.
+    Load with:
+        from transformers import AutoModelForCausalLM, AutoTokenizer
+        model = AutoModelForCausalLM.from_pretrained(
+            "aliborji/tenns-llm-1b", trust_remote_code=True
+        )
+        tokenizer = AutoTokenizer.from_pretrained("aliborji/tenns-llm-1b")
+    Generate with:
+        output = model.generate_text("Hello, world!", tokenizer, max_new_tokens=50)
+        print(output)
+    Note: This model uses recurrent SSM states. Use generate_text() rather than
+    model.generate(), which is designed for attention-based KV-cache models.
+    """
+    config_class = TennsLLMConfig
+    # Weights are saved without a 'model.' prefix — flatten components directly
+    # onto this class so state dict keys match the safetensors file exactly.
+    _tied_weights_keys = []
+    @property
+    def all_tied_weights_keys(self):
+        return {}
+    def __init__(self, config: TennsLLMConfig):
+        super().__init__(config)
+        # Assign TENNsLLM components directly (not as self.model) so that
+        # state dict keys match the safetensors: embedding.weight, backbone.0...
+        _backbone = TENNsLLM(
+            vocab_size=config.vocab_size,
+            channels=config.channels,
+            num_blocks=config.num_blocks,
+            num_coeffs=config.num_coeffs,
+            repeat=config.repeat,
+        )
+        self.embedding = _backbone.embedding
+        self.backbone  = _backbone.backbone
+        self.head      = _backbone.head
+    def _reset_states(self):
+        for module in self.modules():
+            if isinstance(module, TENNsBlock):
+                module.reset_states()
+    def forward(self, input_ids, **kwargs):
+        x = self.embedding(input_ids)
+        x = self.backbone(x)
+        logits = self.head(x)
+        return CausalLMOutputWithPast(logits=logits)
+    @torch.no_grad()
+    def generate_text(self, prompt, tokenizer, max_new_tokens=50,
+                      temperature=1.0, top_k=None):
+        """Autoregressive text generation.
+        Args:
+            prompt: Input text string
+            tokenizer: HuggingFace tokenizer
+            max_new_tokens: Maximum number of tokens to generate
+            temperature: Sampling temperature (lower = more deterministic)
+            top_k: If set, sample from top-k tokens; otherwise greedy argmax
+        Returns:
+            Generated text string (not including the prompt)
+        """
+        self.eval()
+        self._reset_states()
+        input_ids = tokenizer(prompt, return_tensors='pt',
+                              add_special_tokens=False)['input_ids'].squeeze()
+        input_ids = input_ids.to(self.device)
+        # Ingest prompt tokens
+        for token in input_ids:
+            logits = self.forward(token.view(1, 1)).logits
+            probs = F.softmax(logits[0, -1], dim=-1)
+            next_token = torch.argmax(probs).item()
+        # Autoregressive generation
+        output_ids = []
+        token = next_token
+        for _ in range(max_new_tokens):
+            logits = self.forward(torch.tensor([[token]], device=self.device)).logits
+            next_logits = logits[0, -1]
+            if temperature != 1.0:
+                next_logits = next_logits / temperature
+            if top_k is not None:
+                v, _ = torch.topk(next_logits, top_k)
+                next_logits[next_logits < v[-1]] = float('-inf')
+            probs = F.softmax(next_logits, dim=-1)
+            token = (torch.multinomial(probs, 1).item() if top_k is not None
+                     else torch.argmax(probs).item())
+            if token == tokenizer.eos_token_id:
+                break
+            output_ids.append(token)
+        return tokenizer.decode(output_ids)

tenns_core/__init__.py ADDED Viewed

	@@ -0,0 +1,50 @@

+"""
+TENNs Core: Efficient State Space Models for Sequence Modeling
+A standalone library providing various SSM (State Space Model) architectures
+for deep learning on sequences. Includes S5, DWS, Neck, Full, and Gate modes
+all implemented in pure PyTorch.
+Quick Start - Training:
+----------------------
+>>> from tenns_core import SSMLayer
+>>> import torch
+>>>
+>>> # Create S5-mode SSM layer
+>>> layer = SSMLayer(
+...     num_coeffs=64,
+...     in_channels=128,
+...     out_channels=256,
+...     mode='s5',
+...     norm='layer',
+...     postact='gelu'
+... )
+>>>
+>>> # Forward pass (training mode - FFT convolution)
+>>> x = torch.randn(4, 128, 512)  # (batch, channels, length)
+>>> y = layer(x)  # (4, 256, 512)
+Quick Start - Streaming Inference:
+----------------------------------
+>>> # Convert trained model to streaming inference
+>>> infer_layer = layer.to_inference()
+>>>
+>>> # Process audio stream chunk-by-chunk
+>>> for chunk in audio_stream:
+>>>     output = infer_layer(chunk)  # State maintained automatically
+>>>
+>>> # Reset state between utterances
+>>> infer_layer.reset_state()
+"""
+from importlib.metadata import PackageNotFoundError, version
+from .inference import SSMLayerInference
+from .ssm import Kernelizer, SSMLayer
+try:
+    __version__ = version('tenns-core')
+except PackageNotFoundError:
+    __version__ = '0.0.0+unknown'
+__all__ = ['Kernelizer', 'SSMLayer', 'SSMLayerInference']

tenns_core/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (1.42 kB). View file

tenns_core/__pycache__/__init__.cpython-312.pyc ADDED Viewed

Binary file (1.65 kB). View file

tenns_core/__pycache__/activations.cpython-310.pyc ADDED Viewed

Binary file (4.51 kB). View file

tenns_core/__pycache__/activations.cpython-312.pyc ADDED Viewed

Binary file (6.34 kB). View file

tenns_core/__pycache__/fft_ops.cpython-310.pyc ADDED Viewed

Binary file (5.04 kB). View file

tenns_core/__pycache__/fft_ops.cpython-312.pyc ADDED Viewed

Binary file (8.45 kB). View file

tenns_core/__pycache__/inference.cpython-310.pyc ADDED Viewed

Binary file (14.7 kB). View file

tenns_core/__pycache__/inference.cpython-312.pyc ADDED Viewed

Binary file (24.4 kB). View file

tenns_core/__pycache__/recurrent_ops.cpython-310.pyc ADDED Viewed

Binary file (16 kB). View file

tenns_core/__pycache__/recurrent_ops.cpython-312.pyc ADDED Viewed

Binary file (14.8 kB). View file

tenns_core/__pycache__/scan_ops.cpython-310.pyc ADDED Viewed

Binary file (4.68 kB). View file

tenns_core/__pycache__/scan_ops.cpython-312.pyc ADDED Viewed

Binary file (20.3 kB). View file

tenns_core/__pycache__/ssm.cpython-310.pyc ADDED Viewed

Binary file (12.6 kB). View file

tenns_core/__pycache__/ssm.cpython-312.pyc ADDED Viewed

Binary file (20.9 kB). View file

tenns_core/activations.py ADDED Viewed

	@@ -0,0 +1,158 @@

+"""
+Activation, normalization, and dropout utilities for SSM layers.
+Extracted from tenns.models.utils to provide activation layer construction.
+"""
+from torch import nn
+from torch.nn import RMSNorm
+class LayerNormFeature(nn.LayerNorm):
+    """LayerNorm that operates on the feature dimension (dim=-2) instead of time (dim=-1)."""
+    def forward(self, input):
+        return super().forward(input.moveaxis(-1, -2)).moveaxis(-1, -2)
+class RmsNormFeature(nn.Module):
+    """RMSNorm that operates on the feature dimension (dim=-2) instead of time (dim=-1)."""
+    def __init__(self, features):
+        super().__init__()
+        self.rms_norm = RMSNorm(features)
+    def forward(self, input):
+        return self.rms_norm(input.moveaxis(-1, -2)).moveaxis(-1, -2)
+def get_norm(norm, num_features, ndim=2):
+    """Get normalization layer by name.
+    Args:
+        norm: Normalization type ('batch', 'layer', 'layer-feature', 'rms', None)
+        num_features: Number of features/channels
+        ndim: Number of dimensions (1, 2, or 3)
+    Returns:
+        Normalization layer module
+    """
+    match norm:
+        case 'batch':
+            match ndim:
+                case 1:
+                    return nn.BatchNorm1d(num_features)
+                case 2:
+                    return nn.BatchNorm2d(num_features)
+                case 3:
+                    return nn.BatchNorm3d(num_features)
+                case _:
+                    raise ValueError(f'Invalid dimensions: {ndim}')
+        case 'layer':
+            return nn.LayerNorm(num_features)
+        case 'layer-feature':
+            if num_features > 1:
+                return LayerNormFeature(num_features)
+            else:
+                return nn.Identity()
+        case 'rms':
+            if num_features > 1:
+                return RmsNormFeature(num_features)
+            else:
+                return nn.Identity()
+        case None:
+            return nn.Identity()
+        case _:
+            raise ValueError(f'Invalid normalization type: {norm}')
+def get_postact(postact):
+    """Get activation function by name.
+    Args:
+        postact: Activation type ('relu', 'gelu', 'silu', etc., or None)
+    Returns:
+        Activation function module
+    """
+    if postact is None:
+        return nn.Identity()
+    postact_registry = {
+        'relu': nn.ReLU(),
+        'relu6': nn.ReLU6(),
+        'lelu': nn.LeakyReLU(0.1),
+        'sigmoid': nn.Sigmoid(),
+        'tanh': nn.Tanh(),
+        'gelu': nn.GELU(),
+        'glu': nn.GLU(dim=1),
+        'silu': nn.SiLU(),
+    }
+    if postact in postact_registry:
+        return postact_registry[postact]
+    else:
+        raise ValueError(f'Invalid activation name: {postact}')
+def get_dropout(p, dropout_dim, num_features):
+    """Get dropout layer by dimension.
+    Args:
+        p: Dropout probability (None for no dropout)
+        dropout_dim: Dimension of dropout (0 for standard, 1 for 1d, etc.)
+        num_features: Number of features (used to determine if dropout should be applied)
+    Returns:
+        Dropout module
+    """
+    if p is None:
+        return nn.Identity()
+    dropout_registry = {
+        0: nn.Dropout,
+        1: nn.Dropout1d,
+        2: nn.Dropout2d,
+        3: nn.Dropout3d,
+    }
+    if dropout_dim in dropout_registry:
+        # Only apply dropout if we have enough features
+        if dropout_dim == 0 or num_features >= 16:
+            return dropout_registry[dropout_dim](p)
+        else:
+            return nn.Identity()
+    else:
+        raise ValueError(f'Invalid dropout dimension: {dropout_dim}')
+def get_activations(ndim, num_features, norm=None, postact=None, p=None, dropout_dim=0):
+    """Build a sequential module with normalization, activation, and dropout.
+    Args:
+        ndim: Number of dimensions (1, 2, or 3)
+        num_features: Number of features/channels
+        norm: Normalization type (None, 'batch', 'layer', 'layer-feature', 'rms')
+        postact: Activation function type (None, 'relu', 'gelu', 'silu', etc.)
+        p: Dropout probability (None for no dropout)
+        dropout_dim: Dimension of dropout (0, 1, 2, or 3)
+    Returns:
+        Sequential module combining norm, activation, and dropout
+    """
+    if (norm is None) and (postact is None) and (p is None):
+        return nn.Identity()
+    activations = nn.Sequential()
+    if norm is not None:
+        activations.append(get_norm(norm, num_features, ndim))
+    if postact is not None:
+        activations.append(get_postact(postact))
+    if p is not None:
+        activations.append(get_dropout(p, dropout_dim, num_features))
+    return activations

tenns_core/fft_ops.py ADDED Viewed

	@@ -0,0 +1,174 @@

+"""
+FFT-based convolution operations for SSM layers.
+This module provides optimized FFT convolution operations used in SSM training,
+combining functionality from fft_utils.py and fft_utils_opt.py.
+"""
+import torch
+from torch.amp import custom_bwd, custom_fwd
+class PaddedFFTConv(torch.autograd.Function):
+    """Custom autograd function for padded FFT convolution with efficient gradients.
+    Supports both depthwise ('dw') and full ('full') convolution modes.
+    """
+    @staticmethod
+    @torch.compiler.disable
+    @custom_fwd(device_type='cuda', cast_inputs=torch.float32)
+    def forward(ctx, u, k, n, mode, is_complex=False):
+        """
+        Args:
+            u: Input tensor
+            k: Kernel tensor
+            n: Sequence length
+            mode: 'dw' for depthwise or 'full' for full convolution
+            is_complex: Whether to use complex FFT
+        """
+        if is_complex:
+            uf = torch.fft.fft(u, 2 * n)
+            kf = torch.fft.fft(k, 2 * n)
+        else:
+            uf = torch.fft.rfft(u, 2 * n)
+            kf = torch.fft.rfft(k, 2 * n)
+        if mode == 'dw':
+            yf = uf * kf
+        elif mode == 'full':
+            yf = torch.einsum('bcl,dcl->bdl', uf, kf)
+        ctx.is_complex = is_complex
+        ctx.mode = mode
+        ctx.n = n
+        ctx.save_for_backward(u, k)
+        if is_complex:
+            return torch.fft.ifft(yf)[..., :n]
+        else:
+            return torch.fft.irfft(yf)[..., :n]
+    @staticmethod
+    @torch.compiler.disable
+    @custom_bwd(device_type='cuda')
+    def backward(ctx, grad_output):
+        is_complex = ctx.is_complex
+        mode = ctx.mode
+        n = ctx.n
+        u, k = ctx.saved_tensors
+        if is_complex:
+            uf = torch.fft.fft(u, 2 * n)
+            kf = torch.fft.fft(k, 2 * n)
+            grad_yf = torch.fft.fft(grad_output, 2 * n)
+        else:
+            uf = torch.fft.rfft(u, 2 * n)
+            kf = torch.fft.rfft(k, 2 * n)
+            grad_yf = torch.fft.rfft(grad_output, 2 * n)
+        if mode == 'dw':
+            grad_uf = grad_yf * torch.conj(kf)
+        elif mode == 'full':
+            grad_uf = torch.einsum('bdl,dcl->bcl', grad_yf, torch.conj(kf))
+        if is_complex:
+            grad_u = torch.fft.ifft(grad_uf, 2 * n)[..., :n]
+        else:
+            grad_u = torch.fft.irfft(grad_uf, 2 * n)[..., :n]
+        if mode == 'dw':
+            grad_kf = torch.einsum('bnl,bnl->nl', grad_yf, torch.conj(uf))
+        elif mode == 'full':
+            grad_kf = torch.einsum('bdl,bcl->dcl', grad_yf, torch.conj(uf))
+        if is_complex:
+            grad_k = torch.fft.ifft(grad_kf, 2 * n)[..., :n]
+        else:
+            grad_k = torch.fft.irfft(grad_kf, 2 * n)[..., :n]
+        return grad_u, grad_k, None, None, None
+def _K(dtA_real, dtA_imag, length, weight=None, dim=-2, complex_proj=False, l_shift=0):
+    """Generate SSM convolution kernel from discretized state matrix.
+    Args:
+        dtA_real: Real part of discretized state matrix diagonal
+        dtA_imag: Imaginary part of discretized state matrix diagonal
+        length: Sequence length
+        weight: Optional weight matrix to apply
+        dim: Dimension to reduce over if weight is provided
+        complex_proj: Whether to use complex projection
+        l_shift: Shift amount for the range
+    Returns:
+        SSM convolution kernel of shape (..., length)
+    """
+    device = dtA_real.device
+    lrange = torch.arange(l_shift, length + l_shift, device=device)
+    with torch.autocast('cuda', enabled=False):
+        dtA_real, dtA_imag = dtA_real.float(), dtA_imag.float()
+        if complex_proj:
+            K = (torch.complex(dtA_real, dtA_imag)[..., None] * lrange).exp()
+        else:
+            K = (dtA_real[..., None] * lrange).exp() * torch.cos(dtA_imag[..., None] * lrange)
+        if weight is not None:
+            return (K * weight[..., None]).sum(dim)
+        else:
+            return K
+def _full_k(dtA_real, dtA_imag, B, C, E, length):
+    """Generate full SSM kernel by combining B, C, and state kernel.
+    Used for optimizing s5/neck mode when full kernel is more efficient.
+    """
+    K = _K(dtA_real, dtA_imag, length, weight=E)
+    return (B[..., None] * C[..., None, None] * K[:, None, :]).sum(1)
+def padded_fft_conv_opt(input, dtA_real, dtA_imag, B, C, E):
+    """Optimized padded FFT convolution for SSM layers.
+    Automatically chooses between naive and optimized contraction based on
+    tensor shapes to minimize computation.
+    Args:
+        input: Input tensor of shape (batch, in_channels, length)
+        dtA_real: Real part of discretized A matrix
+        dtA_imag: Imaginary part of discretized A matrix
+        B: Input projection matrix (None for dws/full modes)
+        C: Output projection matrix (None for dws/full modes)
+        E: State projection matrix (None for s5/neck modes)
+    Returns:
+        Output tensor of shape (batch, out_channels, length)
+    """
+    batch, chin, length = input.shape
+    # DWS/Full mode: no B/C matrices
+    if B is None:
+        K = _K(dtA_real, dtA_imag, length, weight=E)
+        if K.ndim == 3:
+            return PaddedFFTConv.apply(input, K, length, 'full', False)
+        elif K.ndim == 2:
+            return PaddedFFTConv.apply(input, K, length, 'dw', False)
+    # S5/Neck mode: has B/C matrices
+    chout, coeffs = C.shape
+    # Choose contraction order based on efficiency
+    # Compare cost of: (1) fusing B,C,K vs (2) separate contractions
+    if (1 / chin + 1 / chout) > (1 / batch + 1 / coeffs):
+        # Fuse full kernel and apply single convolution
+        kernel = _full_k(dtA_real, dtA_imag, B, C, E, length)
+        return PaddedFFTConv.apply(input, kernel, length, 'full', False)
+    else:
+        # Separate: project input, convolve, then project output
+        K = _K(dtA_real, dtA_imag, length, weight=E)
+        x = torch.einsum('bcl,nc->bnl', input, B)
+        x = PaddedFFTConv.apply(x, K, length, 'dw', False)
+        return torch.einsum('bnl,dn->bdl', x, C)

tenns_core/inference.py ADDED Viewed

	@@ -0,0 +1,540 @@

+"""
+Inference mode for SSM layers.
+Provides streaming/online inference with stateful processing for real-time applications.
+"""
+import torch
+from torch import nn
+from .recurrent_ops import (
+    discretize_dws,
+    discretize_full,
+    discretize_neck,
+    discretize_s5,
+    recurrent_gate,
+    recurrent_gate_single_step,
+    step_dws,
+    step_full,
+    step_neck,
+    step_s5,
+)
+class SSMLayerInference(nn.Module):
+    """Streaming inference wrapper for SSMLayer.
+    Provides stateful recurrent inference for real-time applications.
+    Maintains internal state across chunks for low-latency streaming.
+    Discretization (Ad, B_hat, etc.) is precomputed once at construction time
+    from the raw SSM parameters, so only the per-timestep step function runs
+    during forward passes.
+    Example:
+        >>> # After training
+        >>> train_layer = SSMLayer(64, 128, 256, mode='s5')
+        >>> # ... training ...
+        >>>
+        >>> # Convert to inference mode
+        >>> infer_layer = SSMLayerInference.from_training(train_layer)
+        >>>
+        >>> # Process streaming chunks (state maintained automatically)
+        >>> for chunk in audio_stream:
+        >>>     output = infer_layer(chunk)
+        >>>
+        >>> # Reset state when starting new utterance
+        >>> infer_layer.reset_state()
+    Note:
+        - Inference mode uses sequential scan (O(T) per chunk)
+        - Training mode uses FFT (O(T log T) for full sequence)
+        - For streaming, inference mode has lower latency
+        - For batch processing full sequences, training mode is faster
+    """
+    def __init__(self, mode, in_channels, out_channels, **kwargs):
+        """Initialize inference layer.
+        Args:
+            mode: SSM mode ('s5', 'dws', 'neck', 'full', 'gate')
+            in_channels: Number of input channels
+            out_channels: Number of output channels
+            **kwargs: Mode-specific parameters (Ad, B_hat, C, dt, B, E, A, log_dt, mixer, etc.)
+        """
+        super().__init__()
+        self.mode = mode
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        if mode == 's5':
+            self.register_buffer('Ad', kwargs['Ad'])
+            self.register_buffer('B_hat', kwargs['B_hat'])
+            self.register_buffer('C', kwargs['C'])
+        elif mode == 'dws':
+            self.register_buffer('Ad', kwargs['Ad'])
+            self.register_buffer('B_hat', kwargs['B_hat'])
+        elif mode == 'neck':
+            self.register_buffer('Ad', kwargs['Ad'])
+            self.register_buffer('dt', kwargs['dt'])
+            self.register_buffer('B', kwargs['B'])
+            self.register_buffer('C', kwargs['C'])
+            self.register_buffer('E', kwargs['E'])
+        elif mode == 'full':
+            self.register_buffer('Ad', kwargs['Ad'])
+            self.register_buffer('B_hat', kwargs['B_hat'])
+        elif mode == 'gate':
+            # Gate mode: input-dependent discretization, store raw params
+            self.register_buffer('A', kwargs['A'])
+            self.B = kwargs['B']  # nn.Module
+            self.C = kwargs['C']  # nn.Module
+            self.log_dt = kwargs['log_dt']  # nn.Module
+        else:
+            raise ValueError(f'Unknown mode: {mode}')
+        # Mixer module (for DWS mode to project channels)
+        self.mixer = kwargs.get('mixer') or nn.Identity()
+        # Internal state
+        self.state = None
+    @classmethod
+    def from_training(cls, ssm_layer):
+        """Create inference layer from trained SSMLayer.
+        Args:
+            ssm_layer: Trained SSMLayer instance
+        Returns:
+            SSMLayerInference instance with precomputed discretized weights
+        Example:
+            >>> train_layer = SSMLayer(64, 128, 256, mode='s5')
+            >>> infer_layer = SSMLayerInference.from_training(train_layer)
+        """
+        mode = ssm_layer.mode
+        kwargs = {}
+        if mode == 's5':
+            Ad, B_hat = discretize_s5(
+                ssm_layer.A.detach().clone(),
+                ssm_layer.B.detach().clone(),
+                ssm_layer.log_dt.detach().clone(),
+            )
+            kwargs['Ad'] = Ad
+            kwargs['B_hat'] = B_hat
+            kwargs['C'] = ssm_layer.C.detach().clone()
+        elif mode == 'dws':
+            Ad, B_hat = discretize_dws(
+                ssm_layer.A.detach().clone(),
+                ssm_layer.E.detach().clone(),
+                ssm_layer.log_dt.detach().clone(),
+            )
+            kwargs['Ad'] = Ad
+            kwargs['B_hat'] = B_hat
+            kwargs['mixer'] = ssm_layer.mixer
+        elif mode == 'neck':
+            Ad, dt = discretize_neck(
+                ssm_layer.A.detach().clone(),
+                ssm_layer.log_dt.detach().clone(),
+            )
+            kwargs['Ad'] = Ad
+            kwargs['dt'] = dt
+            kwargs['B'] = ssm_layer.B.detach().clone()
+            kwargs['C'] = ssm_layer.C.detach().clone()
+            kwargs['E'] = ssm_layer.E.detach().clone()
+        elif mode == 'full':
+            Ad, B_hat = discretize_full(
+                ssm_layer.A.detach().clone(),
+                ssm_layer.E.detach().clone(),
+                ssm_layer.log_dt.detach().clone(),
+            )
+            kwargs['Ad'] = Ad
+            kwargs['B_hat'] = B_hat
+        elif mode == 'gate':
+            kwargs['A'] = ssm_layer.A.detach().clone()
+            kwargs['B'] = ssm_layer.B
+            kwargs['C'] = ssm_layer.C
+            kwargs['log_dt'] = ssm_layer.log_dt
+            kwargs['mixer'] = ssm_layer.mixer
+        else:
+            raise ValueError(f'Unknown mode: {mode}')
+        return cls(
+            mode=mode,
+            in_channels=ssm_layer.in_channels,
+            out_channels=ssm_layer.out_channels,
+            **kwargs,
+        )
+    def forward(self, input, return_state=False):
+        """Forward pass with stateful processing.
+        Args:
+            input: Input tensor of shape (B, C, T) or (C, T) for single sample
+            return_state: If True, return (output, state) tuple
+        Returns:
+            output: Output tensor of shape (B, D, T) or (D, T)
+            state (optional): Internal state if return_state=True
+        Note:
+            State is maintained internally across calls. Use reset_state()
+            to clear it.
+        """
+        # Handle input format
+        squeeze_batch = False
+        if input.dim() == 2:
+            input = input.unsqueeze(0)  # (C, T) -> (1, C, T)
+            squeeze_batch = True
+        B_batch, _C, T = input.shape
+        # Transpose to (B, T, C) for step functions
+        input = input.transpose(1, 2)
+        if self.mode == 'gate':
+            output, self.state = recurrent_gate(
+                input, self.A, self.B, self.C, self.log_dt, self.state
+            )
+        else:
+            # Non-gate modes: loop over timesteps with precomputed discretization
+            outputs = []
+            for b in range(B_batch):
+                batch_outputs = []
+                # Use per-batch state or init
+                if self.state is not None and self.state.dim() > len(self._state_shape()):
+                    x = self.state[b]
+                else:
+                    x = self.state
+                for t in range(T):
+                    u_t = input[b, t]  # (C_in,)
+                    y_t, x = self._step(u_t, x)
+                    batch_outputs.append(y_t)
+                # Update state
+                if b == 0:
+                    self.state = x.unsqueeze(0) if B_batch > 1 else x
+                elif B_batch > 1:
+                    self.state = torch.cat([self.state, x.unsqueeze(0)], dim=0)
+                outputs.append(torch.stack(batch_outputs, dim=1))  # (D, T)
+            output = torch.stack(outputs, dim=0)  # (B, D, T)
+        # Apply mixer (important for DWS mode which projects channels)
+        output = self.mixer(output)
+        if squeeze_batch:
+            output = output.squeeze(0)
+            if self.state is not None and self.state.dim() > len(self._state_shape()):
+                self.state = self.state.squeeze(0)
+        if return_state:
+            return output, self.state
+        return output
+    def _step(self, u, state):
+        """Dispatch to mode-specific step function."""
+        if self.mode == 's5':
+            return step_s5(u, self.Ad, self.B_hat, self.C, state)
+        elif self.mode == 'dws':
+            return step_dws(u, self.Ad, self.B_hat, state)
+        elif self.mode == 'neck':
+            return step_neck(u, self.Ad, self.dt, self.B, self.C, self.E, state)
+        elif self.mode == 'full':
+            return step_full(u, self.Ad, self.B_hat, state)
+    def _state_shape(self):
+        """Return expected unbatched state shape for current mode."""
+        if self.mode == 's5':
+            return self.Ad.shape  # (N, 2)
+        elif self.mode == 'dws':
+            return self.Ad.shape  # (C, N, 2)
+        elif self.mode == 'neck':
+            return self.Ad.shape  # (R, N, 2)
+        elif self.mode == 'full':
+            return self.Ad.shape  # (D, C, N, 2)
+        elif self.mode == 'gate':
+            return (self.A.shape[0],)  # (N,)
+    def reset_state(self):
+        """Reset internal state.
+        Call this when starting a new sequence.
+        Example:
+            >>> for utterance in utterances:
+            >>>     infer_layer.reset_state()  # Clear state
+            >>>     for chunk in utterance:
+            >>>         output = infer_layer(chunk)
+        """
+        if self.state is not None:
+            self.state.zero_()
+        else:
+            self.state = None
+    def get_state(self):
+        """Get current internal state for checkpointing or branching.
+        Returns a clone of the state to prevent accidental mutations.
+        Useful for beam search, hypothesis tracking, or state snapshots.
+        Returns:
+            state: Cloned state tensor or None if no state exists
+        Example:
+            >>> # Save state for beam search
+            >>> saved_state = infer_layer.get_state()
+            >>> # Process hypothesis 1
+            >>> output1 = infer_layer(chunk1)
+            >>> # Restore and try hypothesis 2
+            >>> infer_layer.set_state(saved_state)
+            >>> output2 = infer_layer(chunk2)
+        """
+        return self.state.clone() if self.state is not None else None
+    def set_state(self, state):
+        """Restore internal state from checkpoint.
+        Sets the state to a clone of the provided tensor to prevent
+        accidental mutations. Useful for restoring checkpoints or
+        branching hypotheses in beam search.
+        Args:
+            state: State tensor (shape depends on mode) or None to reset
+        Example:
+            >>> # Checkpoint state before branching
+            >>> checkpoint = infer_layer.get_state()
+            >>> # ... process some data ...
+            >>> # Restore to checkpoint
+            >>> infer_layer.set_state(checkpoint)
+        """
+        self.state = state.clone() if state is not None else None
+    def __repr__(self):
+        return (
+            f'SSMLayerInference(mode={self.mode}, '
+            f'in_channels={self.in_channels}, '
+            f'out_channels={self.out_channels}, '
+            f'state={"active" if self.state is not None else "reset"})'
+        )
+class SSMLayerExportable(nn.Module):
+    """Single-timestep exportable SSM layer for ONNX export (B=1, T=1).
+    This class processes one timestep at a time with explicit state input/output,
+    enabling export to ONNX by eliminating dynamic control flow and
+    complex number dtypes.
+    Discretization is precomputed at construction time, so the forward pass
+    only runs the step function.
+    Currently supports S5, DWS, Neck, Full, and Gate modes. State is represented as real tensors (..., 2)
+    where [..., 0] is the real part and [..., 1] is the imaginary part.
+    Example:
+        >>> # After training
+        >>> train_layer = SSMLayer(num_coeffs=64, in_channels=32, out_channels=32, mode='s5')
+        >>> # ... training ...
+        >>>
+        >>> # Convert to exportable inference mode
+        >>> export_layer = SSMLayerExportable.from_training(train_layer)
+        >>>
+        >>> # Export to ONNX
+        >>> dummy_input = torch.randn(32)
+        >>> torch.onnx.export(export_layer, (dummy_input, None), "model.onnx")
+        >>>
+        >>> # Use in streaming application (external loop)
+        >>> state = None
+        >>> for t in range(audio_length):
+        >>>     output, state = export_layer(audio[t], state)
+    Note:
+        - Processes single sample (B=1), single timestep (T=1) per call
+        - State is automatically initialized to zeros if None
+        - Loop over time must be external to the model
+        - Complex numbers represented as (..., 2) real tensors
+    """
+    def __init__(self, mode, in_channels, out_channels, **kwargs):
+        """Initialize exportable SSM layer.
+        Args:
+            mode: SSM mode ('s5', 'dws', 'neck', 'full', 'gate')
+            in_channels: Number of input channels
+            out_channels: Number of output channels
+            **kwargs: Mode-specific discretized parameters
+        """
+        super().__init__()
+        self.mode = mode
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        if mode == 's5':
+            self.register_buffer('Ad', kwargs['Ad'])
+            self.register_buffer('B_hat', kwargs['B_hat'])
+            self.register_buffer('C', kwargs['C'])
+        elif mode == 'dws':
+            self.register_buffer('Ad', kwargs['Ad'])
+            self.register_buffer('B_hat', kwargs['B_hat'])
+        elif mode == 'neck':
+            self.register_buffer('Ad', kwargs['Ad'])
+            self.register_buffer('dt', kwargs['dt'])
+            self.register_buffer('B', kwargs['B'])
+            self.register_buffer('C', kwargs['C'])
+            self.register_buffer('E', kwargs['E'])
+        elif mode == 'full':
+            self.register_buffer('Ad', kwargs['Ad'])
+            self.register_buffer('B_hat', kwargs['B_hat'])
+        elif mode == 'gate':
+            self.register_buffer('A', kwargs['A'])
+            self.B = kwargs['B']  # nn.Module
+            self.C = kwargs['C']  # nn.Module
+            self.log_dt = kwargs['log_dt']  # nn.Module
+        else:
+            raise ValueError(f'Unknown mode: {mode}')
+        # Mixer module (for DWS mode to project channels)
+        self.mixer = kwargs.get('mixer') or nn.Identity()
+    @classmethod
+    def from_training(cls, ssm_layer):
+        """Create exportable layer from trained SSMLayer.
+        Args:
+            ssm_layer: Trained SSMLayer instance
+        Returns:
+            SSMLayerExportable instance with precomputed discretized weights
+        Raises:
+            ValueError: If ssm_layer.mode is not supported
+        Example:
+            >>> train_layer = SSMLayer(num_coeffs=64, in_channels=32, out_channels=32, mode='s5')
+            >>> export_layer = SSMLayerExportable.from_training(train_layer)
+        """
+        mode = ssm_layer.mode
+        kwargs = {}
+        if mode == 's5':
+            Ad, B_hat = discretize_s5(
+                ssm_layer.A.detach().clone(),
+                ssm_layer.B.detach().clone(),
+                ssm_layer.log_dt.detach().clone(),
+            )
+            kwargs['Ad'] = Ad
+            kwargs['B_hat'] = B_hat
+            kwargs['C'] = ssm_layer.C.detach().clone()
+        elif mode == 'dws':
+            Ad, B_hat = discretize_dws(
+                ssm_layer.A.detach().clone(),
+                ssm_layer.E.detach().clone(),
+                ssm_layer.log_dt.detach().clone(),
+            )
+            kwargs['Ad'] = Ad
+            kwargs['B_hat'] = B_hat
+            kwargs['mixer'] = ssm_layer.mixer
+        elif mode == 'neck':
+            Ad, dt = discretize_neck(
+                ssm_layer.A.detach().clone(),
+                ssm_layer.log_dt.detach().clone(),
+            )
+            kwargs['Ad'] = Ad
+            kwargs['dt'] = dt
+            kwargs['B'] = ssm_layer.B.detach().clone()
+            kwargs['C'] = ssm_layer.C.detach().clone()
+            kwargs['E'] = ssm_layer.E.detach().clone()
+        elif mode == 'full':
+            Ad, B_hat = discretize_full(
+                ssm_layer.A.detach().clone(),
+                ssm_layer.E.detach().clone(),
+                ssm_layer.log_dt.detach().clone(),
+            )
+            kwargs['Ad'] = Ad
+            kwargs['B_hat'] = B_hat
+        elif mode == 'gate':
+            kwargs['A'] = ssm_layer.A.detach().clone()
+            kwargs['B'] = ssm_layer.B
+            kwargs['C'] = ssm_layer.C
+            kwargs['log_dt'] = ssm_layer.log_dt
+        else:
+            raise ValueError(
+                f'SSMLayerExportable only supports S5, DWS, Neck, Full, and Gate modes, got {mode}'
+            )
+        return cls(
+            mode=mode,
+            in_channels=ssm_layer.in_channels,
+            out_channels=ssm_layer.out_channels,
+            **kwargs,
+        )
+    def forward(self, input, state=None):
+        """Forward pass for single timestep.
+        Args:
+            input: Input tensor of shape (C_in,) - single sample, single timestep
+            state: Optional state tensor - shape depends on mode:
+                   - S5: (N, 2) real representation
+                   - DWS: (C, N, 2) real representation
+                   - Neck: (R, N, 2) real representation
+                   - Full: (D, C, N, 2) real representation
+                   - Gate: (N,) real-valued
+                   If None, initializes to zeros internally
+        Returns:
+            output: Output tensor of shape (D,)
+            new_state: Updated state - same shape as state input
+        Example:
+            >>> export_layer = SSMLayerExportable.from_training(trained_layer)
+            >>> x = torch.randn(32)  # Single timestep input
+            >>> y, state = export_layer(x, None)  # First call, state=None
+            >>> y2, state = export_layer(x2, state)  # Subsequent call with state
+        """
+        if self.mode == 's5':
+            output, new_state = step_s5(input, self.Ad, self.B_hat, self.C, state)
+        elif self.mode == 'dws':
+            output, new_state = step_dws(input, self.Ad, self.B_hat, state)
+            # Apply mixer for DWS mode (channel projection)
+            # Mixer expects (B, C, T) format, we have (C,) single timestep
+            output = (
+                self.mixer(output.unsqueeze(0).unsqueeze(-1)).squeeze(0).squeeze(-1)
+            )  # (C,) -> (1, C, 1) -> (1, D, 1) -> (D,)
+        elif self.mode == 'neck':
+            output, new_state = step_neck(input, self.Ad, self.dt, self.B, self.C, self.E, state)
+        elif self.mode == 'full':
+            output, new_state = step_full(input, self.Ad, self.B_hat, state)
+        elif self.mode == 'gate':
+            # Initialize state if None
+            if state is None:
+                N = self.A.shape[0]
+                state = torch.zeros(N, dtype=torch.float32, device=input.device)
+            output, new_state = recurrent_gate_single_step(
+                input, self.A, self.B, self.C, self.log_dt, state
+            )
+        else:
+            raise ValueError(f'Unsupported mode: {self.mode}')
+        return output, new_state
+    def __repr__(self):
+        return (
+            f'SSMLayerExportable(mode={self.mode}, '
+            f'in_channels={self.in_channels}, '
+            f'out_channels={self.out_channels})'
+        )

tenns_core/recurrent_ops.py ADDED Viewed

	@@ -0,0 +1,437 @@

+"""
+Recurrent operations for streaming SSM inference.
+Provides discretize_* functions (called once at init) and step_* functions
+(called per timestep) for each SSM mode, enabling low-latency streaming
+inference by maintaining state across chunks.
+Gate mode is special: its discretization is input-dependent, so it keeps
+combined recurrent_gate / recurrent_gate_single_step functions.
+"""
+import torch
+import torch.nn.functional as F
+# ============================================================================
+# Complex arithmetic helpers for real representation (ONNX compat)
+# ============================================================================
+def complex_mul_real(a, b):
+    """Multiply two complex numbers in real representation (..., 2).
+    Args:
+        a: Complex tensor as real representation (..., 2) where [..., 0] is real, [..., 1] is imag
+        b: Complex tensor as real representation (..., 2)
+    Returns:
+        Complex product as real representation (..., 2)
+        Formula: (a_r + i*a_i) * (b_r + i*b_i) = (a_r*b_r - a_i*b_i) + i*(a_r*b_i + a_i*b_r)
+    """
+    a_real = a[..., 0]
+    a_imag = a[..., 1]
+    b_real = b[..., 0]
+    b_imag = b[..., 1]
+    result_real = a_real * b_real - a_imag * b_imag
+    result_imag = a_real * b_imag + a_imag * b_real
+    return torch.stack([result_real, result_imag], dim=-1)
+# ============================================================================
+# S5 mode
+# ============================================================================
+def discretize_s5(A, B, log_dt):
+    """Precompute discretized parameters for S5 mode.
+    Args:
+        A: State transition parameter of shape (N, 2) - real repr of complex
+        B: Input projection of shape (N, C_in)
+        log_dt: Time step of shape (N,)
+    Returns:
+        Ad: Discretized state transition of shape (N, 2)
+        B_hat: Discretized input projection of shape (N, C_in)
+    """
+    A_real = -F.softplus(A[:, 0])  # (N,)
+    A_imag = A[:, 1]  # (N,)
+    dt = torch.exp(log_dt)  # (N,)
+    scaled_real = dt * A_real
+    scaled_imag = dt * A_imag
+    exp_scaled_real = torch.exp(scaled_real)
+    Ad = torch.stack(
+        [
+            exp_scaled_real * torch.cos(scaled_imag),
+            exp_scaled_real * torch.sin(scaled_imag),
+        ],
+        dim=-1,
+    )  # (N, 2)
+    B_hat = dt[:, None] * B  # (N, C_in)
+    return Ad, B_hat
+def step_s5(u, Ad, B_hat, C, state):
+    """Single timestep for S5 mode using pre-discretized parameters.
+    Args:
+        u: Input tensor of shape (C_in,)
+        Ad: Discretized state transition of shape (N, 2)
+        B_hat: Discretized input projection of shape (N, C_in)
+        C: Output projection of shape (D, N)
+        state: Previous state of shape (N, 2), or None for zero init
+    Returns:
+        y: Output tensor of shape (D,)
+        new_state: Updated state of shape (N, 2)
+    """
+    if state is None:
+        N = Ad.shape[0]
+        state = torch.zeros((N, 2), dtype=torch.float32, device=u.device)
+    # State update: x = Ad * x + B_hat @ u
+    x_new = complex_mul_real(Ad, state)  # (N, 2)
+    Bu = B_hat @ u  # (N,)
+    x_new[..., 0] = x_new[..., 0] + Bu
+    # Output: y = C @ real(x)
+    y = C @ x_new[..., 0]  # (D,)
+    return y, x_new
+# ============================================================================
+# DWS mode
+# ============================================================================
+def discretize_dws(A, E, log_dt):
+    """Precompute discretized parameters for DWS mode.
+    Args:
+        A: State parameter of shape (C, N, 2) - real repr of complex
+        E: Weight matrix of shape (C, N)
+        log_dt: Time step of shape (C, N)
+    Returns:
+        Ad: Discretized state transition of shape (C, N, 2)
+        B_hat: Discretized input projection of shape (C, N)
+    """
+    A_real = -F.softplus(A[..., 0])  # (C, N)
+    A_imag = A[..., 1]  # (C, N)
+    dt = torch.exp(log_dt)  # (C, N)
+    scaled_real = dt * A_real
+    scaled_imag = dt * A_imag
+    exp_scaled_real = torch.exp(scaled_real)
+    Ad = torch.stack(
+        [
+            exp_scaled_real * torch.cos(scaled_imag),
+            exp_scaled_real * torch.sin(scaled_imag),
+        ],
+        dim=-1,
+    )  # (C, N, 2)
+    B_hat = E * dt  # (C, N)
+    return Ad, B_hat
+def step_dws(u, Ad, B_hat, state):
+    """Single timestep for DWS mode using pre-discretized parameters.
+    Args:
+        u: Input tensor of shape (C,)
+        Ad: Discretized state transition of shape (C, N, 2)
+        B_hat: Discretized input projection of shape (C, N)
+        state: Previous state of shape (C, N, 2), or None for zero init
+    Returns:
+        y: Output tensor of shape (C,)
+        new_state: Updated state of shape (C, N, 2)
+    """
+    if state is None:
+        C, N = B_hat.shape
+        state = torch.zeros((C, N, 2), dtype=torch.float32, device=u.device)
+    # State update: x = Ad * x + B_hat * u
+    x_new = complex_mul_real(Ad, state)  # (C, N, 2)
+    Bu = B_hat * u.unsqueeze(1)  # (C, N)
+    x_new[..., 0] = x_new[..., 0] + Bu
+    # Output: y = sum(real(x), dim=1)
+    y = torch.sum(x_new[..., 0], dim=1)  # (C,)
+    return y, x_new
+# ============================================================================
+# Neck mode
+# ============================================================================
+def discretize_neck(A, log_dt):
+    """Precompute discretized parameters for Neck mode.
+    Args:
+        A: State transition parameter of shape (R, N, 2) - real repr of complex
+        log_dt: Time step of shape (R,)
+    Returns:
+        Ad: Discretized state transition of shape (R, N, 2)
+        dt: Discretized time step of shape (R, 1) - needed for input scaling
+    """
+    A_real = -F.softplus(A[..., 0])  # (R, N)
+    A_imag = A[..., 1]  # (R, N)
+    dt = torch.exp(log_dt).reshape(-1, 1)  # (R, 1)
+    scaled_real = dt * A_real
+    scaled_imag = dt * A_imag
+    exp_scaled_real = torch.exp(scaled_real)
+    Ad = torch.stack(
+        [
+            exp_scaled_real * torch.cos(scaled_imag),
+            exp_scaled_real * torch.sin(scaled_imag),
+        ],
+        dim=-1,
+    )  # (R, N, 2)
+    return Ad, dt
+def step_neck(u, Ad, dt, B, C, E, state):
+    """Single timestep for Neck mode using pre-discretized parameters.
+    Args:
+        u: Input tensor of shape (C_in,)
+        Ad: Discretized state transition of shape (R, N, 2)
+        dt: Discretized time step of shape (R, 1)
+        B: Input projection of shape (R, C_in)
+        C: Output projection of shape (D, R)
+        E: State mixing matrix of shape (R, N)
+        state: Previous state of shape (R, N, 2), or None for zero init
+    Returns:
+        y: Output tensor of shape (D,)
+        new_state: Updated state of shape (R, N, 2)
+    """
+    if state is None:
+        R, N = Ad.shape[0], Ad.shape[1]
+        state = torch.zeros((R, N, 2), dtype=torch.float32, device=u.device)
+    # Input projection: v = dt * B @ u
+    v = dt.squeeze(1) * (B @ u)  # (R,)
+    # State update: x = Ad * x + v
+    x_new = complex_mul_real(Ad, state)  # (R, N, 2)
+    x_new[..., 0] = x_new[..., 0] + v.unsqueeze(1)
+    # Output: z = real((x * E).sum(N)), y = C @ z
+    E_cplx = torch.stack([E, torch.zeros_like(E)], dim=-1)  # (R, N, 2)
+    z = torch.sum(complex_mul_real(x_new, E_cplx)[..., 0], dim=1)  # (R,)
+    y = C @ z  # (D,)
+    return y, x_new
+# ============================================================================
+# Full mode
+# ============================================================================
+def discretize_full(A, E, log_dt):
+    """Precompute discretized parameters for Full mode.
+    Args:
+        A: State parameter of shape (D, C, N, 2) - real repr of complex
+        E: Weight matrix of shape (D, C, N)
+        log_dt: Time step of shape (D, N)
+    Returns:
+        Ad: Discretized state transition of shape (D, C, N, 2)
+        B_hat: Discretized input projection of shape (D, C, N)
+    """
+    A_real = -F.softplus(A[..., 0])  # (D, C, N)
+    A_imag = A[..., 1]  # (D, C, N)
+    dt = torch.exp(log_dt)  # (D, N)
+    dt_exp = dt[:, None, :]  # (D, 1, N)
+    scaled_real = dt_exp * A_real
+    scaled_imag = dt_exp * A_imag
+    exp_scaled_real = torch.exp(scaled_real)
+    Ad = torch.stack(
+        [
+            exp_scaled_real * torch.cos(scaled_imag),
+            exp_scaled_real * torch.sin(scaled_imag),
+        ],
+        dim=-1,
+    )  # (D, C, N, 2)
+    B_hat = E * dt_exp  # (D, C, N)
+    return Ad, B_hat
+def step_full(u, Ad, B_hat, state):
+    """Single timestep for Full mode using pre-discretized parameters.
+    Args:
+        u: Input tensor of shape (C,)
+        Ad: Discretized state transition of shape (D, C, N, 2)
+        B_hat: Discretized input projection of shape (D, C, N)
+        state: Previous state of shape (D, C, N, 2), or None for zero init
+    Returns:
+        y: Output tensor of shape (D,)
+        new_state: Updated state of shape (D, C, N, 2)
+    """
+    if state is None:
+        D, C, N = B_hat.shape
+        state = torch.zeros((D, C, N, 2), dtype=torch.float32, device=u.device)
+    # State update: x = Ad * x + B_hat * u
+    x_new = complex_mul_real(Ad, state)  # (D, C, N, 2)
+    u_broadcast = u.unsqueeze(0).unsqueeze(2)  # (1, C, 1)
+    Bu = B_hat * u_broadcast  # (D, C, N)
+    x_new[..., 0] = x_new[..., 0] + Bu
+    # Output: y = sum(real(x), dim=(1, 2))
+    y = torch.sum(x_new[..., 0], dim=(1, 2))  # (D,)
+    return y, x_new
+# ============================================================================
+# Gate mode (input-dependent discretization — cannot precompute)
+# ============================================================================
+def recurrent_gate_single_step(u, A, B_proj, C_proj, log_dt_proj, state):
+    """
+    Gate-style SSM single timestep for ONNX export.
+    Processes single timestep with input-dependent parameters.
+    Unlike other modes, gate uses neural network projections for B, C, and dt.
+    Args:
+        u: Input tensor of shape (C_in,) - single timestep, single batch
+        A: State transition parameter of shape (N,) - in log space, represents decay rates
+        B_proj: nn.Module that projects (C_in,) -> (N,)
+        C_proj: nn.Module that projects (N,) -> (D,)
+        log_dt_proj: nn.Module that projects (C_in,) -> (N,)
+        state: Previous state of shape (N,) - real-valued state
+    Returns:
+        y: Output tensor of shape (D,)
+        new_state: Updated state of shape (N,)
+    State update formula:
+        log_dt = log_dt_proj(u)
+        dt = softplus(log_dt)
+        u_proj = B_proj(u)
+        dta = exp(-dt * exp(A))  # discretized decay
+        x_new = dta * x_old + dt * u_proj
+        y = C_proj(x_new)
+    """
+    # Get input-dependent projections
+    u_proj = B_proj(u)  # (N,)
+    log_dt = log_dt_proj(u)  # (N,)
+    # Discretization
+    dt = F.softplus(log_dt)  # (N,)
+    exp_A = torch.exp(A)  # (N,) - decay rate
+    dta = torch.exp(-dt * exp_A)  # (N,) - discretized decay factor
+    # State update: x_new = dta * x_old + dt * u_proj
+    u_dt = u_proj * dt  # (N,)
+    new_state = dta * state + u_dt  # (N,)
+    # Output projection
+    y = C_proj(new_state)  # (D,)
+    return y, new_state
+def recurrent_gate(u, A, B_proj, C_proj, log_dt_proj, state=None):
+    """
+    Gate-style SSM using sequential scan for streaming inference.
+    Args:
+        u: Input tensor of shape (T, C_in) or (B, T, C_in)
+        A: State transition parameter of shape (N,) - in log space
+        B_proj: nn.Module or callable that projects (*, C_in) -> (*, N)
+        C_proj: nn.Module or callable that projects (*, N) -> (*, D)
+        log_dt_proj: nn.Module or callable that projects (*, C_in) -> (*, N)
+        state: Optional previous state of shape (N,) or (B, N)
+    Returns:
+        y: Output tensor of shape (D, T) or (B, D, T)
+        state: Updated state of shape (N,) or (B, N)
+    """
+    # Handle batched input
+    if u.dim() == 2:
+        u = u.unsqueeze(0)  # (T, C_in) -> (1, T, C_in)
+        squeeze_batch = True
+    else:
+        squeeze_batch = False
+    B_batch, T, C_in = u.shape
+    # Reshape to (B*T, C_in) for vectorized projection
+    u_flat = u.reshape(B_batch * T, C_in)
+    # Get projections
+    u_proj = B_proj(u_flat)  # (B*T, N)
+    log_dt = log_dt_proj(u_flat)  # (B*T, N)
+    N = u_proj.shape[1]
+    # Reshape back to (B, T, N)
+    u_proj = u_proj.reshape(B_batch, T, N)
+    log_dt = log_dt.reshape(B_batch, T, N)
+    # Discretize (vectorized across batch)
+    dt = F.softplus(log_dt).to(torch.float32)  # (B, T, N)
+    exp_A = torch.exp(A).to(torch.float32)  # (N,)
+    log_dta = -dt * exp_A[None, None, :]  # (B, T, N)
+    dta = torch.exp(log_dta)  # (B, T, N)
+    # Prepare scan input
+    u_dt = u_proj * dt  # (B, T, N)
+    # Initialize state
+    if state is None:
+        x = torch.zeros((B_batch, N), dtype=torch.float32, device=u.device)
+    else:
+        if state.dim() == 1:
+            x = state.unsqueeze(0).expand(B_batch, -1)
+        else:
+            x = state
+    # Output accumulator
+    states = torch.zeros((B_batch, T, N), dtype=torch.float32, device=u.device)
+    # Sequential scan over time (vectorized over batch)
+    for t in range(T):
+        x = dta[:, t] * x + u_dt[:, t]  # (B, N)
+        states[:, t] = x
+    # Apply C projection: (B*T, N) -> (B*T, D)
+    states_flat = states.reshape(B_batch * T, N)
+    y_flat = C_proj(states_flat)  # (B*T, D)
+    D = y_flat.shape[1]
+    y = y_flat.reshape(B_batch, T, D)  # (B, T, D)
+    # Return format
+    y = y.transpose(1, 2)  # (B, D, T)
+    if squeeze_batch:
+        y = y.squeeze(0)  # (D, T)
+        x = x.squeeze(0)  # (N,)
+    return y, x

tenns_core/scan_ops.py ADDED Viewed

	@@ -0,0 +1,515 @@

+"""
+Parallel scan operations for gate mode SSM.
+Implements parallel prefix scan with custom autograd for training support.
+Uses Triton kernels when available on CUDA, falls back to pure PyTorch otherwise.
+"""
+import torch
+from torch import nn
+from torch.nn import functional as F
+try:
+    import triton
+    import triton.language as tl
+    _HAS_TRITON = hasattr(tl, 'associative_scan')
+except ImportError:
+    _HAS_TRITON = False
+# ----------------------------
+# Utility
+# ----------------------------
+def _tp(x: torch.Tensor) -> torch.Tensor:
+    """(B, L, N) -> (B, N, L) contiguous."""
+    return x.moveaxis(-1, -2).contiguous()
+# ----------------------------
+# Reference (naive) scan
+# ----------------------------
+def scan_naive(input, log_dt, A, state=None, dim=-1):
+    """Naive sequential scan implementation.
+    Useful for testing and understanding, but slow (O(N) sequential steps).
+    Args:
+        input: Input tensor
+        log_dt: Log timestep parameters
+        A: State decay parameters
+        state: Optional initial state
+        dim: Dimension to scan over
+    Returns:
+        Scanned output tensor
+    """
+    dt = F.softplus(log_dt)
+    log_dta = -dt * A.exp()[..., None]
+    a = log_dta.exp()
+    if state is None:
+        state = 0
+    output = []
+    u = input * dt
+    for ui, ai in zip(u.moveaxis(dim, 0), a.moveaxis(dim, 0), strict=True):
+        state = ai * state + ui
+        output.append(state)
+    return torch.stack(output, dim=dim)
+# ----------------------------
+# PyTorch parallel scan
+# ----------------------------
+class ParallelScan(torch.autograd.Function):
+    """Parallel prefix scan with custom autograd.
+    Implements the associative scan operation:
+        state[t] = a[t] * state[t-1] + u[t]
+    In O(log N) parallel depth instead of O(N) sequential steps.
+    Note: This uses the naive sequential scan for backward pass to ensure
+    correctness. For production use with very long sequences, a parallel
+    backward scan could be implemented.
+    """
+    @staticmethod
+    def forward(ctx, u, a):
+        """Forward pass: parallel prefix scan.
+        Args:
+            u: Input values (batch, N, length)
+            a: Decay factors (batch, N, length)
+        Returns:
+            Scanned output (batch, N, length)
+        """
+        length = u.shape[-1]
+        strides = [2**i for i in range((length - 1).bit_length())]
+        # Save original inputs for backward
+        u_original = u.clone()
+        a_original = a.clone()
+        # Clone to avoid in-place modifications
+        u = u.clone()
+        a = a.clone()
+        for stride in strides:
+            u[..., stride:] = u[..., stride:] + u[..., :-stride] * a[..., stride:]
+            a[..., stride:] = a[..., stride:] * a[..., :-stride]
+        ctx.save_for_backward(u_original, a_original, u)
+        return u
+    @staticmethod
+    def backward(ctx, grad_output):
+        """Backward pass using sequential scan for correctness.
+        For production, this could be parallelized, but sequential is more
+        numerically stable and easier to verify.
+        """
+        u_original, a_original, y = ctx.saved_tensors
+        # Compute gradients using reverse-mode automatic differentiation
+        # by recomputing forward pass while tracking dependencies
+        grad_u = torch.zeros_like(u_original)
+        grad_a = torch.zeros_like(a_original)
+        # Backward scan: process from right to left
+        length = u_original.shape[-1]
+        # Accumulator for gradient flowing backward through time
+        grad_state = torch.zeros_like(u_original[..., 0:1])
+        for t in range(length - 1, -1, -1):
+            # Gradient from output at time t
+            grad_y_t = grad_output[..., t : t + 1]
+            # Total gradient flowing into state[t]
+            grad_state_t = grad_y_t + grad_state
+            # Gradients w.r.t. inputs
+            grad_u[..., t : t + 1] = grad_state_t
+            if t > 0:
+                grad_a[..., t : t + 1] = grad_state_t * y[..., t - 1 : t]
+            # Propagate gradient to previous state
+            if t > 0:
+                grad_state = grad_state_t * a_original[..., t : t + 1]
+        return grad_u, grad_a
+def parallel_scan_pytorch(input, log_dt, A, state=None):
+    """Pure PyTorch parallel scan for SSM.
+    Args:
+        input: Input tensor (batch, length, N)
+        log_dt: Log timestep parameters (batch, length, N)
+        A: State decay parameters (N,)
+        state: Optional initial state (N,)
+    Returns:
+        Scanned output (batch, length, N)
+    """
+    dt = F.softplus(log_dt)
+    log_dta = -dt * A.exp()[None, None, :]
+    a = log_dta.exp()
+    u = input * dt
+    # Fold initial state into first timestep
+    if state is not None:
+        u = u.clone()
+        u[:, 0, :] = u[:, 0, :] + state * a[:, 0, :]
+    # Transpose for scan: (batch, N, length)
+    u = u.transpose(-1, -2)
+    a = a.transpose(-1, -2)
+    # Apply parallel scan
+    output = ParallelScan.apply(u, a)
+    # Transpose back: (batch, length, N)
+    return output.transpose(-1, -2)
+# ----------------------------
+# Triton kernels (guarded)
+# ----------------------------
+if _HAS_TRITON:
+    @triton.jit
+    def _roll_op(x1, y1, x2, y2):
+        return x2, tl.where(y2 == float('inf'), x1, y2)
+    @triton.jit
+    def roll(u, length: tl.constexpr, reverse: tl.constexpr = 0):
+        if reverse:
+            _, u_rol = tl.associative_scan((u, float('inf') + u), 0, _roll_op, reverse=1)
+            u_rol = tl.where(tl.arange(0, length) < length - 1, u_rol, 0)
+        else:
+            _, u_rol = tl.associative_scan((u, float('inf') + u), 0, _roll_op)
+            u_rol = tl.where(tl.arange(0, length) > 0, u_rol, 0)
+        return u_rol
+    @triton.jit
+    def _scan_op(a1, x1, a2, x2):
+        return a1 * a2, a2 * x1 + x2
+    @triton.jit
+    def softplus_tl(x):
+        return tl.where(x < 20, tl.log(1 + tl.exp(x)), x)
+    @triton.jit
+    def scan_heisen_fwd_triton(
+        u_ptr,
+        log_dt_ptr,
+        A_ptr,
+        y_ptr,
+        state_ptr,
+        L,
+        N: tl.constexpr,
+        MAX_L: tl.constexpr,
+        INIT_STATE: tl.constexpr = 1,
+    ):
+        id_BATCH, id_N = tl.program_id(0), tl.program_id(1)
+        id_sample = id_BATCH * N + id_N
+        lrange = tl.arange(0, MAX_L)
+        offsets = id_sample * L + lrange
+        mask = lrange < L
+        A = tl.load(A_ptr + id_N)
+        if INIT_STATE:
+            state = tl.load(state_ptr + id_N)
+        u = tl.load(u_ptr + offsets, mask, 0).to(tl.float32)
+        log_dt = tl.load(log_dt_ptr + offsets, mask, 0).to(tl.float32)
+        dt = softplus_tl(log_dt)
+        log_dta = -1.0 * dt * tl.exp(A)
+        dta = tl.exp(log_dta)
+        if INIT_STATE:
+            u_dt = tl.where(lrange > 0, u * dt, u * dt + state * dta)
+        else:
+            u_dt = u * dt
+        _, y = tl.associative_scan((dta, u_dt), 0, _scan_op)
+        tl.store(y_ptr + offsets, y, mask)
+    @triton.jit
+    def scan_heisen_bwd_triton(
+        u_ptr,
+        grad_x_ptr,
+        log_dt_ptr,
+        A_ptr,
+        state_ptr,
+        grad_u_ptr,
+        grad_log_dt_ptr,
+        grad_A_ptr,
+        grad_x0_ptr,
+        L,
+        N: tl.constexpr,
+        MAX_L: tl.constexpr,
+        INIT_STATE: tl.constexpr = 1,
+    ):
+        id_BATCH, id_N = tl.program_id(0), tl.program_id(1)
+        id_sample = id_BATCH * N + id_N
+        lrange = tl.arange(0, MAX_L)
+        offsets = id_sample * L + lrange
+        mask = lrange < L
+        A = tl.load(A_ptr + id_N)
+        exp_A = tl.exp(A)
+        if INIT_STATE:
+            state = tl.load(state_ptr + id_N)
+        u = tl.load(u_ptr + offsets, mask, 0).to(tl.float32)
+        log_dt = tl.load(log_dt_ptr + offsets, mask, 0).to(tl.float32)
+        dt = softplus_tl(log_dt)
+        log_dta = -1.0 * dt * exp_A
+        dta = tl.exp(log_dta)
+        if INIT_STATE:
+            u_dt = tl.where(lrange > 0, u * dt, u * dt + state * dta)
+        else:
+            u_dt = u * dt
+        _, x = tl.associative_scan((dta, u_dt), 0, _scan_op)
+        x_rol = roll(x, MAX_L)
+        grad_x = tl.load(grad_x_ptr + offsets, mask, 0).to(tl.float32)
+        if INIT_STATE:
+            log_dta_star = tl.cumsum(log_dta, 0)
+            dta_star = tl.exp(log_dta_star)
+            grad_x0 = tl.sum(grad_x * dta_star, 0)
+            tl.store(grad_x0_ptr + id_sample, grad_x0)
+            x_rol = tl.where(lrange > 0, x_rol, state)
+        dta_rol = roll(dta, MAX_L, reverse=1)
+        _, grad_x = tl.associative_scan((dta_rol, grad_x), 0, _scan_op, reverse=1)
+        grad_u = grad_x * dt
+        tl.store(grad_u_ptr + offsets, grad_u, mask)
+        grad_dta = grad_x * x_rol
+        grad_log_dta = tl.exp(log_dta) * grad_dta
+        grad_log_dt = (-1.0 * grad_log_dta * exp_A + u * grad_x) * tl.sigmoid(log_dt)
+        tl.store(grad_log_dt_ptr + offsets, grad_log_dt, mask)
+        grad_A = tl.sum(grad_log_dta * log_dta, 0)
+        tl.store(grad_A_ptr + id_sample, grad_A)
+    class FusedScanTriton(torch.autograd.Function):
+        @staticmethod
+        @torch.compiler.disable
+        @torch.amp.custom_fwd(device_type='cuda')
+        def forward(ctx, u, T1, T2, logdt_bias, A, B1, B2, state=None):
+            INIT_STATE = state is not None
+            uh = u.half()
+            T1, T2, logdt_bias, B1 = T1.half(), T2.half(), logdt_bias.half(), B1.half()
+            if B2 is not None:
+                B2 = B2.half()
+            if B2 is not None:
+                u1 = F.linear(uh, B1)
+                u2_tp = _tp(F.linear(u1, B2))
+            else:
+                u2_tp = _tp(F.linear(uh, B1))
+            logdt_1 = F.linear(uh, T1)
+            logdt_tp = _tp(F.linear(logdt_1, T2, bias=logdt_bias))
+            x_tp = torch.empty_like(u2_tp, dtype=torch.float32)
+            BATCH, N, L = u2_tp.shape
+            grid = (BATCH, N)
+            max_L = triton.next_power_of_2(L)
+            num_warps = max(max_L // 1024, 1)
+            scan_heisen_fwd_triton[grid](
+                u2_tp,
+                logdt_tp,
+                A,
+                x_tp,
+                state,
+                L,
+                N,
+                max_L,
+                INIT_STATE=INIT_STATE,
+                num_warps=num_warps,
+                num_stages=3,
+            )
+            if B2 is not None:
+                ctx.save_for_backward(uh, state, A, T1, T2, logdt_bias, B1, B2)
+                ctx.B2_flag = True
+            else:
+                ctx.save_for_backward(uh, u2_tp, state, A, T1, T2, logdt_bias, B1)
+                ctx.B2_flag = False
+            return x_tp.moveaxis(-1, -2)  # (B, L, N)
+        @staticmethod
+        @torch.compiler.disable
+        @torch.amp.custom_bwd(device_type='cuda')
+        def backward(ctx, grad_x):
+            def back_dot(x, y):
+                return torch.tensordot(x, y, dims=([0], [0]))
+            if ctx.B2_flag:
+                uh, state, A, T1, T2, logdt_bias, B1, B2 = ctx.saved_tensors
+            else:
+                uh, u2_tp, state, A, T1, T2, logdt_bias, B1 = ctx.saved_tensors
+                B2 = None
+            INIT_STATE = state is not None
+            grad_x_tp = _tp(grad_x)
+            if B2 is not None:
+                u1 = F.linear(uh, B1)
+                u2_tp = _tp(F.linear(u1, B2))
+            logdt1 = F.linear(uh, T1)
+            logdt2 = F.linear(logdt1, T2, bias=logdt_bias)
+            logdt2_tp = _tp(logdt2)
+            BATCH, N, L = u2_tp.shape
+            grid = (BATCH, N)
+            grad_u2_tp = torch.empty_like(u2_tp, dtype=torch.float32)
+            grad_logdt_tp = torch.empty_like(u2_tp, dtype=torch.float32)
+            grad_A = torch.empty(grid, dtype=torch.float32, device=A.device)
+            grad_x0 = (
+                torch.empty((BATCH, N), dtype=torch.float32, device=A.device)
+                if INIT_STATE
+                else None
+            )
+            max_L = triton.next_power_of_2(L)
+            num_warps = max(max_L // 1024, 1)
+            scan_heisen_bwd_triton[grid](
+                u2_tp,
+                grad_x_tp,
+                logdt2_tp,
+                A,
+                state,
+                grad_u2_tp,
+                grad_logdt_tp,
+                grad_A,
+                grad_x0,
+                L,
+                N,
+                max_L,
+                INIT_STATE=INIT_STATE,
+                num_warps=num_warps,
+                num_stages=3,
+            )
+            grad_A = grad_A.sum(0)
+            grad_init_state = grad_x0.sum(0) if INIT_STATE else None
+            uh2 = uh.view(-1, N)
+            grad_u2 = _tp(grad_u2_tp).view(-1, N)
+            if B2 is not None:
+                grad_u1 = grad_u2 @ B2
+                grad_B2 = back_dot(grad_u2, u1.view(BATCH * L, -1))
+            else:
+                grad_B2 = None
+                grad_u1 = grad_u2
+            grad_u = grad_u1 @ B1
+            grad_B1 = back_dot(grad_u1, uh2)
+            grad_logdt_bias = grad_logdt_tp.sum((0, 2))
+            grad_logdt = _tp(grad_logdt_tp).view(-1, N)
+            grad_logdt_1 = grad_logdt @ T2
+            grad_T2 = back_dot(grad_logdt, logdt1.view(BATCH * L, -1))
+            grad_u = grad_u + grad_logdt_1 @ T1
+            grad_T1 = back_dot(grad_logdt_1, uh2)
+            return (
+                grad_u.view(BATCH, L, N),
+                grad_T1,
+                grad_T2,
+                grad_logdt_bias,
+                grad_A,
+                grad_B1,
+                grad_B2,
+                grad_init_state,
+            )
+# ----------------------------
+# Unified API
+# ----------------------------
+def _can_use_triton(u: torch.Tensor) -> bool:
+    if not _HAS_TRITON:
+        return False
+    if not u.is_cuda:
+        return False
+    try:
+        major, _ = torch.cuda.get_device_capability(u.device)
+        if major < 7:
+            return False
+    except Exception:
+        pass
+    return True
+def fused_scan(u, log_dt_proj, in_proj, A, state=None):
+    """Fused scan operation for gate mode SSM.
+    Uses Triton kernels on CUDA when available, falls back to PyTorch parallel scan.
+    Args:
+        u: Input tensor (batch, length, channels)
+        log_dt_proj: Sequential module for timestep projection
+        in_proj: Sequential or single module for input projection
+        A: State decay parameters (N,)
+        state: Optional initial state (N,)
+    Returns:
+        Scanned output (batch, length, N)
+    """
+    if _can_use_triton(u):
+        # Extract weights for Triton path
+        if isinstance(in_proj, nn.Linear):
+            B1, B2 = in_proj.weight, None
+        else:
+            B1, B2 = in_proj[0].weight, in_proj[1].weight
+        T1 = log_dt_proj[0].weight
+        T2 = log_dt_proj[1].weight
+        logdt_bias = log_dt_proj[1].bias
+        return FusedScanTriton.apply(
+            u.contiguous(),
+            T1.contiguous(),
+            T2.contiguous(),
+            logdt_bias.contiguous(),
+            A.contiguous(),
+            B1.contiguous(),
+            B2.contiguous() if B2 is not None else None,
+            state.contiguous() if state is not None else None,
+        )
+    # PyTorch fallback (CPU or CUDA without Triton)
+    u_proj = in_proj(u)
+    log_dt = log_dt_proj(u)
+    return parallel_scan_pytorch(u_proj, log_dt, A, state=state)

tenns_core/ssm.py ADDED Viewed

	@@ -0,0 +1,481 @@

+"""
+State Space Model (SSM) layers for sequence modeling.
+This module provides SSMLayer, a flexible implementation of various SSM architectures
+including S5, DWS, Neck, Full, and Gate modes. All implementations use pure PyTorch
+with custom autograd functions for efficient training.
+"""
+import math
+import einops
+import numpy as np
+import torch
+from torch import nn
+from torch.nn import functional as F
+from torch.nn.parameter import Parameter
+from .activations import get_activations
+from .fft_ops import padded_fft_conv_opt
+from .scan_ops import fused_scan
+# Utility functions
+def c2r(inputs):
+    return torch.view_as_real(inputs)
+def r2c(inputs):
+    return torch.view_as_complex(inputs)
+def inv_softplus(x):
+    return x + np.log(-np.expm1(-x))
+class Kernelizer(nn.Module):
+    """Core module for SSM operations using FFT convolutions and parallel scans.
+    This is the base class that handles the actual SSM computation.
+    SSMLayer extends this with parameter initialization and training utilities.
+    """
+    def __init__(self, mode='s5', transposed=False, complex_proj=False, **kwargs):
+        """Initialize Kernelizer.
+        Args:
+            mode: SSM mode ('s5', 'dws', 'neck', 'full', 'gate')
+            transposed: Whether to use transposed operations (time-last vs channel-last)
+            complex_proj: Whether to use complex projections
+        """
+        super().__init__()
+        self.mode = mode
+        self.transposed = transposed
+        self.complex_proj = complex_proj
+    @torch.compiler.disable
+    def discretize(self, A: torch.Tensor, weight: torch.Tensor, log_dt: torch.Tensor):
+        """Discretize continuous-time SSM using zero-order-hold method.
+        Converts continuous-time parameters (A, B, dt) to discrete-time (A_bar, B_bar)
+        using the zero-order-hold discretization:
+            A_bar = exp(A * dt)
+            B_bar = B * dt
+        NOTE: Assumes diagonal state matrix A.
+        Args:
+            A: State matrix diagonal [real, imag] (shape varies by mode)
+            weight: Input weight matrix B or output weight E (shape varies by mode)
+            log_dt: Log of timestep parameters
+        Returns:
+            Tuple of (dtA_real, dtA_imag, weight_hat) discretized parameters
+        """
+        with torch.autocast('cuda', enabled=False):
+            A_real, A_imag = -F.softplus(A[..., 0]), A[..., 1]
+            dt = log_dt.exp()
+            match self.mode:
+                case 'neck':
+                    dt = dt.unsqueeze(-1)  # (R, :) -> (R, :, 1)
+                    weight_hat = weight * dt
+                case 'full':
+                    dt = dt.unsqueeze(-2)  # (D, :) -> (D, 1, :)
+                    weight_hat = weight * dt
+                case 'dws':
+                    weight_hat = weight * dt  # (C, N)
+                case _:  # s5, gate
+                    weight_hat = weight * dt.unsqueeze(-1)  # (R*N, :) -> (R*N, C)
+            dtA_real, dtA_imag = dt * A_real, dt * A_imag
+            return dtA_real, dtA_imag, weight_hat
+    def forward(
+        self,
+        input: torch.Tensor,
+        A: torch.Tensor,
+        B: torch.Tensor,
+        C: torch.Tensor,
+        log_dt: torch.Tensor,
+        E: torch.Tensor,
+        state=None,
+    ):
+        """Forward pass through SSM layer.
+        Args:
+            input: Input tensor (batch, channels, length)
+            A: State matrix diagonal parameters
+            B: Input projection matrix (for s5/neck/gate modes)
+            C: Output projection matrix (for s5/neck modes) or module (for gate)
+            log_dt: Log timestep parameters
+            E: State mixing matrix (for dws/neck/full modes)
+            state: Optional initial state (for gate mode prefix tuning)
+        Returns:
+            Output tensor (batch, out_channels, length)
+        """
+        match self.mode:
+            case 's5' | 'neck':
+                dtA_real, dtA_imag, B_hat = self.discretize(A, B, log_dt)
+                return padded_fft_conv_opt(input, dtA_real, dtA_imag, B_hat, C, E)
+            case 'dws' | 'full':
+                dtA_real, dtA_imag, E_hat = self.discretize(A, E, log_dt)
+                return padded_fft_conv_opt(input, dtA_real, dtA_imag, None, None, E_hat)
+            case 'gate':
+                # Gate mode can work with both formats
+                # Transpose if needed: (B, C, T) -> (B, T, C)
+                if not self.transposed:
+                    input = input.transpose(1, 2)
+                output = C(fused_scan(input, log_dt, B, A, state=state))
+                # Transpose back if needed: (B, T, D) -> (B, D, T)
+                if not self.transposed:
+                    output = output.transpose(1, 2)
+                return output
+class SSMLayer(Kernelizer):
+    """State Space Model layer with multiple architecture variants.
+    Extends Kernelizer with parameter initialization, activation layers,
+    and training utilities. Supports multiple SSM modes:
+    - **s5**: Standard S5 architecture with shared state space
+    - **dws**: Depthwise separable variant (per-channel state spaces)
+    - **neck**: Bottleneck architecture with low-rank state mixing
+    - **full**: Full parameterization (per-output-channel state spaces)
+    - **gate**: Input-dependent gating (Mamba-style selective SSM)
+    Mode Comparison:
+    ----------------
+    | Mode  | Parameters | Best For                    | Speed   |
+    |-------|------------|-----------------------------|---------|
+    | s5    | Medium     | General sequence modeling   | Fast    |
+    | dws   | Low        | Efficient local processing  | Fastest |
+    | neck  | Low        | Long sequences, low memory  | Fast    |
+    | full  | High       | Rich feature interactions   | Medium  |
+    | gate  | High       | Input-adaptive processing   | Slow    |
+    Usage Example:
+    --------------
+    >>> # S5 mode for sequence classification
+    >>> layer = SSMLayer(
+    ...     num_coeffs=64,      # State space dimension
+    ...     in_channels=128,    # Input features
+    ...     out_channels=256,   # Output features
+    ...     mode='s5',
+    ...     repeat=1,           # Number of parallel SSMs
+    ...     norm='layer',
+    ...     postact='gelu'
+    ... )
+    >>> input = torch.randn(4, 128, 512)  # (batch, channels, length)
+    >>> output = layer(input)  # (4, 256, 512)
+    """
+    def __init__(
+        self,
+        num_coeffs: int,
+        in_channels: int,
+        out_channels: int,
+        repeat=None,
+        norm='batch',
+        postact='relu',
+        dropout=None,
+        dropout_dim=1,
+        use_activations=False,
+        **kwargs,
+    ):
+        """Initialize SSM layer.
+        Args:
+            num_coeffs: Dimension of state space (N in SSM notation)
+            in_channels: Number of input channels
+            out_channels: Number of output channels
+            repeat: Number of parallel SSM blocks (default: 1)
+            norm: Normalization type ('batch', 'layer', 'rms', None)
+            postact: Activation function ('relu', 'gelu', 'silu', None)
+            dropout: Dropout probability (None for no dropout)
+            dropout_dim: Dimension for dropout (0, 1, 2, or 3)
+            use_activations: Whether to apply activations to mixer output
+            **kwargs: Additional arguments (mode, transposed, complex_proj, etc.)
+        """
+        _VALID_MODES = {'s5', 'dws', 'neck', 'full', 'gate'}
+        _VALID_NORMS = {'batch', 'layer', 'layer-feature', 'rms', None}
+        _VALID_POSTACTS = {'relu', 'relu6', 'lelu', 'sigmoid', 'tanh', 'gelu', 'glu', 'silu', None}
+        _VALID_DROPOUT_DIMS = {0, 1, 2, 3}
+        mode = kwargs.get('mode', 's5')
+        if mode not in _VALID_MODES:
+            raise ValueError(f"Invalid mode '{mode}'. Must be one of {sorted(_VALID_MODES)}.")
+        if norm not in _VALID_NORMS:
+            raise ValueError(
+                f"Invalid norm '{norm}'. Must be one of {sorted(_VALID_NORMS, key=str)}."
+            )
+        if postact not in _VALID_POSTACTS:
+            raise ValueError(
+                f"Invalid postact '{postact}'. Must be one of {sorted(_VALID_POSTACTS, key=str)}."
+            )
+        if dropout_dim not in _VALID_DROPOUT_DIMS:
+            raise ValueError(
+                f'Invalid dropout_dim {dropout_dim}. Must be one of {sorted(_VALID_DROPOUT_DIMS)}.'
+            )
+        if num_coeffs < 1:
+            raise ValueError(f'num_coeffs must be >= 1, got {num_coeffs}.')
+        if in_channels < 1:
+            raise ValueError(f'in_channels must be >= 1, got {in_channels}.')
+        if out_channels < 1:
+            raise ValueError(f'out_channels must be >= 1, got {out_channels}.')
+        super().__init__(**kwargs)
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.repeat = 1 if repeat is None else repeat
+        self.norm = norm
+        self.postact = postact
+        self.dropout = dropout
+        self.dropout_dim = dropout_dim
+        self.bias = None
+        self.E = None
+        # Initialize state matrix A
+        if self.mode == 'gate':
+            # For gate mode: log-spaced initialization
+            A = np.arange(1, num_coeffs + 1)
+            A = np.log(A)
+        else:
+            # For FFT modes: complex eigenvalues
+            # Real part: decay rate, Imaginary part: frequency
+            A = np.stack([0.5 * np.ones(num_coeffs), math.pi * np.arange(num_coeffs)], -1)
+            A[..., 0] = inv_softplus(A[..., 0])
+        # Initialize timestep parameters
+        if self.mode in ['dws']:
+            dt = np.geomspace(1e-3, 1e-1, in_channels)
+        elif self.mode == 'full':
+            dt = np.geomspace(1e-3, 1e-1, out_channels)
+        else:
+            dt = np.geomspace(1e-3, 1e-1, self.repeat)
+        if self.mode == 'gate':
+            log_dt = inv_softplus(dt)
+        else:
+            log_dt = np.log(dt)
+        # Helper functions for parameter creation
+        def to_parameter(mat, is_complex=False, requires_grad=True):
+            if mat is None:
+                return None
+            tensor = torch.tensor(mat, dtype=torch.float)
+            if is_complex:
+                tensor = tensor.cfloat()
+            return Parameter(tensor, requires_grad=requires_grad)
+        def ones(shape, fan_in):
+            mat = np.ones(shape) / math.sqrt(fan_in)
+            return to_parameter(mat, is_complex=self.complex_proj)
+        def normal(shape, fan_in):
+            mat = np.random.randn(*shape) * math.sqrt(2 / fan_in)
+            return to_parameter(mat, is_complex=self.complex_proj)
+        tot_coeffs = self.repeat * num_coeffs
+        # Mode-specific parameter initialization
+        match self.mode:
+            case 'dws':
+                log_dt = einops.repeat(log_dt, 'c -> c n', n=num_coeffs)
+                A = einops.repeat(A, 'n i -> c n i', c=in_channels)
+                self.B = None
+                self.C = None
+                self.E = ones((in_channels, num_coeffs), num_coeffs)
+            case 's5':
+                log_dt = einops.repeat(log_dt, 'j -> (j n)', n=num_coeffs)
+                A = einops.repeat(A, 'n i -> (j n) i', j=self.repeat)
+                self.B = ones((tot_coeffs, in_channels), in_channels)
+                self.C = normal((out_channels, tot_coeffs), tot_coeffs)
+                self.E = None
+            case 'neck':
+                # Neck mode uses fewer repeated log_dt parameters
+                A = einops.repeat(A, 'n i -> r n i', r=self.repeat)
+                self.B = ones((self.repeat, in_channels), in_channels)
+                self.C = normal((out_channels, self.repeat), tot_coeffs)
+                self.E = normal((self.repeat, num_coeffs), 1)
+            case 'full':
+                log_dt = einops.repeat(log_dt, 'd -> d n', n=num_coeffs)
+                A = einops.repeat(A, 'n i -> d c n i', c=in_channels, d=out_channels)
+                self.B = None
+                self.C = None
+                self.E = ones((out_channels, in_channels, num_coeffs), in_channels)
+            case 'gate':
+                log_dt = einops.repeat(log_dt, 'j -> (j n)', n=num_coeffs)
+                # Timestep projection: learns input-dependent timesteps
+                self.log_dt = nn.Sequential(
+                    nn.Linear(in_channels, self.repeat, bias=False),
+                    nn.Linear(self.repeat, tot_coeffs, bias=True),
+                )
+                nn.init.zeros_(self.log_dt[-1].weight)
+                self.log_dt[-1].bias = to_parameter(log_dt)
+                # State decay parameters
+                A = einops.repeat(A, 'n -> (j n)', j=self.repeat)
+                # Input and output projections
+                self.B = nn.Sequential(
+                    nn.Linear(in_channels, self.repeat, bias=False),
+                    nn.Linear(self.repeat, tot_coeffs, bias=False),
+                )
+                self.C = nn.Linear(tot_coeffs, out_channels, bias=False)
+        # Register parameters
+        self.A = to_parameter(A)
+        if self.mode not in ['gate']:
+            self.log_dt = to_parameter(log_dt)
+        # Mark certain parameters as "sensitive" for optimizer
+        # (suggests using smaller learning rates for these)
+        match self.mode:
+            case 'dws' | 'full' | 'neck':
+                self._register_sensitives(self.log_dt, self.A)
+            case 'gate':
+                self._register_sensitives(self.A)
+        # Mixer layer: final projection and activations
+        if self.mode in ['dws']:
+            # DWS mode has explicit channel mixing
+            self.mixer = nn.Sequential(
+                self._make_activation_block(in_channels),
+                nn.Conv1d(in_channels, out_channels, 1, bias=False),
+                self._make_activation_block(out_channels) if use_activations else nn.Identity(),
+            )
+        else:
+            self.mixer = (
+                self._make_activation_block(out_channels) if use_activations else nn.Identity()
+            )
+    @staticmethod
+    def _register_sensitives(*args):
+        """Mark parameters as sensitive (for optimizer to use smaller learning rates).
+        Args:
+            *args: Parameters or modules to mark as sensitive
+        """
+        for arg in args:
+            if isinstance(arg, nn.Module):
+                for param in arg.parameters():
+                    param.sensitive = True
+                continue
+            arg.sensitive = True
+    def get_param_groups(self, lr=1e-3, sensitive_lr_factor=0.1):
+        """Get optimizer parameter groups with separate learning rates.
+        Sensitive parameters (A matrix, log_dt) benefit from smaller learning
+        rates. This method returns ready-made param groups for the optimizer.
+        Args:
+            lr: Base learning rate for regular parameters
+            sensitive_lr_factor: Multiplier for sensitive parameter learning rate
+                (default: 0.1, i.e. 10x smaller than base lr)
+        Returns:
+            List of dicts suitable for torch.optim optimizers
+        Example:
+            >>> layer = SSMLayer(64, 128, 256, mode='s5')
+            >>> optimizer = torch.optim.AdamW(layer.get_param_groups(lr=1e-3))
+        """
+        regular, sensitive = [], []
+        for param in self.parameters():
+            if getattr(param, 'sensitive', False):
+                sensitive.append(param)
+            else:
+                regular.append(param)
+        groups = [{'params': regular, 'lr': lr}]
+        if sensitive:
+            groups.append({'params': sensitive, 'lr': lr * sensitive_lr_factor})
+        return groups
+    def _make_activation_block(self, num_features):
+        """Create normalization + activation + dropout block.
+        Args:
+            num_features: Number of features for norm/dropout
+        Returns:
+            Sequential module with norm, activation, dropout
+        """
+        return get_activations(
+            1, num_features, self.norm, self.postact, self.dropout, self.dropout_dim
+        )
+    def forward(self, input):
+        """Forward pass through SSM layer.
+        Args:
+            input: Input tensor of shape (batch, in_channels, length)
+        Returns:
+            Output tensor of shape (batch, out_channels, length)
+        """
+        output = super().forward(input, self.A, self.B, self.C, self.log_dt, E=self.E)
+        if self.bias is not None:
+            output = output + self.bias
+        return self.mixer(output)
+    def to_inference(self):
+        """Convert to streaming inference mode.
+        Returns SSMLayerInference instance for low-latency streaming processing.
+        The inference layer maintains state across chunks for applications.
+        Returns:
+            SSMLayerInference: Inference layer with copied weights
+        Example:
+            >>> # After training
+            >>> train_layer = SSMLayer(64, 128, 256, mode='s5')
+            >>> # ... training ...
+            >>>
+            >>> # Convert for streaming
+            >>> infer_layer = train_layer.to_inference()
+            >>>
+            >>> # Process audio stream
+            >>> for chunk in audio_stream:
+            >>>     output = infer_layer(chunk)
+            >>>
+            >>> # Reset between utterances
+            >>> infer_layer.reset_state()
+        Note:
+            The inference layer uses sequential scan which is slower than
+            FFT for full sequences but has lower latency for streaming.
+        """
+        from .inference import SSMLayerInference
+        return SSMLayerInference.from_training(self)
+    def __repr__(self):
+        """String representation showing parameters."""
+        param_info = []
+        for name, param in self.named_parameters():
+            if param.requires_grad:
+                param_info.append(f'{name}: {list(param.shape)}')
+        return f'{self.__class__.__name__}(\n  ' + '\n  '.join(param_info) + '\n)'

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,17 @@

+{
+  "add_prefix_space": null,
+  "backend": "tokenizers",
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "</s>",
+  "extra_special_tokens": [],
+  "is_local": false,
+  "legacy": false,
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": "</s>",
+  "sp_model_kwargs": {},
+  "spaces_between_special_tokens": false,
+  "tokenizer_class": "TokenizersBackend",
+  "unk_token": "<unk>",
+  "use_default_system_prompt": false
+}