Jordan-Spectral Attention (JSA)

This repository is a timestamped public research artifact for Jordan-Spectral Attention (JSA), a spectral-shift replacement for Transformer self-attention in autoregressive language modeling experiments.

JSA replaces token-to-token softmax attention with two structured branches:

  1. Spectral global mixing over a small cosine basis of rank R.
  2. Causal local shift mixing over the previous k tokens.

The current experimental implementation targets the OpenAI Parameter Golf MLX training path.

Core operator

For an input sequence x ∈ R^{BΓ—TΓ—D}:

  • project the token axis into R spectral modes,
  • gate those modes from a pooled sequence representation,
  • reconstruct a global sequence signal,
  • add small causal local shifts,
  • apply a learned channel scale.

The implementation lives in:

jsa/mixer.py

The Parameter Golf integration lives in:

train/train_jsa_mlx.py

Current strongest local result

These are local MLX experiments. Official Parameter Golf reproduction is pending.

Setup Params Artifact Full-val BPB Notes
SP8192 baseline 20.73M 13.62 MB 1.9096 Local 500-step baseline
JSA full replacement, rank 32, k=2 14.11M ~10.55 MB 0.91–1.11 Seed-sensitive but strong
JSA full replacement, rank 64, k=2 14.55M ~11.74–11.78 MB 0.58–0.60 Best current local result

Key caveat: these runs used 10 downloaded train shards for local iteration and full validation over the SP8192 validation split. They should not be presented as official leaderboard results until reproduced through the official track path. Also, note: The above results are based on OpenAI parameter-golf experiments run locally.

Setup and Example command

python3 -m venv .venv
source .venv/bin/activate

pip install -r requirements.txt

Get the data:

chmod +x scripts/setup_sp8192.sh
bash scripts/setup_sp8192.sh

Example Run

A short 50-step sanity check is included only to verify the standalone repo wiring; headline results use the full 500-step / full-validation runs from the original experiment logs.

RUN_ID=sanity_check \
DATA_PATH=./data/datasets/fineweb10B_sp8192 \
TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe.model \
VOCAB_SIZE=8192 \
SEED=42 \
USE_JSA=1 \
JSA_RANK=64 \
JSA_LOCAL_K=2 \
JSA_LAST_N_LAYERS=9 \
ITERATIONS=50 \
TRAIN_BATCH_TOKENS=8192 \
VAL_BATCH_SIZE=8192 \
VAL_LOSS_EVERY=0 \
VAL_MAX_SEQS=128 \
python3 train/train_jsa_mlx.py

Test

Tested on Apple Silicon + MLX 0.31.1.

Dataset note

The SP8192 dataset used by the Parameter Golf records is downloaded from the alternate manifest:

rm -f datasets/manifest.json
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192 --train-shards 10

Jordan-Spectral Attention Block

JSA Block Diagram

Repository layout

jordan-spectral-attention/
β”œβ”€β”€ jsa/                     # core JSA mixer
β”œβ”€β”€ train/                   # MLX training scripts
β”œβ”€β”€ configs/                 # copyable run configs
β”œβ”€β”€ experiments/             # result summaries
β”œβ”€β”€ figures/                 # block diagram
└── logs/                    # selected logs can be added here

Attribution statement

This repository establishes a public timestamped release of Jordan-Spectral Attention (JSA), proposed and implemented by Karimulla Saheb Naik.

License

MIT. See LICENSE.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support