wave-density-130m / README.md
H0ARK's picture
Update README.md
da3d2a0 verified
metadata
license: apache-2.0
datasets:
  - allenai/c4
  - HuggingFaceH4/ultrachat_200k
language:
  - en
metrics:
  - perplexity
pipeline_tag: text-generation
tags:
  - attention
  - transformer
  - language-model
  - wave-based
  - efficient-attention
  - research
  - from-scratch
  - causal-lm
  - fft
model-index:
  - name: Wave-Density Attention (WDA-130M-MOM)
    results:
      - task:
          type: text-generation
        dataset:
          name: UltraChat
          type: HuggingFaceH4/ultrachat_200k
        metrics:
          - name: Perplexity (Mean Evaluation)
            type: perplexity
            value: 20.39
          - name: Perplexity (Best Checkpoint)
            type: perplexity
            value: 17.5
        source:
          name: Internal Evaluation
          url: https://huggingface.co/H0ARK/wave-density-attention

Wave-Density Attention (WDA) — 130M Parameter Language Model

This repository contains a 130M parameter causal language model built with Wave-Density Attention (WDA), a novel alternative to standard dot-product self-attention.

WDA reframes attention as a wave-interference and density-rendering process, replacing the traditional $QK^\top$ similarity computation with learned frequency-based interactions. This allows attention patterns to emerge from constructive and destructive interference rather than explicit pairwise dot products.

Model Overview

  • Architecture: Decoder-only Transformer with Wave-Density Attention
  • Parameters: ~130M
  • Context Length: 256 tokens
  • Attention Mechanism: Wave-Density Attention (Mixture-of-Masks via learned wave bases)
  • Training Regime: From scratch

Training Data

  • Primary: UltraChat 200k (instruction-style supervision)
  • Initialization / Mixing: Streaming C4 (broad web text)

This combination provides both general language coverage and instruction-following coherence, while allowing the WDA mechanism to learn stable long-range structure.

Performance

  • Validation Loss (UltraChat): ~2.86
  • Equivalent Perplexity: ~17.5–20 (best checkpoints)
  • Model Size: 130M parameters

Despite using a fundamentally different attention formulation, WDA achieves competitive perplexity and strong qualitative coherence at this scale.

Usage

To use this model, install or clone the reference implementation from the official repository:

👉 Wave-Density Attention code

Example loading snippet:

from wave_dencity import WaveCharLM
import torch
import json

# Load model configuration
with open("config.json", "r") as f:
    config = json.load(f)

model = WaveCharLM(**config)
# Load weights from model.safetensors
# model.load_state_dict(...)
model.eval()

Note: This model is intended for research and experimentation with alternative attention mechanisms. The codebase exposes WDA internals for inspection and modification.

Why Wave-Density Attention?

Traditional attention relies on sharp token-to-token similarity. WDA instead:

  • Uses frequencies as a representational tool
  • Produces attention surfaces via interference patterns
  • Selects among multiple learned attention masks dynamically (Mixture-of-Masks / “MoM”)

This approach avoids explicit dot-product similarity while still supporting coherent, causal language modeling.

Citation

If you use this model or the Wave-Density Attention mechanism in your work, please cite the official repository and paper.