wave-density-130m / README.md
H0ARK's picture
Update README.md
da3d2a0 verified
---
license: apache-2.0
datasets:
- allenai/c4
- HuggingFaceH4/ultrachat_200k
language:
- en
metrics:
- perplexity
pipeline_tag: text-generation
tags:
- attention
- transformer
- language-model
- wave-based
- efficient-attention
- research
- from-scratch
- causal-lm
- fft
model-index:
- name: Wave-Density Attention (WDA-130M-MOM)
results:
- task:
type: text-generation
dataset:
name: UltraChat
type: HuggingFaceH4/ultrachat_200k
metrics:
- name: Perplexity (Mean Evaluation)
type: perplexity
value: 20.39
- name: Perplexity (Best Checkpoint)
type: perplexity
value: 17.50
source:
name: Internal Evaluation
url: https://huggingface.co/H0ARK/wave-density-attention
---
# Wave-Density Attention (WDA) — 130M Parameter Language Model
This repository contains a 130M parameter causal language model built with Wave-Density Attention (WDA), a novel alternative to standard dot-product self-attention.
WDA reframes attention as a wave-interference and density-rendering process, replacing the traditional $QK^\top$ similarity computation with learned frequency-based interactions. This allows attention patterns to emerge from constructive and destructive interference rather than explicit pairwise dot products.
## Model Overview
- **Architecture**: Decoder-only Transformer with Wave-Density Attention
- **Parameters**: ~130M
- **Context Length**: 256 tokens
- **Attention Mechanism**: Wave-Density Attention (Mixture-of-Masks via learned wave bases)
- **Training Regime**: From scratch
## Training Data
- **Primary**: UltraChat 200k (instruction-style supervision)
- **Initialization / Mixing**: Streaming C4 (broad web text)
This combination provides both general language coverage and instruction-following coherence, while allowing the WDA mechanism to learn stable long-range structure.
## Performance
- **Validation Loss (UltraChat)**: ~2.86
- **Equivalent Perplexity**: ~17.5–20 (best checkpoints)
- **Model Size**: 130M parameters
Despite using a fundamentally different attention formulation, WDA achieves competitive perplexity and strong qualitative coherence at this scale.
## Usage
To use this model, install or clone the reference implementation from the official repository:
👉 [**Wave-Density Attention code**](https://github.com/H0ARK/wave-density-attention)
Example loading snippet:
```python
from wave_dencity import WaveCharLM
import torch
import json
# Load model configuration
with open("config.json", "r") as f:
config = json.load(f)
model = WaveCharLM(**config)
# Load weights from model.safetensors
# model.load_state_dict(...)
model.eval()
```
Note: This model is intended for research and experimentation with alternative attention mechanisms. The codebase exposes WDA internals for inspection and modification.
## Why Wave-Density Attention?
Traditional attention relies on sharp token-to-token similarity. WDA instead:
- Uses frequencies as a representational tool
- Produces attention surfaces via interference patterns
- Selects among multiple learned attention masks dynamically (Mixture-of-Masks / “MoM”)
This approach avoids explicit dot-product similarity while still supporting coherent, causal language modeling.
## Citation
If you use this model or the Wave-Density Attention mechanism in your work, please cite the official repository and paper.