--- license: apache-2.0 datasets: - allenai/c4 - HuggingFaceH4/ultrachat_200k language: - en metrics: - perplexity pipeline_tag: text-generation tags: - attention - transformer - language-model - wave-based - efficient-attention - research - from-scratch - causal-lm - fft model-index: - name: Wave-Density Attention (WDA-130M-MOM) results: - task: type: text-generation dataset: name: UltraChat type: HuggingFaceH4/ultrachat_200k metrics: - name: Perplexity (Mean Evaluation) type: perplexity value: 20.39 - name: Perplexity (Best Checkpoint) type: perplexity value: 17.50 source: name: Internal Evaluation url: https://huggingface.co/H0ARK/wave-density-attention --- # Wave-Density Attention (WDA) — 130M Parameter Language Model This repository contains a 130M parameter causal language model built with Wave-Density Attention (WDA), a novel alternative to standard dot-product self-attention. WDA reframes attention as a wave-interference and density-rendering process, replacing the traditional $QK^\top$ similarity computation with learned frequency-based interactions. This allows attention patterns to emerge from constructive and destructive interference rather than explicit pairwise dot products. ⸻ ## Model Overview - **Architecture**: Decoder-only Transformer with Wave-Density Attention - **Parameters**: ~130M - **Context Length**: 256 tokens - **Attention Mechanism**: Wave-Density Attention (Mixture-of-Masks via learned wave bases) - **Training Regime**: From scratch ## Training Data - **Primary**: UltraChat 200k (instruction-style supervision) - **Initialization / Mixing**: Streaming C4 (broad web text) This combination provides both general language coverage and instruction-following coherence, while allowing the WDA mechanism to learn stable long-range structure. ⸻ ## Performance - **Validation Loss (UltraChat)**: ~2.86 - **Equivalent Perplexity**: ~17.5–20 (best checkpoints) - **Model Size**: 130M parameters Despite using a fundamentally different attention formulation, WDA achieves competitive perplexity and strong qualitative coherence at this scale. ⸻ ## Usage To use this model, install or clone the reference implementation from the official repository: 👉 [**Wave-Density Attention code**](https://github.com/H0ARK/wave-density-attention) Example loading snippet: ```python from wave_dencity import WaveCharLM import torch import json # Load model configuration with open("config.json", "r") as f: config = json.load(f) model = WaveCharLM(**config) # Load weights from model.safetensors # model.load_state_dict(...) model.eval() ``` Note: This model is intended for research and experimentation with alternative attention mechanisms. The codebase exposes WDA internals for inspection and modification. ⸻ ## Why Wave-Density Attention? Traditional attention relies on sharp token-to-token similarity. WDA instead: - Uses frequencies as a representational tool - Produces attention surfaces via interference patterns - Selects among multiple learned attention masks dynamically (Mixture-of-Masks / “MoM”) This approach avoids explicit dot-product similarity while still supporting coherent, causal language modeling. ⸻ ## Citation If you use this model or the Wave-Density Attention mechanism in your work, please cite the official repository and paper.