H0ARK
/

wave-density-130m

Text Generation

efficient-attention

Eval Results (legacy)

Model card Files Files and versions

wave-density-130m / README.md

H0ARK's picture

Update README.md

da3d2a0 verified about 2 months ago

|

history blame contribute delete

3.5 kB

	---
	license: apache-2.0
	datasets:
	- allenai/c4
	- HuggingFaceH4/ultrachat_200k
	language:
	- en
	metrics:
	- perplexity
	pipeline_tag: text-generation
	tags:
	- attention
	- transformer
	- language-model
	- wave-based
	- efficient-attention
	- research
	- from-scratch
	- causal-lm
	- fft
	model-index:
	- name: Wave-Density Attention (WDA-130M-MOM)
	results:
	- task:
	type: text-generation
	dataset:
	name: UltraChat
	type: HuggingFaceH4/ultrachat_200k
	metrics:
	- name: Perplexity (Mean Evaluation)
	type: perplexity
	value: 20.39
	- name: Perplexity (Best Checkpoint)
	type: perplexity
	value: 17.50
	source:
	name: Internal Evaluation
	url: https://huggingface.co/H0ARK/wave-density-attention
	---


	# Wave-Density Attention (WDA) — 130M Parameter Language Model

	This repository contains a 130M parameter causal language model built with Wave-Density Attention (WDA), a novel alternative to standard dot-product self-attention.

	WDA reframes attention as a wave-interference and density-rendering process, replacing the traditional $QK^\top$ similarity computation with learned frequency-based interactions. This allows attention patterns to emerge from constructive and destructive interference rather than explicit pairwise dot products.

	⸻

	## Model Overview
	- Architecture: Decoder-only Transformer with Wave-Density Attention
	- Parameters: ~130M
	- Context Length: 256 tokens
	- Attention Mechanism: Wave-Density Attention (Mixture-of-Masks via learned wave bases)
	- Training Regime: From scratch

	## Training Data
	- Primary: UltraChat 200k (instruction-style supervision)
	- Initialization / Mixing: Streaming C4 (broad web text)

	This combination provides both general language coverage and instruction-following coherence, while allowing the WDA mechanism to learn stable long-range structure.

	⸻

	## Performance
	- Validation Loss (UltraChat): ~2.86
	- Equivalent Perplexity: ~17.5–20 (best checkpoints)
	- Model Size: 130M parameters

	Despite using a fundamentally different attention formulation, WDA achieves competitive perplexity and strong qualitative coherence at this scale.

	⸻

	## Usage

	To use this model, install or clone the reference implementation from the official repository:

	👉 [Wave-Density Attention code](https://github.com/H0ARK/wave-density-attention)

	Example loading snippet:

	```python
	from wave_dencity import WaveCharLM
	import torch
	import json

	# Load model configuration
	with open("config.json", "r") as f:
	config = json.load(f)

	model = WaveCharLM(**config)
	# Load weights from model.safetensors
	# model.load_state_dict(...)
	model.eval()
	```

	Note: This model is intended for research and experimentation with alternative attention mechanisms. The codebase exposes WDA internals for inspection and modification.

	⸻

	## Why Wave-Density Attention?

	Traditional attention relies on sharp token-to-token similarity. WDA instead:
	- Uses frequencies as a representational tool
	- Produces attention surfaces via interference patterns
	- Selects among multiple learned attention masks dynamically (Mixture-of-Masks / “MoM”)

	This approach avoids explicit dot-product similarity while still supporting coherent, causal language modeling.

	⸻

	## Citation

	If you use this model or the Wave-Density Attention mechanism in your work, please cite the official repository and paper.