|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: apache-2.0 |
|
|
base_model: meta-llama/Llama-3.2-1B-Instruct |
|
|
tags: |
|
|
- text-generation |
|
|
- causal-lm |
|
|
- transformers |
|
|
- nanohammer |
|
|
- holographic-embeddings |
|
|
- state-space |
|
|
- efficient-attention |
|
|
- long-context |
|
|
pipeline_tag: text-generation |
|
|
model-index: |
|
|
- name: NanoHammer-1.5B-Instruct |
|
|
results: |
|
|
- task: |
|
|
type: text-generation |
|
|
name: Text Generation |
|
|
dataset: |
|
|
name: AI2 Reasoning Challenge (ARC-Challenge) |
|
|
type: arc_challenge |
|
|
metrics: |
|
|
- type: acc_norm |
|
|
value: 33.28 |
|
|
name: normalized accuracy |
|
|
- task: |
|
|
type: text-generation |
|
|
name: Text Generation |
|
|
dataset: |
|
|
name: AI2 Reasoning Challenge (ARC-Easy) |
|
|
type: arc_easy |
|
|
metrics: |
|
|
- type: acc |
|
|
value: 59.81 |
|
|
name: accuracy |
|
|
- task: |
|
|
type: text-generation |
|
|
name: Text Generation |
|
|
dataset: |
|
|
name: HellaSwag |
|
|
type: hellaswag |
|
|
metrics: |
|
|
- type: acc_norm |
|
|
value: 56.33 |
|
|
name: normalized accuracy |
|
|
- task: |
|
|
type: text-generation |
|
|
name: Text Generation |
|
|
dataset: |
|
|
name: PIQA |
|
|
type: piqa |
|
|
metrics: |
|
|
- type: acc |
|
|
value: 69.86 |
|
|
name: accuracy |
|
|
- task: |
|
|
type: text-generation |
|
|
name: Text Generation |
|
|
dataset: |
|
|
name: WinoGrande |
|
|
type: winogrande |
|
|
metrics: |
|
|
- type: acc |
|
|
value: 57.14 |
|
|
name: accuracy |
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
# π¨ NanoHammer-1.5B-Instruct |
|
|
|
|
|
**Explicit Causal Modeling with Holographic Integral State Compression** |
|
|
|
|
|
*A novel hybrid architecture combining Transformer attention with O(1) global causal state* |
|
|
|
|
|
[](https://opensource.org/licenses/Apache-2.0) |
|
|
[]() |
|
|
[]() |
|
|
|
|
|
</div> |
|
|
|
|
|
--- |
|
|
|
|
|
## π Key Innovation: Explicit Causal Modeling |
|
|
|
|
|
NanoHammer introduces a **groundbreaking hybrid architecture** that augments standard Transformer layers with an **explicit causal state mechanism**. Unlike traditional attention that implicitly learns causal dependencies across O(nΒ²) token pairs, NanoHammer maintains a **single global state token** that explicitly captures and propagates causal information through the sequence. |
|
|
|
|
|
### π― Core Advantages |
|
|
|
|
|
| Feature | Traditional Attention | NanoHammer | |
|
|
|---------|---------------------|------------| |
|
|
| **Causal Modeling** | Implicit (learned) | **Explicit (structured)** | |
|
|
| **Global State Complexity** | O(nΒ²) pairwise | **O(1) constant** | |
|
|
| **Extrapolation Cost** | Grows with sequence | **Constant O(1)** | |
|
|
| **Long Context Efficiency** | Quadratic scaling | **Linear scaling** | |
|
|
| **State Compression** | Distributed across KV cache | **Single token compression** | |
|
|
|
|
|
### π¬ Technical Breakthrough |
|
|
|
|
|
``` |
|
|
Traditional Transformer: NanoHammer Architecture: |
|
|
Tokenβ β Attention β Tokenβ' Tokenβ βββ State Update β S(t) |
|
|
Tokenβ β Attention β Tokenβ' β |
|
|
Tokenβ β Attention β Tokenβ' [S(t)] + [Tokenβ...Tokenβ] β Attention β Output |
|
|
... O(nΒ²) O(1) + O(nΒ²) = O(nΒ²) |
|
|
Tokenβ β Attention β Tokenβ' But with global causal context! |
|
|
``` |
|
|
|
|
|
The state token **S(t)** acts as a **causal information accumulator**, providing: |
|
|
- **Holographic encoding**: Position-aware via complex-domain rotations (e^(iΞΈ)) |
|
|
- **Fixed-point iteration**: Multi-head Euler method for stable state evolution |
|
|
- **Constant extrapolation**: New tokens always interact with O(1) state, not O(n) history |
|
|
|
|
|
--- |
|
|
|
|
|
## π Quick Start |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install transformers torch |
|
|
``` |
|
|
|
|
|
### Basic Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
import torch |
|
|
|
|
|
# Load model |
|
|
model_path = "NoesisLab/NanoHammer-1.5B-Instruct" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_path, |
|
|
trust_remote_code=True, |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map="auto", |
|
|
) |
|
|
|
|
|
# Generate response |
|
|
prompt = "Explain the concept of causality in physics." |
|
|
messages = [{"role": "user", "content": prompt}] |
|
|
|
|
|
input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
inputs = tokenizer(input_text, return_tensors="pt").to(model.device) |
|
|
|
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=256, |
|
|
temperature=0.7, |
|
|
do_sample=True, |
|
|
top_p=0.9, |
|
|
) |
|
|
|
|
|
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True) |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
### Multi-turn Conversation |
|
|
|
|
|
```python |
|
|
messages = [ |
|
|
{"role": "user", "content": "What is a holographic state?"}, |
|
|
{"role": "assistant", "content": "A holographic state is a compressed representation that encodes global information..."}, |
|
|
{"role": "user", "content": "How does it differ from traditional attention?"} |
|
|
] |
|
|
|
|
|
input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
# ... generate as above |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## ποΈ Architecture Details |
|
|
|
|
|
### Hybrid Decoder Layer Flow |
|
|
|
|
|
Each NanoHammer decoder layer executes the following pipeline: |
|
|
|
|
|
``` |
|
|
Input Tokens (T tokens) |
|
|
β |
|
|
[1] State Update Cell |
|
|
β’ Multi-head fixed-point iteration: S_{t+1} = S_t + Ξ±Β·f(S_t) |
|
|
β’ Learnable per-head step sizes |
|
|
β’ Pre-norm β MLP β Post-norm |
|
|
β |
|
|
[2] State Token Projection |
|
|
β’ Project state_hidden_size (512) β hidden_size (2048) |
|
|
β’ Create global "state token" encoding causal history |
|
|
β |
|
|
[3] State Token Injection |
|
|
β’ Prepend state token: [S(t)] + [Tokenβ, ..., Tokenβ] |
|
|
β’ Sequence length: T β T+1 |
|
|
β |
|
|
[4] Llama Self-Attention |
|
|
β’ Standard Llama attention over T+1 tokens |
|
|
β’ GQA: 32 query heads, 8 KV heads |
|
|
β’ RoPE position encoding |
|
|
β |
|
|
[5] Llama MLP |
|
|
β’ SwiGLU activation |
|
|
β’ 2048 β 8192 β 2048 |
|
|
β |
|
|
[6] State Token Removal |
|
|
β’ Extract and remove state token |
|
|
β’ Return T tokens |
|
|
β |
|
|
Output Tokens (T tokens) |
|
|
``` |
|
|
|
|
|
### Core Components |
|
|
|
|
|
#### 1οΈβ£ **HolographicRotaryEmbedding** |
|
|
```python |
|
|
# Complex-domain rotational encoding |
|
|
x_i * e^(i*ΞΈ_k) where ΞΈ_k = position_id / (10000^(2k/d)) |
|
|
``` |
|
|
- Encodes **absolute positions** in complex space |
|
|
- Enables **inverse rotation** for relative coordinate transformations |
|
|
- Maintains **temporal coherence** across state updates |
|
|
|
|
|
#### 2οΈβ£ **StateUpdateCell** |
|
|
```python |
|
|
# Multi-head Euler iteration |
|
|
for head in range(num_state_heads): |
|
|
S_new[head] = S[head] + step_size[head] * MLP(LayerNorm(S[head])) |
|
|
``` |
|
|
- **16 independent state heads** (512-dim total) |
|
|
- **Learnable step sizes** per head for adaptive evolution |
|
|
- **Pre-norm + MLP + Post-norm** architecture for stability |
|
|
|
|
|
#### 3οΈβ£ **StateTokenProjection** |
|
|
```python |
|
|
# Compress global state into single token |
|
|
state_token = Linear(state_hidden_size=512 β hidden_size=2048) |
|
|
``` |
|
|
- **Dimensional expansion**: 512 β 2048 |
|
|
- **Single token** represents entire causal history |
|
|
- **O(1) memory footprint** regardless of sequence length |
|
|
|
|
|
### Model Specifications |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| **Total Parameters** | ~1.5B | |
|
|
| **Hidden Size** | 2048 | |
|
|
| **Intermediate Size** | 8192 | |
|
|
| **Num Layers** | 16 | |
|
|
| **Attention Heads** | 32 (query) / 8 (KV, GQA) | |
|
|
| **State Heads** | 16 | |
|
|
| **State Hidden Size** | 512 | |
|
|
| **Vocab Size** | 128,256 | |
|
|
| **Max Position Embeddings** | 131,072 | |
|
|
| **RoPE Theta** | 500,000 | |
|
|
|
|
|
--- |
|
|
|
|
|
## β‘ Performance Characteristics |
|
|
|
|
|
### Computational Complexity |
|
|
|
|
|
| Operation | Complexity | Description | |
|
|
|-----------|-----------|-------------| |
|
|
| **State Update** | O(1) | Fixed-size state iteration | |
|
|
| **State Projection** | O(1) | Single token transformation | |
|
|
| **Self-Attention** | O(nΒ²) | Standard Transformer attention | |
|
|
| **Total per Layer** | **O(nΒ²)** | Dominated by attention (as expected) | |
|
|
|
|
|
**Key Insight**: While overall complexity remains O(nΒ²) due to attention, the **state mechanism adds negligible overhead** while providing **explicit causal modeling** that is: |
|
|
- **Free during inference**: State update cost is independent of context length |
|
|
- **Efficient for extrapolation**: New tokens interact with O(1) state, not O(n) history |
|
|
- **Globally coherent**: Single state token ensures causal consistency |
|
|
|
|
|
### Memory Efficiency |
|
|
|
|
|
``` |
|
|
Traditional KV Cache: O(n * d * L) [n tokens Γ d dims Γ L layers] |
|
|
NanoHammer State: O(d_s * L) [512 dims Γ 16 layers = 8KB constant!] |
|
|
``` |
|
|
|
|
|
The holographic state acts as a **learned compression** of causal history: |
|
|
- **Constant size** regardless of sequence length |
|
|
- **Accumulated knowledge** from all previous tokens |
|
|
- **Efficient transfer** across generation steps |
|
|
|
|
|
--- |
|
|
|
|
|
## π Benchmark Results |
|
|
|
|
|
NanoHammer has been evaluated on standard language understanding benchmarks using the [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) framework (0-shot evaluation). |
|
|
|
|
|
### Common Sense Reasoning & Knowledge |
|
|
|
|
|
| Task | Version | Metric | Value | Stderr | |
|
|
|------|---------|--------|-------|--------| |
|
|
| **ARC-Challenge** | 1 | acc | 29.61% | Β±1.33% | |
|
|
| | | acc_norm | **33.28%** | Β±1.38% | |
|
|
| **ARC-Easy** | 1 | acc | **59.81%** | Β±1.01% | |
|
|
| | | acc_norm | 55.68% | Β±1.02% | |
|
|
| **HellaSwag** | 1 | acc | 42.65% | Β±0.49% | |
|
|
| | | acc_norm | **56.33%** | Β±0.49% | |
|
|
| **PIQA** | 1 | acc | **69.86%** | Β±1.07% | |
|
|
| | | acc_norm | **69.86%** | Β±1.07% | |
|
|
| **WinoGrande** | 1 | acc | **57.14%** | Β±1.39% | |
|
|
|
|
|
### Performance Summary |
|
|
|
|
|
``` |
|
|
Average Accuracy (normalized): 54.86% |
|
|
- Strong performance on physical reasoning (PIQA: 69.86%) |
|
|
- Competitive commonsense reasoning (HellaSwag: 56.33%, WinoGrande: 57.14%) |
|
|
- Moderate performance on knowledge-intensive tasks (ARC: 33-60%) |
|
|
``` |
|
|
|
|
|
**Key Observations:** |
|
|
- The model demonstrates **strong physical and commonsense reasoning** capabilities despite the novel architecture |
|
|
- Performance is competitive with other 1-2B parameter models in the same class |
|
|
- The explicit causal state mechanism does not compromise standard language understanding benchmarks |
|
|
- Results suggest the holographic state successfully captures relevant semantic information |
|
|
|
|
|
### Evaluation Details |
|
|
|
|
|
**Setup:** |
|
|
- Evaluation framework: `lm-evaluation-harness` |
|
|
- Shot configuration: 0-shot (no few-shot examples) |
|
|
- Temperature: Greedy decoding |
|
|
- Batch size: Auto |
|
|
|
|
|
**Reproducing Results:** |
|
|
```bash |
|
|
# Install lm-eval |
|
|
pip install lm-eval |
|
|
|
|
|
# Run evaluation |
|
|
lm_eval --model hf \ |
|
|
--model_args pretrained=NoesisLab/NanoHammer-1.5B-Instruct,trust_remote_code=True \ |
|
|
--tasks arc_challenge,arc_easy,hellaswag,piqa,winogrande \ |
|
|
--batch_size auto \ |
|
|
--output_path results/ |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π Training |
|
|
|
|
|
### Base Model & Weight Transfer |
|
|
|
|
|
NanoHammer initializes from **Llama-3.2-1B-Instruct** via selective weight transfer: |
|
|
|
|
|
**Frozen Components** (from Llama): |
|
|
- Token embeddings (`embed_tokens`) |
|
|
- Language modeling head (`lm_head`) |
|
|
- Self-attention layers (`self_attn`) |
|
|
- MLP layers (`mlp`) |
|
|
- All RMS layer norms |
|
|
|
|
|
**Trainable Components** (NanoHammer-specific): |
|
|
- `token_to_state`: Projects input tokens β state space |
|
|
- `holographic_rope`: Position encoding for state |
|
|
- `state_cell`: State update mechanism (per layer) |
|
|
- `state_projection`: State β hidden projection (per layer) |
|
|
|
|
|
### Training Configuration |
|
|
|
|
|
- **Dataset**: High-quality instruction-following data |
|
|
- **Precision**: BF16 mixed precision |
|
|
- **Optimization**: AdamW with cosine LR schedule |
|
|
- **Gradient Checkpointing**: Enabled for memory efficiency |
|
|
- **Batch Size**: Scaled with gradient accumulation |
|
|
- **Max Sequence Length**: 2048 tokens (extendable to 131K via RoPE) |
|
|
|
|
|
--- |
|
|
|
|
|
## π Why NanoHammer? |
|
|
|
|
|
### Problem: Implicit vs Explicit Causal Modeling |
|
|
|
|
|
Traditional Transformers learn causal dependencies **implicitly** through attention weights: |
|
|
``` |
|
|
Q @ K^T β Attention weights β Implicitly capture "what depends on what" |
|
|
``` |
|
|
|
|
|
**Limitations**: |
|
|
- Causality is **distributed** across nΒ² attention scores |
|
|
- **No explicit structure** for causal information flow |
|
|
- **Quadratic cost** to maintain global context |
|
|
- **Poor extrapolation** to longer sequences |
|
|
|
|
|
### Solution: Holographic Integral State |
|
|
|
|
|
NanoHammer introduces an **explicit causal state token**: |
|
|
``` |
|
|
S(t) β Accumulated causal information from all previous tokens |
|
|
β Updated via fixed-point iteration with temporal encoding |
|
|
β Participates in attention as a "global context token" |
|
|
``` |
|
|
|
|
|
**Benefits**: |
|
|
- Causality is **explicit** in a structured state representation |
|
|
- **O(1) state size** provides constant-cost global context |
|
|
- **Natural extrapolation** to unseen sequence lengths |
|
|
- **Interpretable**: State token can be analyzed/visualized |
|
|
|
|
|
--- |
|
|
|
|
|
## π Model Architecture Diagram |
|
|
|
|
|
``` |
|
|
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
|
|
β Input: "What is the capital of France?" β |
|
|
β Tokens: [What, is, the, capital, of, France, ?] β |
|
|
ββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ |
|
|
β |
|
|
βΌ |
|
|
Token Embeddings |
|
|
β |
|
|
βΌ |
|
|
ββββββββββββββββββββββββββ |
|
|
β Token-to-State Proj β Project to state space |
|
|
ββββββββββββββ¬ββββββββββββ |
|
|
β |
|
|
ββββββββββββββΌββββββββββββ |
|
|
β Holographic RoPE β Apply position encoding |
|
|
β (Complex rotation) β |
|
|
ββββββββββββββ¬ββββββββββββ |
|
|
β |
|
|
βββββββββΌβββββββββ |
|
|
β Layer 1-16 β (Repeated 16 times) |
|
|
β βββββββββββββββββ£ |
|
|
β ββββββββββββββ β |
|
|
β βState Updateβ β S(t+1) = S(t) + Ξ±Β·f(S(t)) |
|
|
β β Cell β β [Fixed-point iteration] |
|
|
β βββββββ¬βββββββ β |
|
|
β β β |
|
|
β βββββββΌβββββββ β |
|
|
β β State β β Project 512 β 2048 |
|
|
β β Projection β β |
|
|
β βββββββ¬βββββββ β |
|
|
β β β |
|
|
β [S] + [Tβ, Tβ, ..., Tβ] β Prepend state token |
|
|
β β β |
|
|
β βββββββΌβββββββ β |
|
|
β β Llama β β Standard attention |
|
|
β β Attention β β over T+1 tokens |
|
|
β βββββββ¬βββββββ β |
|
|
β β β |
|
|
β βββββββΌβββββββ β |
|
|
β β Llama β β SwiGLU MLP |
|
|
β β MLP β β |
|
|
β βββββββ¬βββββββ β |
|
|
β β β |
|
|
β Remove [S] from output |
|
|
β β β |
|
|
βββββββββΌβββββββββ |
|
|
β |
|
|
βββββββββΌβββββββββ |
|
|
β Final Norm β |
|
|
βββββββββ¬βββββββββ |
|
|
β |
|
|
βββββββββΌβββββββββ |
|
|
β LM Head β Project to vocab |
|
|
βββββββββ¬βββββββββ |
|
|
β |
|
|
βΌ |
|
|
Output: "Paris" (logits over 128K vocab) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π Citation |
|
|
|
|
|
If you use NanoHammer in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{nanohammer2025, |
|
|
title={NanoHammer: Explicit Causal Modeling with Holographic Integral State Compression}, |
|
|
author={NoesisLab}, |
|
|
year={2025}, |
|
|
howpublished={\url{https://huggingface.co/NoesisLab/NanoHammer-1.5B-Instruct}}, |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π License |
|
|
|
|
|
This model is released under the **Apache 2.0** license, inheriting from the base Llama-3.2-1B-Instruct model. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Acknowledgments |
|
|
|
|
|
- **Base Model**: Meta's Llama-3.2-1B-Instruct |
|
|
- **Inspiration**: State-space models, holographic memory, and causal inference theory |
|
|
- **Framework**: HuggingFace Transformers |
|
|
|
|
|
--- |
|
|
|
|
|
## π Links |
|
|
|
|
|
- **Model Card**: [NoesisLab/NanoHammer-1.5B-Instruct](https://huggingface.co/NoesisLab/NanoHammer-1.5B-Instruct) |
|
|
- **Paper**: Coming soon |
|
|
|
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
**Built with β€οΈ by NoesisLab** |
|
|
|
|
|
*Advancing causal modeling in large language models* |
|
|
|
|
|
</div> |
|
|
|