NanoHammer-1.5B-Instruct / README.md

OzTianlu

Update README.md

fc86701 verified 2 days ago

preview code

raw

history blame contribute delete

16.7 kB

metadata

language:
  - en
license: apache-2.0
base_model: meta-llama/Llama-3.2-1B-Instruct
tags:
  - text-generation
  - causal-lm
  - transformers
  - nanohammer
  - holographic-embeddings
  - state-space
  - efficient-attention
  - long-context
pipeline_tag: text-generation
model-index:
  - name: NanoHammer-1.5B-Instruct
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (ARC-Challenge)
          type: arc_challenge
        metrics:
          - type: acc_norm
            value: 33.28
            name: normalized accuracy
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (ARC-Easy)
          type: arc_easy
        metrics:
          - type: acc
            value: 59.81
            name: accuracy
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag
          type: hellaswag
        metrics:
          - type: acc_norm
            value: 56.33
            name: normalized accuracy
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: PIQA
          type: piqa
        metrics:
          - type: acc
            value: 69.86
            name: accuracy
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: WinoGrande
          type: winogrande
        metrics:
          - type: acc
            value: 57.14
            name: accuracy

🔨 NanoHammer-1.5B-Instruct

Explicit Causal Modeling with Holographic Integral State Compression

A novel hybrid architecture combining Transformer attention with O(1) global causal state

🌟 Key Innovation: Explicit Causal Modeling

NanoHammer introduces a groundbreaking hybrid architecture that augments standard Transformer layers with an explicit causal state mechanism. Unlike traditional attention that implicitly learns causal dependencies across O(n²) token pairs, NanoHammer maintains a single global state token that explicitly captures and propagates causal information through the sequence.

🎯 Core Advantages

Feature	Traditional Attention	NanoHammer
Causal Modeling	Implicit (learned)	Explicit (structured)
Global State Complexity	O(n²) pairwise	O(1) constant
Extrapolation Cost	Grows with sequence	Constant O(1)
Long Context Efficiency	Quadratic scaling	Linear scaling
State Compression	Distributed across KV cache	Single token compression

🔬 Technical Breakthrough

Traditional Transformer:     NanoHammer Architecture:
Token₁ → Attention → Token₁' Token₁ ──→ State Update → S(t)
Token₂ → Attention → Token₂'            ↓
Token₃ → Attention → Token₃' [S(t)] + [Token₁...Tokenₙ] → Attention → Output
  ...        O(n²)                    O(1)  +  O(n²)  =  O(n²)
Tokenₙ → Attention → Tokenₙ'        But with global causal context!

The state token S(t) acts as a causal information accumulator, providing:

Holographic encoding: Position-aware via complex-domain rotations (e^(iθ))
Fixed-point iteration: Multi-head Euler method for stable state evolution
Constant extrapolation: New tokens always interact with O(1) state, not O(n) history

🚀 Quick Start

Installation

pip install transformers torch

Basic Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model
model_path = "NoesisLab/NanoHammer-1.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Generate response
prompt = "Explain the concept of causality in physics."
messages = [{"role": "user", "content": prompt}]

input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    do_sample=True,
    top_p=0.9,
)

response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)

Multi-turn Conversation

messages = [
    {"role": "user", "content": "What is a holographic state?"},
    {"role": "assistant", "content": "A holographic state is a compressed representation that encodes global information..."},
    {"role": "user", "content": "How does it differ from traditional attention?"}
]

input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# ... generate as above

🏗️ Architecture Details

Hybrid Decoder Layer Flow

Each NanoHammer decoder layer executes the following pipeline:

Input Tokens (T tokens)
    ↓
[1] State Update Cell
    • Multi-head fixed-point iteration: S_{t+1} = S_t + α·f(S_t)
    • Learnable per-head step sizes
    • Pre-norm → MLP → Post-norm
    ↓
[2] State Token Projection
    • Project state_hidden_size (512) → hidden_size (2048)
    • Create global "state token" encoding causal history
    ↓
[3] State Token Injection
    • Prepend state token: [S(t)] + [Token₁, ..., Tokenₜ]
    • Sequence length: T → T+1
    ↓
[4] Llama Self-Attention
    • Standard Llama attention over T+1 tokens
    • GQA: 32 query heads, 8 KV heads
    • RoPE position encoding
    ↓
[5] Llama MLP
    • SwiGLU activation
    • 2048 → 8192 → 2048
    ↓
[6] State Token Removal
    • Extract and remove state token
    • Return T tokens
    ↓
Output Tokens (T tokens)

Core Components

1️⃣ HolographicRotaryEmbedding

# Complex-domain rotational encoding
x_i * e^(i*θ_k)  where θ_k = position_id / (10000^(2k/d))

Encodes absolute positions in complex space
Enables inverse rotation for relative coordinate transformations
Maintains temporal coherence across state updates

2️⃣ StateUpdateCell

# Multi-head Euler iteration
for head in range(num_state_heads):
    S_new[head] = S[head] + step_size[head] * MLP(LayerNorm(S[head]))

16 independent state heads (512-dim total)
Learnable step sizes per head for adaptive evolution
Pre-norm + MLP + Post-norm architecture for stability

3️⃣ StateTokenProjection

# Compress global state into single token
state_token = Linear(state_hidden_size=512 → hidden_size=2048)

Dimensional expansion: 512 → 2048
Single token represents entire causal history
O(1) memory footprint regardless of sequence length

Model Specifications

Parameter	Value
Total Parameters	~1.5B
Hidden Size	2048
Intermediate Size	8192
Num Layers	16
Attention Heads	32 (query) / 8 (KV, GQA)
State Heads	16
State Hidden Size	512
Vocab Size	128,256
Max Position Embeddings	131,072
RoPE Theta	500,000

⚡ Performance Characteristics

Computational Complexity

Operation	Complexity	Description
State Update	O(1)	Fixed-size state iteration
State Projection	O(1)	Single token transformation
Self-Attention	O(n²)	Standard Transformer attention
Total per Layer	O(n²)	Dominated by attention (as expected)

Key Insight: While overall complexity remains O(n²) due to attention, the state mechanism adds negligible overhead while providing explicit causal modeling that is:

Free during inference: State update cost is independent of context length
Efficient for extrapolation: New tokens interact with O(1) state, not O(n) history
Globally coherent: Single state token ensures causal consistency

Memory Efficiency

Traditional KV Cache: O(n * d * L)  [n tokens × d dims × L layers]
NanoHammer State:     O(d_s * L)    [512 dims × 16 layers = 8KB constant!]

The holographic state acts as a learned compression of causal history:

Constant size regardless of sequence length
Accumulated knowledge from all previous tokens
Efficient transfer across generation steps

📊 Benchmark Results

NanoHammer has been evaluated on standard language understanding benchmarks using the LM Evaluation Harness framework (0-shot evaluation).

Common Sense Reasoning & Knowledge

Task	Version	Metric	Value	Stderr
ARC-Challenge	1	acc	29.61%	±1.33%
		acc_norm	33.28%	±1.38%
ARC-Easy	1	acc	59.81%	±1.01%
		acc_norm	55.68%	±1.02%
HellaSwag	1	acc	42.65%	±0.49%
		acc_norm	56.33%	±0.49%
PIQA	1	acc	69.86%	±1.07%
		acc_norm	69.86%	±1.07%
WinoGrande	1	acc	57.14%	±1.39%

Performance Summary

Average Accuracy (normalized): 54.86%
- Strong performance on physical reasoning (PIQA: 69.86%)
- Competitive commonsense reasoning (HellaSwag: 56.33%, WinoGrande: 57.14%)
- Moderate performance on knowledge-intensive tasks (ARC: 33-60%)

Key Observations:

The model demonstrates strong physical and commonsense reasoning capabilities despite the novel architecture
Performance is competitive with other 1-2B parameter models in the same class
The explicit causal state mechanism does not compromise standard language understanding benchmarks
Results suggest the holographic state successfully captures relevant semantic information

Evaluation Details

Setup:

Evaluation framework: lm-evaluation-harness
Shot configuration: 0-shot (no few-shot examples)
Temperature: Greedy decoding
Batch size: Auto

Reproducing Results:

# Install lm-eval
pip install lm-eval

# Run evaluation
lm_eval --model hf \
    --model_args pretrained=NoesisLab/NanoHammer-1.5B-Instruct,trust_remote_code=True \
    --tasks arc_challenge,arc_easy,hellaswag,piqa,winogrande \
    --batch_size auto \
    --output_path results/

🎓 Training

Base Model & Weight Transfer

NanoHammer initializes from Llama-3.2-1B-Instruct via selective weight transfer:

Frozen Components (from Llama):

Token embeddings (embed_tokens)
Language modeling head (lm_head)
Self-attention layers (self_attn)
MLP layers (mlp)
All RMS layer norms

Trainable Components (NanoHammer-specific):

token_to_state: Projects input tokens → state space
holographic_rope: Position encoding for state
state_cell: State update mechanism (per layer)
state_projection: State → hidden projection (per layer)

Training Configuration

Dataset: High-quality instruction-following data
Precision: BF16 mixed precision
Optimization: AdamW with cosine LR schedule
Gradient Checkpointing: Enabled for memory efficiency
Batch Size: Scaled with gradient accumulation
Max Sequence Length: 2048 tokens (extendable to 131K via RoPE)

🔍 Why NanoHammer?

Problem: Implicit vs Explicit Causal Modeling

Traditional Transformers learn causal dependencies implicitly through attention weights:

Q @ K^T → Attention weights → Implicitly capture "what depends on what"

Limitations:

Causality is distributed across n² attention scores
No explicit structure for causal information flow
Quadratic cost to maintain global context
Poor extrapolation to longer sequences

Solution: Holographic Integral State

NanoHammer introduces an explicit causal state token:

S(t) ← Accumulated causal information from all previous tokens
     ← Updated via fixed-point iteration with temporal encoding
     ← Participates in attention as a "global context token"

Benefits:

Causality is explicit in a structured state representation
O(1) state size provides constant-cost global context
Natural extrapolation to unseen sequence lengths
Interpretable: State token can be analyzed/visualized

📊 Model Architecture Diagram

┌─────────────────────────────────────────────────────────┐
│  Input: "What is the capital of France?"                │
│  Tokens: [What, is, the, capital, of, France, ?]       │
└────────────────┬────────────────────────────────────────┘
                 │
                 ▼
         Token Embeddings
                 │
                 ▼
    ┌────────────────────────┐
    │  Token-to-State Proj   │  Project to state space
    └────────────┬───────────┘
                 │
    ┌────────────▼───────────┐
    │   Holographic RoPE     │  Apply position encoding
    │   (Complex rotation)    │
    └────────────┬───────────┘
                 │
         ╔═══════▼════════╗
         ║   Layer 1-16   ║  (Repeated 16 times)
         ╠════════════════╣
         ║ ┌────────────┐ ║
         ║ │State Update│ ║  S(t+1) = S(t) + α·f(S(t))
         ║ │   Cell     │ ║  [Fixed-point iteration]
         ║ └─────┬──────┘ ║
         ║       │        ║
         ║ ┌─────▼──────┐ ║
         ║ │   State    │ ║  Project 512 → 2048
         ║ │ Projection │ ║
         ║ └─────┬──────┘ ║
         ║       │        ║
         ║   [S] + [T₁, T₂, ..., Tₙ]  ← Prepend state token
         ║       │        ║
         ║ ┌─────▼──────┐ ║
         ║ │   Llama    │ ║  Standard attention
         ║ │ Attention  │ ║  over T+1 tokens
         ║ └─────┬──────┘ ║
         ║       │        ║
         ║ ┌─────▼──────┐ ║
         ║ │   Llama    │ ║  SwiGLU MLP
         ║ │    MLP     │ ║
         ║ └─────┬──────┘ ║
         ║       │        ║
         ║   Remove [S] from output
         ║       │        ║
         ╚═══════▼════════╝
                 │
         ┌───────▼────────┐
         │   Final Norm   │
         └───────┬────────┘
                 │
         ┌───────▼────────┐
         │     LM Head    │  Project to vocab
         └───────┬────────┘
                 │
                 ▼
    Output: "Paris" (logits over 128K vocab)

📚 Citation

If you use NanoHammer in your research, please cite:

@misc{nanohammer2025,
  title={NanoHammer: Explicit Causal Modeling with Holographic Integral State Compression},
  author={NoesisLab},
  year={2025},
  howpublished={\url{https://huggingface.co/NoesisLab/NanoHammer-1.5B-Instruct}},
}

📝 License

This model is released under the Apache 2.0 license, inheriting from the base Llama-3.2-1B-Instruct model.

🙏 Acknowledgments

Base Model: Meta's Llama-3.2-1B-Instruct
Inspiration: State-space models, holographic memory, and causal inference theory
Framework: HuggingFace Transformers

🔗 Links

Model Card: NoesisLab/NanoHammer-1.5B-Instruct
Paper: Coming soon

Built with ❤️ by NoesisLab

Advancing causal modeling in large language models