Upload 9 files

Browse files

Files changed (7) hide show

README.md +109 -19
config.json +1 -1
model.safetensors +2 -2
special_tokens_map.json +30 -0
tokenizer.json +2 -2
tokenizer_config.json +38 -0
training_args.bin +1 -1

README.md CHANGED Viewed

@@ -24,7 +24,7 @@ model-index:
       type: arc_challenge
     metrics:
     - type: acc_norm
-      value: 33.28
       name: normalized accuracy
   - task:
       type: text-generation
@@ -34,7 +34,7 @@ model-index:
       type: arc_easy
     metrics:
     - type: acc
-      value: 59.81
       name: accuracy
   - task:
       type: text-generation
@@ -44,7 +44,7 @@ model-index:
       type: hellaswag
     metrics:
     - type: acc_norm
-      value: 56.33
       name: normalized accuracy
   - task:
       type: text-generation
@@ -54,7 +54,7 @@ model-index:
       type: piqa
     metrics:
     - type: acc
-      value: 69.86
       name: accuracy
   - task:
       type: text-generation
@@ -64,7 +64,7 @@ model-index:
       type: winogrande
     metrics:
     - type: acc
-      value: 57.14
       name: accuracy
 ---
@@ -264,6 +264,82 @@ state_token = Linear(state_hidden_size=512 → hidden_size=2048)
 ---
 ## ⚡ Performance Characteristics
 ### Computational Complexity
@@ -310,29 +386,43 @@ NanoHammer has been evaluated on standard language understanding benchmarks usin
 | Task | Version | Metric | Value | Stderr |
 |------|---------|--------|-------|--------|
-| **ARC-Challenge** | 1 | acc | 29.61% | ±1.33% |
-| | | acc_norm | **33.28%** | ±1.38% |
-| **ARC-Easy** | 1 | acc | **59.81%** | ±1.01% |
-| | | acc_norm | 55.68% | ±1.02% |
-| **HellaSwag** | 1 | acc | 42.65% | ±0.49% |
-| | | acc_norm | **56.33%** | ±0.49% |
-| **PIQA** | 1 | acc | **69.86%** | ±1.07% |
-| | | acc_norm | **69.86%** | ±1.07% |
-| **WinoGrande** | 1 | acc | **57.14%** | ±1.39% |
 ### Performance Summary
 ```
-Average Accuracy (normalized): 54.86%
-- Strong performance on physical reasoning (PIQA: 69.86%)
-- Competitive commonsense reasoning (HellaSwag: 56.33%, WinoGrande: 57.14%)
-- Moderate performance on knowledge-intensive tasks (ARC: 33-60%)
 ```
 **Observations:**
 - Performance is comparable to other 1-2B parameter models
 - The causal state mechanism does not degrade standard benchmark performance
-- Strong physical reasoning (PIQA: 69.86%) suggests the state captures useful semantic information
 - Note: These benchmarks don't specifically test long-range causal reasoning where the architecture may have advantages
 ### Evaluation Details

       type: arc_challenge
     metrics:
     - type: acc_norm
+      value: 35.67
       name: normalized accuracy
   - task:
       type: text-generation
       type: arc_easy
     metrics:
     - type: acc
+      value: 65.66
       name: accuracy
   - task:
       type: text-generation
       type: hellaswag
     metrics:
     - type: acc_norm
+      value: 57.24
       name: normalized accuracy
   - task:
       type: text-generation
       type: piqa
     metrics:
     - type: acc
+      value: 72.80
       name: accuracy
   - task:
       type: text-generation
       type: winogrande
     metrics:
     - type: acc
+      value: 59.91
       name: accuracy
 ---
 ---
+## 🧠 O(1) Incremental Inference: The Core Logic
+This is the heart of how NanoHammer achieves O(1) state recurrence. In traditional Transformers, generating the $t$-th token typically requires looking back at all $t-1$ previous tokens via the KV Cache. In NanoHammer, we compress "history" into a fixed-dimensional state vector $S$.
+The essence of `_forward_incremental` is that it's not "reviewing" history—it's **updating the current state snapshot**.
+### Algorithm: NanoHammer Incremental Inference (O(1) State Recurrence)
+**Inputs:**
+- $x_t$: Current token's hidden state
+- $S_t$: Cumulative integral state entering this layer
+- $S_{prev\_out}$: Previous timestep's output state from this layer (this is key—represents the fully evolved history at $t-1$)
+- $Cache_{KV}$: Historical Key-Value cache
+**Outputs:**
+- $y_t$: Current layer's output hidden state
+- $S_{updated}$: Updated state (passed to next timestep as $S_{prev\_out}$)
+```python
+def forward_incremental(x_t, S_t, S_prev_out, Cache_KV):
+    """
+    NanoHammer's O(1) State Recurrence Step
+    Complexity: Regardless of sequence length, state S has fixed dimensions,
+                so computation remains constant.
+    """
+    # 1. State Evolution (The Euler Step)
+    # Physics: Evolve the system state forward one step based on current input S_t
+    # S_{updated} = S_t + alpha * f(S_t)
+    S_updated = StateUpdateCell(S_t)
+    # 2. Holographic Inverse Rotation
+    # Physics: Project previous "absolute state" S_prev_out into current timestep t's
+    #          "relative coordinate system"
+    # This step decompresses position information encoded in S
+    # R^{-1}(S, t) = S * e^{-i * theta * t}
+    S_relative = InverseHolographicRoPE(S_prev_out, position_id=t)
+    # 3. State Materialization
+    # Project abstract physics state vector into Transformer-readable token space
+    Token_State = Project(S_relative)
+    # 4. Dual-Token Query Construction
+    # We don't just query x_t; we query [Global State, Current Input]
+    # Query = [Token_State, x_t]
+    Q_pair = Concat([Token_State, x_t])
+    # 5. Hybrid Attention
+    # Token_State handles "recalling" global history (Long-term Memory)
+    # x_t handles "attending to" local details (Local Context)
+    # Note: While attention still occurs, deeper layers gradually ignore Cache_KV,
+    #       relying primarily on Token_State
+    y_pair = LlamaAttention(
+        query=Q_pair,
+        key_value=Cache_KV + Current_KV
+    )
+    # 6. Extract Output
+    # We only need the output corresponding to x_t; Token_State's output is discarded
+    # (it only serves as guidance)
+    y_t = y_pair[1]
+    return y_t, S_updated
+```
+### Key Insight
+The state update (`StateUpdateCell`) is **O(1)** regardless of sequence length because:
+1. State dimension is fixed at 512
+2. The Euler step operates only on the current state, not on historical tokens
+3. Position information is encoded holographically, not through explicit sequence traversal
+This contrasts with standard KV-cache attention where attending to history costs O(T).
+---
 ## ⚡ Performance Characteristics
 ### Computational Complexity
 | Task | Version | Metric | Value | Stderr |
 |------|---------|--------|-------|--------|
+| **ARC-Challenge** | 1 | acc | 32.42% | ±1.37% |
+| | | acc_norm | **35.67%** | ±1.40% |
+| **ARC-Easy** | 1 | acc | **65.66%** | ±0.97% |
+| | | acc_norm | 62.67% | ±0.99% |
+| **HellaSwag** | 1 | acc | 43.54% | ±0.49% |
+| | | acc_norm | **57.24%** | ±0.49% |
+| **PIQA** | 1 | acc | **72.80%** | ±1.04% |
+| | | acc_norm | 72.47% | ±1.04% |
+| **WinoGrande** | 1 | acc | **59.91%** | ±1.38% |
 ### Performance Summary
 ```
+Average Accuracy (normalized): 57.59%
+- Strong performance on physical reasoning (PIQA: 72.80%)
+- Competitive commonsense reasoning (HellaSwag: 57.24%, WinoGrande: 59.91%)
+- Solid performance on knowledge tasks (ARC-Easy: 65.66%, ARC-Challenge: 35.67%)
 ```
+### Comparison with Similar-Scale Models (OpenLLM Leaderboard)
+| Metric | NanoHammer (1.5B, 16K Data) | Llama 3.2 1B (Instruct) | Qwen 2.5 1.5B (Instruct) | TinyLlama 1.1B (3T Tokens) |
+|--------|----------------------------|-------------------------|--------------------------|---------------------------|
+| **WinoGrande** | **59.91%** 🏆 | 59.70% | ~60.2% | 59.1% |
+| **PIQA** | 72.80% ⚔️ | 74.40% | ~75.0% | 73.3% |
+| **ARC-Challenge** | 35.67% | 38.10% | ~40.5% | 30.1% |
+| **HellaSwag** | 57.24% | 60.80% | ~65.0% | 59.2% |
+| **ARC-Easy** | 65.66% | 68.50% | ~70.0% | 55.2% |
+> 🏆 **WinoGrande**: Outperforms Llama 3.2 1B with only 16K training samples!
+> ⚔️ **PIQA**: Competitive physical reasoning, close to fully-trained baselines
+> 📊 **Data Efficiency**: Achieves comparable results with **16K samples** vs **3T tokens** (TinyLlama)
 **Observations:**
 - Performance is comparable to other 1-2B parameter models
 - The causal state mechanism does not degrade standard benchmark performance
+- Strong physical reasoning (PIQA: 72.80%) suggests the state captures useful semantic information
 - Note: These benchmarks don't specifically test long-range causal reasoning where the architecture may have advantages
 ### Evaluation Details

config.json CHANGED Viewed

@@ -30,5 +30,5 @@
   "tie_word_embeddings": false,
   "transformers_version": "4.57.6",
   "use_cache": true,
-  "vocab_size": 128256
 }

   "tie_word_embeddings": false,
   "transformers_version": "4.57.6",
   "use_cache": true,
+  "vocab_size": 128260
 }

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:0d0f17d08fcd74eb7b77937646e6e0202d2141de03f6dbedb27227854ae8aec3
-size 3099854832

 version https://git-lfs.github.com/spec/v1
+oid sha256:0bc1367dcb11e79d389929690e62ca72e2a6d8f1c2496e15485214a95e32c3bd
+size 3099887600

special_tokens_map.json CHANGED Viewed

@@ -1,4 +1,34 @@
 {
   "bos_token": {
     "content": "<|begin_of_text|>",
     "lstrip": false,

 {
+  "additional_special_tokens": [
+    {
+      "content": "<|begin_of_thought|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|end_of_thought|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|begin_of_solution|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|end_of_solution|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    }
+  ],
   "bos_token": {
     "content": "<|begin_of_text|>",
     "lstrip": false,

tokenizer.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:6b9e4e7fb171f92fd137b777cc2714bf87d11576700a1dcd7a399e7bbe39537b
-size 17209920

 version https://git-lfs.github.com/spec/v1
+oid sha256:1a7490b61d01accdadfff5738bded9597f29a70294dd6ecb1cf7da2383dbf663
+size 17210706

tokenizer_config.json CHANGED Viewed

@@ -2047,8 +2047,46 @@
       "rstrip": false,
       "single_word": false,
       "special": true
     }
   },
   "bos_token": "<|begin_of_text|>",
   "clean_up_tokenization_spaces": true,
   "eos_token": "<|eot_id|>",

       "rstrip": false,
       "single_word": false,
       "special": true
+    },
+    "128256": {
+      "content": "<|begin_of_thought|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128257": {
+      "content": "<|end_of_thought|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128258": {
+      "content": "<|begin_of_solution|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128259": {
+      "content": "<|end_of_solution|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
     }
   },
+  "additional_special_tokens": [
+    "<|begin_of_thought|>",
+    "<|end_of_thought|>",
+    "<|begin_of_solution|>",
+    "<|end_of_solution|>"
+  ],
   "bos_token": "<|begin_of_text|>",
   "clean_up_tokenization_spaces": true,
   "eos_token": "<|eot_id|>",

training_args.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:3d44102a13ed5f1f80d7e5eb19eae6eee1c6ca395460da3e29c7c7fe5999a494
 size 6289

 version https://git-lfs.github.com/spec/v1
+oid sha256:120f235cb54fc8b650658ee1f6b63c25c7cddb8840b68c1b889aed22347713d3
 size 6289