Infinite.Code.III / README.md
GODsStrongestSoldier's picture
Add README.md
766e5c8 verified
---
language:
- en
- code
license: apache-2.0
tags:
- recursive-language-model
- causal-lm
- multimodal
- long-context
- mixture-of-experts
- continual-learning
- meta-learning
- self-automated
- safetensors
- pytorch
model_name: Infinite.Code.III
pipeline_tag: text-generation
library_name: transformers
---
# Infinite.Code.III β€” Recursive Language Model
> *"Not a Large Language Model. A Recursive Mind."*
## Overview
**Infinite.Code.III** is a **1.210B-parameter Recursive Language Model (RLM)**
built from scratch as a unified Hybrid Mind architecture. Unlike standard LLMs that apply a
fixed forward-pass transformer, Infinite.Code.III integrates Self-Automated (S.A.) learning
systems as architectural primitives β€” they are not pipeline steps; they are woven into every
decoder layer.
| Property | Value |
|---|---|
| Parameters | **1.210B** |
| Context Window | **1,000,000 tokens** |
| Architecture | Recursive Language Model (RLM) |
| Attention | Grouped-Query Attention (GQA) 10/5 heads |
| Positional Encoding | RoPE (ΞΈ = 500,000, long-ctx scaled) |
| FFN | Alternating Dense / Mixture-of-Experts (8 experts, top-2) |
| Vocabulary | 65,536 BPE tokens |
| Layers | 20 |
| Hidden Size | 1280 |
| Weight Format | safetensors (bfloat16 trained, float32 saved) |
| Modalities | Text Β· Image Β· Audio Β· Video |
| License | Apache 2.0 |
---
## S.A. System Architecture
### S.A. Meta Learning
Each layer has a learnable `adaptive_alpha` scalar (sigmoid-gated) that blends the
transformed output with the layer's top-of-layer residual. This is the meta-learning
channel β€” it learns *how much* each transformation contributes per layer.
### S.A. Reinforcement Learning
`RewardHead` (D β†’ 512 β†’ 1 scalar) attaches to the final hidden states.
During RL fine-tuning (RLHF / GRPO), this head provides the value signal.
Pass `output_reward=True` during rollout collection.
### S.A. Continual Learning
`HybridMemory` LTM uses exponential moving average write-back
(`0.95 Γ— old + 0.05 Γ— new`) β€” knowledge accumulates across forward passes
without overwriting, resisting catastrophic forgetting.
### S.A. Adaptive Learning
The per-layer `adaptive_alpha` gate is trained end-to-end, self-calibrating
each layer's write strength to the residual stream.
### S.A. Rewriting Learning
Every 3rd layer runs `RewriteAttention` β€” a 4-head causal self-attention
pass that lets the model revise its own intermediate token representations
within a single forward pass.
### S.A. NLP + S.A. Problem Solving
`MetaOutputMixer` at decoder output applies a 3-way soft gate
(language / code / math-logic) via `NLPGate`. The final representation
is a content-adaptive weighted mixture of three parallel projections.
### S.A. Innovation Learning
Odd-numbered layers use `MoELayer` β€” 8 experts, top-2 routing,
each a SwiGLU FFN with 2048-dim intermediate.
### S.A. DeBugging
`DebugHookManager` gradient hook registry. Set `debug_mode: true` in config to
activate mean-absolute-gradient logging on the embedding and any registered tensor.
Zero cost when disabled.
### S.A. Advanced Long/Short-Term Memory
`HybridMemory` (every 4th layer):
- **STM**: 512-slot soft-attention read buffer (refreshed each pass)
- **LTM**: 2048-slot persistent EMA key-value store (continual write-back)
### S.A. Recursive Seed Learning
`RecursiveSeedGate` on **every layer** β€” depth-4 intra-layer recursion:
seeds a 256-dim vector, projects to full D, gates with sigmoid,
re-seeds from updated h. Creates true within-layer feedback loops.
---
## Multimodal Inputs
| Modality | Projector | Input Shape |
|---|---|---|
| Image | `ImageProjector` Linear(1024β†’2560β†’1280) | `(B, N_patches, 1024)` |
| Audio | `AudioProjector` GRU(80β†’512) + Linear | `(B, T_frames, 80)` |
| Video | `VideoProjector` Linear + TransformerEncoderLayer | `(B, F_frames, 1024)` |
---
## Fine-Tuning
### SFT Recommended Hyperparameters
| Setting | Value |
|---|---|
| Learning Rate | 2e-5 |
| LR Schedule | cosine + 100-step warmup |
| Batch Size | 1–4 per GPU + grad accumulation Γ—8 |
| Max Seq Length | start at 8192, scale to 1M |
| Precision | bfloat16 |
| Optimizer | AdamW (β₁=0.9, Ξ²β‚‚=0.95, Ξ΅=1e-8, wd=0.1) |
| Grad Clip | 1.0 |
### RLHF / GRPO
The `reward_head` is the built-in value model. Pass `output_reward=True`
during rollout. The scalar is differentiable β€” plug directly into TRL `GRPOTrainer`.
---
## Citation
```bibtex
@misc{infinite_code_iii_2025,
title = {Infinite.Code.III: A Recursive Language Model with Self-Automated Learning},
author = {GODsStrongestSoldier},
year = {2025},
url = {https://huggingface.co/GODsStrongestSoldier/Infinite.Code.III},
note = {1.210B Recursive Language Model, 1M context window}
}
```