Add README.md

766e5c8 verified 9 days ago

4.75 kB

	---
	language:
	- en
	- code
	license: apache-2.0
	tags:
	- recursive-language-model
	- causal-lm
	- multimodal
	- long-context
	- mixture-of-experts
	- continual-learning
	- meta-learning
	- self-automated
	- safetensors
	- pytorch
	model_name: Infinite.Code.III
	pipeline_tag: text-generation
	library_name: transformers
	---

	# Infinite.Code.III — Recursive Language Model

	> "Not a Large Language Model. A Recursive Mind."

	## Overview

	Infinite.Code.III is a 1.210B-parameter Recursive Language Model (RLM)
	built from scratch as a unified Hybrid Mind architecture. Unlike standard LLMs that apply a
	fixed forward-pass transformer, Infinite.Code.III integrates Self-Automated (S.A.) learning
	systems as architectural primitives — they are not pipeline steps; they are woven into every
	decoder layer.

	\| Property \| Value \|
	\|---\|---\|
	\| Parameters \| 1.210B \|
	\| Context Window \| 1,000,000 tokens \|
	\| Architecture \| Recursive Language Model (RLM) \|
	\| Attention \| Grouped-Query Attention (GQA) 10/5 heads \|
	\| Positional Encoding \| RoPE (θ = 500,000, long-ctx scaled) \|
	\| FFN \| Alternating Dense / Mixture-of-Experts (8 experts, top-2) \|
	\| Vocabulary \| 65,536 BPE tokens \|
	\| Layers \| 20 \|
	\| Hidden Size \| 1280 \|
	\| Weight Format \| safetensors (bfloat16 trained, float32 saved) \|
	\| Modalities \| Text · Image · Audio · Video \|
	\| License \| Apache 2.0 \|

	---

	## S.A. System Architecture

	### S.A. Meta Learning
	Each layer has a learnable `adaptive_alpha` scalar (sigmoid-gated) that blends the
	transformed output with the layer's top-of-layer residual. This is the meta-learning
	channel — it learns how much each transformation contributes per layer.

	### S.A. Reinforcement Learning
	`RewardHead` (D → 512 → 1 scalar) attaches to the final hidden states.
	During RL fine-tuning (RLHF / GRPO), this head provides the value signal.
	Pass `output_reward=True` during rollout collection.

	### S.A. Continual Learning
	`HybridMemory` LTM uses exponential moving average write-back
	(`0.95 × old + 0.05 × new`) — knowledge accumulates across forward passes
	without overwriting, resisting catastrophic forgetting.

	### S.A. Adaptive Learning
	The per-layer `adaptive_alpha` gate is trained end-to-end, self-calibrating
	each layer's write strength to the residual stream.

	### S.A. Rewriting Learning
	Every 3rd layer runs `RewriteAttention` — a 4-head causal self-attention
	pass that lets the model revise its own intermediate token representations
	within a single forward pass.

	### S.A. NLP + S.A. Problem Solving
	`MetaOutputMixer` at decoder output applies a 3-way soft gate
	(language / code / math-logic) via `NLPGate`. The final representation
	is a content-adaptive weighted mixture of three parallel projections.

	### S.A. Innovation Learning
	Odd-numbered layers use `MoELayer` — 8 experts, top-2 routing,
	each a SwiGLU FFN with 2048-dim intermediate.

	### S.A. DeBugging
	`DebugHookManager` gradient hook registry. Set `debug_mode: true` in config to
	activate mean-absolute-gradient logging on the embedding and any registered tensor.
	Zero cost when disabled.

	### S.A. Advanced Long/Short-Term Memory
	`HybridMemory` (every 4th layer):
	- STM: 512-slot soft-attention read buffer (refreshed each pass)
	- LTM: 2048-slot persistent EMA key-value store (continual write-back)

	### S.A. Recursive Seed Learning
	`RecursiveSeedGate` on every layer — depth-4 intra-layer recursion:
	seeds a 256-dim vector, projects to full D, gates with sigmoid,
	re-seeds from updated h. Creates true within-layer feedback loops.

	---

	## Multimodal Inputs

	\| Modality \| Projector \| Input Shape \|
	\|---\|---\|---\|
	\| Image \| `ImageProjector` Linear(1024→2560→1280) \| `(B, N_patches, 1024)` \|
	\| Audio \| `AudioProjector` GRU(80→512) + Linear \| `(B, T_frames, 80)` \|
	\| Video \| `VideoProjector` Linear + TransformerEncoderLayer \| `(B, F_frames, 1024)` \|

	---

	## Fine-Tuning

	### SFT Recommended Hyperparameters
	\| Setting \| Value \|
	\|---\|---\|
	\| Learning Rate \| 2e-5 \|
	\| LR Schedule \| cosine + 100-step warmup \|
	\| Batch Size \| 1–4 per GPU + grad accumulation ×8 \|
	\| Max Seq Length \| start at 8192, scale to 1M \|
	\| Precision \| bfloat16 \|
	\| Optimizer \| AdamW (β₁=0.9, β₂=0.95, ε=1e-8, wd=0.1) \|
	\| Grad Clip \| 1.0 \|

	### RLHF / GRPO
	The `reward_head` is the built-in value model. Pass `output_reward=True`
	during rollout. The scalar is differentiable — plug directly into TRL `GRPOTrainer`.

	---

	## Citation

	```bibtex
	@misc{infinite_code_iii_2025,
	title = {Infinite.Code.III: A Recursive Language Model with Self-Automated Learning},
	author = {GODsStrongestSoldier},
	year = {2025},
	url = {https://huggingface.co/GODsStrongestSoldier/Infinite.Code.III},
	note = {1.210B Recursive Language Model, 1M context window}
	}
	```