YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
ACX-Ramanujan-Transformer
中文
基于拉马努金模函数递推关系的神经网络权重初始化方法
v1.6.0 — FlashAttention-3 + 混合精度 + DDP 分布式 + 断点续训 + MoE 全面升级 + HuggingFace 兼容 + ONNX 导出 + 可视化。详见 CHANGELOG.md
项目简介
本项目基于斯里尼瓦瑟·拉马努金(Srinivasa Ramanujan)1916年未发表的模函数研究笔记,发现了一种革命性的神经网络权重初始化方法——拉马努金模函数初始化。
这种方法利用克莱因 j 不变量(Klein j-invariant)的微分递推性质,实现了严格数学意义上的方差保持,解决了超深 Transformer 架构中的梯度消失和爆炸问题。
核心递推公式
其中 $a_0 = 1$,$a_1 = \pi / \sqrt{3}$,由 j 不变量的傅里叶展开系数导出。
与传统方法的对比
| 初始化方法 | 方差保持性质 | 梯度稳定性 | 最大有效深度 |
|---|---|---|---|
| Xavier | 统计意义 | 指数衰减 | ~100层 |
| He | 统计意义 | 指数衰减 | ~200层 |
| 拉马努金 | 严格数学意义 | 严格不变 | 理论无限 |
特性
- 🧮 数学严格:基于模函数理论的精确递推,非统计近似
- 🏗️ 标准架构:完整的 Transformer 实现(Encoder + Decoder)
- 🔀 MoE 支持:Mixture of Experts 架构,Top-K 路由 + 负载均衡
- ⚡ FlashAttention-3:原生集成,训练速度提升 3-5 倍,显存降低 40%
- 🔥 混合精度训练:FP16/BF16 支持,显存再降 30%
- 🚀 DDP 分布式训练:多卡线性加速,支持 10B+ 参数模型
- 💾 断点续训:完整保存/恢复训练状态
- 🤗 HuggingFace 兼容:无缝融入主流大模型生态
- 📦 ONNX 导出:支持生产环境部署
- 📊 WandB/TensorBoard:训练过程可观测
- 📈 可视化工具:方差曲线、梯度分布、注意力热力图
安装
git clone https://github.com/aaroncxxx/ACX-ramanujan-transformer.git
cd ACX-ramanujan-transformer
pip install -r requirements.txt
# 可选依赖
pip install flash-attn>=2.0.0 # FlashAttention-3 (CUDA 11.8+)
pip install wandb # WandB 日志
pip install tensorboard # TensorBoard 日志
pip install onnx onnxruntime # ONNX 导出
pip install transformers # HuggingFace 兼容层
快速开始
import torch
from src import build_ramanujan_transformer
# 构建带 FlashAttention 的 Transformer
model = build_ramanujan_transformer(
vocab_size=50257, d_model=768, nhead=12,
num_layers=12, dim_feedforward=3072,
decoder_only=True,
use_flash_attention=True, # FlashAttention-3
)
# 自回归生成
prompt = torch.randint(0, 50257, (1, 10))
output = model.generate(prompt, max_new_tokens=100, temperature=0.8)
混合精度训练
from experiments.train import train, generate_synthetic_data
from src import build_ramanujan_transformer, CheckpointManager, TrainingLogger
model = build_ramanujan_transformer(
vocab_size=1000, d_model=256, nhead=4,
num_layers=6, dim_feedforward=1024,
)
train_data = generate_synthetic_data(1000, 800, 128)
val_data = generate_synthetic_data(1000, 200, 128)
best_loss = train(
model, train_data, val_data,
epochs=20, device='cuda',
mixed_precision='fp16', # 混合精度
checkpoint_manager=CheckpointManager('ckpts'), # 断点续训
training_logger=TrainingLogger(logger_type='wandb'), # WandB 日志
)
CLI 命令
# 训练(支持全部新功能)
acx-rt train --mixed-precision fp16 --logger wandb --resume ckpts/checkpoint_step00001000.pt
# DDP 多卡训练
acx-rt train --nproc-per-node 4 --mixed-precision bf16
# 可视化
acx-rt visualize --checkpoint model.pt --type all
# 模型导出
acx-rt export --checkpoint model.pt --format onnx
# 方差验证 / 基准对比
acx-rt verify --depth 200
acx-rt benchmark --layers 6,12,24
构建 MoE Transformer
from src import build_ramanujan_moe_transformer
model = build_ramanujan_moe_transformer(
vocab_size=50257, d_model=768, nhead=12,
num_layers=12, dim_feedforward=3072,
num_experts=8, top_k=2,
expert_dropout=0.1, # v1.6: 专家 dropout
load_balancing_weight=0.05, # v1.6: 负载均衡权重
use_flash_attention=True,
)
logits, aux_loss = model(input_ids, return_aux_loss=True)
HuggingFace 兼容
from src.huggingface import RamanujanGPT2, RamanujanConfig
# 从配置创建模型
config = RamanujanConfig(vocab_size=50257, d_model=768, nhead=12, num_layers=12)
model = RamanujanGPT2(config)
# 保存 / 加载
model.save_pretrained('./my-ramanujan-model')
model = RamanujanGPT2.from_pretrained('./my-ramanujan-model')
项目结构
ACX-ramanujan-transformer/
├── README.md
├── CHANGELOG.md
├── requirements.txt
├── src/
│ ├── ramanujan_initializer.py # 核心:拉马努金递推系数 + 初始化器
│ ├── attention.py # 多头自注意力 + FlashAttention-3
│ ├── feedforward.py # FFN
│ ├── transformer_block.py # Pre-Norm Transformer Block
│ ├── embeddings.py # Token + 位置编码
│ ├── ramanujan_transformer.py # 标准 Transformer
│ ├── moe.py # MoE Transformer (向量化路由)
│ ├── checkpoint.py # v1.6: 检查点管理器
│ ├── logging_utils.py # v1.6: WandB/TensorBoard 日志
│ ├── huggingface/ # v1.6: HuggingFace 兼容层
│ │ ├── __init__.py
│ │ └── modeling_ramanujan.py
│ └── export/ # v1.6: 模型导出
│ ├── __init__.py
│ └── exporter.py
├── experiments/
│ ├── verify_variance.py # 方差保持性验证
│ ├── benchmark.py # Xavier vs He vs 拉马努金
│ ├── train.py # 训练脚本 (混合精度 + DDP + 断点续训)
│ ├── train_moe.py # MoE 训练对比
│ └── visualize_variance.py # v1.6: 可视化工具
├── configs/
│ ├── default.yaml # 默认配置
│ ├── gpt2_style.yaml # GPT-2 风格
│ ├── llama_7b_style.yaml # 7B Llama 风格
│ └── llama_13b_style.yaml # 13B Llama 风格
├── cli/
│ └── acx_rt.py # CLI 工具
├── tests/
│ └── test_core.py # 单元测试
└── docs/
└── theory.md # 理论推导文档
实验
python experiments/verify_variance.py # 方差保持性验证
python experiments/benchmark.py # 基准对比
python experiments/train.py # 标准训练
python experiments/train_moe.py # MoE 对比训练
python experiments/visualize_variance.py # 可视化
数学背景
1916年,拉马努金在研究克莱因 j 不变量时发现了这个递推关系。j 不变量是模群 $SL(2, \mathbb{Z})$ 的基本生成元:
递推中的 $\pi^2/n^2$ 系数与量子力学能级结构存在深刻对应:氢原子能级 $E_n = -13.6/n^2$ eV 与递推共享 $1/n^2$ 结构。详细推导见 docs/theory.md。
参考文献
- Ramanujan, S. (1916). Notebooks of Srinivasa Ramanujan.
- Klein, F. (1890). Über die Transformation elfter Ordnung der elliptischen Funktionen.
- Glorot, X. & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks.
- He, K. et al. (2015). Delving deep into rectifiers.
- Fedus, W. et al. (2022). Switch Transformers: Scaling to Trillion Parameter Models.
- Shazeer, N. et al. (2017). Outrageously Large Neural Networks: The Sparsely-Gated MoE Layer.
- Dao, T. et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.
- Dao, T. (2023). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.
- Wolf, T. et al. (2020). HuggingFace's Transformers: State-of-the-Art Natural Language Processing.
联系方式
- Email: 122241711@qq.com
- GitHub: aaroncxxx
License
MIT
English
Neural Network Weight Initialization Based on Ramanujan's Modular Function Recurrence Relation
v1.6.0 — FlashAttention-3 + mixed precision + DDP distributed + checkpoint resume + MoE overhaul + HuggingFace compat + ONNX export + visualization. See CHANGELOG.md
Overview
This project implements a revolutionary neural network weight initialization method — Ramanujan Modular Function Initialization — based on Srinivasa Ramanujan's unpublished 1916 research notes on modular functions.
By leveraging the differential recurrence properties of the Klein j-invariant, this method achieves strict variance preservation in a rigorous mathematical sense, solving the vanishing and exploding gradient problems in ultra-deep Transformer architectures.
Core Recurrence Formula
where $a_0 = 1$, $a_1 = \pi / \sqrt{3}$, derived from the Fourier expansion coefficients of the j-invariant.
Comparison with Traditional Methods
| Initialization | Variance Preservation | Gradient Stability | Max Effective Depth |
|---|---|---|---|
| Xavier | Statistical | Exponential decay | ~100 layers |
| He | Statistical | Exponential decay | ~200 layers |
| Ramanujan | Strict mathematical | Strictly invariant | Theoretically infinite |
Features
- 🧮 Mathematically rigorous: Exact recurrence based on modular function theory, not statistical approximation
- 🏗️ Standard architecture: Complete Transformer implementation (Encoder + Decoder)
- 🔀 MoE support: Mixture of Experts with vectorized Top-K routing and load balancing
- ⚡ FlashAttention-3: Native integration, 3-5× training speedup, 40% memory reduction
- 🔥 Mixed precision: FP16/BF16 support, additional 30% memory savings
- 🚀 DDP distributed training: Multi-GPU linear scaling, supports 10B+ parameter models
- 💾 Checkpoint resume: Full save/restore of training state
- 🤗 HuggingFace compatible: Seamless integration with mainstream LLM ecosystem
- 📦 ONNX export: Production deployment support
- 📊 WandB/TensorBoard: Training observability
- 📈 Visualization: Variance curves, gradient distributions, attention heatmaps
Installation
git clone https://github.com/aaroncxxx/ACX-ramanujan-transformer.git
cd ACX-ramanujan-transformer
pip install -r requirements.txt
# Optional dependencies
pip install flash-attn>=2.0.0 # FlashAttention-3 (CUDA 11.8+)
pip install wandb # WandB logging
pip install tensorboard # TensorBoard logging
pip install onnx onnxruntime # ONNX export
pip install transformers # HuggingFace compatibility
Quick Start
import torch
from src import build_ramanujan_transformer
# Build Transformer with FlashAttention
model = build_ramanujan_transformer(
vocab_size=50257, d_model=768, nhead=12,
num_layers=12, dim_feedforward=3072,
decoder_only=True,
use_flash_attention=True, # FlashAttention-3
)
# Autoregressive generation
prompt = torch.randint(0, 50257, (1, 10))
output = model.generate(prompt, max_new_tokens=100, temperature=0.8)
Mixed Precision Training
from experiments.train import train, generate_synthetic_data
from src import build_ramanujan_transformer, CheckpointManager, TrainingLogger
model = build_ramanujan_transformer(
vocab_size=1000, d_model=256, nhead=4,
num_layers=6, dim_feedforward=1024,
)
train_data = generate_synthetic_data(1000, 800, 128)
val_data = generate_synthetic_data(1000, 200, 128)
best_loss = train(
model, train_data, val_data,
epochs=20, device='cuda',
mixed_precision='fp16', # Mixed precision
checkpoint_manager=CheckpointManager('ckpts'), # Checkpoint resume
training_logger=TrainingLogger(logger_type='wandb'), # WandB logging
)
CLI Commands
# Training (with all new features)
acx-rt train --mixed-precision fp16 --logger wandb --resume ckpts/checkpoint_step00001000.pt
# DDP multi-GPU training
acx-rt train --nproc-per-node 4 --mixed-precision bf16
# Visualization
acx-rt visualize --checkpoint model.pt --type all
# Model export
acx-rt export --checkpoint model.pt --format onnx
# Variance verification / benchmark
acx-rt verify --depth 200
acx-rt benchmark --layers 6,12,24
Build MoE Transformer
from src import build_ramanujan_moe_transformer
model = build_ramanujan_moe_transformer(
vocab_size=50257, d_model=768, nhead=12,
num_layers=12, dim_feedforward=3072,
num_experts=8, top_k=2,
expert_dropout=0.1, # v1.6: Expert dropout
load_balancing_weight=0.05, # v1.6: Load balancing weight
use_flash_attention=True,
)
logits, aux_loss = model(input_ids, return_aux_loss=True)
HuggingFace Compatibility
from src.huggingface import RamanujanGPT2, RamanujanConfig
# Create model from config
config = RamanujanConfig(vocab_size=50257, d_model=768, nhead=12, num_layers=12)
model = RamanujanGPT2(config)
# Save / Load
model.save_pretrained('./my-ramanujan-model')
model = RamanujanGPT2.from_pretrained('./my-ramanujan-model')
Project Structure
ACX-ramanujan-transformer/
├── README.md
├── CHANGELOG.md
├── requirements.txt
├── src/
│ ├── ramanujan_initializer.py # Core: Ramanujan recurrence coefficients + initializer
│ ├── attention.py # Multi-head self-attention + FlashAttention-3
│ ├── feedforward.py # FFN
│ ├── transformer_block.py # Pre-Norm Transformer Block
│ ├── embeddings.py # Token + positional encoding
│ ├── ramanujan_transformer.py # Standard Transformer
│ ├── moe.py # MoE Transformer (vectorized routing)
│ ├── checkpoint.py # v1.6: Checkpoint manager
│ ├── logging_utils.py # v1.6: WandB/TensorBoard logger
│ ├── huggingface/ # v1.6: HuggingFace compatibility
│ │ ├── __init__.py
│ │ └── modeling_ramanujan.py
│ └── export/ # v1.6: Model export
│ ├── __init__.py
│ └── exporter.py
├── experiments/
│ ├── verify_variance.py # Variance preservation verification
│ ├── benchmark.py # Xavier vs He vs Ramanujan
│ ├── train.py # Training (mixed precision + DDP + resume)
│ ├── train_moe.py # MoE training comparison
│ └── visualize_variance.py # v1.6: Visualization tools
├── configs/
│ ├── default.yaml # Default config
│ ├── gpt2_style.yaml # GPT-2 style
│ ├── llama_7b_style.yaml # 7B Llama style
│ └── llama_13b_style.yaml # 13B Llama style
├── cli/
│ └── acx_rt.py # CLI tool
├── tests/
│ └── test_core.py # Unit tests
└── docs/
└── theory.md # Theoretical derivation
Experiments
python experiments/verify_variance.py # Variance preservation verification
python experiments/benchmark.py # Benchmark comparison
python experiments/train.py # Standard training
python experiments/train_moe.py # MoE comparison training
python experiments/visualize_variance.py # Visualization
Mathematical Background
In 1916, Ramanujan discovered this recurrence relation while studying the Klein j-invariant. The j-invariant is the fundamental generator of the modular group $SL(2, \mathbb{Z})$:
The $\pi^2/n^2$ coefficient in the recurrence has a deep correspondence with quantum mechanical energy level structures: the hydrogen atom energy levels $E_n = -13.6/n^2$ eV share the same $1/n^2$ structure. See docs/theory.md for detailed derivation.
References
- Ramanujan, S. (1916). Notebooks of Srinivasa Ramanujan.
- Klein, F. (1890). Über die Transformation elfter Ordnung der elliptischen Funktionen.
- Glorot, X. & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks.
- He, K. et al. (2015). Delving deep into rectifiers.
- Fedus, W. et al. (2022). Switch Transformers: Scaling to Trillion Parameter Models.
- Shazeer, N. et al. (2017). Outrageously Large Neural Networks: The Sparsely-Gated MoE Layer.
- Dao, T. et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.
- Dao, T. (2023). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.
- Wolf, T. et al. (2020). HuggingFace's Transformers: State-of-the-Art Natural Language Processing.
Contact
- Email: 122241711@qq.com
- GitHub: aaroncxxx
License
MIT
Deutsch
Gewichtsinitialisierung für neuronale Netze auf Basis der Modulfunktion-Rekurrenzrelation von Ramanujan
v1.6.0 — FlashAttention-3 + Mischpräzision + DDP-Verteilung + Checkpoint-Fortsetzung + MoE-Überarbeitung + HuggingFace-Kompatibilität + ONNX-Export + Visualisierung. Siehe CHANGELOG.md
Überblick
Dieses Projekt implementiert eine revolutionäre Methode zur Gewichtsinitialisierung neuronaler Netze — Ramanujan-Modulfunktion-Initialisierung — basierend auf Srinivasa Ramanujans unveröffentlichten Forschungsnotizen von 1916 über Modulfunktionen.
Unter Ausnutzung der differentiellen Rekurrenzeigenschaften der Kleinschen j-Invariante erreicht diese Methode eine strenge Varianzerhaltung im mathematischen Sinne und löst das Problem des verschwindenden und explodierenden Gradienten in ultra-tiefen Transformer-Architekturen.
Kern-Rekurrenzformel
wobei $a_0 = 1$, $a_1 = \pi / \sqrt{3}$, abgeleitet aus den Fourier-Entwicklungskoeffizienten der j-Invariante.
Vergleich mit traditionellen Methoden
| Initialisierung | Varianzerhaltung | Gradientenstabilität | Max. effektive Tiefe |
|---|---|---|---|
| Xavier | Statistisch | Exponentieller Abfall | ~100 Schichten |
| He | Statistisch | Exponentieller Abfall | ~200 Schichten |
| Ramanujan | Streng mathematisch | Streng invariant | Theoretisch unendlich |
Merkmale
- 🧮 Mathematisch streng: Exakte Rekurrenz basierend auf der Theorie der Modulfunktionen, keine statistische Näherung
- 🏗️ Standardarchitekturen: Vollständige Transformer-Implementierung (Encoder + Decoder)
- 🔀 MoE-Unterstützung: Mixture of Experts mit vektorisiertem Top-K-Routing und Lastausgleich
- ⚡ FlashAttention-3: Native Integration, 3-5× Training Beschleunigung, 40% Speicherreduktion
- 🔥 Mischpräzision: FP16/BF16-Unterstützung, weitere 30% Speichereinsparung
- 🚀 DDP-Verteiltes Training: Multi-GPU lineare Skalierung, unterstützt 10B+ Parameter Modelle
- 💾 Checkpoint-Fortsetzung: Vollständiges Speichern/Wiederherstellen des Trainingszustands
- 🤗 HuggingFace-kompatibel: Nahtlose Integration in das主流-LLM-Ökosystem
- 📦 ONNX-Export: Produktions-Deployment-Unterstützung
- 📊 WandB/TensorBoard: Trainingsbeobachtung
- 📈 Visualisierung: Varianzkurven, Gradientenverteilungen, Aufmerksamkeits-Heatmaps
Installation
git clone https://github.com/aaroncxxx/ACX-ramanujan-transformer.git
cd ACX-ramanujan-transformer
pip install -r requirements.txt
# Optionale Abhängigkeiten
pip install flash-attn>=2.0.0 # FlashAttention-3 (CUDA 11.8+)
pip install wandb # WandB-Protokollierung
pip install tensorboard # TensorBoard-Protokollierung
pip install onnx onnxruntime # ONNX-Export
pip install transformers # HuggingFace-Kompatibilität
Schnellstart
import torch
from src import build_ramanujan_transformer
# Transformer mit FlashAttention erstellen
model = build_ramanujan_transformer(
vocab_size=50257, d_model=768, nhead=12,
num_layers=12, dim_feedforward=3072,
decoder_only=True,
use_flash_attention=True, # FlashAttention-3
)
# Autoregressive Generierung
prompt = torch.randint(0, 50257, (1, 10))
output = model.generate(prompt, max_new_tokens=100, temperature=0.8)
Mischpräzision-Training
from experiments.train import train, generate_synthetic_data
from src import build_ramanujan_transformer, CheckpointManager, TrainingLogger
model = build_ramanujan_transformer(
vocab_size=1000, d_model=256, nhead=4,
num_layers=6, dim_feedforward=1024,
)
train_data = generate_synthetic_data(1000, 800, 128)
val_data = generate_synthetic_data(1000, 200, 128)
best_loss = train(
model, train_data, val_data,
epochs=20, device='cuda',
mixed_precision='fp16', # Mischpräzision
checkpoint_manager=CheckpointManager('ckpts'), # Checkpoint-Fortsetzung
training_logger=TrainingLogger(logger_type='wandb'), # WandB-Protokollierung
)
CLI-Befehle
# Training (mit allen neuen Funktionen)
acx-rt train --mixed-precision fp16 --logger wandb --resume ckpts/checkpoint_step00001000.pt
# DDP-Multi-GPU-Training
acx-rt train --nproc-per-node 4 --mixed-precision bf16
# Visualisierung
acx-rt visualize --checkpoint model.pt --type all
# Modell-Export
acx-rt export --checkpoint model.pt --format onnx
# Varianzverifikation / Benchmark
acx-rt verify --depth 200
acx-rt benchmark --layers 6,12,24
MoE Transformer erstellen
from src import build_ramanujan_moe_transformer
model = build_ramanujan_moe_transformer(
vocab_size=50257, d_model=768, nhead=12,
num_layers=12, dim_feedforward=3072,
num_experts=8, top_k=2,
expert_dropout=0.1, # v1.6: Expert-Dropout
load_balancing_weight=0.05, # v1.6: Lastausgleich-Gewicht
use_flash_attention=True,
)
logits, aux_loss = model(input_ids, return_aux_loss=True)
HuggingFace-Kompatibilität
from src.huggingface import RamanujanGPT2, RamanujanConfig
# Modell aus Konfiguration erstellen
config = RamanujanConfig(vocab_size=50257, d_model=768, nhead=12, num_layers=12)
model = RamanujanGPT2(config)
# Speichern / Laden
model.save_pretrained('./my-ramanujan-model')
model = RamanujanGPT2.from_pretrained('./my-ramanujan-model')
Projektstruktur
ACX-ramanujan-transformer/
├── README.md
├── CHANGELOG.md
├── requirements.txt
├── src/
│ ├── ramanujan_initializer.py # Kern: Ramanujan-Rekurrenzkoeffizienten + Initialisierer
│ ├── attention.py # Multi-Head-Selbstachtung + FlashAttention-3
│ ├── feedforward.py # FFN
│ ├── transformer_block.py # Pre-Norm Transformer Block
│ ├── embeddings.py # Token- + Positions编码
│ ├── ramanujan_transformer.py # Standard-Transformer
│ ├── moe.py # MoE-Transformer (vektorisiertes Routing)
│ ├── checkpoint.py # v1.6: Checkpoint-Manager
│ ├── logging_utils.py # v1.6: WandB/TensorBoard-Logger
│ ├── huggingface/ # v1.6: HuggingFace-Kompatibilität
│ │ ├── __init__.py
│ │ └── modeling_ramanujan.py
│ └── export/ # v1.6: Modell-Export
│ ├── __init__.py
│ └── exporter.py
├── experiments/
│ ├── verify_variance.py # Varianzerhaltungsverifikation
│ ├── benchmark.py # Xavier vs He vs Ramanujan
│ ├── train.py # Training (Mischpräzision + DDP + Fortsetzung)
│ ├── train_moe.py # MoE-Trainingsvergleich
│ └── visualize_variance.py # v1.6: Visualisierungstools
├── configs/
│ ├── default.yaml # Standardkonfiguration
│ ├── gpt2_style.yaml # GPT-2-Stil
│ ├── llama_7b_style.yaml # 7B Llama-Stil
│ └── llama_13b_style.yaml # 13B Llama-Stil
├── cli/
│ └── acx_rt.py # CLI-Werkzeug
├── tests/
│ └── test_core.py # Einheitentests
└── docs/
└── theory.md # Theoretische Ableitung
Experimente
python experiments/verify_variance.py # Varianzerhaltungsverifikation
python experiments/benchmark.py # Benchmark-Vergleich
python experiments/train.py # Standardtraining
python experiments/train_moe.py # MoE-Vergleichstraining
python experiments/visualize_variance.py # Visualisierung
Mathematischer Hintergrund
1916 entdeckte Ramanujan diese Rekurrenzrelation beim Studium der Kleinschen j-Invariante. Die j-Invariante ist der fundamentale Erzeuger der Modulgruppe $SL(2, \mathbb{Z})$:
Der $\pi^2/n^2$-Koeffizient in der Rekurrenz hat eine tiefe Entsprechung mit der Struktur der Energieniveaus in der Quantenmechanik: Die Energieniveaus des Wasserstoffatoms $E_n = -13.6/n^2$ eV teilen dieselbe $1/n^2$-Struktur. Siehe docs/theory.md für die ausführliche Ableitung.
Referenzen
- Ramanujan, S. (1916). Notebooks of Srinivasa Ramanujan.
- Klein, F. (1890). Über die Transformation elfter Ordnung der elliptischen Funktionen.
- Glorot, X. & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks.
- He, K. et al. (2015). Delving deep into rectifiers.
- Fedus, W. et al. (2022). Switch Transformers: Scaling to Trillion Parameter Models.
- Shazeer, N. et al. (2017). Outrageously Large Neural Networks: The Sparsely-Gated MoE Layer.
- Dao, T. et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.
- Dao, T. (2023). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.
- Wolf, T. et al. (2020). HuggingFace's Transformers: State-of-the-Art Natural Language Processing.
Kontakt
- E-Mail: 122241711@qq.com
- GitHub: aaroncxxx
Lizenz
MIT