CoFrGeNet-F β Continued Fraction Language Model
An open-source implementation of the CoFrGeNet-F architecture from IBM Research's paper arXiv:2601.21766. CoFrGeNet-F replaces standard Transformer FFN layers with continued fraction networks. This repo contains model weights for two CoFrGeNet-F experiments and the standard Transformer baseline, all trained on identical data with identical hyperparameters.
The Mathematics of Continued Fractions
At the heart of this architecture lies the generalized continued fraction:
where each coefficient is a learned linear function of the input. This recursive structure is computed efficiently via continuant polynomials:
The continued fraction then evaluates to a simple ratio of consecutive continuants:
Efficient Gradients via Proposition 1
The continuant formulation yields gradients that require only one division instead of d:
By caching the reciprocal of the final continuant and reusing it across all depth levels, we reduce the cost from O(d) divisions to O(1).
The Cffn Layer
Each Continued Fraction FFN (Cffn) replaces the standard two-layer FFN. Where a standard Transformer block uses Linear, GELU, Linear with a 4x expansion, the Cffn instead computes:
Direct linear path:
Gating projection:
Continued fraction ladders (each computes p independent continued fractions element-wise):
Combination weights:
Parameter Count per Cffn Layer
At p=768: 1,193,472 params vs a standard FFN's 4,718,592 β a 3.95x reduction per layer.
Experiments
All models trained on FineWeb-Edu 10BT (~10B tokens) with identical hyperparameters. Evaluated on the same NVIDIA H200 with identical code.
Experiment 1: Parameter-Efficient (82M vs 124M)
Can CoFrGeNet-F match baseline quality with 34% fewer parameters?
| Model | Config | Parameters |
|---|---|---|
| Baseline | 12L, 768d, 12h, standard FFN | 124.3M |
| CoFrGeNet-F | 12L, 768d, 12h, Cffn (L=3, d=5) | 82.0M |
Experiment 2: Iso-Parameter (128M vs 124M)
With equal parameter budget, does CoFrGeNet-F match or beat the baseline? By widening the hidden dimension from 768 to 1024, the Cffn model reaches ~128M parameters. Because Cffn scales as:
while the standard FFN scales as:
the Cffn model can use a wider hidden dimension within the same parameter budget, giving it richer representations and more attention heads.
| Model | Config | Parameters |
|---|---|---|
| Baseline | 12L, 768d, 12h, standard FFN | 124.3M |
| CoFrGeNet-F | 12L, 1024d, 16h, Cffn (L=3, d=5) | 128.3M |
Results
| Metric | Baseline (124M) | CoFrGeNet-F (82M) | CoFrGeNet-F (128M) |
|---|---|---|---|
| WikiText-2 PPL | 40.79 | 110.32 | 82.46 |
| WikiText-103 PPL | 40.79 | 110.32 | 82.46 |
| LAMBADA PPL | 37.45 | 166.57 | 111.26 |
| LAMBADA Accuracy | 19.06% | 8.77% | 11.41% |
| Throughput | 452,622 tok/s | 103,455 tok/s | 128,206 tok/s |
| Generation Speed | 3.68 ms/tok | 10.92 ms/tok | 10.50 ms/tok |
Analysis
The 128M model shows significant improvement over the 82M (WikiText-2 PPL 110 to 82, LAMBADA accuracy 8.8% to 11.4%), confirming that the architecture benefits from scale. However, the baseline still wins at this scale. The paper's strongest results were at GPT-2 XL scale (985M CoFrGeNet-F vs 1.5B GPT2-xl), suggesting the gap may close at larger model sizes.
Experiment 3: More Ladders (128M, L=8) β In Progress
Each ladder is an independent rational approximation. More ladders give a richer function space at negligible parameter cost (+0.3%). Training on H200, ETA ~March 10, 2026. Results will be added here upon completion.
Training Details
| Hyperparameter | Value |
|---|---|
| Dataset | FineWeb-Edu sample-10BT (~10B tokens) |
| Tokenizer | GPT-2 (tiktoken) |
| Optimizer | AdamW |
| Learning rate | 6e-4 peak, cosine decay to 0 |
| Warmup | 700 steps |
| Weight decay | 0.1 |
| Batch size | 524,288 tokens per update |
| Total steps | 19,073 (one epoch) |
| Precision | bfloat16 |
| Seed | 42 |
Dyadic Training Schedule
The paper's dyadic schedule is critical β without it, performance degrades 10-80%. Each depth level i unfreezes at:
| Depth | Unfreeze Step | % of Training |
|---|---|---|
| 0 (linear components U, G, V) | 0 | 0% |
| 1 (ladder column 0) | 9,537 | 50% |
| 2 (ladder column 1) | 14,305 | 75% |
| 3 (ladder column 2) | 16,689 | 87.5% |
| 4 (ladder column 3) | 17,881 | 93.75% |
| 5 (ladder column 4) | 18,477 | 96.875% |
Hardware
| Model | GPU | Throughput | Training Time |
|---|---|---|---|
| Baseline (124M) | RTX 5090 | ~141K tok/s | ~19.7 hours |
| CoFrGeNet-F (82M) | H200 NVL | ~74K tok/s | ~37.3 hours |
| CoFrGeNet-F (128M) | H200 NVL | ~114K tok/s | ~24.3 hours |
The 128M model used torch.compile for a ~2.3x throughput boost.
Repo Structure
cahlen/cofrgenet-f/
βββ baseline/
β βββ model.safetensors # Standard Transformer (124M params)
β βββ eval_results.json
βββ cofrgenet/
β βββ model.safetensors # CoFrGeNet-F 82M (Experiment 1)
β βββ eval_results.json
βββ cofrgenet-128m/
β βββ model.safetensors # CoFrGeNet-F 128M (Experiment 2)
β βββ eval_results.json
βββ src/ # Full model source code
βββ cofrgenet/
βββ baseline/
Usage
CoFrGeNet-F 128M (Experiment 2):
from safetensors.torch import load_file
from src.cofrgenet.config import CoFrGeNetConfig
from src.cofrgenet.model import CoFrGeNetTransformer
config = CoFrGeNetConfig(n_embd=1024, n_head=16)
model = CoFrGeNetTransformer(config)
state_dict = load_file("cofrgenet-128m/model.safetensors")
# Strip torch.compile prefix if present
cleaned = {k.removeprefix("_orig_mod."): v for k, v in state_dict.items()}
model.load_state_dict(cleaned, strict=False)
model.eval()
CoFrGeNet-F 82M (Experiment 1):
config = CoFrGeNetConfig() # defaults: 768d, 12h
model = CoFrGeNetTransformer(config)
state_dict = load_file("cofrgenet/model.safetensors")
model.load_state_dict(state_dict, strict=False)
model.eval()
Baseline (124M):
from src.baseline.config import BaselineConfig
from src.baseline.model import BaselineTransformer
config = BaselineConfig()
model = BaselineTransformer(config)
state_dict = load_file("baseline/model.safetensors")
model.load_state_dict(state_dict, strict=False)
model.eval()
Source Code
Full implementation: github.com/cahlen/cofrgenet-f
Citation
@article{dey2025cofrgenet,
title={CoFrGeNet: Continued Fraction based Geometric Network for target benchmarks},
author={Dey, Subhajit and others},
journal={arXiv preprint arXiv:2601.21766},
year={2025}
}
License
MIT
Dataset used to train cahlen/cofrgenet-f
Paper for cahlen/cofrgenet-f
Evaluation results
- Perplexity on WikiText-2self-reported110.320
- Perplexity on LAMBADAself-reported166.570
- Accuracy on LAMBADAself-reported8.770
- Perplexity on WikiText-2self-reported82.460
- Perplexity on LAMBADAself-reported111.260
- Accuracy on LAMBADAself-reported11.410
- Perplexity on WikiText-2self-reported40.790
- Perplexity on LAMBADAself-reported37.450
- Accuracy on LAMBADAself-reported19.060