CoFrGeNet-F — Continued Fraction Language Model

An open-source implementation of the CoFrGeNet-F architecture from IBM Research's paper arXiv:2601.21766. CoFrGeNet-F replaces standard Transformer FFN layers with continued fraction networks. This repo contains model weights for two CoFrGeNet-F experiments and the standard Transformer baseline, all trained on identical data with identical hyperparameters.

The Mathematics of Continued Fractions

At the heart of this architecture lies the generalized continued fraction:

$\tilde{f}(a_1, a_2, \ldots, a_d) \;=\; \cfrac{1}{a_1 + \cfrac{1}{a_2 + \cfrac{1}{\ddots + \cfrac{1}{a_d}}}}$

where each coefficient is a learned linear function of the input. This recursive structure is computed efficiently via continuant polynomials:

$K_0 = 1, \qquad K_1(a_d) = a_d, \qquad K_k = a_{d-k+1} \cdot K_{k-1} + K_{k-2}$

The continued fraction then evaluates to a simple ratio of consecutive continuants:

$\tilde{f} = \frac{K_{d-1}}{K_d}$

Efficient Gradients via Proposition 1

The continuant formulation yields gradients that require only one division instead of d:

$\frac{\partial \tilde{f}}{\partial a_k} = (-1)^{k} \left( \frac{K_{d-k}}{K_d} \right)^{2}$

By caching the reciprocal of the final continuant and reusing it across all depth levels, we reduce the cost from O(d) divisions to O(1).

The Cffn Layer

Each Continued Fraction FFN (Cffn) replaces the standard two-layer FFN. Where a standard Transformer block uses Linear, GELU, Linear with a 4x expansion, the Cffn instead computes:

$y = U x + \sum_{j=1}^{L} V_j \cdot z_j$

Direct linear path:

$U \in \mathbb{R}^{p \times p}$

Gating projection:

$\hat{x} = \sigma(G x) \odot x, \qquad G \in \mathbb{R}^{p \times p}$

Continued fraction ladders (each computes p independent continued fractions element-wise):

$z_j = \tilde{f}\!\left( \hat{x} \odot W^{(j)} \right), \qquad W^{(j)} \in \mathbb{R}^{p \times d}, \quad j = 1, \ldots, L$

Combination weights:

$V \in \mathbb{R}^{p \times L}$

Parameter Count per Cffn Layer

$2p^2 + L \cdot p \cdot (d + 1)$

At p=768: 1,193,472 params vs a standard FFN's 4,718,592 — a 3.95x reduction per layer.

Experiments

All models trained on FineWeb-Edu 10BT (~10B tokens) with identical hyperparameters. Evaluated on the same NVIDIA H200 with identical code.

Experiment 1: Parameter-Efficient (82M vs 124M)

Can CoFrGeNet-F match baseline quality with 34% fewer parameters?

Model	Config	Parameters
Baseline	12L, 768d, 12h, standard FFN	124.3M
CoFrGeNet-F	12L, 768d, 12h, Cffn (L=3, d=5)	82.0M

Experiment 2: Iso-Parameter (128M vs 124M)

With equal parameter budget, does CoFrGeNet-F match or beat the baseline? By widening the hidden dimension from 768 to 1024, the Cffn model reaches ~128M parameters. Because Cffn scales as:

$\text{Cffn}(p) = 2p^2 + L \cdot p \cdot (d + 1)$

while the standard FFN scales as:

$\text{FFN}(p) = 8p^2$

the Cffn model can use a wider hidden dimension within the same parameter budget, giving it richer representations and more attention heads.

Model	Config	Parameters
Baseline	12L, 768d, 12h, standard FFN	124.3M
CoFrGeNet-F	12L, 1024d, 16h, Cffn (L=3, d=5)	128.3M

Results

Metric	Baseline (124M)	CoFrGeNet-F (82M)	CoFrGeNet-F (128M)
WikiText-2 PPL	40.79	110.32	82.46
WikiText-103 PPL	40.79	110.32	82.46
LAMBADA PPL	37.45	166.57	111.26
LAMBADA Accuracy	19.06%	8.77%	11.41%
Throughput	452,622 tok/s	103,455 tok/s	128,206 tok/s
Generation Speed	3.68 ms/tok	10.92 ms/tok	10.50 ms/tok

Analysis

The 128M model shows significant improvement over the 82M (WikiText-2 PPL 110 to 82, LAMBADA accuracy 8.8% to 11.4%), confirming that the architecture benefits from scale. However, the baseline still wins at this scale. The paper's strongest results were at GPT-2 XL scale (985M CoFrGeNet-F vs 1.5B GPT2-xl), suggesting the gap may close at larger model sizes.

Experiment 3: More Ladders (128M, L=8) — In Progress

Each ladder is an independent rational approximation. More ladders give a richer function space at negligible parameter cost (+0.3%). Training on H200, ETA ~March 10, 2026. Results will be added here upon completion.

Training Details

Hyperparameter	Value
Dataset	FineWeb-Edu sample-10BT (~10B tokens)
Tokenizer	GPT-2 (tiktoken)
Optimizer	AdamW
Learning rate	6e-4 peak, cosine decay to 0
Warmup	700 steps
Weight decay	0.1
Batch size	524,288 tokens per update
Total steps	19,073 (one epoch)
Precision	bfloat16
Seed	42

Dyadic Training Schedule

The paper's dyadic schedule is critical — without it, performance degrades 10-80%. Each depth level i unfreezes at:

$s_i = \left(1 - \frac{1}{2^i}\right) \times S_{\text{total}}$

Depth	Unfreeze Step	% of Training
0 (linear components U, G, V)	0	0%
1 (ladder column 0)	9,537	50%
2 (ladder column 1)	14,305	75%
3 (ladder column 2)	16,689	87.5%
4 (ladder column 3)	17,881	93.75%
5 (ladder column 4)	18,477	96.875%

Hardware

Model	GPU	Throughput	Training Time
Baseline (124M)	RTX 5090	~141K tok/s	~19.7 hours
CoFrGeNet-F (82M)	H200 NVL	~74K tok/s	~37.3 hours
CoFrGeNet-F (128M)	H200 NVL	~114K tok/s	~24.3 hours

The 128M model used torch.compile for a ~2.3x throughput boost.

Repo Structure

cahlen/cofrgenet-f/
├── baseline/
│   ├── model.safetensors       # Standard Transformer (124M params)
│   └── eval_results.json
├── cofrgenet/
│   ├── model.safetensors       # CoFrGeNet-F 82M (Experiment 1)
│   └── eval_results.json
├── cofrgenet-128m/
│   ├── model.safetensors       # CoFrGeNet-F 128M (Experiment 2)
│   └── eval_results.json
└── src/                        # Full model source code
    ├── cofrgenet/
    └── baseline/

Usage

CoFrGeNet-F 128M (Experiment 2):

from safetensors.torch import load_file
from src.cofrgenet.config import CoFrGeNetConfig
from src.cofrgenet.model import CoFrGeNetTransformer

config = CoFrGeNetConfig(n_embd=1024, n_head=16)
model = CoFrGeNetTransformer(config)
state_dict = load_file("cofrgenet-128m/model.safetensors")
# Strip torch.compile prefix if present
cleaned = {k.removeprefix("_orig_mod."): v for k, v in state_dict.items()}
model.load_state_dict(cleaned, strict=False)
model.eval()

CoFrGeNet-F 82M (Experiment 1):

config = CoFrGeNetConfig()  # defaults: 768d, 12h
model = CoFrGeNetTransformer(config)
state_dict = load_file("cofrgenet/model.safetensors")
model.load_state_dict(state_dict, strict=False)
model.eval()

Baseline (124M):

from src.baseline.config import BaselineConfig
from src.baseline.model import BaselineTransformer

config = BaselineConfig()
model = BaselineTransformer(config)
state_dict = load_file("baseline/model.safetensors")
model.load_state_dict(state_dict, strict=False)
model.eval()

Source Code

Full implementation: github.com/cahlen/cofrgenet-f

Citation

@article{dey2025cofrgenet,
  title={CoFrGeNet: Continued Fraction based Geometric Network for target benchmarks},
  author={Dey, Subhajit and others},
  journal={arXiv preprint arXiv:2601.21766},
  year={2025}
}

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train cahlen/cofrgenet-f

Paper for cahlen/cofrgenet-f

CoFrGeNet: Continued Fraction Architectures for Language Generation

Paper • 2601.21766 • Published Jan 29 • 1

Evaluation results

Perplexity on WikiText-2
self-reported

110.320
Perplexity on LAMBADA
self-reported

166.570
Accuracy on LAMBADA
self-reported

8.770
Perplexity on WikiText-2
self-reported

82.460
Perplexity on LAMBADA
self-reported

111.260
Accuracy on LAMBADA
self-reported

11.410
Perplexity on WikiText-2
self-reported

40.790
Perplexity on LAMBADA
self-reported

37.450